De-identification and Anonymization of Claims Narratives Using Hybrid Neural Architectures
Abstract
This paper addresses the challenge of de-identifying sensitive and personal information within insurance claims narratives through the design of hybrid neural architectures that combine multiple layers of advanced language processing. The proposed methodology aims to obfuscate, remove, or replace personally identifying information while preserving the overall coherence and semantic structure of the text for further analysis. By unifying recurrent language models and attention-based transformations with external knowledge resources, this approach ensures robustness to various syntactic and contextual nuances present in real-world claims data. The work builds on the premise that a balance between contextual embedding fidelity and explicit structured constraints can yield high-quality anonymized narratives that retain their factual essence but eliminate the possibility of deducing personal attributes. Both theoretical and practical aspects of the approach are investigated, including novel representations that integrate logic-based constraints for capturing domain-specific rules, and linear algebraic mechanisms to manage large-scale embeddings efficiently. In-depth experiments carried out on de-identified insurance claim datasets confirm that a hybrid model strategy can surpass single-model baselines in terms of precision, recall, and downstream utility. The results also suggest that strategic inclusion of domain-specific lexical constraints reinforces privacy guarantees. The findings highlight a promising new direction for designing anonymization frameworks capable of working with a broad range of domain texts while offering strong theoretical guarantees of de-identification accuracy.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 author

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.