Feb 11, 2026 3 min read

OpenMed's Multilingual PII Detection: Open-Source De-Identification for European Healthcare

De-identifying patient data across multiple languages is a persistent challenge in European healthcare. Most organizations working with non-English clinical text end up translating to English before running PII detection, building fragile regex patterns per language, or running English-only models on non-English text and hoping for the best. Some skip de-identification entirely.

OpenMed, an open-source healthcare AI project, recently released something interesting: 105 language-specific PII detection models for French, German, and Italian.

What OpenMed Built

The project has open-sourced 105 PII detection models across French, German, and Italian, designed for HIPAA- and GDPR-aligned compliance in healthcare settings. All are released under the Apache 2.0 license.

The release includes 35 fine-tuned models per language, spanning every major transformer architecture: DeBERTa-v3, RoBERTa, ModernBERT, XLM-RoBERTa, and multilingual specialists like mDeBERTa and EuroBERT, as well as biomedical variants including BioClinical-ModernBERT and Clinical-Longformer. Models range from 33 million to 600 million parameters, allowing organizations to choose their own trade-off between accuracy and inference speed.

Each language was trained on substantial datasets:

French: 49,580 samples, top model achieving 97.97% F1
German: 42,250 samples, top model achieving 97.61% F1
Italian: 40,944 samples, top model achieving 97.28% F1

The models detect over 55 PII entity types per language, covering personal identifiers like national IDs and passport numbers, contact information, financial data such as IBANs and bank accounts, and network identifiers. What stands out is that each language handles its own native patterns. French models recognize the numéro de sécurité sociale and IBAN FR formats. German models handle the Sozialversicherungsnummer and IBAN DE. Italian models detect the Codice Fiscale and IBAN IT. These are language-native detection systems, not translations of English patterns.

Why This Is Worth Watching

The underlying problem is structural. French hospitals generate French clinical notes. German insurers process German claims. Italian research institutions work with Italian patient records. When organizations force this data through English-language pipelines, they introduce compounding errors at every stage: translation artifacts, lost context, and PII entities that simply do not exist in English formats.

This mirrors a broader pattern across multilingual AI. Just as LLMs demonstrate substantial performance disparities across languages in comprehension and generation, the tooling around those models, including safety and compliance infrastructure, tends to default to English. Privacy protection should not be a capability that degrades when you switch languages.

For healthcare specifically, the regulatory context adds urgency. GDPR requires robust data protection regardless of the language in which patient data is recorded. HIPAA mandates de-identification across all 18 Safe Harbor categories. Projects like OpenMed's are a step toward closing that gap between what compliance demands and what open-source tooling currently offers.

Open-Source Multilingual AI Infrastructure

OpenMed's release joins a growing body of open-source work aimed at closing the multilingual gap in AI infrastructure. Projects like AfriqueLLM, which adapted language models to 20 African languages through continued pre-training, and AI Singapore's SEA-LION for Southeast Asian languages, represent parallel efforts in different domains and regions.

What these projects share is a recognition that multilingual AI capability cannot be an afterthought. It needs to be built into the infrastructure layer: the models, the evaluation benchmarks, and the compliance tooling.

At Future Ethics, our own multilingual NLP work, including the SASEA Language Working Group focused on South and Southeast Asian languages, operates from the same premise. The languages and domains differ, but the challenge is consistent: building AI systems that work equitably across the languages people actually use.

Looking Forward

OpenMed's models are available on Hugging Face. For organizations working in European healthcare AI, clinical NLP, or privacy-preserving machine learning, the collection is worth exploring.

We will continue tracking and writing about projects like this as part of a broader effort to catalog the libraries, models, and services advancing multilingual AI infrastructure. If you are working on something in this space, we would like to hear from you.

Subscribe to my newsletter

Join our newsletter for AI safety news and research updates