HomeNewsMIT scientists study the chance of memorization within the age of clinical...

MIT scientists study the chance of memorization within the age of clinical AI

What is the aim of patient privacy? The Hippocratic Oath, considered considered one of the world's earliest and best-known texts on medical ethics, states: “Whatever I see or hear within the lives of my patients, whether or not in reference to my skilled practice, and which shouldn’t be spoken of externally, I’ll keep secret, considering all such things private.”

As privacy becomes increasingly scarce within the age of data-hungry algorithms and cyberattacks, medicine is considered one of the few remaining fields where confidentiality stays central to practice and allows patients to trust their doctors with confidential information.

But a paper The project, co-authored by MIT researchers, examines how artificial intelligence models trained on deidentified electronic health records (EHRs) can store patient-specific information. The work, recently presented on the 2025 Conference on Neural Information Processing Systems (NeurIPS), recommends a rigorous testing setup to be sure that targeted prompts cannot leak information, and emphasizes that leaks should be evaluated in a healthcare context to find out whether or not they significantly impact patient privacy.

Foundation models trained on EHRs should typically generalize knowledge to make higher predictions using many patient records. However, when “memorized,” the model relies on a single patient record to supply its results, potentially violating the patient’s privacy. Foundation models particularly are already known vulnerable to data leaks.

“Knowledge of those high-performance models is usually a resource for a lot of communities, but adversarial attackers can force a model to extract information from training data,” says Sana Tonekaboni, a postdoctoral researcher on the Eric and Wendy Schmidt Center on the Broad Institute of MIT and Harvard and lead writer of the paper. Given the chance that foundation models could also store private data, she notes, “This work is a step toward ensuring that there are practical assessment steps our community can take before models are published.”

To conduct research on the potential risks that basic EHR models could pose in medicine, Tonekaboni turned to the MIT associate professor Marzyeh Ghassemithe lead investigator is on Abdul Latif Jameel Healthcare Machine Learning Clinic (Jameel Clinic), member of the Computer Science and Artificial Intelligence Laboratory. Ghassemi, a school member in MIT's Department of Electrical Engineering and Computer Science and the Institute for Medical Engineering and Science, leads it Healthy ML groupwhich focuses on robust machine learning in healthcare.

How much information does a malicious actor need to show sensitive data, and what risks are related to the leaked information? To assess this, the research team developed a series of tests that may hopefully lay the muse for future privacy assessments. These tests are intended to measure several types of uncertainty and assess their practical risk to patients by measuring different levels of attack possibility.

“We've really tried to give attention to practicality here. If an attacker must know the date and value of a dozen lab tests out of your file to extract information, the chance of harm could be very low. If I have already got access to this level of protected source data, why should I would like to attack a big foundation model to get more?” says Ghassemi.

With the inevitable digitalization of medical records, data breaches have gotten more common. For the past 24 months, the U.S. Department of Health and Human Services has kept records 747 data breaches of health information affecting greater than 500 people, most of which were classified as hacking/IT incidents.

Patients with special medical conditions are particularly in danger because they’re easy to acknowledge. “Even with de-identified data, it is determined by what kind of data you reveal in regards to the person,” says Tonekaboni. “Once you discover them, you realize loads more.”

In their structured tests, the researchers found that the more information the attacker has about a selected patient, the more likely the model is to leak information. They demonstrated methods to distinguish model generalization cases from patient-level memorization cases to properly assess privacy risk.

The paper also emphasized that some leaks are more damaging than others. For example, a model that reveals a patient's age or demographics may very well be characterised as a more benign leak than the model that reveals more sensitive information resembling an HIV diagnosis or alcohol abuse.

The researchers note that patients with particular medical conditions are particularly in danger because they’re easy to detect and will require higher levels of protection. “Even with de-identified data, it really is determined by what sort of information you’re revealing in regards to the person,” Tonekaboni says. The researchers plan to expand the work to be more interdisciplinary, adding clinicians and privacy experts in addition to legal experts.

“There’s a reason our health information is private,” Tonekaboni says. “There’s no reason for anyone else to learn about it.”

This work was supported by the Eric and Wendy Schmidt Center on the Broad Institute of MIT and Harvard, Wallenberg AI, the Knut and Alice Wallenberg Foundation, the US National Science Foundation (NSF), a Gordon and Betty Moore Foundation award, a Google Research Scholar Award, and the Schmidt Sciences AI2050 program. Resources used to organize this research were provided partially by the Province of Ontario, the Government of Canada through CIFAR, and corporations sponsoring the Vector Institute.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read