MIT scientists investigate the risk of memory loss in the age of clinical AI | MIT News

What is patient confidentiality? The Hippocratic Oath, thought to be one of the world’s most well-known medical ethics texts, reads: “Whatever I see or hear in the lives of my patients, whether related to my professional work or not, that should not be spoken of outside, I will keep it secret, as I consider all these things to be private.”
As privacy becomes increasingly scarce in the age of data-hungry algorithms and cyberattacks, medicine is one of the few fields left where confidentiality remains the core of practice, enabling patients to trust their doctors with sensitive information.
But a paper co-authored by MIT researchers investigates whether artificial intelligence models trained on de-identified electronic health records (EHRs) can memorize specific patient information. The work, which was recently presented at the 2025 Conference on Neural Information Processing Systems (NeurIPS), recommends setting up strict checks to ensure that targeted information cannot reveal information, stressing that leaks should be assessed in a healthcare context to determine whether they objectively compromise patient privacy.
Baseline models trained in EHRs must often aggregate information to make better predictions, using multiple patient records. But in “memorization,” the model pulls a single patient’s record to deliver its result, potentially violating patient privacy. Notably, basic models are already known to be prone to data leakage.
“Information from these high-level models can be useful to many communities, but adversarial attackers can make the model extract information from training data,” said Sana Tonekaboni, a postdoc at the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard and first author of the paper. Given the risk that basic models can also capture private data, he notes, “this work is a step to ensure that there are effective vetting steps our community can take before releasing models.”
To conduct research on the potential risk of EHR-based models in medicine, Tonekaboni went to MIT Associate Professor Marzyeh Ghassemi, principal investigator at the Abdul Latif Jameel Clinic for Machine Learning in Health (Jameel Clinic), a member of the Computer Science and Artificial Intelligence Lab. Ghassemi, a faculty member in MIT’s Department of Electrical Engineering and Computer Science and Center for Medical Engineering and Science, runs the Healthy ML group, which focuses on robust machine learning in health.
How much information does a bad actor need to expose sensitive data, and what are the risks associated with leaked information? To test this, the research team conducted a series of experiments that they hope will lay the foundation for future privacy analyses. This test is designed to measure various types of uncertainty, and to evaluate their actual risk to patients by measuring the various stages of possible attacks.
“We’ve really tried to emphasize functionality here; if an attacker has to know the date and number of a dozen laboratory tests on your record to extract information, there’s very little risk of harm. If I already have access to that level of secure source data, why would I need to attack a large base model to find out more?” Ghassemi said.
With the inevitable digitization of medical records, data breaches have become commonplace. In the past 24 months, the US Department of Health and Human Services has recorded 747 data breaches of health information affecting more than 500 people, most classified as hacking/IT incidents.
Patients with unique conditions are particularly at risk, given how easy it is to pick them out. “Even with de-identified data, it depends on what kind of information you’re extracting about the person,” Tonekaboni said. “Once you identify them, you know a lot.”
In their systematic tests, the researchers found that the more information an attacker has about a particular patient, the more likely the model will leak information. They showed how to distinguish generic sample samples from patient-level samples, in order to properly assess the privacy risk.
This newspaper also emphasized that some leaks are more dangerous than others. For example, a model that reveals a patient’s age or demographics may be seen as a more humane leak than a model that reveals more sensitive information, such as an HIV diagnosis or alcohol abuse.
The researchers note that patients with unique conditions are particularly vulnerable given how easy it is to select them, which may require higher levels of protection. “Even with de-identified data, it depends on what kind of information you’re extracting about the person,” Tonekaboni said. The researchers plan to expand the work to be more interdisciplinary, adding doctors and privacy and legal experts.
“There’s a reason our health data is private,” Tonekaboni said. “There is no reason for others to know about it.”
This work is supported by the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, Wallenberg AI, the Knut and Alice Wallenberg Foundation, the US National Science Foundation (NSF), the Gordon and Betty Moore Foundation award, the Google Research Scholar award, and the AI2050 program at Schmidt Sciences. Resources used in the preparation of this study were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and corporate sponsors of the Vector Institute.



