Illustration by Ana Kova

Healthcare organizations and hospitals in the United States all sit on treasure troves: a stockpile of patient health data stored as electronic medical records. Those files show what people are sick with, how they were treated, and what happened next. Taken together, they’re hugely valuable resources for medical discovery.

Because of certain provisions of the Health Insurance Portability and Accountability Act (HIPAA), healthcare organizations are able to put that treasure trove to work. As long as they de-identify the records — removing information like patient names, locations, and phone numbers — they can give or sell the data to partners for research. They don’t need to get consent from patients to do it or even tell them about it.

More and more healthcare groups are taking advantage of those partnerships. The Mayo Clinic in Rochester, Minnesota, is working with startups to develop algorithms to diagnose and manage conditions based on health data. Fourteen US health systems formed a company to aggregate and sell de-identified data earlier this year. The healthcare company HCA announced a new data deal with Google in May.

There may be benefits to sharing this data — researchers can learn what types of treatments are best for people with certain medical conditions and develop tools to improve care. But there are risks to free-flowing data, says Eric Perakslis, chief science and digital officer at the Duke Clinical Research Institute. He outlined the ways the system could potentially harm patients in a recent New England Journal of Medicine article with Kenneth Mandl, director of the computational health informatics program at Boston Children’s Hospital.

“I’m a huge advocate for open data,” Perakslis says. “I think it’s very easy to get excited about the benefits. What we know with medical sciences, though, is that you don’t always understand the risks that come with the benefits until later.”

Perakslis talked to The Verge about what could go wrong and how to protect people from those risks.

This interview has been lightly edited for clarity.

When did healthcare organizations start taking advantage of their electronic medical records data in this way?

I want to say it was probably in 2017 or 2018. The thing that really pushed this into overdrive was the rise of privacy-preserving record linkage, which combines records from the same person without identifying them. The technologies are perfectly fine. But it almost makes a lot of people feel like, “Well, if I de-identified the data, I can almost do anything I want.”

Before those technologies, the only good way to do de-identification was to have a statistician do it. These technologies made it so almost anyone could do it. They’re not expensive. So the technology is ubiquitous, and it’s very easy to make deals and start marketing the data.

Who is using this data, and what’s it being used for?

If you look at ethical research, there are a lot of academic medical centers with great de-identified data sets and large research networks that have been well-monitored and well-designed. What’s happened beyond that is a place like an MRI center or a pharmacy has an agreement with a hospital, but that agreement doesn’t prohibit them from doing anything with that data. If you haven’t explicitly told them they can’t de-identify and sell data, they can. That’s the kind of thing we’d call a data leak.

So then there are these large data networks that are being formed, where people are putting their data in with other people’s, and then those big networks are trying to sell to pharma or government research labs or places like that. At the end of the day, everyone’s trying to sell to pharma because it’s so lucrative. In fairness, the pharma companies aren’t necessarily looking to do anything less than anything completely ethical, as far as getting data.

I think most healthcare institutions are interested in using data for profit and for research. I don’t think there’s anything wrong with that if you can actually say how you’re returning the benefit back to the core mission of the place.

So say I have my health record at a hospital that then decides to sell it to a private company that’s building a health database. It’s de-identified, so my name isn’t on it. In what ways does that put me at risk?

I’ve always kind of called de-identification a privacy placebo. It works about as well as the thermostat in a hotel room. There’s a lot of ways around it.

If it’s re-identified and the data is hacked or exposed, there are a few things that could go on. A lot of people will use solely medical data to make fraudulent medical claims, and then what happens is the victim of the identity theft gets all these bills. Your medical record contains financial information, so there’s the financial risk of that. The other thing that can go on is if you had a condition that you didn’t want your family to know about, or your employer or something like that, it could be exposed.

We’ve gotten really good at not fixing anything when this happens. Once the data is out, it’s out.

Those are the risks. But what are the benefits? How well can data from health records actually be used to solve health problems?

The usefulness of it is absolutely overblown. I think that, at the end of the day, electronic health records have been shown to be pretty good billing systems. While there’s great research done on them, that doesn’t mean the research is easy. It just means more people have taken it on. Good research requires really skilled people to do it. It’s really easy to underestimate the complexity of the problem. I get people calling me all the time with big studies, wondering why we couldn’t just do it using electronic health records data. We can do a lot of research that way, but it isn’t always high-quality research.

I think there are benefits, it just matters where you’re looking. There are great open-source data initiatives that have just really democratized the ability for smart people everywhere to get access to good-quality data for their ideas.

I’m sure people are going to do great things. But I’ve had long conversations with people in this market, and many of them genuinely believe that what they’re doing is going to help patients. But they’re naive, and there will be gaps in their methods that will invalidate the research. It’s the “move fast, break things” mentality, which is wonderful, but please don’t move fast and break things in my daughter’s medical records.

Are there other ways to do better research that also gives patients more protection in the process?

There’s also the traditional medical research establishment that gets patients’ consent and uses the same technology in that consented way. And they get IRB approval. [Note: Institutional Review Boards, or IRBs, do ethics reviews of research that includes human subjects.]

That’s done by the government and nonprofits and also pharma. There’s great stuff out there. So I guess the question is: why do we need this whole other thing? If pharma is de-identifying and sharing data where the patients consented for research and it’s overseen by an IRB, if all of that is working, why do we need this other, risky thing? The IRB also looks at the validity of the research. Nobody’s looking at the validity of the research of this off-the-grid stuff that’s going on.

Do you think health institutions or regulatory agencies will adjust anything to block some of these data leaks or prevent some of the risks to patients?

Something like this is like any other kind of medical harm. An adverse event could be a destroyed credit score. I think there are parts of healthcare that take this very seriously, but I don’t think it’s second nature yet.

That’s partly because the game is always being upped. I think it’s very difficult to stay on the curve, especially in medicine. On one side, you have these speed-of-light tech and cybercrime processes going on, and the other is smart people trying to take care of patients better. And they’re just mismatched.

But I think the industry could help itself a little bit, and be more open, and say that they’ll do more with consent. Or [regulators] could make re-identification of data illegal. Something that actually protects the people who are going to suffer from this. I actually don’t believe there’s anything wrong with the technologies. It’s really more a matter of saying, “If we’re going to do this type of research, how do we ensure we’re protecting the people that might be harmed by it?”