The Doppelganger Effect in Big Data, Explained

This article is an excerpt from the Shortform book guide to "Everybody Lies" by Seth Stephens-Davidowitz. Shortform has the world's best summaries and analyses of books you should be reading.

Like this article? Sign up for a free trial here .

What is the doppelganger effect in data? How is the method used to study people?

A big data technique Seth Stephens-Davidowitz identifies is the doppelganger method. It’s a technique where researchers make predictions about one person by studying another person who’s statistically similar to the first person.

Learn more about the power of doppelgangers, as explained in Everybody Lies.

The Power of Doppelgangers

Stephens-Davidowitz explains that the doppelganger method was first developed by statistician and political forecaster Nate Silver, who used it to predict baseball players’ future performances. Silver realized that instead of trying to map a player’s performance onto a generic career trajectory curve, it would be better to find the past players who were statistically most similar to the player in question. These similar players are what Stephens-Davidowitz calls doppelgangers, and finding them lets you use them as a reference for your predictions. These findings are better known as the doppelganger effect. For example, if you’re trying to decide whether to keep or trade your star hitter as he nears 30 years old, you can look at his doppelgangers to see whether they kept performing or declined in their 30s.

Stephens-Davidowitz suggests that the doppelganger method could be used to improve other fields such as medicine. He argues that if we gathered and compiled enough medical data, we could find doppelgangers for each patient, and doctors could use these doppelgangers to inform their medical decisions. For example, by comparing a patient to other similar patients, a computer could flag the early symptoms of disease before they’re obvious to the doctor. He argues that a doppelganger system would also let patients find others similar to themselves in order to find out what treatments helped their doppelgangers.

(Shortform note: In Thank You for Being Late, Thomas Friedman mentions that IBM found a similar use for Watson—their computer system most famous for beating Ken Jennings and Brad Rutter at Jeopardy!. They trained Watson to identify early-stage melanomas by looking at pictures of questionable skin lesions and comparing them to a database of cancerous and noncancerous lesions. The goal, according to an IBM researcher, is for computers to reduce the size of the haystack doctors have to sift through to find the needle of early cancer—Friedman further argues that by shifting the diagnostic burden onto Watson, doctors can focus on exercising the judgment and empathy that only humans provide. The technique has since been expanded by other researchers to use AI to evaluate large expanses of patients’ skin for suspicious marks.)

The doppelganger effect requires a high volume of information—you need enough people in your database to have a high likelihood of finding matches, and you need enough different data points on those people to be able to compare them meaningfully. Stephens-Davidowitz points out that the doppelganger technique—like many statistical and data science developments—started in baseball because baseball has far more comprehensive data (in terms of breadth, depth, and historical longevity) than most fields.

(Shortform note: Coincidentally, baseball also offers another example of the type of new data we saw earlier. Baseball analytics traditionally relied on players’ statistics (batting average, home runs, and so on) for insights. But recently, ballparks installed video-based tracking systems like PITCHf/x to record information like pitch velocity and spin rate, batted ball speed and trajectory, players’ running speed and ground covered, and so on. These new data types have opened up a whole new realm of performance analysis, showing that even in one of the most data-heavy industries imaginable, there are brand new types of data yet to be unearthed and studied.)

The Doppelganger Effect in Big Data, Explained