PDF Summary:Everybody Lies, by

Book Summary: Learn the key points in minutes.

Below is a preview of the Shortform book summary of Everybody Lies by Seth Stephens-Davidowitz. Read the full comprehensive summary at Shortform.

1-Page PDF Summary of Everybody Lies

Can Google help us understand sexual diversity, detect unreported child abuse, and prevent hate crimes? Data scientist Seth Stephens-Davidowitz thinks so. In Everybody Lies, Stephens-Davidowitz argues that in daily life and traditional surveys, people tend to lie about their sexuality, prejudices, and emotional turmoil—but they willingly confess all of these secrets in their Google searches and other web activity.

Stephens-Davidowitz says this information can be used for the greater good—that it can inform better social policies, improve institutions like education and health care, promote social equity, and bring hidden injustice to light. But to do so, data researchers have to understand big data’s inherent strengths—and avoid its inherent weaknesses.

In this guide, we’ll explore the four benefits that Stephens-Davidowitz says give big data its power while balancing his optimistic viewpoint with recent analyses of big data’s potential for harm.

(continued)...

Respondents will probably be more honest in an online survey than an in-person survey—but that still doesn’t solve the problem of self-deception. Stephens-Davidowitz argues that we’re often poor judges of our own thoughts and behaviors because we don’t want to acknowledge the less savory aspects of ourselves.

Google Confessions

Google searches and other internet activity, on the other hand, reveal truths that might never come out in traditional data-gathering methods like surveys. For example, Stephens-Davidowitz shows that in states whose laws oppose gay marriage, the percentage of self-reported gay men is much lower than the estimated average across the whole population. But, he says, if you look at searches on Google and on porn sites, the percentage of male users looking for gay porn (or asking how to tell if they’re gay) is much closer to that average. Also, the percentage of gay men as defined by search results is roughly stable from state to state.

This suggests that search data is a more accurate—and honest—measure of gay male sexuality than traditional surveys. Similarly, Stephens-Davidowitz says that search results reveal truths about all kinds of topics that we have an incentive to lie about or hide in real life, such as:

1) Sexuality: Stephens-Davidowitz says that, in addition to the data on gay men, search results cut against common stereotypes about sexuality. For example, women are just as likely to ask Google why their husbands or boyfriends don’t want sex as men are to ask the same about their wives and girlfriends.

2) Prejudice: Stephens-Davidowitz argues that the prevalence of searches for racist terms and phrases reveals that there is a lot more explicit prejudice (as opposed to unconscious bias or systemic inequity) than traditional surveys suggest.

3) Child Abuse: During the 2007-2008 financial crisis, experts predicted a rise in child abuse and neglect only to be surprised by a downturn in cases. Stephens-Davidowitz shows that searches like “my mom beat me” went up in heavily affected areas—suggesting that abuse and neglect increased, but that cases went unreported or uninvestigated because of lessened resources.

4) Abortion: Stephens-Davidowitz explains that searches about self-induced abortion are more common in states with restrictive abortion laws.

Why Do People Confess to Google?

One of the interesting things about Stephens-Davidowitz’s research is that many Google searches take the form of declarative statements rather than questions or strings of keywords. For example, Stephens-Davidowitz cites queries like, “I regret having children”—statements that seem more like diary entries than search terms. Stephens-Davidowitz acknowledges this oddity, but because he’s more interested in the topics of searches than their syntax, he doesn’t spend much time exploring why so many people “talk” to Google this way.

One possible explanation has to do with how people construct their identities. In a paper examining the causes of dishonest self-reporting on surveys, Philip S. Brenner and John DeLamater propose that people might answer surveys in ways that reflect who they want to be rather than their actual actions. In other words, if honesty is important to you and you consider yourself to be an honest person, you might, if surveyed, misreport how much you lie—not because you’re embarrassed to admit to your occasional dishonesty, but because you subconsciously see the survey as a chance to reconfirm your “honest person” identity by reporting honest behavior.

Brenner and DeLamater’s hypothesis adds an interesting nuance to the idea that users see Google as a safe space. Just as certain questions—like, “Why won’t my partner have sex with me?”—might be hard to ask in public because they’re embarrassing, certain opinions might be hard to acknowledge if they cut against the ways we define ourselves. For example, if you regret having children, that regret might very well threaten your identity as a parent. In other words, just as Google offers a private way to look for information or advice you wouldn’t want to ask other people for, it also seems to offer a place to express thoughts that you might not even want to admit to yourself.

Reasons for Hope

Stephens-Davidowitz argues that although a lot of this data seems depressing, it also offers causes for hope. He gives three reasons:

First, he says that big data suggests that you’re not alone. Internet searches show that in pretty much any case, a lot more people than you think share the kinds of concerns, troubles, and interests that you might never admit to in public.

Second, he argues that big data can point out suffering that we wouldn’t otherwise notice, as with the child abuse data mentioned above.

Finally, he says, big data provides feedback we can use to solve problems. He explains that during an Obama speech decrying Islamophobia, Islamophobic searches actually increased—except after a line about Muslim athletes and soldiers, which seemed to prompt curious rather than hateful searches. Stephens-Davidowitz says that Obama’s speechwriters appeared to capitalize on this data by tailoring a future speech to focus on concrete examples of Muslim Americans rather than abstract calls for tolerance.

…And Reasons for Concern

While Stephens-Davidowitz focuses on the potential benefits of the uncomfortable truths people reveal online, it’s also worth pointing out how some of these truths can be exploited for more cynical purposes. For one thing, even as big data suggests the sheer diversity of opinions, preferences, habits, and interests, it offers up that diversity as a boon to advertisers: Google Trends—the same resource behind much of Stephens-Davidowitz’s research—positions itself in part as a marketing tool that promises to help businesses craft better campaigns by revealing the topics their prospective customers search for.

Unfortunately, the targeted ad techniques that have grown out of big data can easily be used to create racially discriminatory ad campaigns—for example, by illegally excluding racial and ethnic minorities from housing ads. In extreme cases, some journalists have shown how easy it is to create ads specifically intended for “Jew haters” on Facebook while others have demonstrated how Google’s ad platform actually helps advertisers target racist search terms by offering related searches such as “black people ruin neighborhoods.” It’s important to note that Facebook and Google disable these sorts of abuses when they come to light, but recent studies suggest that the larger problem still exists.

Benefit #3: High Resolution

Stephens-Davidowitz argues that one of the powers of big data is that it allows you to zoom in on specific subsets of data, which in turn allows new insights and new types of studies. (Shortform note: This benefit derives from another of the three Vs—volume.)

High Definition Information

Big data allows us to zoom in because, by providing so many data points, it gives our information better resolution in the same way that a high-definition display improves resolution by including more pixels.

For instance, Stephens-Davidowitz describes using Wikipedia’s database to figure out what geographical factors give you the best chance of succeeding in life. He used the database to figure out the birth county of every American notable enough to warrant a Wikipedia entry, then he cross-referenced that information with census and other data to find that the most important factors for success are proximity to a big city, proximity to a major university, and proximity to an immigrant population. His point is that a study like this is only possible because he had enough information to zoom in on individual counties and compare them across numerous factors.

(Shortform note: A study like this points to another power of big data that Stephens-Davidowitz doesn’t explicitly discuss: the ease of cross-referencing different types of information. Much of Stephens-Davidowitz’s work involves combining and comparing data from different sources to draw out new insights—even from “old” information like census data.)

The Power of Doppelgangers

Another big data technique Stephens-Davidowitz identifies is what he calls the doppelganger method—a technique where researchers make predictions about one person by studying another person who’s statistically similar to the first person.

He explains that this method was first developed by statistician and political forecaster Nate Silver, who used it to predict baseball players’ future performances. Silver realized that instead of trying to map a player’s performance onto a generic career trajectory curve, it would be better to find the past players who were statistically most similar to the player in question. These similar players are what Stephens-Davidowitz calls doppelgangers, and finding them lets you use them as a reference for your predictions. For example, if you’re trying to decide whether to keep or trade your star hitter as he nears 30 years old, you can look at his doppelgangers to see whether they kept performing or declined in their 30s.

Stephens-Davidowitz suggests that the doppelganger method could be used to improve other fields such as medicine. He argues that if we gathered and compiled enough medical data, we could find doppelgangers for each patient, and doctors could use these doppelgangers to inform their medical decisions. For example, by comparing a patient to other similar patients, a computer could flag the early symptoms of disease before they’re obvious to the doctor. He argues that a doppelganger system would also let patients find others similar to themselves in order to find out what treatments helped their doppelgangers.

(Shortform note: In Thank You for Being Late, Thomas Friedman mentions that IBM found a similar use for Watson—their computer system most famous for beating Ken Jennings and Brad Rutter at Jeopardy!. They trained Watson to identify early-stage melanomas by looking at pictures of questionable skin lesions and comparing them to a database of cancerous and noncancerous lesions. The goal, according to an IBM researcher, is for computers to reduce the size of the haystack doctors have to sift through to find the needle of early cancer—Friedman further argues that by shifting the diagnostic burden onto Watson, doctors can focus on exercising the judgment and empathy that only humans provide. The technique has since been expanded by other researchers to use AI to evaluate large expanses of patients’ skin for suspicious marks.)

Similar to zooming in, finding doppelgangers requires a high volume of information—you need enough people in your database to have a high likelihood of finding matches, and you need enough different data points on those people to be able to compare them meaningfully. Stephens-Davidowitz points out that the doppelganger technique—like many statistical and data science developments—started in baseball because baseball has far more comprehensive data (in terms of breadth, depth, and historical longevity) than most fields.

(Shortform note: Coincidentally, baseball also offers another example of the type of new data we saw earlier. Baseball analytics traditionally relied on players’ statistics (batting average, home runs, and so on) for insights. But recently, ballparks installed video-based tracking systems like PITCHf/x to record information like pitch velocity and spin rate, batted ball speed and trajectory, players’ running speed and ground covered, and so on. These new data types have opened up a whole new realm of performance analysis, showing that even in one of the most data-heavy industries imaginable, there are brand new types of data yet to be unearthed and studied.)

Benefit #4: Easy Cause-Effect Studies

The final benefit Stephens-Davidowitz says big data has is that it makes it easy to perform causal research. Scientific studies typically try to find cause-effect relationships by performing experiments that determine what impact a given variable has in a specific situation. In the social sciences, this research traditionally involves recruiting volunteers, dividing them into two or more groups, exposing some of the groups to the variable, and comparing those experimental groups to the control group.

Stephens-Davidowitz points out that this traditional experimental process requires a lot of funding, time, and other resources—and these factors limit the number of experiments researchers can do as well as the scope of those experiments. He says that big data research eliminates these problems, thereby vastly expanding the research we can do.

A/B Testing

One way big data makes causal research easier is by enabling simple A/B testing. Stephens-Davidowitz explains that an A/B test entails randomly selecting groups of users and showing each group a different version of a product or feature. For example, if a business is developing a new webpage, they might code their site so that half of their visitors see a red background and half see a blue background. They could then track each group to see how long people stayed on the page, how many links they clicked on, and whether they bought anything. By comparing the red group to the blue group, the company could determine which background color is more effective.

(Shortform note: Stephens-Davidowitz frames A/B testing as a way for data science to expand social science research, but the scientific benefits are not so clear cut. For one thing, most of the real-world applications of A/B testing seem to be limited to the corporate and political worlds—which might explain why Stephens-Davidowitz’s examples are limited to experiments with marketing and interface design. Moreover, when sites like Facebook have used A/B techniques to explore sociological questions, critics have questioned the ethics of manipulating users’ emotions or influencing their voting behavior in the name of research.)

Natural Experiments

Big data also makes causal research easier by allowing researchers to study pre-existing data (rather than running experiments to generate new data). Stephens-Davidowitz calls these kinds of studies natural experiments—a technique common in fields like economics and epidemiology where controlled experiments would be impossible or unethical. A natural experiment entails studying the results of natural processes such as disease outbreaks or market changes as though they were the results of randomized controlled experiments.

For example, Stephens-Davidowitz cites studies that examined whether attending an elite high school or college results in better outcomes than attending an ostensibly lesser school. It wouldn’t be ethical to ask schools to alter their admissions for the sake of an experiment, so the researchers instead turned to pre-existing data. To account for the fact that elite schools attract elite applicants, researchers studied two groups—students who just made it in and those who just missed the admissions cutoff or who were admitted but went to school elsewhere. The studies examined future salaries as a measure of success, and found similar results from both groups, suggesting that the schools themselves have little impact on their students’ future success.

As with zooming in and doppelganger studies, this kind of research is only possible thanks to big data’s high volume. Researchers needed to know a lot about the students in question: their application qualifications like test scores, their backgrounds, which colleges they applied to, which colleges they attended, and their career earnings.

(Shortform note: Researchers conducting natural experiments need to be especially careful to avoid some of the pitfalls mentioned earlier—namely, cherry-picking data to suit a hypothesis or, conversely, shaping a hypothesis to fit the data. Antifragile author Nassim Nicholas Taleb also warns about the danger of data scientists mistaking coincidental correlations for cause-effect relationships. Taleb argues that an easy way to confirm a causal relationship is by observing whether the proposed cause always comes before the proposed effect—something that’s harder to do when studying data after the fact rather than observing events as they happen.)

Data’s Drawbacks and Dangers

Even though Stephens-Davidowitz is openly enthusiastic about data studies, he’s aware that data has drawbacks and limitations and can lead to great harm if used unethically. In this section, we’ll look at some of the drawbacks and dangers Stephens-Davidowitz identifies and explore some cases where these dangers have come to pass since the book’s publication.

Drawbacks: When Data Gets in the Way

Stephens-Davidowitz warns that good data science isn’t just a matter of amassing a giant data set. When working with data, he says it’s important to keep data’s shortcomings in mind and not lose sight of the bigger picture.

Drawback #1: False Correlations

Stephens-Davidowitz says that when a dataset is too detailed, it can lead to predictive errors. The problem, he says, is the curse of dimensionality—a phenomenon whereby the more details a dataset contains, the more likely it is to suggest false positives when you look for predictive correlations.

Stephens-Davidowitz gives the example of flipping coins to try to predict the stock market. Say you flip a coin every day, record whether it was heads or tails, and then record whether the stock market went up or down that day. Stephens-Davidowitz says that if you perform this test using 1,000 coins for two years, it’s likely that by pure chance, at least one coin’s results will appear to correlate with market performance. Obviously this correlation is false. But Stephens-Davidowitz says this problem happens any time you test a lot of variables against a small number of outcomes—such as when trying to predict the stock market or link gene variations to disease likelihood.

(Shortform note: In addition to the risk of drawing conclusions based on random noise as Stephens-Davidowitz describes, the curse of dimensionality can make it hard to draw any meaningful conclusions at all. That happens when you classify data into so many parameters that all data points appear equidistant from each other and there are fewer “clusters” of data to draw your attention—in other words, you can no longer see useful similarities between items.)

Drawback #2: Data for Data’s Sake

Stephens-Davidowitz points out that it’s easy to fall in love with data for its own sake. When that happens, we’re likely to lose sight of what the data was supposed to be doing for us in the first place. He gives the example of standardized testing in education, which aims to make teaching and learning measurable by generating data on student outcomes. But in many cases, schools end up focusing on improving their test scores (which are tied to schools’ reputation and funding) by any means necessary—means that include limiting the curriculum in order to focus on test prep and, in extreme cases, cheating on the tests.

Stephens-Davidowitz says that studies suggest the best way to use data to measure teacher quality is to combine test scores with other factors like student evaluations and classroom observation. He says that many fields are finding that this combination of big data and traditional, small-scale information works better than focusing on big data alone.

Big Data vs. Small Data

Similarly, in Small Data, author and branding consultant Martin Lindstrom argues that big data on its own is misleading and that it should be coupled with what he calls “small data”—in-person observation of people’s desires and motivations.

Lindstrom gives the example of LEGO, which tried to address struggling sales by turning to big data research. That research convinced the company that millennials would be easily bored by toy building blocks, so the company simplified its sets in an attempt to offer instant gratification. This approach failed. But when market research interviews with actual children revealed that kids like mastering hobbies, LEGO found more success than ever before by making more complicated sets—an approach that directly contradicted what big data had told them.

Lindstrom’s definition of “small data” is not the only one. Other researchers use the term to describe small-scale measurements of specific attributes—such as wind direction sensors on wind turbines or smart bottle labels that track a medicine’s remaining shelf life. This kind of small data can work on its own (for example, by telling the turbine to adjust its blades to maximize electricity output) or integrate with big data techniques (for example, to track when, where, and why medicines expire on shelves).

Dangers: Exploitation and Privacy Invasion

Stephens-Davidowitz warns that big data can easily lend itself to exploitative practices by businesses and by governments.

Businesses Exploit Customers

While A/B testing can help businesses optimize their products and services by identifying the most effective design choices, it can also help them make their offerings more addictive. He points out that from a business perspective, a site like Facebook is ultimately designed to get you to spend more time on Facebook. The more addictive they make the site and its services, the better, and A/B testing helps them find the best ways to keep users hooked.

Similarly, Stephens-Davidowitz argues that businesses can use the doppelganger method to extract the maximum profit from their customers. He gives the example of casinos, which can use data about customers like you to predict your pain point—the point at which you lose enough money that you won’t come back to the casino for a while, if at all. Once they know your pain point, they can let you lose money until you’re approaching that point, then intervene to offer you a free dinner or other perks. They come across as generous when in reality they’re manipulating you by stopping you from gambling now so that you’ll come back sooner.

(Shortform note: Some critics warn that data-based business practices might have even farther-ranging consequences than these. For example, in 21 Lessons for the 21st Century, Yuval Noah Harari argues that the increasing prevalence of computer algorithms threatens to erode our free will. He suggests that the more we learn to trust computer-generated suggestions, the less we trust our own decision-making capabilities—taken to an extreme, he argues that this trend could lead to humans entrusting their entire lives to computers and their data-driven recommendations.)

Minority Report

Finally, Stephens-Davidowitz warns of the temptation to use data to make inappropriate predictions about individuals. He points to studies of loan applications that identify which words are most correlated with future defaults and which are most associated with paying back the loan. He points out that it would be unfair for a lender to use this information to deny a loan in any one particular case, based only on the words an applicant used. That’s because data can only identify statistical likelihoods, which tell us, for example, that many people who “swear to God I’ll pay back this loan” default; this insight says nothing about any specific loan applicant’s likelihood of default (whether they “swear to God” or not).

(Shortform note: And yet, as Harari points out, many banks and other institutions already use algorithms for just such purposes—a practice that potentially leads to new forms of discrimination. Similarly, some fitness trackers offer discounts on insurance premiums for meeting certain fitness goals—what they don’t tell you is that there’s nothing stopping insurance companies from raising your rates based on the data they collect.)

Likewise, Stephens-Davidowitz says, it may be possible to correlate, for instance, a rise in searches for racist terminology in a specific neighborhood with the likelihood of racially motivated violence in that neighborhood. He argues that it may even be prudent to use this information to allocate police resources. But he rejects the idea of acting against individuals—just because somebody’s Google search suggests an interest in committing a hate crime doesn’t mean that person will actually commit any crime.

Stephens-Davidowitz acknowledges that these lines can be fuzzy: If police know that someone has been Googling where to buy guns and ammunition, how to modify those guns to make them fully automatic, and for information on a nearby school, should they intervene directly against that person? Should they inform the school? The answers aren’t clear.

How Governments Already Exploit Big Data

Just as banks and insurance companies already use data to inform their decisions, governments and law enforcement already use data to surveil citizens and, in some cases, act against them. In Permanent Record, Edward Snowden details how he discovered a secret US government program for collecting data on individuals without warrants. Though this program’s public exposure was met with outrage and legal challenges, government surveillance and data collection has only expanded since Snowden blew the whistle in 2013. For example:

1) Data-mining company Palantir has come under criticism from human rights advocates and its own employees for maintaining a lucrative contract with the US’s Immigration and Customs Enforcement agency—which has used Palantir’s technology to detain and deport immigrants.

2) During the COVID-19 pandemic, the US and UK governments partnered with Palantir to track patients and map outbreaks.

3) With the overturning of Roe v. Wade, privacy advocates worried about the potential for the government to scrutinize pregnant women’s online activity and even track their locations in order to intervene against those who might seek abortions.

These real-life applications of big data bring to mind Harari’s concerns about who owns and has access to personal data. In 21 Lessons for the 21st Century, Harari argues that data in government hands poses the threat of authoritarian measures bordering on national mind control.

Bonus: Everybody Lies’s Insights and Curiosities

Everybody Lies is full of strange, interesting, and disturbing insights into human nature. Throughout this guide, we’ve focused on the logical core of Stephens-Davidowitz’s arguments about big data—which means ignoring most of the juicy tidbits in order to focus on his fundamental claims. Given how important (and fun) those tidbits are to the book as a whole, we’d be remiss not to include a few of the more interesting details that didn’t make it into the main body of the guide. For instance:

1) Conventional wisdom holds that most professional basketball players come from poor inner-city backgrounds—but in fact, for black players in particular, being born in a wealthy county doubles the chances of reaching the NBA. Stephens-Davidowitz explains that a wealthier background means better nutrition (which translates to height) and better interpersonal skills (which help players navigate the pressures and politics of professional sports).

2) According to textual analysis, American newspapers on average are more liberal than conservative—not because the papers themselves have a political agenda, but because they shape their content to appeal to the political biases of their audiences.

3) Women are twice as likely as men to search for porn that features violence against women.

4) Violent movies reduce the rate of violent crimes.

5) Super Bowl ads are so effective that companies are actually underpaying for them. The ad’s impact is strongest in the markets whose teams play in the game.

A Final Shortform Note: The Humanity of Data Analytics

It’s also worth ending on facts like these because, while they might seem trivial, they remind us of the human interest at the core of Everybody Lies. As Stephens-Davidowitz points out, data isn’t valuable in and of itself—its value lies in its potential to advance human ends (for good or ill, as we’ve discussed throughout this guide). Ultimately, it’s not about any specific technique or technology or even about big data per se; it’s about how new information—and new ways of working with information—can help us better understand ourselves and improve our lives. We can infer that Stephens-Davidowitz includes these fascinating insights, in part, to connect data with our humanity, and it previews where data has gone since.

If we review the trends in data science since Everybody Lies’s 2017 publication, we see an increasing emphasis on combining data with human insight. In recent years, data scientists, technology professionals, and companies have emphasized more flexible and versatile data applications and better integration with human users. For example:

1) The fields of AI and machine learning have moved away from ever-larger data sets in favor of techniques like transfer learning, in which machines are programmed to learn new but related tasks based on things they already know. This approach allows researchers to use AI to study problems for which large data sets might not be available or practical.

2) Similarly, developers have found that AI works best when it’s trained not just by large datasets, but also by human expertise. This approach not only improves the AI, but it also leads to a helpful division of labor as computers tackle routine tasks and humans only step in to solve more difficult problems.

3) Many companies have started pursuing data democratization—a practice of empowering everyone in the organization (not just specialized analysts) to understand, access, and work with data.

As technology continues to evolve, the specifics will keep changing; as we can see, data research already looks different from what we find in Everybody Lies. But as society becomes increasingly driven by data, it only becomes more important to consider the underlying human potential (for good and for harm) that Stephens-Davidowitz describes.

Want to learn the rest of Everybody Lies in 21 minutes?

Unlock the full book summary of Everybody Lies by signing up for Shortform.

Shortform summaries help you learn 10x faster by:

  • Being 100% comprehensive: you learn the most important points in the book
  • Cutting out the fluff: you don't spend your time wondering what the author's point is.
  • Interactive exercises: apply the book's ideas to your own life with our educators' guidance.

Here's a preview of the rest of Shortform's Everybody Lies PDF summary:

What Our Readers Say

This is the best summary of Everybody Lies I've ever read. I learned all the main points in just 20 minutes.

Learn more about our summaries →

Why are Shortform Summaries the Best?

We're the most efficient way to learn the most useful ideas from a book.

Cuts Out the Fluff

Ever feel a book rambles on, giving anecdotes that aren't useful? Often get frustrated by an author who doesn't get to the point?

We cut out the fluff, keeping only the most useful examples and ideas. We also re-organize books for clarity, putting the most important principles first, so you can learn faster.

Always Comprehensive

Other summaries give you just a highlight of some of the ideas in a book. We find these too vague to be satisfying.

At Shortform, we want to cover every point worth knowing in the book. Learn nuances, key examples, and critical details on how to apply the ideas.

3 Different Levels of Detail

You want different levels of detail at different times. That's why every book is summarized in three lengths:

1) Paragraph to get the gist
2) 1-page summary, to get the main takeaways
3) Full comprehensive summary and analysis, containing every useful point and example