How Is Social Science Data Collected? Big Data!

This article is an excerpt from the Shortform book guide to "Everybody Lies" by Seth Stephens-Davidowitz. Shortform has the world's best summaries and analyses of books you should be reading.

Like this article? Sign up for a free trial here .

How can we use big data to study social science? How does data give us more insight into the social sciences?

Through search data, researchers can discover psychological and sociological information that traditional surveys couldn’t provide. Seth Stephens-Davidowitz, the author of Everybody Lies, uses Freud’s theories of sexuality as an example.

Read how to receive social science data with the help of big data.

In addition to improving our natural intuition, big data studies can help make social science data more rigorous. Stephens-Davidowitz notes that traditionally, there’s a divide between hard sciences (such as physics and chemistry) and soft sciences (such as psychology and sociology). That divide boils down to differences in method and types of evidence, with critics accusing the social sciences of advancing theories that can’t be falsified.

Stephens-Davidowitz gives the example of Freud’s theories of sexuality, which Freud based on his own observations and interpretations rather than on experimental evidence. Stephens-Davidowitz shows how Google and Pornhub search data let us test these previously untestable ideas (he finds no evidence for Freud’s claim that phallic symbols in dreams reveal latent desires; on the other hand, he finds a surprising number of searches for parent-child incest videos, suggesting some truth to Freud’s Oedipal theory).

Big Data and the Replication Crisis

It’s worth wondering about the role of data-driven research in light of science’s (including psychology’s) ongoing replication crisis—in which an alarming number of experimental findings can’t be reproduced in follow-up experiments.

On one hand, some of the explanations for the replication problem have to do with experimental sample sizes. In some cases, unreliable studies might use undersized samples, which increases the chances of finding false positives. Similarly, some scientists have been accused of consciously or unconsciously manipulating their data, for example, by ending an experiment as soon as they find a statistically significant result—a practice that likewise increases the chances of faulty findings.

On the surface, data science seems to offer an antidote to these problems. Big data, by definition, operates with very large sample sizes, which decreases the chance of stumbling across small-scale fluke results or cherry-picking data.

On the other hand, data-based studies can introduce their own problems. For example, data researchers often start with the data and then formulate hypotheses to explain the patterns they find—a reversal of the traditional scientific method. This approach introduces the risk of mistaking large-scale random variation for meaningful patterns (a problem Stephens-Davidowitz calls the “curse of dimensionality” and which we discuss later in this guide).

Finally, as suggested above, data research is susceptible to some of the same problems that plague traditional methods and lead to cherry-picking and confirmation bias—problems such as arbitrary definitions, flexible hypotheses, and liberal thresholds for experimental success. In traditional science and data science alike, there’s always the chance of researchers designing studies and interpreting results in ways that support their pet ideas (or their field’s established theory).

None of this is to dismiss data science—or traditional science, for that matter. But it’s not a given that data studies will necessarily equate to more (real or perceived) rigor for social sciences, especially when scientific rigor is, itself, under fire at present.

How Is Social Science Data Collected? Big Data!