Big Data is great but what happened to quality over quantity?

As a life-long comic book fan, I can’t help but think of Braniacfrom the Superman comics whenever I hear discussions of big data. Much like Braniac, we have become so obsessed with our quest to collect data that we have neglected to think through why we are collecting it and what we will use it for. While rushing to collect data on everyone and everything, we have lost sight of the old cliché “quality over quantity.” Big Data and predictive analytics are invaluable additions to the toolkit we use to understand complex problems; but we need to remember that they should be deployed in conjunction with a much wider kit of decision making tools.

We are collecting more data than we can possibly process in the near future, and the motivation for this is unclear. According to IBM, 2.5 quintillion bytes of data are generated every day, and 90% of the data in the world has been generated and recorded in the last two years. While this statistic is impressive and speaks to the monumental nature of recent technological advances, a stat that’s always conspicuously missing is what percentage of data collected is actually useful at this point in time. As the cost of collecting and storing this data drops dramatically, it seems that we are simply hoping that it will be valuable at some point in the future. Big data is made up of structured (tables) and unstructured (text), and this is the first time in history we can “easily” analyze the latter. In analyzing big data, the field is looking for a needle in a haystack that matches a different needle in another haystack. The more effectively and simply analysts are able to collect, organize, and sort the hay (data) the easier it is to find the needle or critical piece of information. But, in the rush to gather and analyze this unstructured data, not much thought has gone into the quality of the tools we use to gather data, the quality of the insights we glean from them, and the integrity of the information itself.

Big data benefits from an illusion of impartiality due to the mechanized nature of collection and a misunderstanding of statistics. It’s important to remember the insights we derive from data are related to information, qualities, and goals that people have deemed valuable. Essentially, themachines learn based on the inputs we give them. While this is not inherently problematic, humans have biases based on their life experiences that dictate what types of data are prioritized over others. For example many HR data analytic companies have used big data to say that a key metric in job success is a person’s postal code. On the surface, this can seem like a harmless statement — zip codes might correlate to job performance because the closer you live to work the more likely you are to interact with coworkers and have more in common to talk about. Additionally, one could assume that reduced commuting time reduces stress. The danger is if one accepts these correlations at face value, for other hidden assumptions and biases can be ignored. For example, postal codes are correlated to affluence or maybe the performance review at this company isn’t as robust as it should be. If we aren’t careful, hubris about the factual nature of big data can act as a catalyst for confirmation bias. Studies have already hinted at the fact that poor people may be seeing a different internet than rich people (i.e. a poor person that searches the word apple will see a picture of the fruit while a rich person may see an iPhone). We need to be careful of spurious correlations and the unintended consequences of these advancements.

Seemingly innocuous personal information can render people little more than a collection of data points. Data that’s collected now may seem useless, but in the future something as mundane as your tweets, Facebook posts, and web browsing history could be used to paint a startlingly accurate picture of you as a person. For example, a recent study showed that a computer algorithm could predict someone’s key personality traits better than friends, family and coworkers.

If we take this study a step further, as education becomes more data drivenand people give out information about themselves via the Internet at a younger age, young children and adolescents might be defined by their online footprint before they have even fully formed preferences. On the one hand, it may be possible to use this social data that kids give up as an early indicator for things like bipolar disorder, depression, and schizophrenia. On the other hand, it’s within reason to see a future where a child’s browsing habits and algorithms are used to predict future success and ultimately used for things like school acceptances and career prospects. Will we tell a child based on data collected about him/her and other factors from behavior analysis that their chances of becoming a doctor or a musician are slim to none and bar them from signing up for certain classes? Will we get to a point where a bank won’t give students a loan based on correlations between their high school internet browsing habits and repayment rates? Importantly, what important innovations will we miss out on if predictive analytics replaces the process of trial and error?

The over-reliance on predictive analytics could lead to a world where guaranteed success is overvalued at the expense of the learnings that come from trial and error. As we shift to using Big Data and predictive analytics to guide our most important decisions, it is important to think about the value of trial and error throughout history. Critical discoveries from Penicillin to the microwave happened as a result of accidents that take place throughout the scientific discovery process. Predictive analytics gives us an opportunity to know more about likely outcomes, and thus we run the risk of prioritizing the safe over the unknown, forgetting about the discoveries we may miss out on along the way. This is especially true at the nexus of entertainment and data. For example, Netflix bid a tremendous amount of money for the show House of Cards, because data showed viewers would like it, but at the same time would a critically acclaimed show like Lost ever get green lit?

Big data has furthered thinking in a variety of fields from increasing the efficiency of basketball teams to furthering biomedical research. However, as Nassim Taleb said in a recent article for Wired, “Big data may mean more information, but it also means more false information.” As we rely more on big data to solve critical problems, we need to be deliberate about what information we store and hold ourselves accountable to understanding its potential impact on individual lives.

Scott Salandy-DefourMarch 19, 2019

Perspectives