The data-driven revolution is prefaced upon the idea that data and algorithms can lead companies away from biased human judgement towards pristine mathematical perfection that captures the world as it is rather than the world biased humans would like. Unfortunately, as data-driven decision-making has infused throughout the corporate and governmental worlds, organizations have begun to recognize that the world captured in their data is often an extraordinarily biased one that does not always align with their corporate values or which presents challenges to their algorithmic training. In turn, this has increasingly led companies to manually alter their algorithms and overrule their findings to replace the “truth” told by their data with the preordained outcome they desire.
Much as a photograph constructs one possible reality from the infinite realities presented by a scene, so too does data, no matter how large, present one small slice of the reality it attempts to capture.
In turn, much as the imperfections of a camera’s hardware define the way in which it reproduces the experience of a place, so too does the design of an algorithm impact the “truth” it derives from data.
There is no such thing as perfect data or perfect algorithms. All datasets and the tools used to examine them represent tradeoffs.
The precise impact of these tradeoffs is often unknown. Scraping free imagery from the Web to train a deep learning algorithm represents a tradeoff of cost and collection speed balanced against representative diversity. It is well understood that such data is biased, but the precise form those biases take will likely differ from dataset to dataset, making it hard to precisely document those biases in order to mitigate them.
Each dataset represents a constructed reality of the phenomena it is intended to measure. In turn, the algorithms used to analyze it construct yet more realities.
Few datasets natively capture a balanced representation of all inputs an algorithm will need to handle. For example, a training set of dog photographs might be primarily captured outdoors in grassy fields, with few images taken at night.
Data scientists therefore intervene, artificially adjusting the composition of their data in order to capture a more balanced and “fairer” view of all possible inputs.
Even with perfectly balanced training datasets, artifacts of today’s correlative deep learning systems may mean that certain categories are simply harder for the algorithm to encode, even if their training data is balanced.
Practitioners typically address this by manually increasing the number of training examples for those categories, altering their datasets from a balanced composition to a highly imbalanced one that reflects the needs of their algorithm rather than the reality of the real world.
To the press, policymakers and general public, the data-driven revolution is presented as the search for “truth” through data and algorithms.
The reality is that this “truth” is constructed.
Data scientists collect the data that is most readily at hand and feed it through algorithms they can most readily understand and use.
The choice of inputs and the composition of training and testing data all influence what model results.
Most importantly, the choice of what output to build the model upon, whether to focus a management algorithm on maximizing employee happiness or whether to maximize corporate profit at all cost, constructs the reality within which the algorithm exists.
Indeed, the same dataset fed into the same algorithm can yield polar opposite results depending on the data filters and algorithmic settings chosen.
In short, a data scientist can arrive at any desired conclusion simply by selecting the dataset, algorithm, filters and settings to match.
In many ways, the basic premise of the data-driven revolution in bringing quantitative certainty to decision-making is a false narrative.
There is no single “truth” to be obtained through mathematical precision.
There is merely an infinite universe of outcomes for the data scientist to select the one they desire.
Putting this all together, as more and more decisions are placed in the hands of data-driven analyses and algorithms, it is more imperative than ever that society recognizes that data does not equate to truth.
