Data-driven science is a failure of imagination

Professor Hans Rosling certainly is a remarkable figure. I recommend watching his performances. Especially the BBC's "Joy of Stats" is exemplary. Rosling sells passion for data, visual clarity and great deal of comedy. He represents the data-driven paradigm in science. What is it? And is it as exciting and promising as the documentary suggests?

Data-driven scientists (data miners) such as Rosling believe that data can tell a story, that observation equals information, that the best way towards scientific progress is to collect data, visualize them and analyze them (data miners are not specific about what analyze means exactly). When you listen to Rosling carefully he sometimes makes data equivalent to statistics: a scientist collects statistics. He also claims that "if we can uncover the patterns in the data then we can understand". I know this attitude: there are massive initiatives to mobilize data, integrate data, there are methods for data assimilation and data mining, and there is an enormous field of scientific data visualization. Data-driven scientists sometimes call themselves informaticians or data scientists. And they are all excited about big data: the larger is the number of observations (N) the better.

Rosling is right that data are important and that science uses statistics to deal with the data. But he completely ignores the second component of statistics: hypothesis (here equivalent to model or theory). There are two ways to define statistics and both require data as well as hypotheses: (1) Frequentist statistics makes probabilistic statements about data, given the hypothesis. (2) Bayesian statistics works the other way round: it makes probabilistic statements about the hypothesis, given the data. Frequentist statistics prevailed as a major discourse as it used to be computationally simpler. However, it is also less consistent with the way we think - we are nearly always ultimately curious about the Bayesian probability of the hypothesis (i.e. "how probable it is that things work a certain way, given what we see") rather then in the frequentist pobability of the data (i.e. "how likely it is that we would see this if we repeated the experiment again and again and again").

In any case, data and hypothesis are two fundamental parts of both Bayesian and frequentist statistics. Emphasisizing data at the expense of hypothesis means that we ignore the actual thinking and we end up with trivial or arbitrary statements, spurious relationships emerging by chance, maybe even with plenty of publications, but with no real understanding. This is the ultimate and unfortunate fate of all data miners. I shall note that the opposite is similarly dangerous: Putting emphasis on hypotheses (the extreme case of hypothesis-driven science) can lead to a lunatic abstractions disconnected from what we observe. Good science keeps in mind both the empirical observations (data) and theory (hypotheses, models).

Is it any good to have large data (high N)? In other words, does high number of observations lead to better science? It doesn't. Data have their value only when confronted with a useful theory. Theories can get strong and robust support even from relatively small data (Fig. 1a, b). Hypotheses and relationships that need very large data to be demonstrated (Fig. 1c, d) are weak hypotheses and weak relationships. Testing simple theories is more of a hassle with very large data than with  small data, especially in the computationally intensive Bayesian framework. Finally, collection, storage and handling of very large data costs a lot of effort, time and money.

Clipboard01

Figure 1 Strong effects (the slope of the linear model y=f(x)) can get strong support even from small data (a). Collecting more data does not increase the support very much (b) and is just a waste of time, effort, storage space and money. Weak effects will find no support in small data (c) and will be supported only by very large datasets (d). In case of (d) there is such a large amount of unexplained variability and the effect is so weak, that the hypothesis that y=f(x) does not seem very interesting - there is probably some not yet imagined cause of the variability. Note that as Bayesian I can afford to speak about direct support for hypotheses (unlike  frequentists who can only reject them).

My final argument is that data are not always an accurate representation of what they try to measure. Especially in life sciences and social sciences (the "messy" fields) data are regularly contaminated by measurement errors, subjective biases, incomplete coverages, non-independence, detectability problems, aggregation problems, poor metadata, nomenclature problems and so on. Collecting more data may enhance such problems and can lead to spurious patterns. On the other hand, if the theory-driven approach is adopted, these biases can be made an integral part of the model, fitted to the data, and accounted for. What is then visualized are not the raw biased data but the (hopefully) unbiased model predictions of the real process of interest.

So why many scientists find data-driven research and large data exciting? It has nothing to do with science. The desire to have datasets as large as possible and to create giant data mines is driven by our instinctive craving for plenty (richness), and by boyish tendency to have a "bigger" toy (car, gun, house, pirate ship, database) than anyone else. And whoever guards the vaults of data holds power over all of the other scientists who crave the data.

But most importantly, data-driven science is less intellectually demanding then hypothesis-driven science. Data mining is sweet, anyone can do it. Plotting multivariate data, maps, "relationships" and colorful visualizations is hip and catchy, everybody can understand it. By contrary, thinking about theory can be pain and it requires a rare commodity: imagination.

25 thoughts on “Data-driven science is a failure of imagination

  1. I agree with everything up to this:
    "Hypotheses and relationships that need very large data to be demonstrated (Fig. 1c, d) are weak hypotheses and weak relationships."

    That is demonstrably not true. In fact the very largest single data set in the world is strongly constrained by models - the LHC data. They needed trillions and trillions of observations to get to confidence in detection because the models were actually that good. The hypothesis (Standard Model) is as strong as you can get and the relationship is binary - something is there or not, so that isn't weak. The - events - are very rare, so they needed lots of data to tease them out.

    You need whatever size dataset that will give you enough confidence in tests of your model to go forward.

    I absolutely agree that trying to form models out of gobs of data - which is the ultimate goal of a bunch of these big data programs, doesn't seem very fruitful.
    Yes it can sometimes turn things up, but to me more resources in coming up with some good framework to test will multiply the power of the collected data so much that only if the 'big data' analysis is extremely low cost in researcher time would it be useful. Now that may well be true in many cases.

    • I will second this comment.
      Since the LHC experiment you mentioned is after things that occur very rarely, you don't have a choice but to accumulate an enormous dataset in order to find enough of these rare occurences. That being said, in the end you reduce it down by many orders of magnitude while isolating the interesting behavior or process you are after, and I would argue that the final dataset you end up with is a very reduced subset of the big data you started with :) which is then used to test whatever hypothesis or model used to describe it. My 2 cents.

    • I take exception with " The - events - are very rare, so they needed lots of data to tease them out." That is a misleading characterization of LHC run post-analysis. In fact, what happens in any experiment is that, yes, the events of interest, the interesting particles emanating from hadron collisions, are rare, but the problem is there are many, many other events of less rarity that need to be sifted through. Part of the genius of these detectors is that they have special hardware which screens through detections, and drops so many on the floor. Post analysis is similar ... Depending upon the experiment, often trying to bound the mass of ejecta from a collision, only a few events are fully analyzed. Indeed, through 2012, only 300 events of interest were generated, despite there being 300 trillion proton-proton collisions.

      Also, few instruments can be used without model-based calibration. The most ingenious instruments are very simple and depend entirely upon the calibration to do what they do. Sampling calorimeters used in particle detectors are such instruments, which have alternate layers of scintillating material yielding signals proportional to the energy of the particle going through them.

      For more on LHC and other detectors for particle experiments, see the review article P. Grannis, P. Jenni, "The evolution of hadron-collider experiments", Physics Today, June 2013, 38-44.

  2. @ petrkeil. Indeed several scientist often perceive data driven results as more ‘reliable’ than hypothesis-driven, there are more professors Roslings even in the not so “messy” fields.

    But I think data-driven can be used to stimulate the imagination, sometimes strange results (relations) appear. So what about those who use 'data-driven' to torture themselves (feeding their imagination) in order to improve their hypothesis?

  3. That's a great post. I definitely agree with you. I believe that without a hypothesis "data science" stops being "a science" and becomes programming/engineering (which, obviously, is also challenging).

    To form good hypotheses and be able to test them you need to be an expert in the field in which you are working or you need to be working with experts. Markk's example above is a case in point. If you do not have training in physics and more specifically in elementary particle physics/standard model, your data mining proficiency will be almost useless at LHC (or CERN or anything).

    regards
    Jarek

  4. thanks for the post. Nice overview.

    However, in conclusion #2 you made the (somewhat common error) to equate quality with no. of citations. This is just not true. E.g. highly controversial opinions might be disregarded and much later be shown to be correct. In the time in between these papers do not get cited. Actually, you made the exact same error that you (correctly) blamed the "data scientists" for: drawing immediate and simple conclusions from too much data (citations). One just needs to lock behind the curtain: who's the biggest promotor of the ideology of citation-based rankings (vulgo 'scienceometrics')? Thomson Scientific. Coincidentially also the provider of the necessary information of more or less useless bean-counting data.

    In addition, there are at least two factors that need to be included into you hypothhesis: 1. language-bias in favor of native speakers and 2. bias due to editor-ship (many journals are run by US-based researchers), and this is then a egg-and-chicken problem at best.

  5. You seem to be disappointed by the data mining "promise". Mainstream data miners certainly produce mindless, spurious or arbitrary relations, as you mention. Try to focus on excellent researchers and modellers and learn from them.

  6. This device executes a huge number of times every second. Every executed instruction has a left side that grabs the physical device and a right side that grabs the physical data. This relationship is echoed in the software paradigm which is a program/data symmetry.

    Data is not a peripheral concept fed to this device. Data is this implemented energy that drives this device. The data scientist's primary role should be to facilitate the efficient transference of this energy in conformity with the First Law of Thermodynamics. It is from this point we can begin to efficiently code in intellectual prejudice.
    @thepoettrap

  7. Good insight always comes from a blend of induction and deduction. Someone purely using deductive hypothesis-testing would fall victim to Anscombe's quartet; a pure inductivist might run afoul of some of the implications of DeMoivre's equation. I don't really see the point of championing one approach over the other. And although I agree that much of what passes for data visualization today lacks nuance and insight, I don't think that's a necessary characteristic of inductive approaches—wade through John Tukey's "Exploratory Data Analysis" sometime for some very thoughtful approaches.

  8. I like how you make a lot of assertions(larger data sets do not produce better results, for example), never back them up, and then try to say that people aren't using scientific reasons to justify the size of their data set. Good thing we can all go home because you just solved science for us. All the other scientists don't use the right kind of science, which I am proposing! Also my science is harder, the other one anyone can do, and mine produces better results. Where is my evidence? Don't need any.

    My personal favorite though, is the specious argument that data might be wrong, so why even collect it? So educated, yet so fucking stupid it hurts.

  9. Chill out Lewis.

    Rather than pick apart the verbatim like an endlessly useful grammar nazi, I'll say I appreciate the sentiment of your post. I particularly like the second to last paragraph, which is often the elephant in the room. I've sat in endless meetings where 'getting more sites onboard' fast becomes the topic of conversation (dick swinging contest) as opposed to the hypothesis and the appropriateness of the data to address the hypothesis.

    Seems at times like that that academia is no different to the private sector.

  10. Hi
    I agree with some parts of your post, but the problem is that you mixed a huge varieties together and called them data-science.
    Data or information is a not a new thing and even Sophists had this instrument (data) for their argumentation, which was was their language, but now in last century what we have more than before is the Digital Data, which makes empirical research more and more easier and attractive. Of course, Rosling needs to go for hypothesis testing or theory building in the next step and what he shows is just a possible projection of almost abundant data. In other words, these people are celebrating this new capability (Data Abundance) , which was not easy 50 years ago.
    But although the notion of Big Data is ambiguous, I think the only interesting point about it is that by Big Data one can overcome the limit of Rationalism, which I think is the thing you mentioned as the real science, which is deeper than easy data mining or visualization.
    In fact, Rational models (e.g. One fitted curve to a data set) always come with the notion of Error (I say based on a point of view) and then for complex (not complicated!) systems, we no longer can find any rationale (A specific curve) otherwise they are not complex.
    Then we need one more level of abstraction in our scientific journey. Just imagine we can "compute" with the whole available data set (which looks like a cloud) and without a need to go for hypothesis testing. Because normally after these kind of hypothesis testings one will use the results (e.g. linear correlation analysis) to fit a curve or a regression model to give some advice to decision and policy makers, which can be dangerous and mainly limited.
    These kind of Data-Driven Computation, which are not dependent to pass the Modeling phase are emerging and in fact they act like a container of Any Possible Specific (Rational) Model. They are in "Pre-Specific" space and for sure they need some kind of intellect and they can't be just a result of data deluge.
    And I think in the end, if some one wants to do Hypothesis testing it is not contradictory with Big Data. And with Data Driven Computation we can reduce to Specific Modeling as well. It can be considered as the relation between Natural Numbers, Rational Numbers, +Irrational Real numbers and finally Complex numbers.

    Best
    Vahid

  11. "Hypotheses and relationships that need very large data to be demonstrated (Fig. 1c, d) are weak hypotheses and weak relationships."
    A couple of comments.
    1) This is the classical statistician's point of view, in that it is innocent of a geometric perspective. Many of the most interesting modern statistical problems are large dimensional, which always need large sets for statistical strength, especially to look for what geometric structure may be in the data.
    2) (this may be rephrasing @Adama) Where do hypotheses come from? There are theoretical hypotheses e.g. from particle physics experts--but these are really the result of extrapolating patterns learned from a lot very sophisticated geometrical thinking about patterns observed in many years of experimental atomic, nuclear and particle physics, and then subjected to exhaustive further experimental validation. I guess that hypotheses really ultimately come from "mining" the data, grabbing at what patterns seem to appear, and calling that a hypothesis.

  12. Long before big data or data mining existed as terms, John Tukey was championing exploratory data analysis (EDA), the original data driven science. The attraction of big data is in fact the challenge of imagination. The hypothesis is that there are interesting but hidden phenomena that explain the nonrandom features of the data. The challenge is to find them, and to convince yourself that there are not other less interesting explanations.

  13. This opinion post has the flavor of an us/them argument. Things are rarely so simple. I find it handy to ask two questions when determining if something is "good" or "bad" to make it less simple, but more useful as a start to inquiry:
    1) for whom?
    2) under what conditions?

    So big data may be "bad" for those who fail to appreciate hypotheses (and ignored by those who worship hypotheses like others worshop data). But big data is "good" for people studying events that are sufficiently rare enough that large numbers of observations are necessary (although one could argue that the small number of events are the "data" for your model and that those thousands or millions of ignored, yet recorded, events are just not "data" for the conditions you study. One could argue that animal behavior scientists "see" innumerable "events" but only choose to collect and analyze select sightings as "data." Then big data proponents would be like those who would want to consider every pixel of a video recording of a thrush singing as potential data, whereas the behavior scientist might ignore all but the tilt of the bird's head, the color of the plumage, and the notes in the song as "data."

    Perhaps it's not the size of the data, but whether the person has reason to define something as data? Maybe the problem really is "big information" that some people think and treat as if it were big data. Reminds me of numerologists who see numbers in everything and see every number as potentially important, whereas mathemeticians know under what conditions a number is important, and the conditions under which it is not.

    Maybe big data is not the problem. Big data, like most anything, is powerful in the hands of those who are able to use it, and dangerous in the hands of those who might not know how to use it, or try to use it for the wrong reasons. Of course, not all those who collected specimens of animals and plants in the age of Western exploration were able to distinguish between the useful and the uninteresting. But all that "big data" became useful grist for the mental mills of natural philosophers, and later to the enzymes, X-rays, and chromatographs of geneticists and other scientists.

    It's not about size, it's about how you use it.

  14. *sigh* Sorry, but this sounds very, very arrogant to me:

    Plotting multivariate data, maps, “relationships” and colorful visualizations is hip and catchy, everybody can understand it.

    Furthermore, it's not true. Multivariate "plotting" (under which, I presume, you summarize all forms of ordinations) is far from hip. It's not even catchy. And I know first hand that not everyone can understand it, including some of the authors publishing MV analysis. Hip are GLMMs and GAMs, hip are Path Analysis, hip might also be PLSR. But MV "plotting"? Ive been trying to publish my MV analysis for two years now, and I keep getting really annoying reviews from people who obviously think that the approaches are to much correlative. (But then, what is modelling but correlative?)

    The trouble is: reviewers keep asking for analysis which are not statistically suitable for your dataset, but would look good in the current mainstream. And they call you a "data miner" to reject your paper.

    You are widening the gap when posting stuff like the quote. I don't appreciate this.

  15. There's a huge hunk of science that has been neglected thus far: model selection. Specifically, how do we use data to identify the best representation of reality? If reality is complex, then we'll need more data in order to identify this complexity.

    Consider this linear model y = b0 + b1*x1 + b2*x2 + b3*x1*x2 + error

    If you try to "identify" the best model using AIC and you have just 15 observations of this process, you're going to tend, on average, to pick a model with just one variable. However, as sample size (N) increases, your ability to identify the model with both variables (and thier interaction) also increases.

    See for yourself. After you run this R code, reset N to 150 and repeat.

    N<-15
    out<-rep(NA,N)
    for (i in 1:100){
    x1<-rnorm(N)
    x2<-rnorm(N)
    y<-2 + 0.2*x1 - 0.2*x2 + 0.2*x1*x2 +rnorm(N,0,1)
    mod1<-lm(y~x1)
    mod2<-lm(y~x1+x2)
    mod3<-lm(y~x1+x2+I(x1*x2))
    AIC.out<-AIC(mod1,mod2,mod3)
    out[i]<-which(AIC.out$AIC==min(AIC.out$AIC))
    }
    hist(out)

  16. Matz, that is a wonderfully stated demonstration of the confrontation between data and hypotheses. However, please allow me to play Devil's advocate and point out the tradeoff between sample and effect sizes. That is, in your model, you assume that the effect of your variables (your coefficients) is quite small, thus detection and statistical power would be weak.

    In the of this post, I don't think one should think about large data sets or small data sets. I propose the not-so-novel idea of APPROPRIATE sample sizes to detect the signal one is after. In order to get even a glimpse of the magnitude of a signal however, in my opinion however, one needs understanding of the process. My point if of course one of support to the spirit of this post: that That the understanding of processes (whether physical, biological, anthropological, etc.), and the confrontation of ideas with observation IS science. On the other hand, data mining without a hypothesis in mind, while useful to science, of great importance to developing scientific ideas, and a heck of a lot of fun, is not science, by definition.

  17. while editing, I forgot my ending point: BIG DATA does not necessarily = BAD! ...when appropriate, for example when effect sizes are as small as Mattz Falconi's model. However huge datasets WILL provide spurious signal, and small datasets WILL miss it when either is inappropriate.

    Moral: data set size should not the currency with which scientists deal. Rather, appropriateness of the sample size relative to some idea of process and its EFFECT SIZE should be.

  18. EOC- Regardless of effect size, the fact remains that as N (replication) goes up, so too does your ability to identify more complex models as the best representation of reality (assuming a complex process actually generated the data). My point is that there is value in large data sets (lots or replication and lots of independent variables) because such data sets enable scientists to entertain more complex hypotheses. I have a lot of competing elaborate hypotheses about my process (salmon abundance) that I’d like to evaluate, but I can’t do this because I don’t have enough data. Thus, I take exception to Petrkeil’s claim that a high number of observations doesn’t lead to better science. It is ironic because Petrkeil’s claim is only true for hypotheses generated under a failure of imagination.

  19. I believe the outcome of the current Big Data hype will be the removal of much of the cost and complexity you mention in terms of dealing with big data sets. These will continue to be sliced and diced as appropriate for requirement, but the technology of Big Data is making this trivial and largely irrelevant for the experimenter. Source data on available on tap that is easily and quickly transformable I think is universally appreciated. Also the ability to effectively share data sets or draw down from existing sources, making access to relevant data less dependent on your own ability to collect (accessibility, budget, connections).

    If the data doesn't directly define the hypothesis the ability to visualize and “explore” these large and broad data sets still seems to offer benefit to the formulation of theory.

  20. Pingback: Links 1/24/13 | Mike the Mad Biologist

  21. Pingback: Data-driven science is a failure of imagination « Working Scientist

  22. Simply desire to say your article is as astonishing.

    The clearness for your publish is just cool and i could
    assume you're knowledgeable in this subject. Fine together with your permission allow me to take hold of your RSS feed
    to stay up to date with coming near near post. Thanks one million
    and please carry on the gratifying work.

    Check out my web page: web hosting canada, Brain,

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>