There’s a movement underway – call it the “reproducibility movement” – which questions how much we can trust scientific research. The movement often emerges into public view as bad news. “Only 50% of psychology studies could be reproduced!” “Influential deworming study questioned!” “Famous economists’ spreadsheet error led to ill-formed conclusion.”
The overall sense that one could easily get from these news stories is that individual scientists are falling short. It’s relatively easy to explain things going wrong in specific cases, and to talk about who was right or wrong in their conclusions.
Over the past couple years, working on these issues, I’ve realized how hard it is to explain the “endemic problems” affecting research in a gripping way, as opposed to pointing out specific cases. As soon as I say “endemic,” eyes glaze over. For example, I was starting to explain p-values and publication bias to a relative recently, and I realized that I could repeat my spiel five times without her getting it. She is a smart person, but my explanation was just too dry. So I tried this: “What selfies do you share?” –“The good ones.” “Research is like that.” The metaphor clicked right away.
So what’s the metaphor for the reproducibility movement? Think about restaurants. When we go to a restaurant, we trust that the ingredients are as they describe and that they aren’t likely to make us sick. Why do we trust that their food is fine? If we’ve been there before, we may rely on our personal experience. But we’re also happy to try new restaurants, because there’s a fairly robust system in place to verify that the food is fine, and even that it’s great. We trust that there are regulations and inspections in place, and we can see posted results of the inspections. And on top of that, in many cities, we can read plenty of restaurant reviews and look at ratings on Yelp before we pick a place. So we have a system of fairly good evidence for safety and good taste.
Think of researchers as restaurateurs. How do we know we can trust their results? Peer review – the process of peers assessing the results before they are published – is supposed to be that system. But reviewers tend to see and accept the articles saying “we found something!” That would be like if Yelp didn’t post negative reviews. And what about the restaurant inspections? Reviewers rarely even see the materials used to generate the results (e.g. data/code) – the kitchen, so to speak. How can they know what’s gone on in there, without seeing and checking what’s in the prep area?
There are steps in the direction of regulations and inspections, along with reviews. Examples are funder & journal data-sharing policies, and some examples of open peer review and re-analysis. But there’s still poor compliance with the policies, and very few cases of open "inspections." And there’s often too a question of what those regulations and inspections should be. Maybe funders could get researchers to share their data, as could journals. But what does anyone do with that data? Other restaurateurs aren’t so well motivated to go out and post a review saying that their competitor’s food isn’t good. They might be seen as having an agenda (since they’re competing for customers), and be invite counter-attack. And more than that, there’s another problem - on Yelp, we tend to trust that other humans – at least enough of them – have similar taste to ours and we can trust their judgment. But in many cases, there’s not well-established agreement about how to cook i.e. how to examine the data. Different researchers may very well come to different results, looking at the same data. So the question is how we can create a recipe (so to speak) that will produce results we can trust. That’s where there’s still a lot of uncertainty in some fields – e.g. in development economics, there’s controversy about whether researchers should register a protocol before running an experiment (as a matter of required good practice).
So, moving past the metaphor now, what should we conclude? Here’s my conclusion: I don’t have a high level in confidence in any individual piece of research. That’s not because I mistrust researchers – I certainly don’t expect them to commit outright fraud such as data fabrication (except on rare and strange occasions). It’s because I see that there’s not a system in place which engenders trust. Quite the opposite – we have a system which rewards people for saying something exciting, and which makes it hard to publish null results. We also lack norms of reporting which would make it possible to see what went tested but unmentioned.
What would a better system look like? That’s not easy to say. I’ve been grappling with this question for a couple of years now, long enough to see that there are not easy answers. We’re in an uncharted place, and we’re surrounded by “ifs.” If tenure committees rewarded data-sharing or funding agencies checked for compliance, then researchers would share data more. If researchers shared data, then someone could check their results. If someone checked their results, then the record would self-correct. Beyond the “ifs,” there are the hard puzzles lurking: details of what to share, what to reward, what “checking results” means, really, when we don’t have an easy way to adjudicate disputes that would inevitably arise from deeper re-analysis.
Above the cloud of “ifs” and devilish details, I sometimes catch a fleeting vision of a better system. In this system, researchers are incentivized only by doing the best research they can, and not by “publish or perish.” Their work is fully transparent e.g., they report all statistical tests that they ran and why they think that some are preferable, and we’re funding repeat experiments regularly before we raise our level of confidence. Back in the actual world, many billions are spent each year on research that we can’t be very confident in because of the “endemic problems.” There aren’t reliable inspections, and there’s no Yelp equivalent. But we need to be able to trust research, to accomplish most of our deeply-held goals of helping ourselves and others. And that's why it's so important for us to figure out how to develop a better system.