Wednesday, November 11, 2015

A metaphor for the reproducibility movement

There’s a movement underway – call it the “reproducibility movement” –  which questions how much we can trust scientific research. The movement often emerges into public view as bad news. “Only 50% of psychology studies could be reproduced!” “Influential deworming study questioned!” “Famous economists’ spreadsheet error led to ill-formed conclusion.”[1]

The overall sense that one could easily get from these news stories is that individual scientists are falling short. It’s relatively easy to explain things going wrong in specific cases, and to talk about who was right or wrong in their conclusions.

Over the past couple years, working on these issues, I’ve realized how hard it is to explain the “endemic problems” affecting research in a gripping way, as opposed to pointing out specific cases. As soon as I say “endemic,” eyes glaze over. For example, I was starting to explain p-values and publication bias to a relative recently, and I realized that I could repeat my spiel five times without her getting it. She is a smart person, but my explanation was just too dry. So I tried this: “What selfies do you share?” –“The good ones.” “Research is like that.” The metaphor clicked right away.

So what’s the metaphor for the reproducibility movement? Think about restaurants. When we go to a restaurant, we trust that the ingredients are as they describe and that they aren’t likely to make us sick. Why do we trust that their food is fine? If we’ve been there before, we may rely on our personal experience. But we’re also happy to try new restaurants, because there’s a fairly robust system in place to verify that the food is fine, and even that it’s great. We trust that there are regulations and inspections in place, and we can see posted results of the inspections. And on top of that, in many cities, we can read plenty of restaurant reviews and look at ratings on Yelp before we pick a place. So we have a system of fairly good evidence for safety and good taste.

Think of researchers as restaurateurs. How do we know we can trust their results? Peer review – the process of peers assessing the results before they are published –  is supposed to be that system. But reviewers tend to see and accept the articles saying “we found something!” That would be like if Yelp didn’t post negative reviews. And what about the restaurant inspections? Reviewers rarely even see the materials used to generate the results (e.g. data/code) – the kitchen, so to speak. How can they know what’s gone on in there, without seeing and checking what’s in the prep area?

There are steps in the direction of regulations and inspections, along with reviews. Examples are funder & journal data-sharing policies, and some examples of open peer review and re-analysis. But there’s still poor compliance with the policies, and very few cases of open "inspections." And there’s often too a question of what those regulations and inspections should be. Maybe funders could get researchers to share their data, as could journals. But what does anyone do with that data? Other restaurateurs aren’t so well motivated to go out and post a review saying that their competitor’s food isn’t good. They might be seen as having an agenda (since they’re competing for customers), and be invite counter-attack. And more than that, there’s another problem - on Yelp, we tend to trust that other humans – at least enough of them – have similar taste to ours and we can trust their judgment.  But in many cases, there’s not well-established agreement about how to cook i.e. how to examine the data. Different researchers may very well come to different results, looking at the same data. So the question is how we can create a recipe (so to speak) that will produce results we can trust. That’s where there’s still a lot of uncertainty in some fields – e.g. in development economics, there’s controversy about whether researchers should register a protocol before running an experiment (as a matter of required good practice).

So, moving past the metaphor now, what should we conclude? Here’s my conclusion: I don’t have a high level in confidence in any individual piece of research. That’s not because I mistrust researchers – I certainly don’t expect them to commit outright fraud such as data fabrication (except on rare and strange occasions). It’s because I see that there’s not a system in place which engenders trust. Quite the opposite – we have a system which rewards people for saying something exciting, and which makes it hard to publish null results. We also lack norms of reporting which would make it possible to see what went tested but unmentioned.

What would a better system look like? That’s not easy to say. I’ve been grappling with this question for a couple of years now, long enough to see that there are not easy answers. We’re in an uncharted place, and we’re surrounded by “ifs.” If tenure committees rewarded data-sharing or funding agencies checked for compliance, then researchers would share data more.  If researchers shared data, then someone could check their results. If someone checked their results, then the record would self-correct. Beyond the “ifs,” there are the hard puzzles lurking: details of what to share, what to reward, what “checking results” means, really, when we don’t have an easy way to adjudicate disputes that would inevitably arise from deeper re-analysis.

Above the cloud of “ifs” and devilish details, I sometimes catch a fleeting vision of a better system. In this system, researchers are incentivized only by doing the best research they can, and not by “publish or perish.” Their work is fully transparent e.g., they report all statistical tests that they ran and why they think that some are preferable, and we’re funding repeat experiments regularly before we raise our level of confidence. Back in the actual world, many billions are spent each year on research that we can’t be very confident in because of the “endemic problems.” There aren’t reliable inspections, and there’s no Yelp equivalent. But we need to be able to trust research, to accomplish most of our deeply-held goals of helping ourselves and others. And that's why it's so important for us to figure out how to develop a better system.

Monday, August 11, 2014

Data-sharing and reproducibility

As data-sharing becomes more prevalent, so do discussions of the important topics surrounding data-sharing. Data citation and linking data to papers, metadata standards, infrastructure for data-sharing, legal aspects of re-using data, and so on are all topics that I have seen discussed quite frequently at places like the annual IASSIST conference and more broadly in the data curation & data-sharing community.

However, one topic that I haven’t seen discussed is something that I wonder about a lot myself. What do we mean by “reproducible research” and “replication,” and how does this interface with sharing data?

What do we mean by “replication”?

One of the main rationales for requiring and/or encouraging researchers to share data is that doing so will make it possible to replicate their research. 

Let’s pause, since this can get confusing. The words “replication” and “reproducible” tend to be used in different ways, varying by field or sometimes even by researcher. I see two main categories of activities, both of which are sometimes called “replication”: 
  • Re-analysis/robustness checks of the original study, using original data/code.
  • Conducting a new study with new data collection, similar in some or (almost) all ways to the original.
We could get more fine-grained, but these seem to be the basic categories. 

Because of the proliferation of naming schemas (I’ve seen about a dozen papers or blog posts with suggestions), I’m a little wary of using my own here. But because it’s much easier to use a single word, I’m going to refer to these basic kinds of replication with the labels “re-analyzing” and “reproducing” a study, respectively.

Data-sharing and re-analysis: 

Data-sharing is often connected explicitly to reproducibility of some kind. It’s one of the main justifications given by many journals for requiring that researchers share the data/code underlying published results, for example - i.e., so that their work can be replicated. (Note: it’s not so that peer reviewers can see the data/code as they decide on its reliability, at least not in the social sciences, which I’m more familiar with - even when these materials are required at publication, it seems uncommon for them to be used at all beforehand).

The basic idea, of course, is that anyone who is interested can go ahead and check your published results using the raw materials used to create them. As an added benefit, you’re more likely to be careful in checking your work if you have to share data/code, in addition to just your summary of end results.

Data-sharing and checking the reliability of the analysis:

Here’s the question that I wonder about: to what extent does what is (normally) shared allow someone to actually check the reliability of the analysis? 

When researchers share data and code for journal requirements, often this will be:

  1. A subset of the data that was collected e.g., what was used in the published results
  2. A subset of the code used to produce the final results e.g., the analysis code used to produce the tables. There is plenty of code that precedes this final stage code - e.g., all the code used to carry out operations like cleaning the data, merging datasets, transforming the collected data into new variables used in the analysis, and so on. 
So, what does it mean to check the analysis using these materials? It means - running the very end stage materials, essentially to see that the code runs without producing errors, and that the numbers in the tables do match up with what’s reported in the paper. It might also mean going through the code to see that the end stage analysis there does what is described (e.g., a regression controlling for XYZ).

But what you can’t check with these materials are, in a way, “deeper,” potentially more problematic things such as:

  • What decisions were made in the process of data-cleaning and variable construction? For example: were there outliers excluded from the originally collected data (and if so, why)? When the variables were constructed from raw data, which choices were made and why? Were datasets merged properly?
  • Were there choices about what to report, and how to report it? If a subset of the data is shared, checking whether there was selective reporting of outcomes isn’t possible. It’s not even possible to know what was collected, unless the researcher mentions it in the paper. Were only certain age groups or other subgroups reported? Only a few outcomes of many surveyed? If you controlled for other variables, would that change the reported results? Hard to tell if the full range of variables that you could control for aren’t included in the data that is shared.

The question: partial vs. start-to-finish reproducibility?

By “start to finish reproducibility,” I mean: sharing data and code such that someone could track what you did from data collection to the point that you published results. Currently, it seems to me to be much more often the case that shared materials allow for “partial reproducibility.”

So the question is: should researchers aim to share materials that would allow for start-to-finish reproducibility? Is that the ideal and what we should be aiming for?

Some further reflections: 

  • Start-to-finish reproducibility can be very difficult if you don’t set out to do it from the beginning. For one thing, keeping track of the code (cleaning, transforming variables, and so on) from start to finish is a difficult thing, particularly when one has multiple research assistants helping out throughout the life of the data (collection, cleaning, analysis).
  • There is also a question of how to achieve this full reproducibility when in some cases, PII (personally identifiable information) would come in, when running the full process start to finish. Obviously, PII would never be shared publicly.
  • Aiming to get data and code into comprehensible and well-organized shape early on makes it much easier. My sense is that what we need are good guidelines (and implementation of those guidelines) for structuring files, writing code, and managing data (e.g., labeling variables) throughout the study. With some effort from the outset, “start to finish reproducibility” is likely to be much more feasible. Some groups in social science such as the Berkeley Initiative for Transparency in the Social Sciences are making progress on creating such a guide.
  • It would be great to see more public discussion of how valuable it is to aim for and achieve start-to-finish reproducibility. My impression is that in some discussions, reproducible research is often just referred to as a goal without much talk of what this means (e.g. partial vs start-to-finish reproducibility). Connecting conversations of reproducible research to what this means -- and importantly, best practices for doing it -- seems pretty essential, to move forward.

Thursday, December 26, 2013

Great discussion of framing effects replication

Joseph Simmons and Leif Nelson recently wrote up the results of their replication attempt of a framing effects experiment. The experiment was done by David Mandel, and included an attempted replication of Tversky and Kahneman's "Asian Disease Problem" framing effect experiment. Mandel changed the wording of the original slightly to test its robustness. He added the word "exactly" so as to rule out misinterpretation of the original wording, which could potentially be read as meaning "at least." Simmons and Nelson attempted to replicate Mandel's results, and found a notably different outcome than Mandel.

What I want to focus on is not the details of the discussion, as interesting as they are. For the details, I'd recommend reading Mandel's original paper, the Simmons/Nelson replication and Mandel's reply. Instead, I want to comment on some great features of the exchange.

First, the discussion was remarkably rapid. Mandel's replication study was published in August 2013. Simmons and Nelson responded with their own replication, posted on their blog, in December. Mandel responded in detail to their replication on his own blog within a few days of receiving the drafted post from Simmons/Nelson. It's great to be able to read not just what Simmons/Nelson found, but also Mandel's take on it, without long delays.

Second, not only was the discussion rapid, but it's also high quality. Both Simmons/Nelson and Mandel seemed to engage really closely with what their experiment results showed, and also made the discussion accessible to readers.

Third, the tone is great. Simmons/Nelson point out why Mandel's replication is important and worth replicating:
The original finding is foundational and the criticism is both novel and fundamental. We read Mandel’s paper because we care about the topic, and we replicated it because we cared about the outcome.
Then, Mandel makes it clear right away that he is taking their replication in a collegial way. He titles it "AT LEAST I REPLIED: Reply to Simmons & Nelson's Data Colada blog post." Making a joke in his title is a nice signal of the spirit in which he takes their replication; he then adds:
First, let me say that when Joe contacted me, he noted that his investigation with Leif was conducted entirely in the scientific spirit and not meant to prove me wrong. He said they planned to publish their results regardless of the outcome. I accept that that’s so, and my own comments are likewise intended in the spirit of scientific debate.
A barrier to doing and discussing replications is that current academic incentives can make the practice awkward and professionally unrewarding. Researchers who replicate might not be able to publish their work since it's often not seen as original enough, and the authors whose work is replicated may not welcome the efforts. While the former issue isn't something that this discussion bears on, since the posts appear on blogs, it clearly is relevant to the latter.

Having examples of thoughtful exchanges like this one is a nice demonstration of what replication can be. At its best, it is detailed and thoughtful work that helps us sort through which effects we can more confidently rely on.

Thursday, December 5, 2013

Altmetrics and Cochrane reviews

"Altmetrics" is a term coined by Jason Priem, referring to a new, more comprehensive way of measuring the impact of scholarship. Whereas the usual ways of assessing an article's importance are where it is published (i.e., the journal, as ranked by impact factor) and its citation count, altmetrics aim to include measures how often it is discussed and mentioned in social media. This allows for a broader take on impact, as well as allowing impact measurement of a wider range of research outputs, such as datasets.

Altmetrics - in addition to being a new term - is also an organization. Its product is an embeddable icon and link to a scoring system. The system crawls social media sites (facebook, twitter, blogs, reddit, etc) for mentions of a particular paper, and then displays numbers of mentions as well as links.

While I'd heard of altmetrics (both the term and organization) some time ago and in general appreciated this development, I haven't seen the product in action until recently. Cochrane Collaboration, which I've written about before, now has embedded altmetrics in its abstract pages.

What this means is that when you search Cochrane summaries, and then look through a particular review's abstract, take for example "Antioxidant supplements for prevention of mortality in healthy participants and patients with various diseases," you can click on a link to an altmetrics page:

What I really liked while clicking through a few reviews and associated altmetrics pages is that I could easily see 1. the extent to which a review has been discussed and also to an extent 2. who had discussed it. What strikes me as great here is that it makes the small area of the internet that you might be obsessing over inter-connected in a way that it wasn't before. Of course we can always google to try to sleuth out who is saying what about a particular paper or subject, but this can be very time-consuming. When the sources are linked together through the altmetrics page, it's easier to quickly find others thinking and writing about the same paper.

For instance, I'm curious to discover others who are writing about Cochrane reviews, especially ones that I've taken an interest in (such as the one above). Altmetrics gives me an easier way to find them.

My main comment on what I'd like to see changed, though, is that the altmetrics page displays only a "subset" of the relevant mentions, while I'd like to be able to scroll through all the mentions (or at least a larger subset!). The goal now seems much more focused on quantifying the discussion, rather than on what I find the most valuable, which is easily finding others who are interested in the same topics.

Sunday, October 20, 2013

Research transparency landscape on Figshare

As a part of research consulting work this past summer, I wrote a landscape on funder data access policies and other resources.

The write-up was originally shared informally with an email list of funders and interested researchers. But then a researcher requested that I put it on Figshare so that it could be cited in a paper she's writing. It occurred to me that this landscape might be useful to a wider audience interested in research transparency/data-sharing. So here it is:

This was my first upload to Figshare, and it made me even more aware than before that Figshare is a great site! Easy and pleasant interface to use, with Creative Commons licenses (CC-BY and CC0 for data) that accompany all uploads to the site. I recommend using it for sharing papers and data.

Careers, caring and the unexpected (Part III)

Deciding between careers is a tough process. There are a lot of factors. Location, job security, salary, enjoyment, value to others/the world, replaceability (how easily your position could be filled by someone who would do it as well as you) - just to name a few. I spent quite a bit of time mulling over the decision of whether to continue as a professor or not.

Leaving academia was a particularly difficult decision. For one thing, the academic job market is so competitive - hundreds of eager applicants for each tenure-track job - that it's hard to go back once you leave.

I also had to get beyond the feeling that being an academic was only way to really be an intellectual. Looking back at it now that I no longer have this feeling, it seems ridiculous (of course you can be intellectual without being a professor! why not?) But there's a sense of this within academia. It's not made explicit exactly, but I believe it's quite pervasive, at least in the humanities. I'll always remember a fellow grad student who, deciding not to go on the academic job market, had printed and posted this article on the office door.

GiveWell offered me a full-time position after the summer trial period. In the end, after weighing everything up, the factor that really clinched my decision was that I wanted to be excited about my job. I knew what it was to have a really nice job, one that I was lucky to have. But I wanted to give and get more from my work. I gave notice at the college and, in January 2012, moved to NYC from Boston to work for GiveWell.

Right away, I loved being in New York. I had a great community of friends from college and elsewhere, and living in Brooklyn fit exactly what I was looking for. The job with GiveWell gave me the chance to work on a lot of interesting topics. I also really enjoyed always having people to talk to who had similar interests.

One of my favorite research topics quickly became "meta-research," GiveWell's term for initiatives aimed at improving research. This can involve a lot of things, but early on, the focus of GiveWell's work in this area was looking into the Cochrane Collaboration. As I've posted about, Cochrane does great systematic reviews of health interventions. I had a really interesting experience talking to a large number of people who work with Cochrane. On the basis of this research, GiveWell directed a grant to the US Cochrane Center (via Good Ventures).

The work on Cochrane became a gateway for me to other areas of meta-research. This work really fit into a main theme that had originally drawn me to GiveWell. I wanted better evidence for guiding decisions. As I began to learn more, it started to sink in that there are issues which affect not just philanthropic research but all research. Lack of transparency makes reported results less reliable, because we can't check them. Publication models which encourage and reward "interesting" results lead to a system where we can't trust that positive findings reflect how things really are (rather than what's likely to get published).

OK, so what's happened in the past year? (I'm going to speed through a bit, since it's harder to take a bird's-eye view of things that have happened in the past year as opposed to say, 5 years ago.)

First, GiveWell moved to San Francisco and I stayed in NYC. There were many reasons for this, some of them personal, but a big one for me was that I love New York and feel at home here and a part of a community. I've remained a big fan of GiveWell after moving on from being a researcher with the group. After some time considering my next step, I became a research consultant with another philanthropic advisor in NYC, which allowed me to follow up further on my interest in improving research.

Careers, caring and the unexpected (Part II)

I emailed GiveWell immediately after finding the site, saying something effusive like "What you're working on is really great. Can I help in some way?" Elie (a co-founder of GiveWell) soon wrote back, and I began to do things like check footnotes and sources on GiveWell pages as a volunteer. It was summertime between semesters teaching, and I spent quite a lot of time on this. It might not sound super-exciting to check footnotes for errors, but my excitement about GiveWell carried over to the task.

In the fall, GiveWell asked if I'd be interested in part-time research consulting work. The research that I did initially was focused on the "Malthus" question of whether aid in developing countries might, through saving lives, lead to overpopulation and increased scarcity of resources. This is a big question, and I focused on the sub-question of the relation between child mortality and a reduction in fertility (i.e. average children per woman). Some researchers - examples are Jeffrey Sachs and Hans Rosling - argue that there's a causal relation between the two. That is: if you save childrens' lives, this will lead to lower birthrates, as parents decide to have fewer children. Of course, as with many questions in development that involve lots of correlated variables, it's notoriously difficult to make well-supported causal inferences. In my research, I came across 50+ papers which offered conflicting views on this question.

Through GiveWell work, I learned that I really enjoyed research on empirical questions. Prior to this, I'd always thought of myself as a "humanities person." In college, I took a lot of classes in literature, history and philosophy. In grad school, I focused on theoretical problems in epistemology and metaphysics. I'd missed out on a side of myself. This is a side that really enjoys puzzling over applied questions e.g. comparing the effectiveness of programs aimed at helping people, learning about statistics and empirical methods, and so on.

I kept doing GiveWell work during my second year as an assistant professor. The more of this kind of work I did, the more excited about it I felt. I started thinking about whether I wanted to stay in philosophy.

On the one hand, teaching philosophy was really nice job. It involved having interesting conversations on a beautiful campus about topics that I found generally enjoyable. I didn't anticipate a torturous route to tenure because publishing requirements weren't sky-high (I'd have to work for it, but thought it would be manageable). I liked my colleagues and students. On the other hand, I didn't feel my heart beat faster at the thought of teaching philosophy for the next 30 years. I wasn't getting up in the morning eager to expand my understanding of topics I thought were important. But I did feel that way about the work I did for GiveWell.

It was a hard decision to make. I asked if I could work with GiveWell full-time over the following summer as a trial period. I spent much of the time researching the evidence on the effectiveness for cash transfers, and two months went by very quickly.

(To be continued in the next post...)