A Critical Review of Reproducibility and Replicability from the NAS in 2019

Data integrity is key to the advancement of science and to everyday commercial transactions. Recently, reproducibility and replicability of measurements including scientific experiments have become a concern. In May 2019, the National Academies of Science, Engineering, and Medicine published Reproducibility and Replicability in Science to examine the topic and make recommendations on remedial actions.1 This blog post is a review of the first 30 pages of the 219-page book. I plan to cover succeeding pages in later blog posts.

The acknowledgment lists five scientists who contributed background papers used in drafting the NAS report.

  • Rosemary Bush: “Perspectives on Reproducibility and Replication of Results in Climate Science”
  • Emily Howell, “Public Perceptions of Scientific Uncertainty and Media Reporting of Reproducibility and Replication in Science”
  • Xihong Lin, “Reproducibility and Replicability in Large Scale Genetic Studies”
  • Anne Plant and Robert Hanish, “Reproducibility and Replicability in Science, a Metrology Perspective”
  • Lars Vilhuber, “Reproducibility and Replicability in Economics”

These confidential backgrounders illustrate the broad scope of the consensus study. However, one should be careful to read the unique definitions of reproducibility and replicability used in the report.

Definitions: The report uses unusual definitions for reproducibility and replicability compared to common usage. Readers need to keep these definitions in mind in order to understand the focus of recommendations.

Reproducibility in the report means “computational reproducibility,” e.g., obtaining consistent computational results using the same input data, computational steps, methods, and code, and conditions of analysis (page 24).

Reproducibility is strongly associated with transparency. A study’s data and code have to be available in order for others to reproduce and confirm results. In the extreme, one may need to run the programs on the exact same model of computer with the same number of cores.

I’ve heard anecdotal reports of processing large data sets from DNA sequencers show different results for successive runs starting from the same data set. Comparing sequential runs is often not checked in large data sets.

Later in the report, the authors suggest that journal editors consider ways to ensure reproducibility for publications that make claims based on computations, to the extent ethically and legally possible (page 24). How would I start to do this? I’m clueless.

Replicability in the NAS report also has an unusual meaning: Replicability means obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.

Thus, for the NAS report, reproducibility measures the spread obtained from repetitive runs with a single data set and code. Replicability is the result spread in new data generation using similar methods used by previous or current studies.

My understanding of these differences can be summarized by the equation that relates the spread of data points to the variances.

Sigma measured is the square root of the sum of the square of the data spread from the individual contributing causes.

The data spread measured by reproducibility includes only variance of the computational measurement only. It excludes the variance contributed by the sample prep, the analytical system such as HPLC with a detector that the report calls “replicability.” Replicability includes the spread attributable to all other factors including the operator, reagents including water, plus time-dependent variations due to different days, variance of one instrument to another, location, and more. All contribute to the spread in results. Chemists involved in method transfer of an analytical method from one lab to another uncover interesting cause/effect relationships.

Using their definitions, let’s look at their recommendations:

Recommendation 4-1: To help ensure the reproducibility of computational results, researchers should convey clear, specific, and complete information about any computational methods and data products that support their published results in order to enable other researchers to repeat the analysis, unless such information is restricted by non-public data policies. That information should include the data, study methods, and computational environment:

  • the input data used in the study either in extension (e.g., a text file or a binary) or in intension (e.g., a script to generate the data), as well as intermediate results and output data for steps that are nondeterministic and cannot be reproduced in principle;
  • a detailed description of the study methods (ideally in executable form) together with its computational steps and associated parameters; and
  • information about the computational environment where the study was originally executed, such as operating system, hardware architecture, and library dependencies (which are relationships described in and managed by a software dependency manager tool to mitigate problems that occur when installed software packages have dependencies on specific versions of other software packages) (page 27).

At first glance, recommendation 4-1 seems to promote transparency, but is it practical? Software is subject to updates and automated upgrades. These changes may, might, or do change results. Many of us lack the ability to qualify the impact of such changes. Enterprise-wide and cloud computing remove the data processing part of an experiment from control of the scientist. Asking a scientist to critically review and compare versions of an operating system or software program seems excessive and probably beyond the skills of most non-computer jocks. Plus, the vendors of software are not transparent about such changes. Another problem is what to do when the particular software is no longer available or supported? The lifecycle of storage devices is only 20 years. I stored reports on tapes, floppies (large and small), and CDs, all of which are no longer readable by my laptop.

The report recognizes other problems with 4-1 including artificial intelligence, high-performance computing, deep learning, etc., and recommends a solution in 4-2.

Recommendation 4-2: The National Science Foundation should consider investing in research that explores the limits of computational reproducibility in instances in which bitwise reproducibility is not reasonable in order to ensure that the meaning of consistent computational results remains in step with the development of new computational hardware, tools, and methods.

Exact computational reproducibility does not guarantee the correctness of the computation. For example, if a digital artifact in code goes undetected and is reapplied, the same erroneous result may be obtained (page 28). The NAS report notes attempts to address the problems with 4-1 have failed in more than half the attempts due to lack of sufficient experimental detail in the computational workflow.

Intermediate summary: The book starts off with a nearly universal title “Reproducibility” followed by a usual and very limiting definition relating to the data processing system including transmission and storage. Commonly, data spread due to experimental details such as sample prep, processing, workflow, and analytics are usually, but not always, larger contributors to data spread.

The next blog posts will pick up from here.

With a smile,

Bob

Reference

1. Reproducibility and Replicability in Science. The National Academies Press 2019, Washington, DC; doi: https://doi.org/10.17226/25303.

Robert L. Stevenson, Ph.D., is Editor Emeritus, American Laboratory/Labcompare; e-mail: [email protected]

Related Products

Comments