Loading...
Please wait, while we are loading the content...
Similar Documents
If your P value looks too good to be true, it probably is: Communicating reproducibility and variability in cell biology
| Content Provider | Semantic Scholar |
|---|---|
| Author | Lord, Samuel J. Velle, Katrina B. Mullins, R. Dyche Fritz-Laylin, Lillian K. |
| Copyright Year | 2019 |
| Abstract | Cell biology is littered with erroneously tiny P values, often the result of evaluating individual cells as independent samples. Because readers expect low P values and small error bars to imply that an observed difference would persist if the experiment were to be duplicated, the sample size (N) used for statistical tests should actually be the number of times an experiment was repeated. P values calculated using the number of cells do not reflect the reproducibility of the result and are thus highly misleading. To help authors avoid this mistake, we provide examples and practical tutorials to create figures that communicate both the cell-level variability and the experimental reproducibility. SUMMARY SENTENCE: The sample size for a typical cell biology experiment is the number of times the experiment was repeated, not the number of cells measured. P = 0.00000000000000001?! Think again Error bars and P values are often used to assure readers of a real and persistent difference between populations or treatments. P values are based on the difference between population means (or other summary metric) as well as the number of measurements used to determine that difference. In general, increasing the number of measurements decreases the resulting P value. To convey anything about experimental reproducibility, P values and standard error of the mean should be calculated using independent measurements of a population of interest, typically observations from separate experiments (Lazic, 2010; Aarts et al., 2014; Naegle et al., 2015; Lazic et al., 2018). Limited time and resources usually constrain cell biologists to repeat any particular experiment only a handful of times, so a typical sample size N is something like 4. Too often, however, authors1 mistakenly assign N as the number of cells, or even the number of observed subcellular structures. This number may be on the order of hundreds or thousands, resulting in vanishingly small P values and tiny error bars that are not useful for determining the reproducibility of an experiment. 1 We freely admit that our past selves are not innocent of the mistakes described in this manuscript. Cell biology is a noisy discipline, and our measurements have many sources of variance. For example, quantitative experiments are difficult to replicate exactly from day-to-day and lab-to-lab. Well-designed studies embrace both cell-to-cell and sampleto-sample variation (Altman and Krzywinski, 2015). Unlike measuring the gravitational constant or the speed of light, repeatedly quantifying a biological parameter rarely converges on a single “true” value. Calculating standard error from thousands of cells conceals this expected variability. We have written this tutorial to help cell biologists calculate meaningful P values and plot data to highlight both experimental robustness and cell-to-cell variability. Specifically, we propose the use of distribution–reproducibility “SuperPlots” that display the distribution of the entire dataset, and report statistics (such as means, error bars, and P values) that address the reproducibility of the findings. This paper is not an overview of statistical analyses; many have described the troubles with P values (Gardner and Altman, 1986; Sullivan and Feinn, 2012; Greenland et al., 2016) and there are several excellent practical guides to statistics for cell biologists (Lamb et al., 2008; Pollard et al., 2019). In this paper we specifically address ways to communicate reproducibility when performing statistical tests and plotting data. Biological Replicates vs Technical Replicates Before we dig into why the total number of observed cells is usually the wrong number to use as the sample size, we should clarify what we mean by experimental replication. A biological replicate is an independent, repeated experiment; technical replicates are repeated measurements of the same sample. For example, if I use a ruler and measure the length of 200 hairs on my head, I would have many technical replicates and a precise estimate of my own hair length; measuring the hair from 200 different people would give me many biological replicates and a more accurate estimate of the average human hair length. In other words, biological replicates address the reproducibility of an experimental result, and technical replicates can explore the distribution within each sample. An individual cell measurement is typically closer to a technical replicate (Lazic, 2010; Lazic et al., 2018). Deciding what makes for a good biological replicate can be challenging (Aarts et al., 2014; Blainey et al., 2014; Naegle et al., 2015). For example, is it acceptable to run multiple experiments from just one thawed aliquot of cells, or do I need to borrow an aliquot from another lab? Is it necessary to generate multiple knockout strains? Is it sufficient to test in one cell line, or do I need to use multiple cell types or even cells from multiple species? Can I perform all experiments in one lab, or should I include results from a lab on the other side of the country (Lithgow et al., 2017)? There’s no single right answer: each researcher must balance practicality with robust experimental design. At a minimum, researchers must perform an experiment multiple times if they want to know whether the results are robust.2 What Hypothesis is Being Tested? Here’s a simple question to help you clarify what your sample size N should be (Naegle et al., 2015; Lazic et al., 2018; Pollard et al., 2019): What population are you trying to sample? A P value is often used 2 This raises the question of how many cells one should look at in each sample. Is it better to look at many cells in a few biological replicates or spend less time measuring individual cells and redirect that effort to repeating the experiment additional times? Multiple analyses have found that increasing the number of biological replicates has a larger influence on the statistical power than imaging many more cells in each sample (Aarts et al., 2014; Blainey et al., 2014). 3 Despite warnings from statisticians (Greenland et al., 2016). to indicate that a measured difference between populations (treatments, strains, knockdowns, etc.) is “significant,” or unlikely to be due only to random chance.3 What those two populations are depends on how you sample them. For example, if you want to know if a particular treatment changes the speed of crawling cells, you could split a flask of cells into two wells of a 96-well plate, dose one well with a drug of interest and one with a placebo, and then track individual cells in each of the two wells. If you use each cell as a sample, the two populations you end up comparing are the cells in those two particular wells. By repeating the experiment multiple times from new flasks, and using each experiment as a sample, you evaluate the effect of the treatment on any arbitrary flask of cells. Multiple observations within one well increases the precision for estimating the mean for that one sample, but doesn’t reveal a truth about all cells in all wells (just like measuring many hairs on my own head doesn’t give me insight into the average haircut). If you only care about cell-to-cell variability within a particular sample, then maybe N really is the number of cells you observed. Making inferences beyond that sample, however, would be questionable, because the natural variability of individual cells can be overshadowed by systematic differences between biological replicates. Whether caused by passage number, confluency, or location in the incubator, cells often vary from flask-to-flask and day-today. Entire flasks of cells might even be described as “unhappy.” Accordingly, cells from experimental and control samples (e.g. tubes, flasks, wells, or coverslips) may differ from each other, regardless of the intended experimental treatment. When authors report the sample size as the number of cells, the resulting statistical analysis cannot help the reader evaluate whether any observed differences are due to the intended treatment or simple sample-to-sample variability. We are not prescribing any specific definition of N, we are simply encouraging researchers to consider what main source of variability they hope to overcome when designing experiments and statistical analyses (Altman and Krzywinski, 2015) (see Table 1). To test the hypothesis that two treatments or populations are different, the treatment must be applied or the populations sampled multiple times. In a drug trial for a new painkiller pill, one of my knees cannot be randomly assigned to the treatment group and the other to placebo, so researchers cannot count each of my knees as a separate N. (If the trial is testing steroid injections, then under certain statistical models, each knee could be considered a separate sample (Aarts et al., 2014).) Similarly, neighboring cells within one flask or well treated with a drug are not separate tests of the hypothesis, because the treatment was only applied once. But if individual cells are microinjected with a drug or otherwise randomly assigned to a different treatment, then each cell really can be a separate test of a hypothesis. How to Calculate P Values from Cell-Level Obser- |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | https://arxiv.org/pdf/1911.03509v2.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |