Loading...
Please wait, while we are loading the content...
Similar Documents
Running head: ESTIMATING REPLICABILITY 1 How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies
| Content Provider | Semantic Scholar |
|---|---|
| Author | Schimmack, Ulrich |
| Copyright Year | 2016 |
| Abstract | In the past five years, the replicability of original findings published in psychology journals has been questioned. We show that replicability can be estimated by computing the average power of studies. We then present four methods that can be used to estimate average power for a set of studies that were selected for significance: p-curve, p-uniform, maximum likelihood, and z-curve. We present the results of large-scale simulation studies with both homogeneous and heterogeneous effect sizes. All methods work well with homogeneous effect sizes, but only maximum likelihood and z-curve produce accurate estimates with heterogeneous effect sizes. All methods overestimate replicability using the Open Science Collaborative reproducibility project and we discuss possible reasons for this. Based on the simulation studies, we recommend z-curve as a valid method to estimate replicability. We also validated a conservative bootstrap confidence interval that makes it possible to use z-curve with small sets of studies. Keywords: Power estimation, Post-hoc power analysis, Publication bias, Maximum likelihood, P-curve, P-uniform, Z-curve, Effect size, Replicability, Simulation. Running head: ESTIMATING REPLICABILITY 3 How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test-‐statistics in original studies Science is built on a mixture of trust and healthy skepticism. On one hand, scientists who read and cite published work trust the authors, reviewers, and editors to ensure that most reported results provide sufficient credible evidence based on objective empirical studies. On the other hand, scientists also insist that studies be reported with sufficient detail to reproduce them and to see whether other researchers can replicate the results. Replication studies ensure that false positives will be promptly discovered when replication studies fail to confirm the original results. Replicability is acknowledged to be a requirement of good science (Popper 1934, Bunge 1998). According to Fisher, replicability is also a characteristic of a good experiment; “A properly designed experiment rarely fails to give ... significance” (Fisher, 1926, p. 504). In recent years, psychologists and other scientists have started to realize that published results are far less replicable than one would expect based on the high rate of significant results in published articles (Hirschhorn, Lohmueller, Byrne and Hirschhorn 2002, Ioannidis 2008, Simmons, Nelson and Simonsohn 2011, Begley and Ellis 2012, John, Lowenstein and Prelec 2012, Begley 2013, Chang and Li 2015, Baker 2016). In Psychology, the Open Science Collaboration (OSR) project attempted to estimate the replicability of published results in psychology by replicating 100 primary findings of articles from three influential journals that publish results from social and cognitive psychology (OSR, 2015). Ninety seven percent of the replicated studies reported a statistically significant result, but only 37% of the replication studies were able to Running head: ESTIMATING REPLICABILITY 4 replicate this outcome. This low success rate has created heated debates, especially in social psychology where the success rate was only 25%. The use of actual replication studies to estimate replicability has a number of limitations. First, it is practically impossible to conduct actual replications on a large scale, especially for studies that require a long time (longitudinal studies) or are very expensive (MRI studies), or raise ethical concerns (animal research). Second, actual replication studies may require expertise that only a few researchers in the world have. Third, there are many reasons why a particular replication study might fail, and replication failure would call for additional efforts to seek reasons for the failure. For these reasons, it is desirable to have an alternative method of estimating replicability that does not require literal replication. We see this method as complementary to actual replication studies. Actual replication studies are needed because they provide more information than just finding a significant result again. For example, they show that the results can be replicated over time and are not limited to a specific historic, cultural context. They also show that the description of the original study was sufficiently precise to reproduce the study in a way that it successfully replicated the original result. At the same time a statistical estimation method based on the results reported in original articles can provide information that replication studies do not provide. For example, they can show that it was highly probable or improbable that an exact replication study would be successful. This information can be helpful in the evaluation of failed replication studies. If the replicability estimate of the original study is low, it is not surprising that an actual replication study failed to produce a significant result. In contrast, if the estimated replicability was high, it suggests that the replication study was not exact or had some Running head: ESTIMATING REPLICABILITY 5 problems. Thus, statistical estimates of replicability and the outcome of replication studies can be seen as two independent methods that are expected to produce convergent evidence of replicability. Our approach to the estimation of replicability based on evidence from original studies is based on the concept of statistical power. Power analysis was introduced by Neyman and Pearson (1933) as a formalization of Fisher’s criterion of a good experiment. According to Fisher (1926, p. 504), a good experiment should rarely produce a non-significant result when the null hypothesis is false. Most psychologists are familiar with Cohen’s (1988) suggestion that good experiments should have 80% power; that is 4 out of 5 replications should produce a significant result and only 1 out of 5 studies would fail to reject the false null hypothesis; that is, making a type-II error. However, in actual practice, psychologists have ignored a priori power analysis and typically conduct studies with less power (Schimmack, 2012). A common estimate of power is that average power is about 50% (Cohen 1962, Sedlmeier and Gigerenzer 1989). This means that about half of the studies in psychology have less than 50% power. Power has direct consequences for replicability because power is the long-run probability of obtaining a statistically significant result. Thus, even if a study with 30% power produced a significant result, the chance of obtaining the same result again in a replication study remains 30%. Methodologists have wondered for a long time why researchers ignore power if power is essential for producing significant results in original studies and in replication studies (Schimmack, 2012; Sedlmeier & Gigerenzer, 1989) without a satisfactory answer. We believe one possible explanation is that researchers confuse the rate of significant Running head: ESTIMATING REPLICABILITY 6 results in published articles with the replicability of published findings. As power is the long-run probability of obtaining a significant result, the rate of obtained significant can be used to estimate observed power, and the success rate of original articles is over 90%. (Sterling, 1959; Sterling, Rosenbaum and Weinkam, 1995). This may create the illusion that studies have high power and nearly always produce significant results and the expectation that replication studies will be equally successful. However, Sterling et al. (1995) pointed out that the observed success rate in journals provides an inflated estimate of power, and therefore with replicability, because journals are more likely to publish significant results than non-significant results. It is well known that non-significant results often end up in Rosenthal's (1969) proverbial file drawer, but it is not known how many non-significant results remain unpublished. As a result, neither the true success rate of original studies, nor the replicability of these studies is currently known. In this article, we present four methods that can be used to estimate replicability of published studies even if published studies are selected for significance. These methods can be used to estimate the replicability of psychological research in general or the replicability of results in specific journals. We define replicability as the probability of obtaining the same result in an exact replication study with the same procedure and sample sizes. As most studies focus on rejecting the null hypothesis, we define “obtaining the same result” as obtaining a significant result again. This definition ignores the sign of an effect or the pattern of a complex interaction effect, which leads to a slight over-estimation of replicability, if one were also taking the sign of an effect into account. However, this bias is small because it is very rare that a replication study produces a significant result in the opposite direction Running head: ESTIMATING REPLICABILITY 7 (OSR, 2015). All four methods of estimating replicability are using the statistical evidence against the null hypothesis in original studies to estimate average power. In a simple scenario, where all studies have the same power (homogeneous case), replicability is power. However, in the more realistic and complex scenario, where studies have different power (heterogeneous case), replicability corresponds to mean power. In our technical description of the statistical methods we focus on the statistically well-defined concept of power. In the end, we use this approach to make predictions about replicability in the OSR reproducibility project. Introduction of Statistical Methods for Power Estimation Consider a population of significance tests in which every test has its own probability of being significant; that is, there is a population of power v |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://www.utstat.toronto.edu/~brunner/zcurve2016/HowReplicable.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |