Saturday, 31 August 2019

publications - Studies over how noisy is it to accept/reject submissions



This year, the NIPS 2014 conference did an interesting experiment: conference chairs duplicated 10% of the submissions (170 papers) and sent them to two different groups of reviewers. The result: 25.9% of disagreement.


This indicates that for almost every one out of four papers, the paper is accepted by one group of experts while rejected by the other group. This just shows how noisy the reviewing process is. I was wondering if there were other similar experiments for other fields and what the disagreement percentage was in each (regardless of the venue: journal or conference).



Answer



There have been many studies on this. Here is a recent meta-analysis of 48(!) of them:



Bornmann, Lutz, RĂ¼diger Mutz, and Hans-Dieter Daniel. "A reliability-generalization study of journal peer reviews: a multilevel meta-analysis of inter-rater reliability and its determinants." PLOS ONE 5.12 (2010): e14331.



Here's the abstract:



Background



This paper presents the first meta-analysis for the inter-rater reliability (IRR) of journal peer reviews. IRR is defined as the extent to which two or more independent reviews of the same scientific document agree.


Methodology/Principal Findings


Altogether, 70 reliability coefficients (Cohen's Kappa, intra-class correlation [ICC], and Pearson product-moment correlation [r]) from 48 studies were taken into account in the meta-analysis. The studies were based on a total of 19,443 manuscripts; on average, each study had a sample size of 311 manuscripts (minimum: 28, maximum: 1983). The results of the meta-analysis confirmed the findings of the narrative literature reviews published to date: The level of IRR (mean ICC/r2 = .34, mean Cohen's Kappa = .17) was low. To explain the study-to-study variation of the IRR coefficients, meta-regression analyses were calculated using seven covariates. Two covariates that emerged in the meta-regression analyses as statistically significant to gain an approximate homogeneity of the intra-class correlations indicated that, firstly, the more manuscripts that a study is based on, the smaller the reported IRR coefficients are. Secondly, if the information of the rating system for reviewers was reported in a study, then this was associated with a smaller IRR coefficient than if the information was not conveyed.


Conclusions/Significance


Studies that report a high level of IRR are to be considered less credible than those with a low level of IRR. According to our meta-analysis the IRR of peer assessments is quite limited and needs improvement (e.g., reader system).



This meta-analysis includes studies of peer review agreement in economics/law, natural sciences, medical sciences, and social sciences.


Here is another paper, which includes a section on the reliability of peer review (i.e. agreement between reviewers) in which a number of other studies are summarized:



Bornmann, Lutz. "Scientific peer review." Annual Review of Information Science and Technology 45.1 (2011): 197-245.




Specifically in computer science, there's this:



Ragone, Azzurra, et al. "On peer review in computer science: analysis of its effectiveness and suggestions for improvement." Scientometrics 97.2 (2013): 317-356.



They measured inter-reviewer agreement in



a large reviews data set from ten different conferences in computer science for a total of ca. 9,000 reviews on ca. 2,800 submitted contributions.



and found




in our case we have six conferences with ICC > 0.6, i.e. with significant correlation, 3 conferences with a fair correlation (0.4 < ICC < 0.59) and one conference with poor correlation among raters (ICC < 0.4).



They also found that agreement on "strong reject" papers was much higher than agreement on other papers. More precisely,



A more detailed analysis shows that if somebody gives a mark from the "strong reject" band, this increases the probability of giving marks not only from strong and weak reject bands (by 14 and 63% correspondingly) but also from borderline band (by 11%). In the "strong accept" set the probability of others giving a "weak accept" mark is 20% higher than the overall probability, but the probability of giving marks from other bands are almost the same as the overall probabilities.


Therefore, we can say that we have marks skewed towards the "weak accept" and reviewers still agree on very bad contributions while disagree on very good.



No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...