Performance of Semi-Automated Screening Using Rayyan and ASReview: A Retrospective Analysis of Potential Work Reduction and Different Stopping Rules
Author(s) / Creator(s)
Scherhag, Julian
Burgard, Tanja
Abstract / Description
Background: Although systematic reviews are a pillar of modern science, guiding policy making and evidence-based practice (Borenstein et al., 2009; Cooper et al., 2009), their undertaking becomes increasingly more difficult due to an ever increasing pool of scientific publications (Bornmann et al., 2021) which exceeds human information processing (Robledo et al., 2021) and labor capacities (Borah et al., 2017). In recent years, a variety of semi-automation tools have been developed to solve these issues via machine learning. Their application holds great promise, halving the screening workload while achieving high recall levels of 95% and above (Burgard & Bittermann, in press). Few studies to this date evaluated the performance of Rayyan or ASReview.
Objectives / Research Questions: There is no consensus on when to stop semi-automated screening (Bannach-Brown et al., 2019; Marshall & Wallace, 2019). In this vein, we investigated how the performance of Rayyan & ASReview differed, applying different stopping criteria.
Method: To assess the quality of the tool´s relevance predictions and semi-automated screening decisions with regards to work saved over sampling (WSS) and achieved recall, the Systematic Drug Class Review Gold Standard Data (Cohen et al., 2006) was used. For each of the fifteen reviews the abstract triage and article triage status were given. The bibliographic information was uploaded to the tools and the actual full text screening decisions were then given to the tools successively. The predictive model was retrained with subsequent user decisions and the remaining studies re-ranked accordingly. This process terminated when all articles were reviewed. We analyzed retrospectively, how many screenings were necessary to achieve certain recall rates (e.g. 90%, 95%) and how the tools performed at the different stopping points. The tested stopping criteria were 50/100 consecutive irrelevant articles, stopping after each quarter screened (25%, 50%, 75%), stopping at Rayyan´s relevance threshold, and stopping after achieving 95% of the estimated recall based on the random training sets. Furthermore, the robustness of these stopping rules was assessed with ASReview´s simulation feature.
Results / Findings: Both tools reduce the workload considerably (20-54%). However, ASReview performed better than Rayyan, finding more relevant studies with less screening. Overall, a trade-off between high recall and moderate WSS was observed. The screened quarters showed the most reliable performance. The other criteria achieved a median recall over 80% but varied considerably with regards to recall and WSS.
Conclusions and Implications: Applying the tools saves considerable screening time and identifies most relevant articles. ASReview is preferable over Rayyan when time is of key importance. Users should not rely on a single criterion but rather combine multiple to circumvent one criterion failing.
Persistent Identifier
Date of first publication
2023-05-03
Is part of
Big Data & Research Syntheses 2023, Frankfurt, Germany
Publisher
ZPID (Leibniz Institute for Psychology)
Citation
-
Performance of Semi-Automated Screening Using Rayyan and ASReview (Scherhag & Burgard, 2023).pdfAdobe PDF - 517.51KBMD5: 152611c51b83f5fc41c8a455a821d681
-
There are no other versions of this object.
-
Author(s) / Creator(s)Scherhag, Julian
-
Author(s) / Creator(s)Burgard, Tanja
-
PsychArchives acquisition timestamp2023-05-03T12:59:25Z
-
Made available on2023-05-03T12:59:25Z
-
Date of first publication2023-05-03
-
Abstract / DescriptionBackground: Although systematic reviews are a pillar of modern science, guiding policy making and evidence-based practice (Borenstein et al., 2009; Cooper et al., 2009), their undertaking becomes increasingly more difficult due to an ever increasing pool of scientific publications (Bornmann et al., 2021) which exceeds human information processing (Robledo et al., 2021) and labor capacities (Borah et al., 2017). In recent years, a variety of semi-automation tools have been developed to solve these issues via machine learning. Their application holds great promise, halving the screening workload while achieving high recall levels of 95% and above (Burgard & Bittermann, in press). Few studies to this date evaluated the performance of Rayyan or ASReview. Objectives / Research Questions: There is no consensus on when to stop semi-automated screening (Bannach-Brown et al., 2019; Marshall & Wallace, 2019). In this vein, we investigated how the performance of Rayyan & ASReview differed, applying different stopping criteria. Method: To assess the quality of the tool´s relevance predictions and semi-automated screening decisions with regards to work saved over sampling (WSS) and achieved recall, the Systematic Drug Class Review Gold Standard Data (Cohen et al., 2006) was used. For each of the fifteen reviews the abstract triage and article triage status were given. The bibliographic information was uploaded to the tools and the actual full text screening decisions were then given to the tools successively. The predictive model was retrained with subsequent user decisions and the remaining studies re-ranked accordingly. This process terminated when all articles were reviewed. We analyzed retrospectively, how many screenings were necessary to achieve certain recall rates (e.g. 90%, 95%) and how the tools performed at the different stopping points. The tested stopping criteria were 50/100 consecutive irrelevant articles, stopping after each quarter screened (25%, 50%, 75%), stopping at Rayyan´s relevance threshold, and stopping after achieving 95% of the estimated recall based on the random training sets. Furthermore, the robustness of these stopping rules was assessed with ASReview´s simulation feature. Results / Findings: Both tools reduce the workload considerably (20-54%). However, ASReview performed better than Rayyan, finding more relevant studies with less screening. Overall, a trade-off between high recall and moderate WSS was observed. The screened quarters showed the most reliable performance. The other criteria achieved a median recall over 80% but varied considerably with regards to recall and WSS. Conclusions and Implications: Applying the tools saves considerable screening time and identifies most relevant articles. ASReview is preferable over Rayyan when time is of key importance. Users should not rely on a single criterion but rather combine multiple to circumvent one criterion failing.en
-
Publication statusunknownen
-
Review statusunknownen
-
Persistent Identifierhttps://hdl.handle.net/20.500.12034/8364
-
Persistent Identifierhttps://doi.org/10.23668/psycharchives.12843
-
Language of contenteng
-
PublisherZPID (Leibniz Institute for Psychology)en
-
Is part ofBig Data & Research Syntheses 2023, Frankfurt, Germanyen
-
Dewey Decimal Classification number(s)150
-
TitlePerformance of Semi-Automated Screening Using Rayyan and ASReview: A Retrospective Analysis of Potential Work Reduction and Different Stopping Rulesen
-
DRO typeconferenceObjecten
-
Leibniz institute name(s) / abbreviation(s)ZPIDde_DE
-
Visible tag(s)ZPID Conferences and Workshops