Performance of Semi-Automated Screening Using Rayyan and ASReview: A Retrospective Analysis of Potential Work Reduction and Different Stopping Rules

Scherhag, Julian; Burgard, Tanja

Conference Object

Performance of Semi-Automated Screening Using Rayyan and ASReview: A Retrospective Analysis of Potential Work Reduction and Different Stopping Rules

Author(s) / Creator(s)

Scherhag, Julian

Burgard, Tanja

Abstract / Description

Background: Although systematic reviews are a pillar of modern science, guiding policy making and evidence-based practice (Borenstein et al., 2009; Cooper et al., 2009), their undertaking becomes increasingly more difficult due to an ever increasing pool of scientific publications (Bornmann et al., 2021) which exceeds human information processing (Robledo et al., 2021) and labor capacities (Borah et al., 2017). In recent years, a variety of semi-automation tools have been developed to solve these issues via machine learning. Their application holds great promise, halving the screening workload while achieving high recall levels of 95% and above (Burgard & Bittermann, in press). Few studies to this date evaluated the performance of Rayyan or ASReview. Objectives / Research Questions: There is no consensus on when to stop semi-automated screening (Bannach-Brown et al., 2019; Marshall & Wallace, 2019). In this vein, we investigated how the performance of Rayyan & ASReview differed, applying different stopping criteria. Method: To assess the quality of the tool´s relevance predictions and semi-automated screening decisions with regards to work saved over sampling (WSS) and achieved recall, the Systematic Drug Class Review Gold Standard Data (Cohen et al., 2006) was used. For each of the fifteen reviews the abstract triage and article triage status were given. The bibliographic information was uploaded to the tools and the actual full text screening decisions were then given to the tools successively. The predictive model was retrained with subsequent user decisions and the remaining studies re-ranked accordingly. This process terminated when all articles were reviewed. We analyzed retrospectively, how many screenings were necessary to achieve certain recall rates (e.g. 90%, 95%) and how the tools performed at the different stopping points. The tested stopping criteria were 50/100 consecutive irrelevant articles, stopping after each quarter screened (25%, 50%, 75%), stopping at Rayyan´s relevance threshold, and stopping after achieving 95% of the estimated recall based on the random training sets. Furthermore, the robustness of these stopping rules was assessed with ASReview´s simulation feature. Results / Findings: Both tools reduce the workload considerably (20-54%). However, ASReview performed better than Rayyan, finding more relevant studies with less screening. Overall, a trade-off between high recall and moderate WSS was observed. The screened quarters showed the most reliable performance. The other criteria achieved a median recall over 80% but varied considerably with regards to recall and WSS. Conclusions and Implications: Applying the tools saves considerable screening time and identifies most relevant articles. ASReview is preferable over Rayyan when time is of key importance. Users should not rely on a single criterion but rather combine multiple to circumvent one criterion failing.

Persistent Identifier

https://doi.org/10.23668/psycharchives.12843

Date of first publication

2023-05-03

Is part of

Big Data & Research Syntheses 2023, Frankfurt, Germany

Publisher

ZPID (Leibniz Institute for Psychology)

Citation

Select Style

Download BibTex

Download as Text

Performance of Semi-Automated Screening Using Rayyan and ASReview (Scherhag & Burgard, 2023).pdf

Adobe PDF - 517.51KB

MD5 : 152611c51b83f5fc41c8a455a821d681

Sharing Level 0 (Public Use) CC-BY-SA 4.0

Download

There are no other versions of this object.

Author(s) / Creator(s)

Scherhag, Julian
Author(s) / Creator(s)

Burgard, Tanja
PsychArchives acquisition timestamp

2023-05-03T12:59:25Z
Made available on

2023-05-03T12:59:25Z
Date of first publication

2023-05-03
Abstract / Description

Background: Although systematic reviews are a pillar of modern science, guiding policy making and evidence-based practice (Borenstein et al., 2009; Cooper et al., 2009), their undertaking becomes increasingly more difficult due to an ever increasing pool of scientific publications (Bornmann et al., 2021) which exceeds human information processing (Robledo et al., 2021) and labor capacities (Borah et al., 2017). In recent years, a variety of semi-automation tools have been developed to solve these issues via machine learning. Their application holds great promise, halving the screening workload while achieving high recall levels of 95% and above (Burgard & Bittermann, in press). Few studies to this date evaluated the performance of Rayyan or ASReview. Objectives / Research Questions: There is no consensus on when to stop semi-automated screening (Bannach-Brown et al., 2019; Marshall & Wallace, 2019). In this vein, we investigated how the performance of Rayyan & ASReview differed, applying different stopping criteria. Method: To assess the quality of the tool´s relevance predictions and semi-automated screening decisions with regards to work saved over sampling (WSS) and achieved recall, the Systematic Drug Class Review Gold Standard Data (Cohen et al., 2006) was used. For each of the fifteen reviews the abstract triage and article triage status were given. The bibliographic information was uploaded to the tools and the actual full text screening decisions were then given to the tools successively. The predictive model was retrained with subsequent user decisions and the remaining studies re-ranked accordingly. This process terminated when all articles were reviewed. We analyzed retrospectively, how many screenings were necessary to achieve certain recall rates (e.g. 90%, 95%) and how the tools performed at the different stopping points. The tested stopping criteria were 50/100 consecutive irrelevant articles, stopping after each quarter screened (25%, 50%, 75%), stopping at Rayyan´s relevance threshold, and stopping after achieving 95% of the estimated recall based on the random training sets. Furthermore, the robustness of these stopping rules was assessed with ASReview´s simulation feature. Results / Findings: Both tools reduce the workload considerably (20-54%). However, ASReview performed better than Rayyan, finding more relevant studies with less screening. Overall, a trade-off between high recall and moderate WSS was observed. The screened quarters showed the most reliable performance. The other criteria achieved a median recall over 80% but varied considerably with regards to recall and WSS. Conclusions and Implications: Applying the tools saves considerable screening time and identifies most relevant articles. ASReview is preferable over Rayyan when time is of key importance. Users should not rely on a single criterion but rather combine multiple to circumvent one criterion failing.

en
Publication status

unknown

en
Review status

unknown

en
Persistent Identifier

https://hdl.handle.net/20.500.12034/8364
Persistent Identifier

https://doi.org/10.23668/psycharchives.12843
Language of content

eng
Publisher

ZPID (Leibniz Institute for Psychology)

en
Is part of

Big Data & Research Syntheses 2023, Frankfurt, Germany

en
Dewey Decimal Classification number(s)

150
Title

Performance of Semi-Automated Screening Using Rayyan and ASReview: A Retrospective Analysis of Potential Work Reduction and Different Stopping Rules

en
DRO type

conferenceObject

en
Leibniz institute name(s) / abbreviation(s)

ZPID

de_DE
Visible tag(s)

ZPID Conferences and Workshops