Conference Object

Performance of Semi-Automated Screening Using Rayyan and ASReview: A Retrospective Analysis of Potential Work Reduction and Different Stopping Rules

Author(s) / Creator(s)

Scherhag, Julian
Burgard, Tanja

Abstract / Description

Background: Although systematic reviews are a pillar of modern science, guiding policy making and evidence-based practice (Borenstein et al., 2009; Cooper et al., 2009), their undertaking becomes increasingly more difficult due to an ever increasing pool of scientific publications (Bornmann et al., 2021) which exceeds human information processing (Robledo et al., 2021) and labor capacities (Borah et al., 2017). In recent years, a variety of semi-automation tools have been developed to solve these issues via machine learning. Their application holds great promise, halving the screening workload while achieving high recall levels of 95% and above (Burgard & Bittermann, in press). Few studies to this date evaluated the performance of Rayyan or ASReview. Objectives / Research Questions: There is no consensus on when to stop semi-automated screening (Bannach-Brown et al., 2019; Marshall & Wallace, 2019). In this vein, we investigated how the performance of Rayyan & ASReview differed, applying different stopping criteria. Method: To assess the quality of the tool´s relevance predictions and semi-automated screening decisions with regards to work saved over sampling (WSS) and achieved recall, the Systematic Drug Class Review Gold Standard Data (Cohen et al., 2006) was used. For each of the fifteen reviews the abstract triage and article triage status were given. The bibliographic information was uploaded to the tools and the actual full text screening decisions were then given to the tools successively. The predictive model was retrained with subsequent user decisions and the remaining studies re-ranked accordingly. This process terminated when all articles were reviewed. We analyzed retrospectively, how many screenings were necessary to achieve certain recall rates (e.g. 90%, 95%) and how the tools performed at the different stopping points. The tested stopping criteria were 50/100 consecutive irrelevant articles, stopping after each quarter screened (25%, 50%, 75%), stopping at Rayyan´s relevance threshold, and stopping after achieving 95% of the estimated recall based on the random training sets. Furthermore, the robustness of these stopping rules was assessed with ASReview´s simulation feature. Results / Findings: Both tools reduce the workload considerably (20-54%). However, ASReview performed better than Rayyan, finding more relevant studies with less screening. Overall, a trade-off between high recall and moderate WSS was observed. The screened quarters showed the most reliable performance. The other criteria achieved a median recall over 80% but varied considerably with regards to recall and WSS. Conclusions and Implications: Applying the tools saves considerable screening time and identifies most relevant articles. ASReview is preferable over Rayyan when time is of key importance. Users should not rely on a single criterion but rather combine multiple to circumvent one criterion failing.

Persistent Identifier

Date of first publication

2023-05-03

Is part of

Big Data & Research Syntheses 2023, Frankfurt, Germany

Publisher

ZPID (Leibniz Institute for Psychology)

Citation

  • Author(s) / Creator(s)
    Scherhag, Julian
  • Author(s) / Creator(s)
    Burgard, Tanja
  • PsychArchives acquisition timestamp
    2023-05-03T12:59:25Z
  • Made available on
    2023-05-03T12:59:25Z
  • Date of first publication
    2023-05-03
  • Abstract / Description
    Background: Although systematic reviews are a pillar of modern science, guiding policy making and evidence-based practice (Borenstein et al., 2009; Cooper et al., 2009), their undertaking becomes increasingly more difficult due to an ever increasing pool of scientific publications (Bornmann et al., 2021) which exceeds human information processing (Robledo et al., 2021) and labor capacities (Borah et al., 2017). In recent years, a variety of semi-automation tools have been developed to solve these issues via machine learning. Their application holds great promise, halving the screening workload while achieving high recall levels of 95% and above (Burgard & Bittermann, in press). Few studies to this date evaluated the performance of Rayyan or ASReview. Objectives / Research Questions: There is no consensus on when to stop semi-automated screening (Bannach-Brown et al., 2019; Marshall & Wallace, 2019). In this vein, we investigated how the performance of Rayyan & ASReview differed, applying different stopping criteria. Method: To assess the quality of the tool´s relevance predictions and semi-automated screening decisions with regards to work saved over sampling (WSS) and achieved recall, the Systematic Drug Class Review Gold Standard Data (Cohen et al., 2006) was used. For each of the fifteen reviews the abstract triage and article triage status were given. The bibliographic information was uploaded to the tools and the actual full text screening decisions were then given to the tools successively. The predictive model was retrained with subsequent user decisions and the remaining studies re-ranked accordingly. This process terminated when all articles were reviewed. We analyzed retrospectively, how many screenings were necessary to achieve certain recall rates (e.g. 90%, 95%) and how the tools performed at the different stopping points. The tested stopping criteria were 50/100 consecutive irrelevant articles, stopping after each quarter screened (25%, 50%, 75%), stopping at Rayyan´s relevance threshold, and stopping after achieving 95% of the estimated recall based on the random training sets. Furthermore, the robustness of these stopping rules was assessed with ASReview´s simulation feature. Results / Findings: Both tools reduce the workload considerably (20-54%). However, ASReview performed better than Rayyan, finding more relevant studies with less screening. Overall, a trade-off between high recall and moderate WSS was observed. The screened quarters showed the most reliable performance. The other criteria achieved a median recall over 80% but varied considerably with regards to recall and WSS. Conclusions and Implications: Applying the tools saves considerable screening time and identifies most relevant articles. ASReview is preferable over Rayyan when time is of key importance. Users should not rely on a single criterion but rather combine multiple to circumvent one criterion failing.
    en
  • Publication status
    unknown
    en
  • Review status
    unknown
    en
  • Persistent Identifier
    https://hdl.handle.net/20.500.12034/8364
  • Persistent Identifier
    https://doi.org/10.23668/psycharchives.12843
  • Language of content
    eng
  • Publisher
    ZPID (Leibniz Institute for Psychology)
    en
  • Is part of
    Big Data & Research Syntheses 2023, Frankfurt, Germany
    en
  • Dewey Decimal Classification number(s)
    150
  • Title
    Performance of Semi-Automated Screening Using Rayyan and ASReview: A Retrospective Analysis of Potential Work Reduction and Different Stopping Rules
    en
  • DRO type
    conferenceObject
    en
  • Leibniz institute name(s) / abbreviation(s)
    ZPID
    de_DE
  • Visible tag(s)
    ZPID Conferences and Workshops