• Systematic review
  • Open access
  • Published: 19 June 2020

Quantitative measures of health policy implementation determinants and outcomes: a systematic review

  • Peg Allen   ORCID: orcid.org/0000-0001-7000-796X 1 ,
  • Meagan Pilar 1 ,
  • Callie Walsh-Bailey 1 ,
  • Cole Hooley 2 ,
  • Stephanie Mazzucca 1 ,
  • Cara C. Lewis 3 ,
  • Kayne D. Mettert 3 ,
  • Caitlin N. Dorsey 3 ,
  • Jonathan Purtle 4 ,
  • Maura M. Kepper 1 ,
  • Ana A. Baumann 5 &
  • Ross C. Brownson 1 , 6  

Implementation Science volume  15 , Article number:  47 ( 2020 ) Cite this article

19k Accesses

62 Citations

32 Altmetric

Metrics details

Public policy has tremendous impacts on population health. While policy development has been extensively studied, policy implementation research is newer and relies largely on qualitative methods. Quantitative measures are needed to disentangle differential impacts of policy implementation determinants (i.e., barriers and facilitators) and outcomes to ensure intended benefits are realized. Implementation outcomes include acceptability, adoption, appropriateness, compliance/fidelity, feasibility, penetration, sustainability, and costs. This systematic review identified quantitative measures that are used to assess health policy implementation determinants and outcomes and evaluated the quality of these measures.

Three frameworks guided the review: Implementation Outcomes Framework (Proctor et al.), Consolidated Framework for Implementation Research (Damschroder et al.), and Policy Implementation Determinants Framework (Bullock et al.). Six databases were searched: Medline, CINAHL Plus, PsycInfo, PAIS, ERIC, and Worldwide Political. Searches were limited to English language, peer-reviewed journal articles published January 1995 to April 2019. Search terms addressed four levels: health, public policy, implementation, and measurement. Empirical studies of public policies addressing physical or behavioral health with quantitative self-report or archival measures of policy implementation with at least two items assessing implementation outcomes or determinants were included. Consensus scoring of the Psychometric and Pragmatic Evidence Rating Scale assessed the quality of measures.

Database searches yielded 8417 non-duplicate studies, with 870 (10.3%) undergoing full-text screening, yielding 66 studies. From the included studies, 70 unique measures were identified to quantitatively assess implementation outcomes and/or determinants. Acceptability, feasibility, appropriateness, and compliance were the most commonly measured implementation outcomes. Common determinants in the identified measures were organizational culture, implementation climate, and readiness for implementation, each aspects of the internal setting. Pragmatic quality ranged from adequate to good, with most measures freely available, brief, and at high school reading level. Few psychometric properties were reported.

Conclusions

Well-tested quantitative measures of implementation internal settings were under-utilized in policy studies. Further development and testing of external context measures are warranted. This review is intended to stimulate measure development and high-quality assessment of health policy implementation outcomes and determinants to help practitioners and researchers spread evidence-informed policies to improve population health.

Registration

Not registered

Peer Review reports

Contributions to the literature

This systematic review identified 70 quantitative measures of implementation outcomes or determinants in health policy studies.

Readiness to implement and organizational climate and culture were commonly assessed determinants, but fewer studies assessed policy actor relationships or implementation outcomes of acceptability, fidelity/compliance, appropriateness, feasibility, or implementation costs.

Study team members rated most identified measures’ pragmatic properties as good, meaning they are straightforward to use, but few studies documented pilot or psychometric testing of measures.

Further development and dissemination of valid and reliable measures of policy implementation outcomes and determinants can facilitate identification, use, and spread of effective policy implementation strategies.

Despite major impacts of policy on population health [ 1 , 2 , 3 , 4 , 5 , 6 , 7 ], there have been relatively few policy studies in dissemination and implementation (D&I) science to inform implementation strategies and evaluate implementation efforts [ 8 ]. While health outcomes of policies are commonly studied, fewer policy studies assess implementation processes and outcomes. Of 146 D&I studies funded by the National Institutes of Health (NIH) through D&I funding announcements from 2007 to 2014, 12 (8.2%) were policy studies that assessed policy content, policy development processes, or health outcomes of policies, representing 10.5% of NIH D&I funding [ 8 ]. Eight of the 12 studies (66.7%) assessed health outcomes, while only five (41.6%) assessed implementation [ 8 ].

Our ability to explore the differential impact of policy implementation determinants and outcomes and disentangle these from health benefits and other societal outcomes requires high quality quantitative measures [ 9 ]. While systematic reviews of measures of implementation of evidence-based interventions (in clinical and community settings) have been conducted in recent years [ 10 , 11 , 12 , 13 ], to our knowledge, no reviews have explored the quality of quantitative measures of determinants and outcomes of policy implementation.

Policy implementation research in political science and the social sciences has been active since at least the 1970s and has much to contribute to the newer field of D&I research [ 1 , 14 ]. Historically, theoretical frameworks and policy research largely emphasized policy development or analysis of the content of policy documents themselves [ 15 ]. For example, Kingdon’s Multiple Streams Framework and its expansions have been widely used in political science and the social sciences more broadly to describe how factors related to sociopolitical climate, attributes of a proposed policy, and policy actors (e.g., organizations, sectors, individuals) contribute to policy change [ 16 , 17 , 18 ]. Policy frameworks can also inform implementation planning and evaluation in D&I research. Although authors have named policy stages since the 1950s [ 19 , 20 ], Sabatier and Mazmanian’s Policy Implementation Process Framework was one of the first such frameworks that gained widespread use in policy implementation research [ 21 ] and later in health promotion [ 22 ]. Yet, available implementation frameworks are not often used to guide implementation strategies or inform why a policy worked in one setting but not another [ 23 ]. Without explicit focus on implementation, the intended benefits of health policies may go unrealized, and the ability may be lost to move the field forward to understand policy implementation (i.e., our collective knowledge building is dampened) [ 24 ].

Differences in perspectives and terminology between D&I and policy research in political science are noteworthy to interpret the present review. For example, Proctor et al. use the term implementation outcomes for what policy researchers call policy outputs [ 14 , 20 , 25 ]. To non-D&I policy researchers, policy implementation outcomes refer to the health outcomes in the target population [ 20 ]. D&I science uses the term fidelity [ 26 ]; policy researchers write about compliance [ 20 ]. While D&I science uses the terms outer setting, outer context, or external context to point to influences outside the implementing organization [ 26 , 27 , 28 ], non-D&I policy research refers to policy fields [ 24 ] which are networks of agencies that carry out policies and programs.

Identification of valid and reliable quantitative measures of health policy implementation processes is needed. These measures are needed to advance from classifying constructs to understanding causality in policy implementation research [ 29 ]. Given limited resources, policy implementers also need to know which aspects of implementation are key to improve policy acceptance, compliance, and sustainability to reap the intended health benefits [ 30 ]. Both pragmatic and psychometrically sound measures are needed to accomplish these objectives [ 10 , 11 , 31 , 32 ], so the field can explore the influence of nuanced determinants and generate reliable and valid findings.

To fill this void in the literature, this systematic review of health policy implementation measures aimed to (1) identify quantitative measures used to assess health policy implementation outcomes (IOF outcomes commonly called policy outputs in policy research) and inner and outer setting determinants, (2) describe and assess pragmatic quality of policy implementation measures, (3) describe and assess the quality of psychometric properties of identified instruments, and (4) elucidate health policy implementation measurement gaps.

The study team used systematic review procedures developed by Lewis and colleagues for reviews of D&I research measures and received detailed guidance from the Lewis team coauthors for each step [ 10 , 11 ]. We followed the PRISMA reporting guidelines as shown in the checklist (Supplemental Table 1 ). We have also provided a publicly available website of measures identified in this review ( https://www.health-policy-measures.org/ ).

For the purposes of this review, policy and policy implementation are defined as follows. We deemed public policy to include legislation at the federal, state/province/regional unit, or local levels; and governmental regulations, whether mandated by national, state/province, or local level governmental agencies or boards of elected officials (e.g., state boards of education in the USA) [ 4 , 20 ]. Here, public policy implementation is defined as the carrying out of a governmental mandate by public or private organizations and groups of organizations [ 20 ].

Two widely used frameworks from the D&I field guide the present review, and a third recently developed framework that bridges policy and D&I research. In the Implementation Outcomes Framework (IOF), Proctor and colleagues identify and define eight implementation outcomes that are differentiated from health outcomes: acceptability, adoption, appropriateness, cost, feasibility, fidelity, penetration, and sustainability [ 25 ]. In the Consolidated Framework for Implementation Research (CFIR), Damschroder and colleagues articulate determinants of implementation including the domains of intervention characteristics, outer setting, inner setting of an organization, characteristics of individuals within organizations, and process [ 33 ]. Finally, Bullock developed the Policy Implementation Determinants Framework to present a balanced framework that emphasizes both internal setting constructs and external setting constructs including policy actor relationships and networks, political will for implementation, and visibility of policy actors [ 34 ]. The constructs identified in these frameworks were used to guide our list of implementation determinants and outcomes.

Through EBSCO, we searched MEDLINE, PsycInfo, and CINAHL Plus. Through ProQuest, we searched PAIS, Worldwide Political, and ERIC. Due to limited time and staff in the 12-month study, we did not search the grey literature. We used multiple search terms in each of four required levels: health, public policy, implementation, and measurement (Table 1 ). Table 1 shows search terms for each string. Supplemental Tables 2 and 3 show the final search syntax applied in EBSCO and ProQuest.

The authors developed the search strings and terms based on policy implementation framework reviews [ 34 , 35 ], additional policy implementation frameworks [ 21 , 22 ], labels and definitions of the eight implementation outcomes identified by Proctor et al. [ 25 ], CFIR construct labels and definitions [ 9 , 33 ], and additional D&I research and search term sources [ 28 , 36 , 37 , 38 ] (Table 1 ). The full study team provided three rounds of feedback on draft terms, and a library scientist provided additional synonyms and search terms. For each test search, we calculated the percentage of 18 benchmark articles the search captured. We determined a priori 80% as an acceptable level of precision.

Inclusion and exclusion criteria

This review addressed only measures of implementation by organizations mandated to act by governmental units or legislation. Measures of behavior changes by individuals in target populations as a result of legislation or governmental regulations and health status changes were outside the realm of this review.

There were several inclusion criteria: (1) empirical studies of the implementation of public policies already passed or approved that addressed physical or behavioral health, (2) quantitative self-report or archival measurement methods utilized, (3) published in peer-reviewed journals from January 1995 through April 2019, (4) published in the English language, (5) public policy implementation studies from any continent or international governing body, and (6) at least two transferable quantitative self-report or archival items that assessed implementation determinants [ 33 , 34 ] and/or IOF implementation outcomes [ 25 ]. This study sought to identify transferable measures that could be used to assess multiple policies and contexts. Here, a transferable item is defined as one that needed no wording changes or only a change in the referent (e.g., policy title or topic such as tobacco or malaria) to make the item applicable to other policies or settings [ 11 ]. The year 1995 was chosen as a starting year because that is about when web-based quantitative surveying began [ 39 ]. Table 2 provides definitions of the IOF implementation outcomes and the selected determinants of implementation. Broader constructs, such as readiness for implementation, contained multiple categories.

Exclusion criteria in the searches included (1) non-empiric health policy journal articles (e.g., conceptual articles, editorials); (2) narrative and systematic reviews; (3) studies with only qualitative assessment of health policy implementation; (4) empiric studies reported in theses and books; (5) health policy studies that only assessed health outcomes (i.e., target population changes in health behavior or status); (6) bill analyses, stakeholder perceptions assessed to inform policy development, and policy content analyses without implementation assessment; (7) studies of changes made in a private business not encouraged by public policy; and (8) countries with authoritarian regimes. We electronically programmed the searches to exclude policy implementation studies from countries that are not democratically governed due to vast differences in policy environments and implementation factors.

Screening procedures

Citations were downloaded into EndNote version 7.8 and de-duplicated electronically. We conducted dual independent screening of titles and abstracts after two group pilot screening sessions in which we clarified inclusion and exclusion criteria and screening procedures. Abstract screeners used Covidence systematic review software [ 40 ] to code inclusion as yes or no. Articles were included in full-text review if one screener coded it as meeting the inclusion criteria. Full-text screening via dual independent screening was coded in Covidence [ 40 ], with weekly meetings to reach consensus on inclusion/exclusion discrepancies. Screeners also coded one of the pre-identified reasons for exclusion.

Data extraction strategy

Extraction elements included information about (1) measure meta-data (e.g., measure name, total number of items, number of transferable items) and studies (e.g., policy topic, country, setting), (2) development and testing of the measure, (3) implementation outcomes and determinants assessed (Table 2 ), (4) pragmatic characteristics, and (5) psychometric properties. Where needed, authors were emailed to obtain the full measure and measure development information. Two coauthors (MP, CWB) reached consensus on extraction elements. For each included measure, a primary extractor conducted initial entries and coding. Due to time and staff limitations in the 12-month study, we did not search for each empirical use of the measure. A secondary extractor checked the entries, noting any discrepancies for discussion in consensus meetings. Multiple measures in a study were extracted separately.

Quality assessment of measures

To assess the quality of measures, we applied the Psychometric and Pragmatic Evidence Rating Scales (PAPERS) developed by Lewis et al. [ 10 , 11 , 41 , 42 ]. PAPERS includes assessment of five pragmatic instrument characteristics that affect the level of ease or difficulty to use the instrument: brevity (number of items), simplicity of language (readability level), cost (whether it is freely available), training burden (extent of data collection training needed), and analysis burden (ease or difficulty of interpretation of scoring and results). Lewis and colleagues developed the pragmatic domains and rating scales with stakeholder and D&I researchers input [ 11 , 41 , 42 ] and developed the psychometric rating scales in collaboration with D&I researchers [ 10 , 11 , 43 ]. The psychometric rating scale has nine properties (Table 3 ): internal consistency; norms; responsiveness; convergent, discriminant, and known-groups construct validity; predictive and concurrent criterion validity; and structural validity. In both the pragmatic and psychometric scales, reported evidence for each domain is scored from poor (− 1), none/not reported (0), minimal/emerging (1), adequate (2), good (3), or excellent (4). Higher values are indicative of more desirable pragmatic characteristics (e.g., fewer items, freely available, scoring instructions, and interpretations provided) and stronger evidence of psychometric properties (e.g., adequate to excellent reliability and validity) (Supplemental Tables 4 and 5 ).

Data synthesis and presentation

This section describes the synthesis of measure transferability, empiric use study settings and policy topics, and PAPERS scoring. Two coauthors (MP, CWB) consensus coded measures into three categories of item transferability based on quartile item transferability percentages: mostly transferable (≥ 75% of items deemed transferable), partially transferable (25–74% of items deemed transferable), and setting-specific (< 25% of items deemed transferable). Items were deemed transferable if no wording changes or only a change in the referent (e.g., policy title or topic) was needed to make the item applicable to the implementation of other policies or in other settings. Abstractors coded study settings into one of five categories: hospital or outpatient clinics; mental or behavioral health facilities; healthcare cost, access, or quality; schools; community; and multiple. Abstractors also coded policy topics to healthcare cost, access, or quality; mental or behavioral health; infectious or chronic diseases; and other, while retaining documentation of subtopics such as tobacco, physical activity, and nutrition. Pragmatic scores were totaled for the five properties, with possible total scores of − 5 to 20, with higher values indicating greater ease to use the instrument. Psychometric property total scores for the nine properties were also calculated, with possible scores of − 9 to 36, with higher values indicating evidence of multiple types of validity.

The database searches yielded 11,684 articles, of which 3267 were duplicates (Fig. 1 ). Titles and abstracts of the 8417 articles were independently screened by two team members; 870 (10.3%) were selected for full-text screening by at least one screener. Of the 870 studies, 804 were excluded at full-text screening or during extraction attempts with the consensus of two coauthors; 66 studies were included. Two coauthors (MP, CWB) reached consensus on extraction and coding of information on 70 unique quantitative eligible measures identified in the 66 included studies plus measure development articles where obtained. Nine measures were used in more than one included study. Detailed information on identified measures is publicly available at https://www.health-policy-measures.org/ .

figure 1

PRISMA flow diagram

The most common exclusion reason was lack of transferable items in quantitative measures of policy implementation ( n = 597) (Fig. 1 ). While this review focused on transferable measures across any health issue or setting, researchers addressing specific health policies or settings may find the excluded studies of interest. The frequencies of the remaining exclusion reasons are listed in Fig. 1 .

A variety of health policy topics and settings from over two dozen countries were found in the database searches. For example, the searches identified quantitative and mixed methods implementation studies of legislation (such as tobacco smoking bans), regulations (such as food/menu labeling requirements), governmental policies that mandated specific clinical practices (such as vaccination or access to HIV antiretroviral treatment), school-based interventions (such as government-mandated nutritional content and physical activity), and other public policies.

Among the 70 unique quantitative implementation measures, 15 measures were deemed mostly transferable (at least 75% transferable, Table 4 ). Twenty-three measures were categorized as partially transferable (25 to 74% of items deemed transferable, Table 5 ); 32 measures were setting-specific (< 25% of items deemed transferable, data not shown).

Implementation outcomes

Among the 70 measures, the most commonly assessed implementation outcomes were fidelity/compliance of the policy implementation to the government mandate (26%), acceptability of the policy to implementers (24%), perceived appropriateness of the policy (17%), and feasibility of implementation (17%) (Table 2 ). Fidelity/compliance was sometimes assessed by asking implementers the extent to which they had modified a mandated practice [ 45 ]. Sometimes, detailed checklists were used to assess the extent of compliance with the many mandated policy components, such as school nutrition policies [ 83 ]. Acceptability was assessed by asking staff or healthcare providers in implementing agencies their level of agreement with the provided statements about the policy mandate, scored in Likert scales. Only eight (11%) of the included measures used multiple transferable items to assess adoption, and only eight (11%) assessed penetration.

Twenty-six measures of implementation costs were found during full-text screening (10 in included studies and 14 in excluded studies, data not shown). The cost time horizon varied from 12 months to 21 years, with most cost measures assessed at multiple time points. Ten of the 26 measures addressed direct implementation costs. Nine studies reported cost modeling findings. The implementation cost survey developed by Vogler et al. was extensive [ 53 ]. It asked implementing organizations to note policy impacts in medication pricing, margins, reimbursement rates, and insurance co-pays.

Determinants of implementation

Within the 70 included measures, the most commonly assessed implementation determinants were readiness for implementation (61% assessed any readiness component) and the general organizational culture and climate (39%), followed by the specific policy implementation climate within the implementation organization/s (23%), actor relationships and networks (17%), political will for policy implementation (11%), and visibility of the policy role and policy actors (10%) (Table 2 ). Each component of readiness for implementation was commonly assessed: communication of the policy (31%, 22 of 70 measures), policy awareness and knowledge (26%), resources for policy implementation (non-training resources 27%, training 20%), and leadership commitment to implement the policy (19%).

Only two studies assessed organizational structure as a determinant of health policy implementation. Lavinghouze and colleagues assessed the stability of the organization, defined as whether re-organization happens often or not, within a set of 9-point Likert items on multiple implementation determinants designed for use with state-level public health practitioners, and assessed whether public health departments were stand-alone agencies or embedded within agencies addressing additional services, such as social services [ 69 ]. Schneider and colleagues assessed coalition structure as an implementation determinant, including items on the number of organizations and individuals on the coalition roster, number that regularly attend coalition meetings, and so forth [ 72 ].

Tables of measures

Tables 4 and 5 present the 38 measures of implementation outcomes and/or determinants identified out of the 70 included measures with at least 25% of items transferable (useable in other studies without wording changes or by changing only the policy name or other referent). Table 4 shows 15 mostly transferable measures (at least 75% transferable). Table 5 shows 23 partially transferable measures (25–74% of items deemed transferable). Separate measure development articles were found for 20 of the 38 measures; the remaining measures seemed to be developed for one-time, study-specific use by the empirical study authors cited in the tables. Studies listed in Tables 4 and 5 were conducted most commonly in the USA ( n = 19) or Europe ( n = 11). A few measures were used elsewhere: Africa ( n = 3), Australia ( n = 1), Canada ( n = 1), Middle East ( n = 1), Southeast Asia ( n = 1), or across multiple continents ( n = 1).

Quality of identified measures

Figure 2 shows the median pragmatic quality ratings across the 38 measures with at least 25% transferable items shown in Tables 4 and 5 . Higher scores are desirable and indicate the measures are easier to use (Table 3 ). Overall, the measures were freely available in the public domain (median score = 4), brief with a median of 11–50 items (median score = 3), and had good readability, with a median reading level between 8th and 12th grade (median score = 3). However, instructions on how to score and interpret item scores were lacking, with a median score of 1, indicating the measures did not include suggestions for interpreting score ranges, clear cutoff scores, and instructions for handling missing data. In general, information on training requirements or availability of self-training manuals on how to use the measures was not reported in the included study or measure development article/s (median score = 0, not reported). Total pragmatic rating scores among the 38 measures with at least 25% of items transferable ranged from 7 to 17 (Tables 4 and 5 ), with a median total score of 12 out of a possible total score of 20. Median scores for each pragmatic characteristic were the same across all measures as for the 38 mostly or partially transferable measures, with a median total score of 11 across all measures.

figure 2

Pragmatic rating scale results across identified measures. Footnote: pragmatic criteria scores from Psychometric and Pragmatic Evidence Rating Scale (PAPERS) (Lewis et al. [ 11 ], Stanick et al. [ 42 ]). Total possible score = 20, total median score across 38 measures = 11. Scores ranged from 0 to 18. Rating scales for each domain are provided in Supplemental Table 4

Few psychometric properties were reported. The study team found few reports of pilot testing and measure refinement as well. Among the 38 measures with at least 25% transferable items, the psychometric properties from the PAPERS rating scale total scores ranged from − 1 to 17 (Tables 4 and 5 ), with a median total score of 5 out of a possible total score of 36. Higher scores indicate more types of validity and reliability were reported with high quality. The 32 measures with calculable norms had a median norms PAPERS score of 3 (good), indicating appropriate sample size and distribution. The nine measures with reported internal consistency mostly showed Cronbach’s alphas in the adequate (0.70 to 0.79) to excellent (≥ 90) range, with a median of 0.78 (PAPERS score of 2, adequate) indicating adequate internal consistency. The five measures with reported structural validity had a median PAPERS score of 2, adequate (range 1 to 3, poor to good), indicating the sample size was sufficient and the factor analysis goodness of fit was reasonable. Among the 38 measures, no reports were found for responsiveness, convergent validity, discriminant validity, known-groups construct validity, or predictive or concurrent criterion validity.

In this systematic review, we sought to identify quantitative measures used to assess health policy implementation outcomes and determinants, rate the pragmatic and psychometric quality of identified measures, and point to future directions to address measurement gaps. In general, the identified measures are easy to use and freely available, but we found little data on validity and reliability. We found more quantitative measures of intra-organizational determinants of policy implementation than measures of the relationships and interactions between organizations that influence policy implementation. We found a limited number of measures that had been developed for or used to assess one of the eight IOF policy implementation outcomes that can be applied to other policies or settings, which may speak more to differences in terms used by policy researchers and D&I researchers than to differences in conceptualizations of policy implementation. Authors used a variety of terms and rarely provided definitions of the constructs the items assessed. Input from experts in policy implementation is needed to better understand and define policy implementation constructs for use across multiple fields involved in policy-related research.

We found several researchers had used well-tested measures of implementation determinants from D&I research or from organizational behavior and management literature (Tables 4 and 5 ). For internal setting of implementing organizations, whether mandated through public policy or not, well-developed and tested measures are available. However, a number of authors crafted their own items, with or without pilot testing, and used a variety of terms to describe what the items assessed. Further dissemination of the availability of well-tested measures to policy researchers is warranted [ 9 , 13 ].

What appears to be a larger gap involves the availability of well-developed and tested quantitative measures of the external context affecting policy implementation that can be used across multiple policy settings and topics [ 9 ]. Lack of attention to how a policy initiative fits with the external implementation context during policymaking and lack of policymaker commitment of adequate resources for implementation contribute to this gap [ 23 , 93 ]. Recent calls and initiatives to integrate health policies during policymaking and implementation planning will bring more attention to external contexts affecting not only policy development but implementation as well [ 93 , 94 , 95 , 96 , 97 , 98 , 99 ]. At the present time, it is not well-known which internal and external determinants are most essential to guide and achieve sustainable policy implementation [ 100 ]. Identification and dissemination of measures that assess factors that facilitate the spread of evidence-informed policy implementation (e.g., relative advantage, flexibility) will also help move policy implementation research forward [ 1 , 9 ].

Given the high potential population health impact of evidence-informed policies, much more attention to policy implementation is needed in D&I research. Few studies from non-D&I researchers reported policy implementation measure development procedures, pilot testing, scoring procedures and interpretation, training of data collectors, or data analysis procedures. Policy implementation research could benefit from the rigor of D&I quantitative research methods. And D&I researchers have much to learn about the contexts and practical aspects of policy implementation and can look to the rich depth of information in qualitative and mixed methods studies from other fields to inform quantitative measure development and testing [ 101 , 102 , 103 ].

Limitations

This systematic review has several limitations. First, the four levels of the search string and multiple search terms in each level were applied only to the title, abstract, and subject headings, due to limitations of the search engines, so we likely missed pertinent studies. Second, a systematic approach with stakeholder input is needed to expand the definitions of IOF implementation outcomes for policy implementation. Third, although the authors value intra-organizational policymaking and implementation, the study team restricted the search to governmental policies due to limited time and staffing in the 12-month study. Fourth, by excluding tools with only policy-specific implementation measures, we excluded some well-developed and tested instruments in abstract and full-text screening. Since only 12 measures had 100% transferable items, researchers may need to pilot test wording modifications of other items. And finally, due to limited time and staffing, we only searched online for measures and measures development articles and may have missed separately developed pragmatic information, such as training and scoring materials not reported in a manuscript.

Despite the limitations, several recommendations for measure development follow from the findings and related literature [ 1 , 11 , 20 , 35 , 41 , 104 ], including the need to (1) conduct systematic, mixed-methods procedures (concept mapping, expert panels) to refine policy implementation outcomes, (2) expand and more fully specify external context domains for policy implementation research and evaluation, (3) identify and disseminate well-developed measures for specific policy topics and settings, (4) ensure that policy implementation improves equity rather than exacerbating disparities [ 105 ], and (5) develop evidence-informed policy implementation guidelines.

Easy-to-use, reliable, and valid quantitative measures of policy implementation can further our understanding of policy implementation processes, determinants, and outcomes. Due to the wide array of health policy topics and implementation settings, sound quantitative measures that can be applied across topics and settings will help speed learnings from individual studies and aid in the transfer from research to practice. Quantitative measures can inform the implementation of evidence-informed policies to further the spread and effective implementation of policies to ultimately reap greater population health benefit. This systematic review of measures is intended to stimulate measure development and high-quality assessment of health policy implementation outcomes and predictors to help practitioners and researchers spread evidence-informed policies to improve population health and reduce inequities.

Availability of data and materials

A compendium of identified measures is available for dissemination at https://www.health-policy-measures.org/ . A link will be provided on the website of the Prevention Research Center, Brown School, Washington University in St. Louis, at https://prcstl.wustl.edu/ . The authors invite interested organizations to provide a link to the compendium. Citations and abstracts of excluded policy-specific measures are available on request.

Abbreviations

Consolidated Framework for Implementation Research

Cumulative Index of Nursing and Allied Health Literature

Dissemination and implementation science

Elton B. Stephens Company

Education Resources Information Center

Implementation Outcomes Framework

Psychometric and Pragmatic Evidence Rating Scale

Preferred Reporting Items for Systematic Reviews and Meta-Analyses

Purtle J, Dodson EA, Brownson RC. Policy dissemination research. In: Brownson RC, Colditz GA, Proctor EK, editors. Dissemination and Implementation Research in Health: Translating Science to Practice, Second Edition. New York: Oxford University Press; 2018.

Google Scholar  

Brownson RC, Baker EA, Deshpande AD, Gillespie KN. Evidence-based public health. Third ed. New York, NY: Oxford University Press; 2018.

Guide to Community Preventive Services. About the community guide.: community preventive services task force; 2020 [updated October 03, 2019; cited 2020. Available from: https://www.thecommunityguide.org/ .

Eyler AA, Chriqui JF, Moreland-Russell S, Brownson RC, editors. Prevention, policy, and public health, first edition. New York, NY: Oxford University Press; 2016.

Andre FE, Booy R, Bock HL, Clemens J, Datta SK, John TJ, et al. Vaccination greatly reduces disease, disability, death, and inequity worldwide. Geneva, Switzerland: World Health Organization; 2008 February 2008. Contract No.: 07-040089.

Cheng JJ, Schuster-Wallace CJ, Watt S, Newbold BK, Mente A. An ecological quantification of the relationships between water, sanitation and infant, child, and maternal mortality. Environ Health. 2012;11:4.

PubMed   PubMed Central   Google Scholar  

Levy DT, Li Y, Yuan Z. Impact of nations meeting the MPOWER targets between 2014 and 2016: an update. Tob Control. 2019.

Purtle J, Peters R, Brownson RC. A review of policy dissemination and implementation research funded by the National Institutes of Health, 2007-2014. Implement Sci. 2016;11:1.

Lewis CC, Proctor EK, Brownson RC. Measurement issues in dissemination and implementation research. In: Brownson RC, Ga C, Proctor EK, editors. Disssemination and Implementation Research in Health: Translating Science to Practice, Second Edition. New York: Oxford University Press; 2018.

Lewis CC, Fischer S, Weiner BJ, Stanick C, Kim M, Martinez RG. Outcomes for implementation science: an enhanced systematic review of instruments using evidence-based rating criteria. Implement Sci. 2015;10:155.

Lewis CC, Mettert KD, Dorsey CN, Martinez RG, Weiner BJ, Nolen E, et al. An updated protocol for a systematic review of implementation-related measures. Syst Rev. 2018;7(1):66.

Chaudoir SR, Dugan AG, Barr CH. Measuring factors affecting implementation of health innovations: a systematic review of structural, organizational, provider, patient, and innovation level measures. Implement Sci. 2013;8:22.

Rabin BA, Lewis CC, Norton WE, Neta G, Chambers D, Tobin JN, et al. Measurement resources for dissemination and implementation research in health. Implement Sci. 2016;11:42.

Nilsen P, Stahl C, Roback K, Cairney P. Never the twain shall meet?--a comparison of implementation science and policy implementation research. Implement Sci. 2013;8:63.

Sabatier PA, editor. Theories of the Policy Process. New York, NY: Routledge; 2019.

Kingdon J. Agendas, alternatives, and public policies, second edition. Second ed. New York: Longman; 1995.

Jones MD, Peterson HL, Pierce JJ, Herweg N, Bernal A, Lamberta Raney H, et al. A river runs through it: a multiple streams meta-review. Policy Stud J. 2016;44(1):13–36.

Fowler L. Using the multiple streams framework to connect policy adoption to implementation. Policy Studies Journal. 2020 (11 Feb).

Howlett M, Mukherjee I, Woo JJ. From tools to toolkits in policy design studies: the new design orientation towards policy formulation research. Policy Polit. 2015;43(2):291–311.

Natesan SD, Marathe RR. Literature review of public policy implementation. Int J Public Policy. 2015;11(4):219–38.

Sabatier PA, Mazmanian. Implementation of public policy: a framework of analysis. Policy Studies Journal. 1980 (January).

Sabatier PA. Theories of the Policy Process. Westview; 2007.

Tomm-Bonde L, Schreiber RS, Allan DE, MacDonald M, Pauly B, Hancock T, et al. Fading vision: knowledge translation in the implementation of a public health policy intervention. Implement Sci. 2013;8:59.

Roll S, Moulton S, Sandfort J. A comparative analysis of two streams of implementation research. Journal of Public and Nonprofit Affairs. 2017;3(1):3–22.

Proctor E, Silmere H, Raghavan R, Hovmand P, Aarons G, Bunger A, et al. Outcomes for implementation research: conceptual distinctions, measurement challenges, and research agenda. Admin Pol Ment Health. 2011;38(2):65–76.

Brownson RC, Colditz GA, Proctor EK, editors. Dissemination and implementation research in health: translating science to practice, second edition. New York: Oxford University Press; 2018.

Tabak RG, Khoong EC, Chambers DA, Brownson RC. Bridging research and practice: models for dissemination and implementation research. Am J Prev Med. 2012;43(3):337–50.

Rabin BA, Brownson RC, Haire-Joshu D, Kreuter MW, Weaver NL. A glossary for dissemination and implementation research in health. J Public Health Manag Pract. 2008;14(2):117–23.

PubMed   Google Scholar  

Lewis CC, Klasnja P, Powell BJ, Lyon AR, Tuzzio L, Jones S, et al. From classification to causality: advancing understanding of mechanisms of change in implementation science. Front Public Health. 2018;6:136.

Boyd MR, Powell BJ, Endicott D, Lewis CC. A method for tracking implementation strategies: an exemplar implementing measurement-based care in community behavioral health clinics. Behav Ther. 2018;49(4):525–37.

Glasgow RE. What does it mean to be pragmatic? Pragmatic methods, measures, and models to facilitate research translation. Health Educ Behav. 2013;40(3):257–65.

Glasgow RE, Riley WT. Pragmatic measures: what they are and why we need them. Am J Prev Med. 2013;45(2):237–43.

Damschroder LJ, Aron DC, Keith RE, Kirsh SR, Alexander JA, Lowery JC. Fostering implementation of health services research findings into practice: a consolidated framework for advancing implementation science. Implement Sci. 2009;4:50.

Bullock HL. Understanding the implementation of evidence-informed policies and practices from a policy perspective: a critical interpretive synthesis in: How do systems achieve their goals? the role of implementation in mental health systems improvement [Dissertation]. Hamilton, Ontario: McMaster University; 2019.

Watson DP, Adams EL, Shue S, Coates H, McGuire A, Chesher J, et al. Defining the external implementation context: an integrative systematic literature review. BMC Health Serv Res. 2018;18(1):209.

McKibbon KA, Lokker C, Wilczynski NL, Ciliska D, Dobbins M, Davis DA, et al. A cross-sectional study of the number and frequency of terms used to refer to knowledge translation in a body of health literature in 2006: a Tower of Babel? Implement Sci. 2010;5:16.

Terwee CB, Jansma EP, Riphagen II, de Vet HC. Development of a methodological PubMed search filter for finding studies on measurement properties of measurement instruments. Qual Life Res. 2009;18(8):1115–23.

Egan M, Maclean A, Sweeting H, Hunt K. Comparing the effectiveness of using generic and specific search terms in electronic databases to identify health outcomes for a systematic review: a prospective comparative study of literature search method. BMJ Open. 2012;2:3.

Dillman DA, Smyth JD, Christian LM. Internet, mail, and mixed-mode surveys: the tailored design method. Hoboken, NJ: John Wiley & Sons; 2009.

Covidence systematic review software. Melbourne, Australia: Veritas Health Innovation. https://www.covidence.org . Accessed Mar 2019.

Powell BJ, Stanick CF, Halko HM, Dorsey CN, Weiner BJ, Barwick MA, et al. Toward criteria for pragmatic measurement in implementation research and practice: a stakeholder-driven approach using concept mapping. Implement Sci. 2017;12(1):118.

Stanick CF, Halko HM, Nolen EA, Powell BJ, Dorsey CN, Mettert KD, et al. Pragmatic measures for implementation research: development of the Psychometric and Pragmatic Evidence Rating Scale (PAPERS). Transl Behav Med. 2019.

Henrikson NB, Blasi PR, Dorsey CN, Mettert KD, Nguyen MB, Walsh-Bailey C, et al. Psychometric and pragmatic properties of social risk screening tools: a systematic review. Am J Prev Med. 2019;57(6S1):S13–24.

Stirman SW, Miller CJ, Toder K, Calloway A. Development of a framework and coding system for modifications and adaptations of evidence-based interventions. Implement Sci. 2013;8:65.

Lau AS, Brookman-Frazee L. The 4KEEPS study: identifying predictors of sustainment of multiple practices fiscally mandated in children’s mental health services. Implement Sci. 2016;11:1–8.

Ekvall G. Organizational climate for creativity and innovation. European J Work Organizational Psychology. 1996;5(1):105–23.

Lövgren G, Eriksson S, Sandman PO. Effects of an implemented care policy on patient and personnel experiences of care. Scand J Caring Sci. 2002;16(1):3–11.

Dwyer DJ, Ganster DC. The effects of job demands and control on employee attendance and satisfaction. J Organ Behav. 1991;12:595–608.

Condon-Paoloni D, Yeatman HR, Grigonis-Deane E. Health-related claims on food labels in Australia: understanding environmental health officers’ roles and implications for policy. Public Health Nutr. 2015;18(1):81–8.

Patterson MG, West MA, Shackleton VJ, Dawson JF, Lawthom R, Maitlis S, et al. Validating the organizational climate measure: links to managerial practices, productivity and innovation. J Organ Behav. 2005;26:279–408.

Glisson C, Green P, Williams NJ. Assessing the Organizational Social Context (OSC) of child welfare systems: implications for research and practice. Child Abuse Negl. 2012;36(9):621–32.

Beidas RS, Aarons G, Barg F, Evans A, Hadley T, Hoagwood K, et al. Policy to implementation: evidence-based practice in community mental health--study protocol. Implement Sci. 2013;8(1):38.

Eisenberger R, Cummings J, Armeli S, Lynch P. Perceived organizational support, discretionary treatment, and job satisfaction. J Appl Psychol. 1997;82:812–20.

CAS   PubMed   Google Scholar  

Eby L, George K, Brown BL. Going tobacco-free: predictors of clinician reactions and outcomes of the NY state office of alcoholism and substance abuse services tobacco-free regulation. J Subst Abus Treat. 2013;44(3):280–7.

Vogler S, Zimmermann N, de Joncheere K. Policy interventions related to medicines: survey of measures taken in European countries during 2010-2015. Health Policy. 2016;120(12):1363–77.

Wanberg CRB, Banas JT. Predictors and outcomes of openness to change in a reorganizing workplace. J Applied Psychology. 2000;85:132–42.

CAS   Google Scholar  

Hardy LJ, Wertheim P, Bohan K, Quezada JC, Henley E. A model for evaluating the activities of a coalition-based policy action group: the case of Hermosa Vida. Health Promot Pract. 2013;14(4):514–23.

Gavriilidis G, Östergren P-O. Evaluating a traditional medicine policy in South Africa: phase 1 development of a policy assessment tool. Glob Health Action. 2012;5:17271.

Hongoro C, Rutebemberwa E, Twalo T, Mwendera C, Douglas M, Mukuru M, et al. Analysis of selected policies towards universal health coverage in Uganda: the policy implementation barometer protocol. Archives Public Health. 2018;76:12.

Roeseler A, Solomon M, Beatty C, Sipler AM. The tobacco control network’s policy readiness and stage of change assessment: what the results suggest for moving tobacco control efforts forward at the state and territorial levels. J Public Health Manag Pract. 2016;22(1):9–19.

Brämberg EB, Klinga C, Jensen I, Busch H, Bergström G, Brommels M, et al. Implementation of evidence-based rehabilitation for non-specific back pain and common mental health problems: a process evaluation of a nationwide initiative. BMC Health Serv Res. 2015;15(1):79.

Rütten A, Lüschen G, von Lengerke T, Abel T, Kannas L, Rodríguez Diaz JA, et al. Determinants of health policy impact: comparative results of a European policymaker study. Sozial-Und Praventivmedizin. 2003;48(6):379–91.

Smith SN, Lai Z, Almirall D, Goodrich DE, Abraham KM, Nord KM, et al. Implementing effective policy in a national mental health reengagement program for veterans. J Nerv Ment Dis. 2017;205(2):161–70.

Carasso BS, Lagarde M, Cheelo C, Chansa C, Palmer N. Health worker perspectives on user fee removal in Zambia. Hum Resour Health. 2012;10:40.

Goldsmith REH, C.F. Measuring consumer innovativeness. J Acad Mark Sci. 1991;19(3):209–21.

Webster CA, Caputi P, Perreault M, Doan R, Doutis P, Weaver RG. Elementary classroom teachers’ adoption of physical activity promotion in the context of a statewide policy: an innovation diffusion and socio-ecologic perspective. J Teach Phys Educ. 2013;32(4):419–40.

Aarons GA, Glisson C, Hoagwood K, Kelleher K, Landsverk J, Cafri G. Psychometric properties and U.S. National norms of the Evidence-Based Practice Attitude Scale (EBPAS). Psychol Assess. 2010;22(2):356–65.

Gill KJ, Campbell E, Gauthier G, Xenocostas S, Charney D, Macaulay AC. From policy to practice: implementing frontline community health services for substance dependence--study protocol. Implement Sci. 2014;9:108.

Lavinghouze SR, Price AW, Parsons B. The environmental assessment instrument: harnessing the environment for programmatic success. Health Promot Pract. 2009;10(2):176–85.

Bull FC, Milton K, Kahlmeier S. National policy on physical activity: the development of a policy audit tool. J Phys Act Health. 2014;11(2):233–40.

Bull F, Milton K, Kahlmeier S, Arlotti A, Juričan AB, Belander O, et al. Turning the tide: national policy approaches to increasing physical activity in seven European countries. British J Sports Med. 2015;49(11):749–56.

Schneider EC, Smith ML, Ory MG, Altpeter M, Beattie BL, Scheirer MA, et al. State fall prevention coalitions as systems change agents: an emphasis on policy. Health Promot Pract. 2016;17(2):244–53.

Helfrich CD, Savitz LA, Swiger KD, Weiner BJ. Adoption and implementation of mandated diabetes registries by community health centers. Am J Prev Med. 2007;33(1,Suppl):S50-S65.

Donchin M, Shemesh AA, Horowitz P, Daoud N. Implementation of the Healthy Cities’ principles and strategies: an evaluation of the Israel Healthy Cities network. Health Promot Int. 2006;21(4):266–73.

Were MC, Emenyonu N, Achieng M, Shen C, Ssali J, Masaba JP, et al. Evaluating a scalable model for implementing electronic health records in resource-limited settings. J Am Med Inform Assoc. 2010;17(3):237–44.

Konduri N, Sawyer K, Nizova N. User experience analysis of e-TB Manager, a nationwide electronic tuberculosis recording and reporting system in Ukraine. ERJ Open Research. 2017;3:2.

McDonnell E, Probart C. School wellness policies: employee participation in the development process and perceptions of the policies. J Child Nutr Manag. 2008;32:1.

Mersini E, Hyska J, Burazeri G. Evaluation of national food and nutrition policy in Albania. Zdravstveno Varstvo. 2017;56(2):115–23.

Cavagnero E, Daelmans B, Gupta N, Scherpbier R, Shankar A. Assessment of the health system and policy environment as a critical complement to tracking intervention coverage for maternal, newborn, and child health. Lancet. 2008;371 North American Edition(9620):1284-93.

Lehman WE, Greener JM, Simpson DD. Assessing organizational readiness for change. J Subst Abus Treat. 2002;22(4):197–209.

Pankratz M, Hallfors D, Cho H. Measuring perceptions of innovation adoption: the diffusion of a federal drug prevention policy. Health Educ Res. 2002;17(3):315–26.

Cook JM, Thompson R, Schnurr PP. Perceived characteristics of intervention scale: development and psychometric properties. Assessment. 2015;22(6):704–14.

Probart C, McDonnell ET, Jomaa L, Fekete V. Lessons from Pennsylvania’s mixed response to federal school wellness law. Health Aff. 2010;29(3):447–53.

Probart C, McDonnell E, Weirich JE, Schilling L, Fekete V. Statewide assessment of local wellness policies in Pennsylvania public school districts. J Am Diet Assoc. 2008;108(9):1497–502.

Rakic S, Novakovic B, Stevic S, Niskanovic J. Introduction of safety and quality standards for private health care providers: a case-study from the Republic of Srpska, Bosnia and Herzegovina. Int J Equity Health. 2018;17(1):92.

Rozema AD, Mathijssen JJP, Jansen MWJ, van Oers JAM. Sustainability of outdoor school ground smoking bans at secondary schools: a mixed-method study. Eur J Pub Health. 2018;28(1):43–9.

Barbero C, Moreland-Russell S, Bach LE, Cyr J. An evaluation of public school district tobacco policies in St. Louis County, Missouri. J Sch Health. 2013;83(8):525–32.

Williams KM, Kirsh S, Aron D, Au D, Helfrich C, Lambert-Kerzner A, et al. Evaluation of the Veterans Health Administration’s specialty care transformational initiatives to promote patient-centered delivery of specialty care: a mixed-methods approach. Telemed J E-Health. 2017;23(7):577–89.

Spencer E, Walshe K. National quality improvement policies and strategies in European healthcare systems. Quality Safety Health Care. 2009;18(Suppl 1):i22–i7.

Assunta M, Dorotheo EU. SEATCA Tobacco Industry Interference Index: a tool for measuring implementation of WHO Framework Convention on Tobacco Control Article 5.3. Tob Control. 2016;25(3):313–8.

Tummers L. Policy alienation of public professionals: the construct and its measurement. Public Adm Rev. 2012;72(4):516–25.

Tummers L, Bekkers V. Policy implementation, street-level bureaucracy, and the importance of discretion. Public Manag Rev. 2014;16(4):527–47.

Raghavan R, Bright CL, Shadoin AL. Toward a policy ecology of implementation of evidence-based practices in public mental health settings. Implement Sci. 2008;3:26.

Peters D, Harting J, van Oers H, Schuit J, de Vries N, Stronks K. Manifestations of integrated public health policy in Dutch municipalities. Health Promot Int. 2016;31(2):290–302.

Tosun J, Lang A. Policy integration: mapping the different concepts. Policy Studies. 2017;38(6):553–70.

Tubbing L, Harting J, Stronks K. Unravelling the concept of integrated public health policy: concept mapping with Dutch experts from science, policy, and practice. Health Policy. 2015;119(6):749–59.

Donkin A, Goldblatt P, Allen J, Nathanson V, Marmot M. Global action on the social determinants of health. BMJ Glob Health. 2017;3(Suppl 1):e000603-e.

Baum F, Friel S. Politics, policies and processes: a multidisciplinary and multimethods research programme on policies on the social determinants of health inequity in Australia. BMJ Open. 2017;7(12):e017772-e.

Delany T, Lawless A, Baum F, Popay J, Jones L, McDermott D, et al. Health in All Policies in South Australia: what has supported early implementation? Health Promot Int. 2016;31(4):888–98.

Valaitis R, MacDonald M, Kothari A, O'Mara L, Regan S, Garcia J, et al. Moving towards a new vision: implementation of a public health policy intervention. BMC Public Health. 2016;16:412.

Bennett LM, Gadlin H, Marchand, C. Collaboration team science: a field guide. Bethesda, MD: National Cancer Institute, National Institutes of Health; 2018. Contract No.: NIH Publication No. 18-7660.

Mazumdar M, Messinger S, Finkelstein DM, Goldberg JD, Lindsell CJ, Morton SC, et al. Evaluating academic scientists collaborating in team-based research: a proposed framework. Acad Med. 2015;90(10):1302–8.

Brownson RC, Fielding JE, Green LW. Building capacity for evidence-based public health: reconciling the pulls of practice and the push of research. Annu Rev Public Health. 2018;39:27–53.

Brownson RC, Colditz GA, Proctor EK. Future issues in dissemination and implementation research. In: Brownson RC, Colditz GA, Proctor EK, editors. Dissemination and Implementation Research in Health: Translating Science to Practice. Second Edition ed. New York: Oxford University Press; 2018.

Thomson K, Hillier-Brown F, Todd A, McNamara C, Huijts T, Bambra C. The effects of public health policies on health inequalities in high-income countries: an umbrella review. BMC Public Health. 2018;18(1):869.

Download references

Acknowledgements

The authors are grateful for the policy expertise and guidance of Alexandra Morshed and the administrative support of Mary Adams, Linda Dix, and Cheryl Valko at the Prevention Research Center, Brown School, Washington University in St. Louis. We thank Lori Siegel, librarian, Brown School, Washington University in St. Louis, for assistance with search terms and procedures. We appreciate the D&I contributions of Enola Proctor and Byron Powell at the Brown School, Washington University in St. Louis, that informed this review. We thank Russell Glasgow, University of Colorado Denver, for guidance on the overall review and pragmatic measure criteria.

This project was funded March 2019 through February 2020 by the Foundation for Barnes-Jewish Hospital, with support from the Washington University in St. Louis Institute of Clinical and Translational Science Pilot Program, NIH/National Center for Advancing Translational Sciences (NCATS) grant UL1 TR002345. The project was also supported by the National Cancer Institute P50CA244431, Cooperative Agreement number U48DP006395-01-00 from the Centers for Disease Control and Prevention, R01MH106510 from the National Institute of Mental Health, and the National Institute of Diabetes and Digestive and Kidney Diseases award number P30DK020579. The findings and conclusions in this paper are those of the authors and do not necessarily represent the official positions of the Foundation for Barnes-Jewish Hospital, Washington University in St. Louis Institute of Clinical and Translational Science, National Institutes of Health, or the Centers for Disease Control and Prevention.

Author information

Authors and affiliations.

Prevention Research Center, Brown School, Washington University in St. Louis, One Brookings Drive, Campus Box 1196, St. Louis, MO, 63130, USA

Peg Allen, Meagan Pilar, Callie Walsh-Bailey, Stephanie Mazzucca, Maura M. Kepper & Ross C. Brownson

School of Social Work, Brigham Young University, 2190 FJSB, Provo, UT, 84602, USA

Cole Hooley

Kaiser Permanente Washington Health Research Institute, 1730 Minor Ave, Seattle, WA, 98101, USA

Cara C. Lewis, Kayne D. Mettert & Caitlin N. Dorsey

Department of Health Management & Policy, Drexel University Dornsife School of Public Health, Nesbitt Hall, 3215 Market St, Philadelphia, PA, 19104, USA

Jonathan Purtle

Brown School, Washington University in St. Louis, One Brookings Drive, Campus Box 1196, St. Louis, MO, 63130, USA

Ana A. Baumann

Department of Surgery (Division of Public Health Sciences) and Alvin J. Siteman Cancer Center, Washington University School of Medicine, 4921 Parkview Place, Saint Louis, MO, 63110, USA

Ross C. Brownson

You can also search for this author in PubMed   Google Scholar

Contributions

Review methodology and quality assessment scale: CCL, KDM, CND. Eligibility criteria: PA, RCB, CND, KDM, SM, MP, JP. Search strings and terms: CH, PA, MP with review by AB, RCB, CND, CCL, MMK, SM, KDM. Framework selection: PA, AB, CH, MP. Abstract screening: PA, CH, MMK, SM, MP. Full-text screening: PA, CH, MP. Pilot extraction: PA, DNC, CH, KDM, SM, MP. Data extraction: MP, CWB. Data aggregation: MP, CWB. Writing: PA, RCB, JP. Editing: RCB, JP, SM, AB, CD, CH, MMK, CCL, KM, MP, CWB. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Peg Allen .

Ethics declarations

Ethics approval and consent to participate.

Not applicable

Consent for publication

Competing interests.

The authors declare they have no conflicting interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1: table s1.

. PRISMA checklist. Table S2 . Electronic search terms for databases searched through EBSCO. Table S3 . Electronic search terms for searches conducted through PROQUEST. Table S4: PAPERS Pragmatic rating scales. Table S5 . PAPERS Psychometric rating scales.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Allen, P., Pilar, M., Walsh-Bailey, C. et al. Quantitative measures of health policy implementation determinants and outcomes: a systematic review. Implementation Sci 15 , 47 (2020). https://doi.org/10.1186/s13012-020-01007-w

Download citation

Received : 24 March 2020

Accepted : 05 June 2020

Published : 19 June 2020

DOI : https://doi.org/10.1186/s13012-020-01007-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Implementation science
  • Health policy
  • Policy implementation
  • Implementation
  • Public policy
  • Psychometric

Implementation Science

ISSN: 1748-5908

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

peer reviewed journal quantitative research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 11 December 2020

Quantifying and addressing the prevalence and bias of study designs in the environmental and social sciences

  • Alec P. Christie   ORCID: orcid.org/0000-0002-8465-8410 1 ,
  • David Abecasis   ORCID: orcid.org/0000-0002-9802-8153 2 ,
  • Mehdi Adjeroud 3 ,
  • Juan C. Alonso   ORCID: orcid.org/0000-0003-0450-7434 4 ,
  • Tatsuya Amano   ORCID: orcid.org/0000-0001-6576-3410 5 ,
  • Alvaro Anton   ORCID: orcid.org/0000-0003-4108-6122 6 ,
  • Barry P. Baldigo   ORCID: orcid.org/0000-0002-9862-9119 7 ,
  • Rafael Barrientos   ORCID: orcid.org/0000-0002-1677-3214 8 ,
  • Jake E. Bicknell   ORCID: orcid.org/0000-0001-6831-627X 9 ,
  • Deborah A. Buhl 10 ,
  • Just Cebrian   ORCID: orcid.org/0000-0002-9916-8430 11 ,
  • Ricardo S. Ceia   ORCID: orcid.org/0000-0001-7078-0178 12 , 13 ,
  • Luciana Cibils-Martina   ORCID: orcid.org/0000-0002-2101-4095 14 , 15 ,
  • Sarah Clarke 16 ,
  • Joachim Claudet   ORCID: orcid.org/0000-0001-6295-1061 17 ,
  • Michael D. Craig 18 , 19 ,
  • Dominique Davoult 20 ,
  • Annelies De Backer   ORCID: orcid.org/0000-0001-9129-9009 21 ,
  • Mary K. Donovan   ORCID: orcid.org/0000-0001-6855-0197 22 , 23 ,
  • Tyler D. Eddy 24 , 25 , 26 ,
  • Filipe M. França   ORCID: orcid.org/0000-0003-3827-1917 27 ,
  • Jonathan P. A. Gardner   ORCID: orcid.org/0000-0002-6943-2413 26 ,
  • Bradley P. Harris 28 ,
  • Ari Huusko 29 ,
  • Ian L. Jones 30 ,
  • Brendan P. Kelaher 31 ,
  • Janne S. Kotiaho   ORCID: orcid.org/0000-0002-4732-784X 32 , 33 ,
  • Adrià López-Baucells   ORCID: orcid.org/0000-0001-8446-0108 34 , 35 , 36 ,
  • Heather L. Major   ORCID: orcid.org/0000-0002-7265-1289 37 ,
  • Aki Mäki-Petäys 38 , 39 ,
  • Beatriz Martín 40 , 41 ,
  • Carlos A. Martín 8 ,
  • Philip A. Martin 1 , 42 ,
  • Daniel Mateos-Molina   ORCID: orcid.org/0000-0002-9383-0593 43 ,
  • Robert A. McConnaughey   ORCID: orcid.org/0000-0002-8537-3695 44 ,
  • Michele Meroni 45 ,
  • Christoph F. J. Meyer   ORCID: orcid.org/0000-0001-9958-8913 34 , 35 , 46 ,
  • Kade Mills 47 ,
  • Monica Montefalcone 48 ,
  • Norbertas Noreika   ORCID: orcid.org/0000-0002-3853-7677 49 , 50 ,
  • Carlos Palacín 4 ,
  • Anjali Pande 26 , 51 , 52 ,
  • C. Roland Pitcher   ORCID: orcid.org/0000-0003-2075-4347 53 ,
  • Carlos Ponce 54 ,
  • Matt Rinella 55 ,
  • Ricardo Rocha   ORCID: orcid.org/0000-0003-2757-7347 34 , 35 , 56 ,
  • María C. Ruiz-Delgado 57 ,
  • Juan J. Schmitter-Soto   ORCID: orcid.org/0000-0003-4736-8382 58 ,
  • Jill A. Shaffer   ORCID: orcid.org/0000-0003-3172-0708 10 ,
  • Shailesh Sharma   ORCID: orcid.org/0000-0002-7918-4070 59 ,
  • Anna A. Sher   ORCID: orcid.org/0000-0002-6433-9746 60 ,
  • Doriane Stagnol 20 ,
  • Thomas R. Stanley 61 ,
  • Kevin D. E. Stokesbury 62 ,
  • Aurora Torres 63 , 64 ,
  • Oliver Tully 16 ,
  • Teppo Vehanen   ORCID: orcid.org/0000-0003-3441-6787 65 ,
  • Corinne Watts 66 ,
  • Qingyuan Zhao 67 &
  • William J. Sutherland 1 , 42  

Nature Communications volume  11 , Article number:  6377 ( 2020 ) Cite this article

14k Accesses

47 Citations

68 Altmetric

Metrics details

  • Environmental impact
  • Scientific community
  • Social sciences

Building trust in science and evidence-based decision-making depends heavily on the credibility of studies and their findings. Researchers employ many different study designs that vary in their risk of bias to evaluate the true effect of interventions or impacts. Here, we empirically quantify, on a large scale, the prevalence of different study designs and the magnitude of bias in their estimates. Randomised designs and controlled observational designs with pre-intervention sampling were used by just 23% of intervention studies in biodiversity conservation, and 36% of intervention studies in social science. We demonstrate, through pairwise within-study comparisons across 49 environmental datasets, that these types of designs usually give less biased estimates than simpler observational designs. We propose a model-based approach to combine study estimates that may suffer from different levels of study design bias, discuss the implications for evidence synthesis, and how to facilitate the use of more credible study designs.

Similar content being viewed by others

peer reviewed journal quantitative research

Determinants of behaviour and their efficacy as targets of behavioural change interventions

peer reviewed journal quantitative research

A meta-analysis on global change drivers and the risk of infectious disease

peer reviewed journal quantitative research

Biodiversity loss reduces global terrestrial carbon storage

Introduction.

The ability of science to reliably guide evidence-based decision-making hinges on the accuracy and credibility of studies and their results 1 , 2 . Well-designed, randomised experiments are widely accepted to yield more credible results than non-randomised, ‘observational studies’ that attempt to approximate and mimic randomised experiments 3 . Randomisation is a key element of study design that is widely used across many disciplines because of its ability to remove confounding biases (through random assignment of the treatment or impact of interest 4 , 5 ). However, ethical, logistical, and economic constraints often prevent the implementation of randomised experiments, whereas non-randomised observational studies have become popular as they take advantage of historical data for new research questions, larger sample sizes, less costly implementation, and more relevant and representative study systems or populations 6 , 7 , 8 , 9 . Observational studies nevertheless face the challenge of accounting for confounding biases without randomisation, which has led to innovations in study design.

We define ‘study design’ as an organised way of collecting data. Importantly, we distinguish between data collection and statistical analysis (as opposed to other authors 10 ) because of the belief that bias introduced by a flawed design is often much more important than bias introduced by statistical analyses. This was emphasised by Light, Singer & Willet 11 (p. 5): “You can’t fix by analysis what you bungled by design…”; and Rubin 3 : “Design trumps analysis.” Nevertheless, the importance of study design has often been overlooked in debates over the inability of researchers to reproduce the original results of published studies (so-called ‘reproducibility crises’ 12 , 13 ) in favour of other issues (e.g., p-hacking 14 and Hypothesizing After Results are Known or ‘HARKing’ 15 ).

To demonstrate the importance of study designs, we can use the following decomposition of estimation error equation 16 :

This demonstrates that even if we improve the quality of modelling and analysis (to reduce modelling bias through a better bias-variance trade-off 17 ) or increase sample size (to reduce statistical noise), we cannot remove the intrinsic bias introduced by the choice of study design (design bias) unless we collect the data in a different way. The importance of study design in determining the levels of bias in study results therefore cannot be overstated.

For the purposes of this study we consider six commonly used study designs; differences and connections can be visualised in Fig.  1 . There are three major components that allow us to define these designs: randomisation, sampling before and after the impact of interest occurs, and the use of a control group.

figure 1

A hypothetical study set-up is shown where the abundance of birds in three impact and control replicates (e.g., fields represented by blocks in a row) are monitored before and after an impact (e.g., ploughing) that occurs in year zero. Different colours represent each study design and illustrate how replicates are sampled. Approaches for calculating an estimate of the true effect of the impact for each design are also shown, along with synonyms from different disciplines.

Of the non-randomised observational designs, the Before-After Control-Impact (BACI) design uses a control group and samples before and after the impact occurs (i.e., in the ‘before-period’ and the ‘after-period’). Its rationale is to explicitly account for pre-existing differences between the impact group (exposed to the impact) and control group in the before-period, which might otherwise bias the estimate of the impact’s true effect 6 , 18 , 19 .

The BACI design improves upon several other commonly used observational study designs, of which there are two uncontrolled designs: After, and Before-After (BA). An After design monitors an impact group in the after-period, while a BA design compares the state of the impact group between the before- and after-periods. Both designs can be expected to yield poor estimates of the impact’s true effect (large design bias; Equation (1)) because changes in the response variable could have occurred without the impact (e.g., due to natural seasonal changes; Fig.  1 ).

The other observational design is Control-Impact (CI), which compares the impact group and control group in the after-period (Fig.  1 ). This design may suffer from design bias introduced by pre-existing differences between the impact group and control group in the before-period; bias that the BACI design was developed to account for 20 , 21 . These differences have many possible sources, including experimenter bias, logistical and environmental constraints, and various confounding factors (variables that change the propensity of receiving the impact), but can be adjusted for through certain data pre-processing techniques such as matching and stratification 22 .

Among the randomised designs, the most commonly used are counterparts to the observational CI and BACI designs: Randomised Control-Impact (R-CI) and Randomised Before-After Control-Impact (R-BACI) designs. The R-CI design, often termed ‘Randomised Controlled Trials’ (RCTs) in medicine and hailed as the ‘gold standard’ 23 , 24 , removes any pre-impact differences in a stochastic sense, resulting in zero design bias (Equation ( 1 )). Similarly, the R-BACI design should also have zero design bias, and the impact group measurements in the before-period could be used to improve the efficiency of the statistical estimator. No randomised equivalents exist of After or BA designs as they are uncontrolled.

It is important to briefly note that there is debate over two major statistical methods that can be used to analyse data collected using BACI and R-BACI designs, and which is superior at reducing modelling bias 25 (Equation (1)). These statistical methods are: (i) Differences in Differences (DiD) estimator; and (ii) covariance adjustment using the before-period response, which is an extension of Analysis of Covariance (ANCOVA) for generalised linear models — herein termed ‘covariance adjustment’ (Fig.  1 ). These estimators rely on different assumptions to obtain unbiased estimates of the impact’s true effect. The DiD estimator assumes that the control group response accurately represents the impact group response had it not been exposed to the impact (‘parallel trends’ 18 , 26 ) whereas covariance adjustment assumes there are no unmeasured confounders and linear model assumptions hold 6 , 27 .

From both theory and Equation (1), with similar sample sizes, randomised designs (R-BACI and R-CI) are expected to be less biased than controlled, observational designs with sampling in the before-period (BACI), which in turn should be superior to observational designs without sampling in the before-period (CI) or without a control group (BA and After designs 7 , 28 ). Between randomised designs, we might expect that an R-BACI design performs better than a R-CI design because utilising extra data before the impact may improve the efficiency of the statistical estimator by explicitly characterising pre-existing differences between the impact group and control group.

Given the likely differences in bias associated with different study designs, concerns have been raised over the use of poorly designed studies in several scientific disciplines 7 , 29 , 30 , 31 , 32 , 33 , 34 , 35 . Some disciplines, such as the social and medical sciences, commonly undertake direct comparisons of results obtained by randomised and non-randomised designs within a single study 36 , 37 , 38 or between multiple studies (between-study comparisons 39 , 40 , 41 ) to specifically understand the influence of study designs on research findings. However, within-study comparisons are limited in their scope (e.g., a single study 42 , 43 ) and between-study comparisons can be confounded by variability in context or study populations 44 . Overall, we lack quantitative estimates of the prevalence of different study designs and the levels of bias associated with their results.

In this work, we aim to first quantify the prevalence of different study designs in the social and environmental sciences. To fill this knowledge gap, we take advantage of summaries for several thousand biodiversity conservation intervention studies in the Conservation Evidence database 45 ( www.conservationevidence.com ) and social intervention studies in systematic reviews by the Campbell Collaboration ( www.campbellcollaboration.org ). We then quantify the levels of bias in estimates obtained by different study designs (R-BACI, R-CI, BACI, BA, and CI) by applying a hierarchical model to approximately 1000 within-study comparisons across 49 raw environmental datasets from a range of fields. We show that R-BACI, R-CI and BACI designs are poorly represented in studies testing biodiversity conservation and social interventions, and that these types of designs tend to give less biased estimates than simpler observational designs. We propose a model-based approach to combine study estimates that may suffer from different levels of study design bias, discuss the implications for evidence synthesis, and how to facilitate the use of more credible study designs.

Prevalence of study designs

We found that the biodiversity-conservation (conservation evidence) and social-science (Campbell collaboration) literature had similarly high proportions of intervention studies that used CI designs and After designs, but low proportions that used R-BACI, BACI, or BA designs (Fig.  2 ). There were slightly higher proportions of R-CI designs used by intervention studies in social-science systematic reviews than in the biodiversity-conservation literature (Fig.  2 ). The R-BACI, R-CI, and BACI designs made up 23% of intervention studies for biodiversity conservation, and 36% of intervention studies for social science.

figure 2

Intervention studies from the biodiversity-conservation literature were screened from the Conservation Evidence database ( n =4260 studies) and studies from the social-science literature were screened from 32 Campbell Collaboration systematic reviews ( n =1009 studies – note studies excluded by these reviews based on their study design were still counted). Percentages for the social-science literature were calculated for each systematic review (blue data points) and then averaged across all 32 systematic reviews (blue bars and black vertical lines represent mean and 95% Confidence Intervals, respectively). Percentages for the biodiversity-conservation literature are absolute values (shown as green bars) calculated from the entire Conservation Evidence database (after excluding any reviews). Source data are provided as a Source Data file. BA before-after, CI control-impact, BACI before-after-control-impact, R-BACI randomised BACI, R-CI randomised CI.

Influence of different study designs on study results

In non-randomised datasets, we found that estimates of BACI (with covariance adjustment) and CI designs were very similar, while the point estimates for most other designs often differed substantially in their magnitude and sign. We found similar results in randomised datasets for R-BACI (with covariance adjustment) and R-CI designs. For ~30% of responses, in both non-randomised and randomised datasets, study design estimates differed in their statistical significance (i.e., p < 0.05 versus p  > =0.05), except for estimates of (R-)BACI (with covariance adjustment) and (R-)CI designs (Table  1 ; Fig.  3 ). It was rare for the 95% confidence intervals of different designs’ estimates to not overlap – except when comparing estimates of BA designs to (R-)BACI (with covariance adjustment) and (R-)CI designs (Table  1 ). It was even rarer for estimates of different designs to have significantly different signs (i.e., one estimate with entirely negative confidence intervals versus one with entirely positive confidence intervals; Table  1 , Fig.  3 ). Overall, point estimates often differed greatly in their magnitude and, to a lesser extent, in their sign between study designs, but did not differ as greatly when accounting for the uncertainty around point estimates – except in terms of their statistical significance.

figure 3

t-statistics were obtained from two-sided t-tests of estimates obtained by each design for different responses in each dataset using Generalised Linear Models (see Methods). For randomised datasets, BACI and CI axis labels refer to R-BACI and R-CI designs (denoted by ‘R-’). DiD Difference in Differences; CA covariance adjustment. Lines at t-statistic values of 1.96 denote boundaries between cells and colours of points indicate differences in direction and statistical significance ( p  < 0.05; grey = same sign and significance, orange = same sign but difference in significance, red = different sign and significance). Numbers refer to the number of responses in each cell. Source data are provided as a Source Data file. BA Before-After, CI Control-Impact, BACI Before-After-Control-Impact.

Levels of bias in estimates of different study designs

We modelled study design bias using a random effect across datasets in a hierarchical Bayesian model; σ is the standard deviation of the bias term, and assuming bias is randomly distributed across datasets and is on average zero, larger values of σ will indicate a greater magnitude of bias (see Methods). We found that, for randomised datasets, estimates of both R-BACI (using covariance adjustment; CA) and R-CI designs were affected by negligible amounts of bias (very small values of σ; Table  2 ). When the R-BACI design used the DiD estimator, it suffered from slightly more bias (slightly larger values of σ), whereas the BA design had very high bias when applied to randomised datasets (very large values of σ; Table  2 ). There was a highly positive correlation between the estimates of R-BACI (using covariance adjustment) and R-CI designs (Ω[R-BACI CA, R-CI] was close to 1; Table  2 ). Estimates of R-BACI using the DiD estimator were also positively correlated with estimates of R-BACI using covariance adjustment and R-CI designs (moderate positive mean values of Ω[R-BACI CA, R-BACI DiD] and Ω[R-BACI DiD, R-CI]; Table  2 ).

For non-randomised datasets, controlled designs (BACI and CI) were substantially less biased (far smaller values of σ) than the uncontrolled BA design (Table  2 ). A BACI design using the DiD estimator was slightly less biased than the BACI design using covariance adjustment, which was, in turn, slightly less biased than the CI design (Table  2 ).

Standard errors estimated by the hierarchical Bayesian model were reasonably accurate for the randomised datasets (see λ in Methods and Table  2 ), whereas there was some underestimation of standard errors and lack-of-fit for non-randomised datasets.

Our approach provides a principled way to quantify the levels of bias associated with different study designs. We found that randomised study designs (R-BACI and R-CI) and observational BACI designs are poorly represented in the environmental and social sciences; collectively, descriptive case studies (the After design), the uncontrolled, observational BA design, and the controlled, observational CI design made up a substantially greater proportion of intervention studies (Fig.  2 ). And yet R-BACI, R-CI and BACI designs were found to be quantifiably less biased than other observational designs.

As expected the R-CI and R-BACI designs (using a covariance adjustment estimator) performed well; the R-BACI design using a DiD estimator performed slightly less well, probably because the differencing of pre-impact data by this estimator may introduce additional statistical noise compared to covariance adjustment, which controls for these data using a lagged regression variable. Of the observational designs, the BA design performed very poorly (both when analysing randomised and non-randomised data) as expected, being uncontrolled and therefore prone to severe design bias 7 , 28 . The CI design also tended to be more biased than the BACI design (using a DiD estimator) due to pre-existing differences between the impact and control groups. For BACI designs, we recommend that the underlying assumptions of DiD and CA estimators are carefully considered before choosing to apply them to data collected for a specific research question 6 , 27 . Their levels of bias were negligibly different and their known bracketing relationship suggests they will typically give estimates with the same sign, although their tendency to over- or underestimate the true effect will depend on how well the underlying assumptions of each are met (most notably, parallel trends for DiD and no unmeasured confounders for CA; see Introduction) 6 , 27 . Overall, these findings demonstrate the power of large within-study comparisons to directly quantify differences in the levels of bias associated with different designs.

We must acknowledge that the assumptions of our hierarchical model (that the bias for each design (j) is on average zero and normally distributed) cannot be verified without gold standard randomised experiments and that, for observational designs, the model was overdispersed (potentially due to underestimation of statistical error by GLM(M)s or positively correlated design biases). The exact values of our hierarchical model should therefore be treated with appropriate caution, and future research is needed to refine and improve our approach to quantify these biases more precisely. Responses within datasets may also not be independent as multiple species could interact; therefore, the estimates analysed by our hierarchical model are statistically dependent on each other, and although we tried to account for this using a correlation matrix (see Methods, Eq. ( 3 )), this is a limitation of our model. We must also recognise that we collated datasets using non-systematic searches 46 , 47 and therefore our analysis potentially exaggerates the intrinsic biases of observational designs (i.e., our data may disproportionately reflect situations where the BACI design was chosen to account for confounding factors). We nevertheless show that researchers were wise to use the BACI design because it was less biased than CI and BA designs across a wide range of datasets from various environmental systems and locations. Without undertaking costly and time-consuming pre-impact sampling and pilot studies, researchers are also unlikely to know the levels of bias that could affect their results. Finally, we did not consider sample size, but it is likely that researchers might use larger sample sizes for CI and BA designs than BACI designs. This is, however, unlikely to affect our main conclusions because larger sample sizes could increase type I errors (false positive rate) by yielding more precise, but biased estimates of the true effect 28 .

Our analyses provide several empirically supported recommendations for researchers designing future studies to assess an impact of interest. First, using a controlled and/or randomised design (if possible) was shown to strongly reduce the level of bias in study estimates. Second, when observational designs must be used (as randomisation is not feasible or too costly), we urge researchers to choose the BACI design over other observational designs—and when that is not possible, to choose the CI design over the uncontrolled BA design. We acknowledge that limited resources, short funding timescales, and ethical or logistical constraints 48 may force researchers to use the CI design (if randomisation and pre-impact sampling are impossible) or the BA design (if appropriate controls cannot be found 28 ). To facilitate the usage of less biased designs, longer-term investments in research effort and funding are required 43 . Far greater emphasis on study designs in statistical education 49 and better training and collaboration between researchers, practitioners and methodologists, is needed to improve the design of future studies; for example, potentially improving the CI design by pairing or matching the impact group and control group 22 , or improving the BA design using regression discontinuity methods 48 , 50 . Where the choice of study design is limited, researchers must transparently communicate the limitations and uncertainty associated with their results.

Our findings also have wider implications for evidence synthesis, specifically the exclusion of certain observational study designs from syntheses (the ‘rubbish in, rubbish out’ concept 51 , 52 ). We believe that observational designs should be included in systematic reviews and meta-analyses, but that careful adjustments are needed to account for their potential biases. Exclusion of observational studies often results from subjective, checklist-based ‘Risk of Bias’ or quality assessments of studies (e.g., AMSTRAD 2 53 , ROBINS-I 54 , or GRADE 55 ) that are not data-driven and often neglect to identify the actual direction, or quantify the magnitude, of possible bias introduced by observational studies when rating the quality of a review’s recommendations. We also found that there was a small proportion of studies that used randomised designs (R-CI or R-BACI) or observational BACI designs (Fig.  2 ), suggesting that systematic reviews and meta-analyses risk excluding a substantial proportion of the literature and limiting the scope of their recommendations if such exclusion criteria are used 32 , 56 , 57 . This problem is compounded by the fact that, at least in conservation science, studies using randomised or BACI designs are strongly concentrated in Europe, Australasia, and North America 31 . Systematic reviews that rely on these few types of study designs are therefore likely to fail to provide decision makers outside of these regions with locally relevant recommendations that they prefer 58 . The Covid-19 pandemic has highlighted the difficulties in making locally relevant evidence-based decisions using studies conducted in different countries with different demographics and cultures, and on patients of different ages, ethnicities, genetics, and underlying health issues 59 . This problem is also acute for decision-makers working on biodiversity conservation in the tropical regions, where the need for conservation is arguably the greatest (i.e., where most of Earth’s biodiversity exists 60 ) but they either have to rely on very few well-designed studies that are not locally relevant (i.e., have low generalisability), or more studies that are locally relevant but less well-designed 31 , 32 . Either option could lead decision-makers to take ineffective or inefficient decisions. In the long-term, improving the quality and coverage of scientific evidence and evidence syntheses across the world will help solve these issues, but shorter-term solutions to synthesising patchy evidence bases are required.

Our work furthers sorely needed research on how to combine evidence from studies that vary greatly in their design. Our approach is an alternative to conventional meta-analyses which tend to only weight studies by their sample size or the inverse of their variance 61 ; when studies vary greatly in their study design, simply weighting by inverse variance or sample size is unlikely to account for different levels of bias introduced by different study designs (see Equation (1)). For example, a BA study could receive a larger weight if it had lower variance than a BACI study, despite our results suggesting a BA study usually suffers from greater design bias. Our model provides a principled way to weight studies by both their variance and the likely amount of bias introduced by their study design; it is therefore a form of ‘bias-adjusted meta-analysis’ 62 , 63 , 64 , 65 , 66 . However, instead of relying on elicitation of subjective expert opinions on the bias of each study, we provide a data-driven, empirical quantification of study biases – an important step that was called for to improve such meta-analytic approaches 65 , 66 .

Future research is needed to refine our methodology, but our empirically grounded form of bias-adjusted meta-analysis could be implemented as follows: 1.) collate studies for the same true effect, their effect size estimates, standard errors, and the type of study design; 2.) enter these data into our hierarchical model, where effect size estimates share the same intercept (the true causal effect), a random effect term due to design bias (whose variance is estimated by the method we used), and a random effect term for statistical noise (whose variance is estimated by the reported standard error of studies); 3.) fit this model and estimate the shared intercept/true effect. Heuristically, this can be thought of as weighting studies by both their design bias and their sampling variance and could be implemented on a dynamic meta-analysis platform (such as metadataset.com 67 ). This approach has substantial potential to develop evidence synthesis in fields (such as biodiversity conservation 31 , 32 ) with patchy evidence bases, where reliably synthesising findings from studies that vary greatly in their design is a fundamental and unavoidable challenge.

Our study has highlighted an often overlooked aspect of debates over scientific reproducibility: that the credibility of studies is fundamentally determined by study design. Testing the effectiveness of conservation and social interventions is undoubtedly of great importance given the current challenges facing biodiversity and society in general and the serious need for more evidence-based decision-making 1 , 68 . And yet our findings suggest that quantifiably less biased study designs are poorly represented in the environmental and social sciences. Greater methodological training of researchers and funding for intervention studies, as well as stronger collaborations between methodologists and practitioners is needed to facilitate the use of less biased study designs. Better communication and reporting of the uncertainty associated with different study designs is also needed, as well as more meta-research (the study of research itself) to improve standards of study design 69 . Our hierarchical model provides a principled way to combine studies using a variety of study designs that vary greatly in their risk of bias, enabling us to make more efficient use of patchy evidence bases. Ultimately, we hope that researchers and practitioners testing interventions will think carefully about the types of study designs they use, and we encourage the evidence synthesis community to embrace alternative methods for combining evidence from heterogeneous sets of studies to improve our ability to inform evidence-based decision-making in all disciplines.

Quantifying the use of different designs

We compared the use of different study designs in the literature that quantitatively tested interventions between the fields of biodiversity conservation (4,260 studies collated by Conservation Evidence 45 ) and social science (1,009 studies found by 32 systematic reviews produced by the Campbell Collaboration: www.campbellcollaboration.org ).

Conservation Evidence is a database of intervention studies, each of which has quantitatively tested a conservation intervention (e.g., sowing strips of wildflower seeds on farmland to benefit birds), that is continuously being updated through comprehensive, manual searches of conservation journals for a wide range of fields in biodiversity conservation (e.g., amphibian, bird, peatland, and farmland conservation 45 ). To obtain the proportion of studies that used each design from Conservation Evidence, we simply extracted the type of study design from each study in the database in 2019 – the study design was determined using a standardised set of criteria; reviews were not included (Table  3 ). We checked if the designs reported in the database accurately reflected the designs in the original publication and found that for a random subset of 356 studies, 95.1% were accurately described.

Each systematic review produced by the Campbell Collaboration collates and analyses studies that test a specific social intervention; we collated systematic reviews that tested a variety of social interventions across several fields in the social sciences, including education, crime and justice, international development and social welfare (Supplementary Data  1 ). We retrieved systematic reviews produced by the Campbell Collaboration by searching their website ( www.campbellcollaboration.org ) for reviews published between 2013‒2019 (as of 8th September 2019) — we limited the date range as we could not go through every review. As we were interested in the use of study designs in the wider social-science literature, we only considered reviews (32 in total) that contained sufficient information on the number of included and excluded studies that used different study designs. Studies may be excluded from systematic reviews for several reasons, such as their relevance to the scope of the review (e.g., testing a relevant intervention) and their study design. We only considered studies if the sole reason for their exclusion from the systematic review was their study design – i.e., reviews clearly reported that the study was excluded because it used a particular study design, and not because of any other reason, such as its relevance to the review’s research questions. We calculated the proportion of studies that used each design in each systematic review (using the same criteria as for the biodiversity-conservation literature – see Table  3 ) and then averaged these proportions across all systematic reviews.

Within-study comparisons of different study designs

We wanted to make direct within-study comparisons between the estimates obtained by different study designs (e.g., see 38 , 70 , 71 for single within-study comparisons) for many different studies. If a dataset contains data collected using a BACI design, subsets of these data can be used to mimic the use of other study designs (a BA design using only data for the impact group, and a CI design using only data collected after the impact occurred). Similarly, if data were collected using a R-BACI design, subsets of these data can be used to mimic the use of a BA design and a R-CI design. Collecting BACI and R-BACI datasets would therefore allow us to make direct within-study comparisons of the estimates obtained by these designs.

We collated BACI and R-BACI datasets by searching the Web of Science Core Collection 72 which included the following citation indexes: Science Citation Index Expanded (SCI-EXPANDED) 1900-present; Social Sciences Citation Index (SSCI) 1900-present Arts & Humanities Citation Index (A&HCI) 1975-present; Conference Proceedings Citation Index - Science (CPCI-S) 1990-present; Conference Proceedings Citation Index - Social Science & Humanities (CPCI-SSH) 1990-present; Book Citation Index - Science (BKCI-S) 2008-present; Book Citation Index - Social Sciences & Humanities (BKCI-SSH) 2008-present; Emerging Sources Citation Index (ESCI) 2015-present; Current Chemical Reactions (CCR-EXPANDED) 1985-present (Includes Institut National de la Propriete Industrielle structure data back to 1840); Index Chemicus (IC) 1993-present. The following search terms were used: [‘BACI’] OR [‘Before-After Control-Impact’] and the search was conducted on the 18th December 2017. Our search returned 674 results, which we then refined by selecting only ‘Article’ as the document type and using only the following Web of Science Categories: ‘Ecology’, ‘Marine Freshwater Biology’, ‘Biodiversity Conservation’, ‘Fisheries’, ‘Oceanography’, ‘Forestry’, ‘Zoology’, Ornithology’, ‘Biology’, ‘Plant Sciences’, ‘Entomology’, ‘Remote Sensing’, ‘Toxicology’ and ‘Soil Science’. This left 579 results, which we then restricted to articles published since 2002 (15 years prior to search) to give us a realistic opportunity to obtain the raw datasets, thus reducing this number to 542. We were able to access the abstracts of 521 studies and excluded any that did not test the effect of an environmental intervention or threat using an R-BACI or BACI design with response measures related to the abundance (e.g., density, counts, biomass, cover), reproduction (reproductive success) or size (body length, body mass) of animals or plants. Many studies did not test a relevant metric (e.g., they measured species richness), did not use a BACI or R-BACI design, or did not test the effect of an intervention or threat — this left 96 studies for which we contacted all corresponding authors to ask for the raw dataset. We were able to fully access 54 raw datasets, but upon closer inspection we found that three of these datasets either: did not use a BACI design; did not use the metrics we specified; or did not provide sufficient data for our analyses. This left 51 datasets in total that we used in our preliminary analyses (Supplementary Data  2 ).

All the datasets were originally collected to evaluate the effect of an environmental intervention or impact. Most of them contained multiple response variables (e.g., different measures for different species, such as abundance or density for species A, B, and C). Within a dataset, we use the term “response” to refer to the estimation of the true effect of an impact on one response variable. There were 1,968 responses in total across 51 datasets. We then excluded 932 responses (resulting in the exclusion of one dataset) where one or more of the four time-period and treatment subsets (Before Control, Before Impact, After Control, and After Impact data) consisted of entirely zero measurements, or two or more of these subsets had more than 90% zero measurements. We also excluded one further dataset as it was the only one to not contain repeated measurements at sites in both the before- and after-periods. This was necessary to generate reliable standard errors when modelling these data. We modelled the remaining 1,036 responses from across 49 datasets (Supplementary Table  1 ).

We applied each study design to the appropriate components of each dataset using Generalised Linear Models (GLMs 73 , 74 ) because of their generality and ability to implement the statistical estimators of many different study designs. The model structure of GLMs was adjusted for each response in each dataset based on the study design specified, response measure and dataset structure (Supplementary Table  2 ). We quantified the effect of the time period for the BA design (After vs Before the impact) and the effect of the treatment type for the CI and R-CI designs (Impact vs Control) on the response variable (Supplementary Table  2 ). For BACI and R-BACI designs, we implemented two statistical estimators: 1.) a DiD estimator that estimated the true effect using an interaction term between time and treatment type; and 2.) a covariance adjustment estimator that estimated the true effect using a term for the treatment type with a lagged variable (Supplementary Table  2 ).

As there were large numbers of responses, we used general a priori rules to specify models for each response; this may have led to some model misspecification, but was unlikely to have substantially affected our pairwise comparison of estimates obtained by different designs. The error family of each GLM was specified based on the nature of the measure used and preliminary data exploration: count measures (e.g., abundance) = poisson; density measures (e.g., biomass or abundance per unit area) = quasipoisson, as data for these measures tended to be overdispersed; percentage measures (e.g., percentage cover) = quasibinomial; and size measures (e.g., body length) = gaussian.

We treated each year or season in which data were collected as independent observations because the implementation of a seasonal term in models is likely to vary on a case-by-case basis; this will depend on the research questions posed by each study and was not feasible for us to consider given the large number of responses we were modelling. The log link function was used for all models to generate a standardised log response ratio as an estimate of the true effect for each response; a fixed effect coefficient (a variable named treatment status; Supplementary Table  2 ) was used to estimate the log response ratio 61 . If the response had at least ten ‘sites’ (independent sampling units) and two measurements per site on average, we used the random effects of subsample (replicates within a site) nested within site to capture the dependence within a site and subsample (i.e., a Generalised Linear Mixed Model or GLMM 73 , 74 was implemented instead of a GLM); otherwise we fitted a GLM with only the fixed effects (Supplementary Table  2 ).

We fitted all models using R version 3.5.1 75 , and packages lme4 76 and MASS 77 . Code to replicate all analyses is available (see Data and Code Availability). We compared the estimates obtained using each study design (both in terms of point estimates and estimates with associated standard error) by their magnitude and sign.

A model-based quantification of the bias in study design estimates

We used a hierarchical Bayesian model motivated by the decomposition in Equation (1) to quantify the bias in different study design estimates. This model takes the estimated effects of impacts and their standard errors as inputs. Let \(\hat \beta _{ij}\) be the true effect estimator in study \(i\) using design \(j\) and \(\hat \sigma _{ij}\) be its estimated standard error from the corresponding GLM or GLMM. Our hierarchical model assumes:

where β i is the true effect for response \(i\) , \(\gamma _{ij}\) is the bias of design j in response \(i\) , and \(\varepsilon _{ij}\) is the sampling noise of the statistical estimator. Although \(\gamma _{ij}\) technically incorporates both the design bias and any misspecification (modelling) bias due to using GLMs or GLMMs (Equation (1)), we expect the modelling bias to be much smaller than the design bias 3 , 11 . We assume the statistical errors \(\varepsilon _i\) within a response are related to the estimated standard errors through the following joint distribution:

where \({\Omega}\) is the correlation matrix for the different estimators in the same response and λ is a scaling factor to account for possible over/under-estimation of the standard errors.

This model effectively quantifies the bias of design \(j\) using the value of \(\sigma _j\) (larger values = more bias) by accounting for within-response correlations using the correlation matrix \({\Omega}\) and for possible under-estimation of the standard error using \(\lambda\) . We ensured that the prior distributions we used had very large variances so they would have a very small effect on the posterior distribution — accordingly we placed the following disperse priors on the variance parameters:

We fitted the hierarchical Bayesian model in R version 3.5.1 using the Bayesian inference package rstan 78 .

Data availability

All data analysed in the current study are available from Zenodo, https://doi.org/10.5281/zenodo.3560856 .  Source data are provided with this paper.

Code availability

All code used in the current study is available from Zenodo, https://doi.org/10.5281/zenodo.3560856 .

Donnelly, C. A. et al. Four principles to make evidence synthesis more useful for policy. Nature 558 , 361–364 (2018).

Article   ADS   CAS   PubMed   Google Scholar  

McKinnon, M. C., Cheng, S. H., Garside, R., Masuda, Y. J. & Miller, D. C. Sustainability: map the evidence. Nature 528 , 185–187 (2015).

Rubin, D. B. For objective causal inference, design trumps analysis. Ann. Appl. Stat. 2 , 808–840 (2008).

Article   MathSciNet   MATH   Google Scholar  

Peirce, C. S. & Jastrow, J. On small differences in sensation. Mem. Natl Acad. Sci. 3 , 73–83 (1884).

Fisher, R. A. Statistical methods for research workers . (Oliver and Boyd, 1925).

Angrist, J. D. & Pischke, J.-S. Mostly harmless econometrics: an empiricist’s companion . (Princeton University Press, 2008).

de Palma, A. et al . Challenges with inferring how land-use affects terrestrial biodiversity: study design, time, space and synthesis. in Next Generation Biomonitoring: Part 1 163–199 (Elsevier Ltd., 2018).

Sagarin, R. & Pauchard, A. Observational approaches in ecology open new ground in a changing world. Front. Ecol. Environ. 8 , 379–386 (2010).

Article   Google Scholar  

Shadish, W. R., Cook, T. D. & Campbell, D. T. Experimental and quasi-experimental designs for generalized causal inference . (Houghton Mifflin, 2002).

Rosenbaum, P. R. Design of observational studies . vol. 10 (Springer, 2010).

Light, R. J., Singer, J. D. & Willett, J. B. By design: Planning research on higher education. By design: Planning research on higher education . (Harvard University Press, 1990).

Ioannidis, J. P. A. Why most published research findings are false. PLOS Med. 2 , e124 (2005).

Article   PubMed   PubMed Central   Google Scholar  

Open Science Collaboration. Estimating the reproducibility of psychological science. Science 349 , aac4716 (2015).

Article   CAS   Google Scholar  

John, L. K., Loewenstein, G. & Prelec, D. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23 , 524–532 (2012).

Article   PubMed   Google Scholar  

Kerr, N. L. HARKing: hypothesizing after the results are known. Personal. Soc. Psychol. Rev. 2 , 196–217 (1998).

Zhao, Q., Keele, L. J. & Small, D. S. Comment: will competition-winning methods for causal inference also succeed in practice? Stat. Sci. 34 , 72–76 (2019).

Article   MATH   Google Scholar  

Friedman, J., Hastie, T. & Tibshirani, R. The Elements of Statistical Learning . vol. 1 (Springer series in statistics, 2001).

Underwood, A. J. Beyond BACI: experimental designs for detecting human environmental impacts on temporal variations in natural populations. Mar. Freshw. Res. 42 , 569–587 (1991).

Stewart-Oaten, A. & Bence, J. R. Temporal and spatial variation in environmental impact assessment. Ecol. Monogr. 71 , 305–339 (2001).

Eddy, T. D., Pande, A. & Gardner, J. P. A. Massive differential site-specific and species-specific responses of temperate reef fishes to marine reserve protection. Glob. Ecol. Conserv. 1 , 13–26 (2014).

Sher, A. A. et al. Native species recovery after reduction of an invasive tree by biological control with and without active removal. Ecol. Eng. 111 , 167–175 (2018).

Imbens, G. W. & Rubin, D. B. Causal Inference in Statistics, Social, and Biomedical Sciences . (Cambridge University Press, 2015).

Greenhalgh, T. How to read a paper: the basics of Evidence Based Medicine . (John Wiley & Sons, Ltd, 2019).

Salmond, S. S. Randomized Controlled Trials: Methodological Concepts and Critique. Orthopaedic Nursing 27 , (2008).

Geijzendorffer, I. R. et al. How can global conventions for biodiversity and ecosystem services guide local conservation actions? Curr. Opin. Environ. Sustainability 29 , 145–150 (2017).

Dimick, J. B. & Ryan, A. M. Methods for evaluating changes in health care policy. JAMA 312 , 2401 (2014).

Article   CAS   PubMed   Google Scholar  

Ding, P. & Li, F. A bracketing relationship between difference-in-differences and lagged-dependent-variable adjustment. Political Anal. 27 , 605–615 (2019).

Christie, A. P. et al. Simple study designs in ecology produce inaccurate estimates of biodiversity responses. J. Appl. Ecol. 56 , 2742–2754 (2019).

Watson, M. et al. An analysis of the quality of experimental design and reliability of results in tribology research. Wear 426–427 , 1712–1718 (2019).

Kilkenny, C. et al. Survey of the quality of experimental design, statistical analysis and reporting of research using animals. PLoS ONE 4 , e7824 (2009).

Christie, A. P. et al. The challenge of biased evidence in conservation. Conserv, Biol . 13577, https://doi.org/10.1111/cobi.13577 (2020).

Christie, A. P. et al. Poor availability of context-specific evidence hampers decision-making in conservation. Biol. Conserv. 248 , 108666 (2020).

Moscoe, E., Bor, J. & Bärnighausen, T. Regression discontinuity designs are underutilized in medicine, epidemiology, and public health: a review of current and best practice. J. Clin. Epidemiol. 68 , 132–143 (2015).

Goldenhar, L. M. & Schulte, P. A. Intervention research in occupational health and safety. J. Occup. Med. 36 , 763–778 (1994).

CAS   PubMed   Google Scholar  

Junker, J. et al. A severe lack of evidence limits effective conservation of the World’s primates. BioScience https://doi.org/10.1093/biosci/biaa082 (2020).

Altindag, O., Joyce, T. J. & Reeder, J. A. Can Nonexperimental Methods Provide Unbiased Estimates of a Breastfeeding Intervention? A Within-Study Comparison of Peer Counseling in Oregon. Evaluation Rev. 43 , 152–188 (2019).

Chaplin, D. D. et al. The Internal And External Validity Of The Regression Discontinuity Design: A Meta-Analysis Of 15 Within-Study Comparisons. J. Policy Anal. Manag. 37 , 403–429 (2018).

Cook, T. D., Shadish, W. R. & Wong, V. C. Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within-study comparisons. J. Policy Anal. Manag. 27 , 724–750 (2008).

Ioannidis, J. P. A. et al. Comparison of evidence of treatment effects in randomized and nonrandomized studies. J. Am. Med. Assoc. 286 , 821–830 (2001).

dos Santos Ribas, L. G., Pressey, R. L., Loyola, R. & Bini, L. M. A global comparative analysis of impact evaluation methods in estimating the effectiveness of protected areas. Biol. Conserv. 246 , 108595 (2020).

Benson, K. & Hartz, A. J. A Comparison of Observational Studies and Randomized, Controlled Trials. N. Engl. J. Med. 342 , 1878–1886 (2000).

Smokorowski, K. E. et al. Cautions on using the Before-After-Control-Impact design in environmental effects monitoring programs. Facets 2 , 212–232 (2017).

França, F. et al. Do space-for-time assessments underestimate the impacts of logging on tropical biodiversity? An Amazonian case study using dung beetles. J. Appl. Ecol. 53 , 1098–1105 (2016).

Duvendack, M., Hombrados, J. G., Palmer-Jones, R. & Waddington, H. Assessing ‘what works’ in international development: meta-analysis for sophisticated dummies. J. Dev. Effectiveness 4 , 456–471 (2012).

Sutherland, W. J. et al. Building a tool to overcome barriers in research-implementation spaces: The Conservation Evidence database. Biol. Conserv. 238 , 108199 (2019).

Gusenbauer, M. & Haddaway, N. R. Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Res. Synth. Methods 11 , 181–217 (2020).

Konno, K. & Pullin, A. S. Assessing the risk of bias in choice of search sources for environmental meta‐analyses. Res. Synth. Methods 11 , 698–713 (2020).

PubMed   Google Scholar  

Butsic, V., Lewis, D. J., Radeloff, V. C., Baumann, M. & Kuemmerle, T. Quasi-experimental methods enable stronger inferences from observational data in ecology. Basic Appl. Ecol. 19 , 1–10 (2017).

Brownstein, N. C., Louis, T. A., O’Hagan, A. & Pendergast, J. The role of expert judgment in statistical inference and evidence-based decision-making. Am. Statistician 73 , 56–68 (2019).

Article   MathSciNet   Google Scholar  

Hahn, J., Todd, P. & Klaauw, W. Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica 69 , 201–209 (2001).

Slavin, R. E. Best evidence synthesis: an intelligent alternative to meta-analysis. J. Clin. Epidemiol. 48 , 9–18 (1995).

Slavin, R. E. Best-evidence synthesis: an alternative to meta-analytic and traditional reviews. Educ. Researcher 15 , 5–11 (1986).

Shea, B. J. et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ (Online) 358 , 1–8 (2017).

Google Scholar  

Sterne, J. A. C. et al. ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ 355 , i4919 (2016).

Guyatt, G. et al. GRADE guidelines: 11. Making an overall rating of confidence in effect estimates for a single outcome and for all outcomes. J. Clin. Epidemiol. 66 , 151–157 (2013).

Davies, G. M. & Gray, A. Don’t let spurious accusations of pseudoreplication limit our ability to learn from natural experiments (and other messy kinds of ecological monitoring). Ecol. Evolution 5 , 5295–5304 (2015).

Lortie, C. J., Stewart, G., Rothstein, H. & Lau, J. How to critically read ecological meta-analyses. Res. Synth. Methods 6 , 124–133 (2015).

Gutzat, F. & Dormann, C. F. Exploration of concerns about the evidence-based guideline approach in conservation management: hints from medical practice. Environ. Manag. 66 , 435–449 (2020).

Greenhalgh, T. Will COVID-19 be evidence-based medicine’s nemesis? PLOS Med. 17 , e1003266 (2020).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Barlow, J. et al. The future of hyperdiverse tropical ecosystems. Nature 559 , 517–526 (2018).

Gurevitch, J. & Hedges, L. V. Statistical issues in ecological meta‐analyses. Ecology 80 , 1142–1149 (1999).

Stone, J. C., Glass, K., Munn, Z., Tugwell, P. & Doi, S. A. R. Comparison of bias adjustment methods in meta-analysis suggests that quality effects modeling may have less limitations than other approaches. J. Clin. Epidemiol. 117 , 36–45 (2020).

Rhodes, K. M. et al. Adjusting trial results for biases in meta-analysis: combining data-based evidence on bias with detailed trial assessment. J. R. Stat. Soc.: Ser. A (Stat. Soc.) 183 , 193–209 (2020).

Article   MathSciNet   CAS   Google Scholar  

Efthimiou, O. et al. Combining randomized and non-randomized evidence in network meta-analysis. Stat. Med. 36 , 1210–1226 (2017).

Article   MathSciNet   PubMed   Google Scholar  

Welton, N. J., Ades, A. E., Carlin, J. B., Altman, D. G. & Sterne, J. A. C. Models for potentially biased evidence in meta-analysis using empirically based priors. J. R. Stat. Soc. Ser. A (Stat. Soc.) 172 , 119–136 (2009).

Turner, R. M., Spiegelhalter, D. J., Smith, G. C. S. & Thompson, S. G. Bias modelling in evidence synthesis. J. R. Stat. Soc.: Ser. A (Stat. Soc.) 172 , 21–47 (2009).

Shackelford, G. E. et al. Dynamic meta-analysis: a method of using global evidence for local decision making. bioRxiv 2020.05.18.078840, https://doi.org/10.1101/2020.05.18.078840 (2020).

Sutherland, W. J., Pullin, A. S., Dolman, P. M. & Knight, T. M. The need for evidence-based conservation. Trends Ecol. evolution 19 , 305–308 (2004).

Ioannidis, J. P. A. Meta-research: Why research on research matters. PLOS Biol. 16 , e2005468 (2018).

Article   PubMed   PubMed Central   CAS   Google Scholar  

LaLonde, R. J. Evaluating the econometric evaluations of training programs with experimental data. Am. Econ. Rev. 76 , 604–620 (1986).

Long, Q., Little, R. J. & Lin, X. Causal inference in hybrid intervention trials involving treatment choice. J. Am. Stat. Assoc. 103 , 474–484 (2008).

Article   MathSciNet   CAS   MATH   Google Scholar  

Thomson Reuters. ISI Web of Knowledge. http://www.isiwebofknowledge.com (2019).

Stroup, W. W. Generalized linear mixed models: modern concepts, methods and applications . (CRC press, 2012).

Bolker, B. M. et al. Generalized linear mixed models: a practical guide for ecology and evolution. Trends Ecol. Evolution 24 , 127–135 (2009).

R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing (2019).

Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67 , 1–48 (2015).

Venables, W. N. & Ripley, B. D. Modern Applied Statistics with S . (Springer, 2002).

Stan Development Team. RStan: the R interface to Stan. R package version 2.19.3 (2020).

Download references

Acknowledgements

We are grateful to the following people and organisations for contributing datasets to this analysis: P. Edwards, G.R. Hodgson, H. Welsh, J.V. Vieira, authors of van Deurs et al. 2012, T. M. Grome, M. Kaspersen, H. Jensen, C. Stenberg, T. K. Sørensen, J. Støttrup, T. Warnar, H. Mosegaard, Axel Schwerk, Alberto Velando, Dolores River Restoration Partnership, J.S. Pinilla, A. Page, M. Dasey, D. Maguire, J. Barlow, J. Louzada, Jari Florestal, R.T. Buxton, C.R. Schacter, J. Seoane, M.G. Conners, K. Nickel, G. Marakovich, A. Wright, G. Soprone, CSIRO, A. Elosegi, L. García-Arberas, J. Díez, A. Rallo, Parks and Wildlife Finland, Parc Marin de la Côte Bleue. Author funding sources: T.A. was supported by the Grantham Foundation for the Protection of the Environment, Kenneth Miller Trust and Australian Research Council Future Fellowship (FT180100354); W.J.S. and P.A.M. were supported by Arcadia, MAVA, and The David and Claudia Harding Foundation; A.P.C. was supported by the Natural Environment Research Council via Cambridge Earth System Science NERC DTP (NE/L002507/1); D.A. was funded by Portugal national funds through the FCT – Foundation for Science and Technology, under the Transitional Standard – DL57 / 2016 and through the strategic project UIDB/04326/2020; M.A. acknowledges Koniambo Nickel SAS, and particularly Gregory Marakovich and Andy Wright; J.C.A. was funded through by Dirección General de Investigación Científica, projects PB97-1252, BOS2002-01543, CGL2005-04893/BOS, CGL2008-02567 and Comunidad de Madrid, as well as by contract HENARSA-CSIC 2003469-CSIC19637; A.A. was funded by Spanish Government: MEC (CGL2007-65176); B.P.B. was funded through the U.S. Geological Survey and the New York City Department of Environmental Protection; R.B. was funded by Comunidad de Madrid (2018-T1/AMB-10374); J.A.S. and D.A.B. were funded through the U.S. Geological Survey and NextEra Energy; R.S.C. was funded by the Portuguese Foundation for Science and Technology (FCT) grant SFRH/BD/78813/2011 and strategic project UID/MAR/04292/2013; A.D.B. was funded through the Belgian offshore wind monitoring program (WINMON-BE), financed by the Belgian offshore wind energy sector via RBINS—OD Nature; M.K.D. was funded by the Harold L. Castle Foundation; P.M.E. was funded by the Clackamas County Water Environment Services River Health Stewardship Program and the Portland State University Student Watershed Research Project; T.D.E., J.P.A.G. and A.P. were supported by funding from the New Zealand Department of Conservation (Te Papa Atawhai) and from the Centre for Marine Environmental & Economic Research, Victoria University of Wellington, New Zealand; F.M.F. was funded by CNPq-CAPES grants (PELD site 23 403811/2012-0, PELD-RAS 441659/2016-0, BEX5528/13-5 and 383744/2015-6) and BNP Paribas Foundation (Climate & Biodiversity Initiative, BIOCLIMATE project); B.P.H. was funded by NOAA-NMFS sea scallop research set-aside program awards NA16FM1031, NA06FM1001, NA16FM2416, and NA04NMF4720332; A.L.B. was funded by the Portuguese Foundation for Science and Technology (FCT) grant FCT PD/BD/52597/2014, Bat Conservation International student research fellowship and CNPq grant 160049/2013-0; L.C.M. acknowledges Secretaría de Ciencia y Técnica (UNRC); R.A.M. acknowledges Alaska Fisheries Science Center, NOAA Fisheries, and U.S. Department of Commerce for salary support; C.F.J.M. was funded by the Portuguese Foundation for Science and Technology (FCT) grant SFRH/BD/80488/2011; R.R. was funded by the Portuguese Foundation for Science and Technology (FCT) grant PTDC/BIA-BIC/111184/2009, by Madeira’s Regional Agency for the Development of Research, Technology and Innovation (ARDITI) grant M1420-09-5369-FSE-000002 and by a Bat Conservation International student research fellowship; J.C. and S.S. were funded by the Alabama Department of Conservation and Natural Resources; A.T. was funded by the Spanish Ministry of Education with a Formacion de Profesorado Universitario (FPU) grant AP2008-00577 and Dirección General de Investigación Científica, project CGL2008-02567; C.W. was funded by Strategic Science Investment Funding of the Ministry of Business, Innovation and Employment, New Zealand; J.S.K. acknowledges Boreal Peatland LIFE (LIFE08 NAT/FIN/000596), Parks and Wildlife Finland and Kone Foundation; J.J.S.S. was funded by the Mexican National Council on Science and Technology (CONACYT 242558); N.N. was funded by The Carl Tryggers Foundation; I.L.J. was funded by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada; D.D. and D.S. were funded by the French National Research Agency via the “Investment for the Future” program IDEALG (ANR-10-BTBR-04) and by the ALGMARBIO project; R.C.P. was funded by CSIRO and whose research was also supported by funds from the Great Barrier Reef Marine Park Authority, the Fisheries Research and Development Corporation, the Australian Fisheries Management Authority, and Queensland Department of Primary Industries (QDPI). Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government. The scientific results and conclusions, as well as any views or opinions expressed herein, are those of the author(s) and do not necessarily reflect those of NOAA or the Department of Commerce.

Author information

Authors and affiliations.

Conservation Science Group, Department of Zoology, University of Cambridge, The David Attenborough Building, Downing Street, Cambridge, CB3 3QZ, UK

Alec P. Christie, Philip A. Martin & William J. Sutherland

Centre of Marine Sciences (CCMar), Universidade do Algarve, Campus de Gambelas, 8005-139, Faro, Portugal

David Abecasis

Institut de Recherche pour le Développement (IRD), UMR 9220 ENTROPIE & Laboratoire d’Excellence CORAIL, Université de Perpignan Via Domitia, 52 avenue Paul Alduy, 66860, Perpignan, France

Mehdi Adjeroud

Museo Nacional de Ciencias Naturales, CSIC, Madrid, Spain

Juan C. Alonso & Carlos Palacín

School of Biological Sciences, University of Queensland, Brisbane, 4072, QLD, Australia

Tatsuya Amano

Education Faculty of Bilbao, University of the Basque Country (UPV/EHU). Sarriena z/g E-48940 Leioa, Basque Country, Spain

Alvaro Anton

U.S. Geological Survey, New York Water Science Center, 425 Jordan Rd., Troy, NY, 12180, USA

Barry P. Baldigo

Universidad Complutense de Madrid, Departamento de Biodiversidad, Ecología y Evolución, Facultad de Ciencias Biológicas, c/ José Antonio Novais, 12, E-28040, Madrid, Spain

Rafael Barrientos & Carlos A. Martín

Durrell Institute of Conservation and Ecology (DICE), School of Anthropology and Conservation, University of Kent, Canterbury, CT2 7NR, UK

Jake E. Bicknell

U.S. Geological Survey, Northern Prairie Wildlife Research Center, Jamestown, ND, 58401, USA

Deborah A. Buhl & Jill A. Shaffer

Northern Gulf Institute, Mississippi State University, 1021 Balch Blvd, John C. Stennis Space Center, Mississippi, 39529, USA

Just Cebrian

MARE – Marine and Environmental Sciences Centre, Dept. Life Sciences, University of Coimbra, Coimbra, Portugal

Ricardo S. Ceia

CFE – Centre for Functional Ecology, Dept. Life Sciences, University of Coimbra, Coimbra, Portugal

Departamento de Ciencias Naturales, Universidad Nacional de Río Cuarto (UNRC), Córdoba, Argentina

Luciana Cibils-Martina

CONICET, Buenos Aires, Argentina

Marine Institute, Rinville, Oranmore, Galway, Ireland

Sarah Clarke & Oliver Tully

National Center for Scientific Research, PSL Université Paris, CRIOBE, USR 3278 CNRS-EPHE-UPVD, Maison des Océans, 195 rue Saint-Jacques, 75005, Paris, France

Joachim Claudet

School of Biological Sciences, University of Western Australia, Nedlands, WA, 6009, Australia

Michael D. Craig

School of Environmental and Conservation Sciences, Murdoch University, Murdoch, WA, 6150, Australia

Sorbonne Université, CNRS, UMR 7144, Station Biologique, F.29680, Roscoff, France

Dominique Davoult & Doriane Stagnol

Flanders Research Institute for Agriculture, Fisheries and Food (ILVO), Ankerstraat 1, 8400, Ostend, Belgium

Annelies De Backer

Marine Science Institute, University of California Santa Barbara, Santa Barbara, CA, 93106, USA

Mary K. Donovan

Hawaii Institute of Marine Biology, University of Hawaii at Manoa, Honolulu, HI, 96822, USA

Baruch Institute for Marine & Coastal Sciences, University of South Carolina, Columbia, SC, USA

Tyler D. Eddy

Centre for Fisheries Ecosystems Research, Fisheries & Marine Institute, Memorial University of Newfoundland, St. John’s, Canada

School of Biological Sciences, Victoria University of Wellington, P O Box 600, Wellington, 6140, New Zealand

Tyler D. Eddy, Jonathan P. A. Gardner & Anjali Pande

Lancaster Environment Centre, Lancaster University, LA1 4YQ, Lancaster, UK

Filipe M. França

Fisheries, Aquatic Science and Technology Laboratory, Alaska Pacific University, 4101 University Dr., Anchorage, AK, 99508, USA

Bradley P. Harris

Natural Resources Institute Finland, Manamansalontie 90, 88300, Paltamo, Finland

Department of Biology, Memorial University, St. John’s, NL, A1B 2R3, Canada

Ian L. Jones

National Marine Science Centre and Marine Ecology Research Centre, Southern Cross University, 2 Bay Drive, Coffs Harbour, 2450, Australia

Brendan P. Kelaher

Department of Biological and Environmental Science, University of Jyväskylä, Jyväskylä, Finland

Janne S. Kotiaho

School of Resource Wisdom, University of Jyväskylä, Jyväskylä, Finland

Centre for Ecology, Evolution and Environmental Changes – cE3c, Faculty of Sciences, University of Lisbon, 1749-016, Lisbon, Portugal

Adrià López-Baucells, Christoph F. J. Meyer & Ricardo Rocha

Biological Dynamics of Forest Fragments Project, National Institute for Amazonian Research and Smithsonian Tropical Research Institute, 69011-970, Manaus, Brazil

Granollers Museum of Natural History, Granollers, Spain

Adrià López-Baucells

Department of Biological Sciences, University of New Brunswick, PO Box 5050, Saint John, NB, E2L 4L5, Canada

Heather L. Major

Voimalohi Oy, Voimatie 23, Voimatie, 91100, Ii, Finland

Aki Mäki-Petäys

Natural Resources Institute Finland, Paavo Havaksen tie 3, 90014 University of Oulu, Oulu, Finland

Fundación Migres CIMA Ctra, Cádiz, Spain

Beatriz Martín

Intergovernmental Oceanographic Commission of UNESCO, Marine Policy and Regional Coordination Section Paris 07, Paris, France

BioRISC, St. Catharine’s College, Cambridge, CB2 1RL, UK

Philip A. Martin & William J. Sutherland

Departamento de Ecología e Hidrología, Universidad de Murcia, Campus de Espinardo, 30100, Murcia, Spain

Daniel Mateos-Molina

RACE Division, Alaska Fisheries Science Center, National Marine Fisheries Service, NOAA, 7600 Sand Point Way NE, Seattle, WA, 98115, USA

Robert A. McConnaughey

European Commission, Joint Research Centre (JRC), Ispra, VA, Italy

Michele Meroni

School of Science, Engineering and Environment, University of Salford, Salford, M5 4WT, UK

Christoph F. J. Meyer

Victorian National Park Association, Carlton, VIC, Australia

Department of Earth, Environment and Life Sciences (DiSTAV), University of Genoa, Corso Europa 26, 16132, Genoa, Italy

Monica Montefalcone

Department of Ecology, Swedish University of Agricultural Sciences, Uppsala, Sweden

Norbertas Noreika

Chair of Plant Health, Institute of Agricultural and Environmental Sciences, Estonian University of Life Sciences, Tartu, Estonia

Biosecurity New Zealand – Tiakitanga Pūtaiao Aotearoa, Ministry for Primary Industries – Manatū Ahu Matua, 66 Ward St, PO Box 40742, Wallaceville, New Zealand

Anjali Pande

National Institute of Water & Atmospheric Research Ltd (NIWA), 301 Evans Bay Parade, Greta Point Wellington, New Zealand

CSIRO Oceans & Atmosphere, Queensland Biosciences Precinct, 306 Carmody Road, ST. LUCIA QLD, 4067, Australia

C. Roland Pitcher

Museo Nacional de Ciencias Naturales, CSIC, José Gutiérrez Abascal 2, E-28006, Madrid, Spain

Carlos Ponce

Fort Keogh Livestock and Range Research Laboratory, 243 Fort Keogh Rd, Miles City, Montana, 59301, USA

Matt Rinella

CIBIO-InBIO, Research Centre in Biodiversity and Genetic Resources, University of Porto, Vairão, Portugal

Ricardo Rocha

Departamento de Sistemas Físicos, Químicos y Naturales, Universidad Pablo de Olavide, ES-41013, Sevilla, Spain

María C. Ruiz-Delgado

El Colegio de la Frontera Sur, A.P. 424, 77000, Chetumal, QR, Mexico

Juan J. Schmitter-Soto

Division of Fish and Wildlife, New York State Department of Environmental Conservation, 625 Broadway, Albany, NY, 12233-4756, USA

Shailesh Sharma

University of Denver Department of Biological Sciences, Denver, CO, USA

Anna A. Sher

U.S. Geological Survey, Fort Collins Science Center, Fort Collins, CO, 80526, USA

Thomas R. Stanley

School for Marine Science and Technology, University of Massachusetts Dartmouth, New Bedford, MA, USA

Kevin D. E. Stokesbury

Georges Lemaître Earth and Climate Research Centre, Earth and Life Institute, Université Catholique de Louvain, 1348, Louvain-la-Neuve, Belgium

Aurora Torres

Center for Systems Integration and Sustainability, Department of Fisheries and Wildlife, 13 Michigan State University, East Lansing, MI, 48823, USA

Natural Resources Institute Finland, Latokartanonkaari 9, 00790, Helsinki, Finland

Teppo Vehanen

Manaaki Whenua – Landcare Research, Private Bag 3127, Hamilton, 3216, New Zealand

Corinne Watts

Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Wilberforce Road, Cambridge, CB3 0WB, UK

Qingyuan Zhao

You can also search for this author in PubMed   Google Scholar

Contributions

A.P.C., T.A., P.A.M., Q.Z., and W.J.S. designed the research; A.P.C. wrote the paper; D.A., M.A., J.C.A., A.A., B.P.B, R.B., J.B., D.A.B., J.C., R.S.C., L.C.M., S.C., J.C., M.D.C, D.D., A.D.B., M.K.D., T.D.E., P.M.E., F.M.F., J.P.A.G., B.P.H., A.H., I.L.J., B.P.K., J.S.K., A.L.B., H.L.M., A.M., B.M., C.A.M., D.M., R.A.M, M.M., C.F.J.M.,K.M., M.M., N.N., C.P., A.P., C.R.P., C.P., M.R., R.R., M.C.R., J.J.S.S., J.A.S., S.S., A.A.S., D.S., K.D.E.S., T.R.S., A.T., O.T., T.V., C.W. contributed datasets for analyses. All authors reviewed, edited, and approved the manuscript.

Corresponding author

Correspondence to Alec P. Christie .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Casper Albers, Samuel Scheiner, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, description of additional supplementary information, supplementary data 1, supplementary data 2, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Christie, A.P., Abecasis, D., Adjeroud, M. et al. Quantifying and addressing the prevalence and bias of study designs in the environmental and social sciences. Nat Commun 11 , 6377 (2020). https://doi.org/10.1038/s41467-020-20142-y

Download citation

Received : 29 January 2020

Accepted : 13 November 2020

Published : 11 December 2020

DOI : https://doi.org/10.1038/s41467-020-20142-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Robust language-based mental health assessments in time and space through social media.

  • Siddharth Mangalik
  • Johannes C. Eichstaedt
  • H. Andrew Schwartz

npj Digital Medicine (2024)

Is there a “difference-in-difference”? The impact of scientometric evaluation on the evolution of international publications in Egyptian universities and research centres

  • Mona Farouk Ali

Scientometrics (2024)

Quantifying research waste in ecology

  • Marija Purgar
  • Tin Klanjscek
  • Antica Culina

Nature Ecology & Evolution (2022)

Assessing assemblage-wide mammal responses to different types of habitat modification in Amazonian forests

  • Paula C. R. Almeida-Maués
  • Anderson S. Bueno
  • Ana Cristina Mendes-Oliveira

Scientific Reports (2022)

Mitigating impacts of invasive alien predators on an endangered sea duck amidst high native predation pressure

  • Kim Jaatinen
  • Ida Hermansson

Oecologia (2022)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Anthropocene newsletter — what matters in anthropocene research, free to your inbox weekly.

peer reviewed journal quantitative research

Quantitative Research

  • Reference work entry
  • First Online: 13 January 2019
  • Cite this reference work entry

peer reviewed journal quantitative research

  • Leigh A. Wilson 2 , 3  

4455 Accesses

4 Citations

Quantitative research methods are concerned with the planning, design, and implementation of strategies to collect and analyze data. Descartes, the seventeenth-century philosopher, suggested that how the results are achieved is often more important than the results themselves, as the journey taken along the research path is a journey of discovery. High-quality quantitative research is characterized by the attention given to the methods and the reliability of the tools used to collect the data. The ability to critique research in a systematic way is an essential component of a health professional’s role in order to deliver high quality, evidence-based healthcare. This chapter is intended to provide a simple overview of the way new researchers and health practitioners can understand and employ quantitative methods. The chapter offers practical, realistic guidance in a learner-friendly way and uses a logical sequence to understand the process of hypothesis development, study design, data collection and handling, and finally data analysis and interpretation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

peer reviewed journal quantitative research

Writing Quantitative Research Studies

peer reviewed journal quantitative research

Qualitative Research Methods

Babbie ER. The practice of social research. 14th ed. Belmont: Wadsworth Cengage; 2016.

Google Scholar  

Descartes. Cited in Halverston, W. (1976). In: A concise introduction to philosophy, 3rd ed. New York: Random House; 1637.

Doll R, Hill AB. The mortality of doctors in relation to their smoking habits. BMJ. 1954;328(7455):1529–33. https://doi.org/10.1136/bmj.328.7455.1529 .

Article   Google Scholar  

Liamputtong P. Research methods in health: foundations for evidence-based practice. 3rd ed. Melbourne: Oxford University Press; 2017.

McNabb DE. Research methods in public administration and nonprofit management: quantitative and qualitative approaches. 2nd ed. New York: Armonk; 2007.

Merriam-Webster. Dictionary. http://www.merriam-webster.com . Accessed 20th December 2017.

Olesen Larsen P, von Ins M. The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index. Scientometrics. 2010;84(3):575–603.

Pannucci CJ, Wilkins EG. Identifying and avoiding bias in research. Plast Reconstr Surg. 2010;126(2):619–25. https://doi.org/10.1097/PRS.0b013e3181de24bc .

Petrie A, Sabin C. Medical statistics at a glance. 2nd ed. London: Blackwell Publishing; 2005.

Portney LG, Watkins MP. Foundations of clinical research: applications to practice. 3rd ed. New Jersey: Pearson Publishing; 2009.

Sheehan J. Aspects of research methodology. Nurse Educ Today. 1986;6:193–203.

Wilson LA, Black DA. Health, science research and research methods. Sydney: McGraw Hill; 2013.

Download references

Author information

Authors and affiliations.

School of Science and Health, Western Sydney University, Penrith, NSW, Australia

Leigh A. Wilson

Faculty of Health Science, Discipline of Behavioural and Social Sciences in Health, University of Sydney, Lidcombe, NSW, Australia

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Leigh A. Wilson .

Editor information

Editors and affiliations.

Pranee Liamputtong

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this entry

Cite this entry.

Wilson, L.A. (2019). Quantitative Research. In: Liamputtong, P. (eds) Handbook of Research Methods in Health Social Sciences. Springer, Singapore. https://doi.org/10.1007/978-981-10-5251-4_54

Download citation

DOI : https://doi.org/10.1007/978-981-10-5251-4_54

Published : 13 January 2019

Publisher Name : Springer, Singapore

Print ISBN : 978-981-10-5250-7

Online ISBN : 978-981-10-5251-4

eBook Packages : Social Sciences Reference Module Humanities and Social Sciences Reference Module Business, Economics and Social Sciences

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 21, Issue 4
  • How to appraise quantitative research
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

This article has a correction. Please see:

  • Correction: How to appraise quantitative research - April 01, 2019

Download PDF

  • Xabi Cathala 1 ,
  • Calvin Moorley 2
  • 1 Institute of Vocational Learning , School of Health and Social Care, London South Bank University , London , UK
  • 2 Nursing Research and Diversity in Care , School of Health and Social Care, London South Bank University , London , UK
  • Correspondence to Mr Xabi Cathala, Institute of Vocational Learning, School of Health and Social Care, London South Bank University London UK ; cathalax{at}lsbu.ac.uk and Dr Calvin Moorley, Nursing Research and Diversity in Care, School of Health and Social Care, London South Bank University, London SE1 0AA, UK; Moorleyc{at}lsbu.ac.uk

https://doi.org/10.1136/eb-2018-102996

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Some nurses feel that they lack the necessary skills to read a research paper and to then decide if they should implement the findings into their practice. This is particularly the case when considering the results of quantitative research, which often contains the results of statistical testing. However, nurses have a professional responsibility to critique research to improve their practice, care and patient safety. 1  This article provides a step by step guide on how to critically appraise a quantitative paper.

Title, keywords and the authors

The authors’ names may not mean much, but knowing the following will be helpful:

Their position, for example, academic, researcher or healthcare practitioner.

Their qualification, both professional, for example, a nurse or physiotherapist and academic (eg, degree, masters, doctorate).

This can indicate how the research has been conducted and the authors’ competence on the subject. Basically, do you want to read a paper on quantum physics written by a plumber?

The abstract is a resume of the article and should contain:

Introduction.

Research question/hypothesis.

Methods including sample design, tests used and the statistical analysis (of course! Remember we love numbers).

Main findings.

Conclusion.

The subheadings in the abstract will vary depending on the journal. An abstract should not usually be more than 300 words but this varies depending on specific journal requirements. If the above information is contained in the abstract, it can give you an idea about whether the study is relevant to your area of practice. However, before deciding if the results of a research paper are relevant to your practice, it is important to review the overall quality of the article. This can only be done by reading and critically appraising the entire article.

The introduction

Example: the effect of paracetamol on levels of pain.

My hypothesis is that A has an effect on B, for example, paracetamol has an effect on levels of pain.

My null hypothesis is that A has no effect on B, for example, paracetamol has no effect on pain.

My study will test the null hypothesis and if the null hypothesis is validated then the hypothesis is false (A has no effect on B). This means paracetamol has no effect on the level of pain. If the null hypothesis is rejected then the hypothesis is true (A has an effect on B). This means that paracetamol has an effect on the level of pain.

Background/literature review

The literature review should include reference to recent and relevant research in the area. It should summarise what is already known about the topic and why the research study is needed and state what the study will contribute to new knowledge. 5 The literature review should be up to date, usually 5–8 years, but it will depend on the topic and sometimes it is acceptable to include older (seminal) studies.

Methodology

In quantitative studies, the data analysis varies between studies depending on the type of design used. For example, descriptive, correlative or experimental studies all vary. A descriptive study will describe the pattern of a topic related to one or more variable. 6 A correlational study examines the link (correlation) between two variables 7  and focuses on how a variable will react to a change of another variable. In experimental studies, the researchers manipulate variables looking at outcomes 8  and the sample is commonly assigned into different groups (known as randomisation) to determine the effect (causal) of a condition (independent variable) on a certain outcome. This is a common method used in clinical trials.

There should be sufficient detail provided in the methods section for you to replicate the study (should you want to). To enable you to do this, the following sections are normally included:

Overview and rationale for the methodology.

Participants or sample.

Data collection tools.

Methods of data analysis.

Ethical issues.

Data collection should be clearly explained and the article should discuss how this process was undertaken. Data collection should be systematic, objective, precise, repeatable, valid and reliable. Any tool (eg, a questionnaire) used for data collection should have been piloted (or pretested and/or adjusted) to ensure the quality, validity and reliability of the tool. 9 The participants (the sample) and any randomisation technique used should be identified. The sample size is central in quantitative research, as the findings should be able to be generalised for the wider population. 10 The data analysis can be done manually or more complex analyses performed using computer software sometimes with advice of a statistician. From this analysis, results like mode, mean, median, p value, CI and so on are always presented in a numerical format.

The author(s) should present the results clearly. These may be presented in graphs, charts or tables alongside some text. You should perform your own critique of the data analysis process; just because a paper has been published, it does not mean it is perfect. Your findings may be different from the author’s. Through critical analysis the reader may find an error in the study process that authors have not seen or highlighted. These errors can change the study result or change a study you thought was strong to weak. To help you critique a quantitative research paper, some guidance on understanding statistical terminology is provided in  table 1 .

  • View inline

Some basic guidance for understanding statistics

Quantitative studies examine the relationship between variables, and the p value illustrates this objectively.  11  If the p value is less than 0.05, the null hypothesis is rejected and the hypothesis is accepted and the study will say there is a significant difference. If the p value is more than 0.05, the null hypothesis is accepted then the hypothesis is rejected. The study will say there is no significant difference. As a general rule, a p value of less than 0.05 means, the hypothesis is accepted and if it is more than 0.05 the hypothesis is rejected.

The CI is a number between 0 and 1 or is written as a per cent, demonstrating the level of confidence the reader can have in the result. 12  The CI is calculated by subtracting the p value to 1 (1–p). If there is a p value of 0.05, the CI will be 1–0.05=0.95=95%. A CI over 95% means, we can be confident the result is statistically significant. A CI below 95% means, the result is not statistically significant. The p values and CI highlight the confidence and robustness of a result.

Discussion, recommendations and conclusion

The final section of the paper is where the authors discuss their results and link them to other literature in the area (some of which may have been included in the literature review at the start of the paper). This reminds the reader of what is already known, what the study has found and what new information it adds. The discussion should demonstrate how the authors interpreted their results and how they contribute to new knowledge in the area. Implications for practice and future research should also be highlighted in this section of the paper.

A few other areas you may find helpful are:

Limitations of the study.

Conflicts of interest.

Table 2 provides a useful tool to help you apply the learning in this paper to the critiquing of quantitative research papers.

Quantitative paper appraisal checklist

  • 1. ↵ Nursing and Midwifery Council , 2015 . The code: standard of conduct, performance and ethics for nurses and midwives https://www.nmc.org.uk/globalassets/sitedocuments/nmc-publications/nmc-code.pdf ( accessed 21.8.18 ).
  • Gerrish K ,
  • Moorley C ,
  • Tunariu A , et al
  • Shorten A ,

Competing interests None declared.

Patient consent Not required.

Provenance and peer review Commissioned; internally peer reviewed.

Correction notice This article has been updated since its original publication to update p values from 0.5 to 0.05 throughout.

Linked Articles

  • Miscellaneous Correction: How to appraise quantitative research BMJ Publishing Group Ltd and RCN Publishing Company Ltd Evidence-Based Nursing 2019; 22 62-62 Published Online First: 31 Jan 2019. doi: 10.1136/eb-2018-102996corr1

Read the full text or download the PDF:

  • Open access
  • Published: 30 May 2024

Differential attainment in assessment of postgraduate surgical trainees: a scoping review

  • Rebecca L. Jones 1 , 2 ,
  • Suwimol Prusmetikul 1 , 3 &
  • Sarah Whitehorn 1  

BMC Medical Education volume  24 , Article number:  597 ( 2024 ) Cite this article

25 Accesses

Metrics details

Introduction

Solving disparities in assessments is crucial to a successful surgical training programme. The first step in levelling these inequalities is recognising in what contexts they occur, and what protected characteristics are potentially implicated.

This scoping review was based on Arksey & O’Malley’s guiding principles. OVID and Embase were used to identify articles, which were then screened by three reviewers.

From an initial 358 articles, 53 reported on the presence of differential attainment in postgraduate surgical assessments. The majority were quantitative studies (77.4%), using retrospective designs. 11.3% were qualitative. Differential attainment affects a varied range of protected characteristics. The characteristics most likely to be investigated were gender (85%), ethnicity (37%) and socioeconomic background (7.5%). Evidence of inequalities are present in many types of assessment, including: academic achievements, assessments of progression in training, workplace-based assessments, logs of surgical experience and tests of technical skills.

Attainment gaps have been demonstrated in many types of assessment, including supposedly “objective” written assessments and at revalidation. Further research is necessary to delineate the most effective methods to eliminate bias in higher surgical training. Surgical curriculum providers should be informed by the available literature on inequalities in surgical training, as well as other neighbouring specialties such as medicine or general practice, when designing assessments and considering how to mitigate for potential causes of differential attainment.

Peer Review reports

Diversity in the surgical workforce has been a hot topic for the last 10 years, increasing in traction following the BlackLivesMatter movement in 2016 [ 1 ]. In the UK this culminated in publication of the Kennedy report in 2021 [ 2 ]. Before this the focus was principally on gender imbalance in surgery, with the 2010 Surgical Workforce report only reporting gender percentages by speciality, with no comment on racial profile, sexuality distribution, disability occurrence, or socioeconomic background [ 3 ].

Gender is not the only protected characteristic deserving of equity in surgery; many groups find themselves at a disadvantage during postgraduate surgical examinations [ 4 ] and at revalidation [ 5 ]. This phenomenon is termed ‘differential attainment’ (DA), in which disparities in educational outcomes, progression rates, or achievements between groups with protected characteristics occur [ 4 ]. This may be due to the assessors’ subconscious bias, or a deficit in training and education before assessment.

One of the four pillars of medical ethics is “justice”, emphasising that healthcare should be provided in a fair, equitable, and ethical manner, benefiting all individuals and promoting the well-being of society as a whole. This applies not only to our patients but also to our colleagues; training should be provided in a fair, equitable, and ethical manner, benefiting all. By applying the principle of justice to surgical trainees, we can create an environment that is supportive, inclusive, and conducive to professional growth and well-being.

A diverse consultant body is crucial for providing high-quality healthcare to a diverse patient population. It has been shown that patients are happier when cared for by a doctor with the same ethnic background [ 6 ]. Takeshita et al. [ 6 ] proposed this is due to a greater likelihood of mutual understanding of cultural values, beliefs, and preferences and is therefore more likely to cultivate a trusting relationship, leading to accurate diagnosis, treatment adherence and improved patient understanding. As such, ensuring that all trainees are justly educated and assessed throughout their training may contribute to improving patient care by diversifying the consultant body.

Surgery is well known to have its own specific culture, language, and social rules which are unique even within the world of medicine [ 7 , 8 ]. Through training, graduates develop into surgeons, distinct from other physicians and practitioners [ 9 ]. As such, research conducted in other medical domains is not automatically applicable to surgery, and behavioural interventions focused on reducing or eliminating bias in training need to be tailored specifically to surgical settings.

Consequently, it’s important that the surgical community asks the questions:

Does DA exist in postgraduate surgical training, and to what extent?

Why does DA occur?

What groups or assessments are under-researched?

How can we apply this knowledge, or acquire new knowledge, to provide equity for trainees?

The following scoping review hopes to provide the surgical community with robust answers for future of surgical training.

Aims and research question

The aim of this scoping review is to understand the breadth of research about the presence of DA in postgraduate surgical education and to determine themes pertaining to causes of inequalities. A scoping review was chosen to provide a means to map the available literature, including published peer-reviewed primary research and grey literature.

Following the methodological framework set out by Arksey and O’Malley [ 10 ], our research was intended to characterise the literature addressing DA in HST, including Ophthalmology, Obstetrics & Gynaecology (O&G). We included literature from English-language speaking countries, including the UK and USA.

Search strategy

We used search terms tailored to our target population characteristics (e.g., gender, ethnicity), concept (i.e., DA) and context (i.e., assessment in postgraduate surgical education). Medline and Embase were searched with the assistance of a research librarian, with addition of synonyms. This was conducted in May 2023, and was exported to Microsoft Excel for further review. The reference lists of included articles were also searched to find any relevant data sources that had yet to be considered. In addition, to identify grey literature, a search was performed for the term “differential attainment” and “disparity” on the relevant stakeholders’ websites (See supplemental Table 1 for full listing). Stakeholders were included on the basis of their involvement in governance or training of surgical trainees.

Study selection

To start we excluded conference abstracts that were subsequently published as full papers to avoid duplications ( n  = 337). After an initial screen by title to exclude obviously irrelevant articles, articles were filtered to meet our inclusion and exclusion criteria (Table  1 ). The remaining articles ( n  = 47) were then reviewed in their entirety, with the addition of five reports found in grey literature. Following the screening process, 45 studies were recruited for scoping review (Fig.  1 ).

Charting the data

The extracted data included literature title, authors, year of publication, country of study, study design, population characteristic, case number, context, type of assessment, research question and main findings (Appendix 1). Extraction was performed initially by a single author and then subsequently by a second author to ensure thorough review. Group discussion was conducted in case of any disagreements. As charting occurred, papers were discovered within reference lists of included studies which were eligible for inclusion; these were assimilated into the data charting table and included in the data extraction ( n  = 8).

Collating, summarizing and reporting the results

The included studies were not formally assessed in their quality or risk of bias, consistent with a scoping review approach [ 10 ]. However, group discussion was conducted during charting to aid argumentation and identify themes and trends.

We conducted a descriptive numerical summary to describe the characteristics of included studies. Then thematic analysis was implemented to examine key details and organise the attainment quality and population characteristics based on their description. The coding of themes was an iterative process and involved discussion between authors, to identify and refine codes to group into themes.

We categorised the main themes as gender, ethnicity, country of graduation, individual and family background in education, socioeconomic background, age, and disability. The number of articles in each theme is demonstrated in Table  2 . Data was reviewed and organised into subtopics based on assessment types included: academic achievement (e.g., MRCS, FRCS), assessments for progression (e.g., ARCP), workplace-based assessment (e.g., EPA, feedback), surgical experience (e.g., case volume), and technical skills (e.g., visuo-spatial tasks).

figure 1

PRISMA flow diagram

44 articles defined the number of included participants (89,399 participants in total; range of participants across individual studies 16–34,755). Two articles reported the number of included studies for their meta-analysis (18 and 63 included articles respectively). Two reports from grey literature did not define the number of participants they included in their analysis. The characteristics of the included articles are displayed in Table  2 .

figure 2

Growth in published literature on differential attainment over the past 40 years

Academic achievement

In the American Board of Surgery Certifying Exam (ABSCE), Maker [ 11 ] found there to be no significant differences in terms of gender when comparing those who passed on their first attempt and those who did not in general surgery training, a finding supported by Ong et al. [ 12 ]. Pico et al. [ 13 ] reported that in Orthopaedic training, Orthopaedic In-Training Examination (OITE) and American Board of Orthopaedic Surgery (ABOS) Part 1 scores were similar between genders, but that female trainees took more attempts in order to pass. In the UK, two studies reported significantly lower Membership of the Royal College of Surgeons (MRCS) pass rates for female trainees compared to males [ 4 , 14 ]. However, Robinson et al. [ 15 ] presented no significant gender differences in MRCS success rates. A study assessing Fellowship of the Royal College of Surgeons (FRCS) examination results found no significant gender disparities in pass rates [ 16 ]. In MRCOG examination, no significant gender differences were found in Part 1 scores, but women had higher pass rates and scores in Part 2 [ 17 ].

Assessment for Progression

ARCP is the annual process of revalidation that UK doctors must perform to progress through training. A satisfactory progress outcome (“outcome 1”) allows trainees to advance through to the next training year, whereas non-satisfactory outcomes (“2–5”) suggest inadequate progress and recommends solutions, such as further time in training or being released from the training programme. Two studies reported that women received 60% more non-satisfactory outcomes than men [ 16 , 18 ]. In contrast, in O&G men had higher non-satisfactory ARCP outcomes without explicit reasons for this given [ 19 ].

Regarding Milestone evaluations based from the US Accreditation Council for Graduate Medical Education (ACGME), Anderson et al. [ 20 ] reported men had higher ratings of knowledge of diseases at postgraduate year 5 (PGY-5), while women had lower mean score achievements. This was similar to another study finding that men and women had similar competencies at PGY-1 to 3, and that it was only at PGY-5 that women were evaluated lower than men [ 21 ]. However, Kwasny et al. [ 22 ] found no difference in trainers’ ratings between genders, but women self-rated themselves lower. Salles et al. [ 23 ] demonstrated significant improvement in scoring in women following a value-affirmation intervention, while this intervention did not affect men.

Workplace-based Assessment

Galvin et al. [ 24 ] reported better evaluation scores from nurses for PGY-2 male trainees, while females received fewer positive and more negative comments. Gerull et al. [ 25 ] demonstrated men received compliments with superlatives or standout words, whereas women were more likely to receive compliments with mitigating phrases (e.g., excellent vs. quite competent).

Hayward et al. [ 26 ] investigated assessment of attributes of clinical performance (ethics, judgement, technical skills, knowledge and interpersonal skills) and found similar scoring between genders.

Several authors have studied autonomy given to trainees in theatre [ 27 , 28 , 29 , 30 , 31 ]. Two groups found no difference in level of granted autonomy between genders but that women rated lower perceived autonomy on self-evaluation [ 27 , 28 ]. Other studies found that assessors consistently gave female trainees lower autonomy ratings, but only in one paper was this replicated in lower performance scores [ 29 , 30 , 31 ].

Padilla et al. [ 32 ] reported no difference in entrustable professional activity assessment (EPA) levels between genders, yet women rated themselves much lower, which they regarded as evidence of imposter syndrome amongst female trainees. Cooney et al. [ 33 ] found that male trainers scored EPAs for women significantly lower than men, while female trainers rated both genders similarly. Conversely, Roshan et al. [ 34 ] found that male assessors were more positive in feedback comments to female trainees than male trainees, whereas they also found that comments from female assessors were comparable for each gender.

Surgical Experience

Gong et al. [ 35 ] found significantly fewer cataract operations were performed by women in ophthalmology residency programmes, which they suggested could be due to trainers being more likely to give cases to male trainees. Female trainees also participated in fewer robotic colorectal procedures, with less operative time on the robotic console afforded [ 36 ]. Similarly, a systematic review highlighted female trainees in various specialties performed fewer cases per week and potentially had limited access to training facilities [ 37 ]. Eruchalu et al. [ 38 ] found that female trainees performed fewer cases, that is, until gender parity was reached, after which case logs were equivalent.

Technical skills

Antonoff et al. [ 39 ] found higher scores for men in coronary anastomosis skills, with women receiving more “fail” assessments. Dill-Macky et al. [ 40 ] analysed laparoscopic skill assessment using blinded videos of trainees and unblinded assessments. While there was no difference in blinded scores between genders, when comparing blinded and unblinded scores individually, assessors were less likely to agree on the scores of women compared to men. However, another study about laparoscopic skills by Skjold-Ødegaard et al. [ 41 ] reported higher performance scores in female residents, particularly when rated by women. The lowest score was shown in male trainees rated by men. While some studies showed disparities in assessment, several studies reported no difference in technical skill assessments (arthroscopic, knot tying, and suturing skills) between genders [ 42 , 43 , 44 , 45 , 46 ].

Several studies investigated trainees’ abilities to complete isolated tasks associated with surgical skills. In laparoscopic tasks, men were initially more skilful in peg transfer and intracorporeal knot tying than women. Following training, the performance was not different between genders [ 47 ]. A study on microsurgical skills reported better initial visual-spatial and perceptual ability in men, while women had better fine motor psychomotor ability. However, these differences were not significant, and all trainees improved significantly after training [ 48 ]. A study by Milam et al. [ 49 ] revealed men performed better in mental rotation tasks and women outperformed in working memory. They hypothesised that female trainees would experience stereotype threat, fear of being reduced to a stereotype, which would impair their performance. They found no evidence of stereotype threat influencing female performance, disproving their hypothesis, a finding supported by Myers et al. [ 50 ].

Ethnicity and country of graduation

Most papers reported ethnicity and country of graduation concurrently, for example grouping trainees as White UK graduates (WUKG), Black and minority ethnicity UK graduates (BME UKG), and international medical graduates (IMG). Therefore, these areas will be addressed together in the following section.

When assessing the likelihood of passing American Board of Surgery (ABS) examinations on first attempt, Yeo et al. [ 51 ] found that White trainees were more likely than non-White. They found that the influence of ethnicity was more significant in the end-of-training certifying exam than in the start-of-training qualifying exam. This finding was corroborated in a study of both the OITE and ABOS certifying exam, suggesting widening inequalities during training [ 52 ].

Two UK-based studies reported significantly higher MRCS pass rates in White trainees compared to BMEs [ 4 , 14 ]. BMEs were less likely to pass MRCS Part A and B, though this was not true for Part A when variations in socioeconomic background were corrected for [ 14 ]. However, Robinson et al. [ 53 ] found no difference in MRCS pass rates based on ethnicity. Another study by Robinson et al. [ 15 ] demonstrated similar pass rates between WUKGs and BME UKGs, but IMGs had significantly lower pass rates than all UKGs. The FRCS pass rates of WUKGs, BME UKGs and IMGs were 76.9%, 52.9%, and 53.9%, respectively, though these percentages were not statistically significantly different [ 16 ].

There was no difference in MRCOG results based on ethnicity, but higher success rates were found in UKGs [ 19 ]. In FRCOphth, WUKGs had a pass rate of 70%, higher than other groups of trainees, with a pass rate of only 45% for White IMGs [ 52 ].

By gathering data from training programmes reporting little to no DA due to ethnicity, Roe et al. [ 54 ] were able to provide a list of factors they felt were protective against DA, such as having supportive supervisors and developing peer networks.

Assessment for progression

RCOphth [ 55 ] found higher rates of satisfactory ARCP outcomes for WUKGs compared to BME UKGs, followed by IMGs. RCOG [ 19 ] discovered higher rates of non-satisfactory ARCP outcomes from non-UK graduates, particularly amongst BMEs and those from the European Economic Area (EEA). Tiffin et al. [ 56 ] considered the difference in experience between UK graduates and UK nationals whose primary medical qualification was gained outside of the UK, and found that the latter were more likely to receive a non-satisfactory ARCP outcome, even when compared to non-UK nationals.

Woolf et al. [ 57 ] explored reasons behind DA by conducting interview studies with trainees. They investigated trainees’ perceptions of fairness in evaluation and found that trainees felt relationships developed with colleagues who gave feedback could affect ARCP results, and might be challenging for BME UKGs and IMGs who have less in common with their trainers.

Workplace-based assessment

Brooks et al. [ 58 ] surveyed the prevalence of microaggressions against Black orthopaedic surgeons during assessment and found 87% of participants experienced some level of racial discrimination during workplace-based performance feedback. Black women reported having more racially focused and devaluing statements from their seniors than men.

Surgical experience

Eruchalu et al. [ 38 ] found that white trainees performed more major surgical cases and more cases as a supervisor than did their BME counterparts.

Dill-Macky et al. [ 40 ] reported no significant difference in laparoscopic surgery assessments between ethnicities.

Individual and family background in education

Two studies [ 4 , 16 ] concentrated on educational background, considering factors such as parental occupation and attendance of a fee-paying school. MRCS part A pass rate was significantly higher for trainees for whom Medicine was their first Degree, those with university-educated parents, higher POLAR (Participation In Local Areas classification group) quintile, and those from fee-paying schools. Higher part B pass rate was associated with graduating from non-Graduate Entry Medicine programmes and parents with managerial or professional occupations [ 4 ]. Trainees with higher degrees were associated with an almost fivefold increase in FRCS success and seven times more scientific publications than their counterparts [ 16 ].

Socioeconomic background

Two studies used Index of Multiple Deprivation quintile, the official measure of relative deprivation in England based on geographical areas for grading socioeconomic level. The area was defined at the time of medical school application. Deprivation quintiles (DQ) were calculated, ranging from DQ1 (most deprived) to DQ5 (least deprived) [ 4 , 14 ].

Trainees with history of less deprivation were associated with higher MRCS part A pass rate. More success in part B was associated with history of no requirement for income support and less deprived areas [ 4 ]. Trainees from DQ1 and DQ2 had lower pass rates and higher number of attempts to pass [ 14 ]. A general trend of better outcomes in examination was found from O&G trainees in less deprived quintiles [ 19 ].

Trainees from DQ1 and DQ2 received significantly more non-satisfactory ARCP outcomes (24.4%) than DQ4 and DQ5 (14.2%) [ 14 ].

Trainees who graduated at age less than 29 years old were more likely to pass MRCS than their counterparts [ 4 ].

Authors [ 18 , 56 ] found that older trainees received more non-satisfactory ARCP outcomes. Likewise, there was higher percentage of non-satisfactory ARCP outcomes in O&G trainees aged over 45 compared with those aged 25–29 regardless of gender [ 19 ].

Trainees with disability had significantly lower pass rates in MRCS part A compared to candidates without disability. However, the difference was not significant for part B [ 59 ].

What have we learnt from the literature?

It is heartening to note the recent increase in interest in DA (27 studies in the last 4 years, compared to 26 in the preceding 40) (Fig.  2 ). The vast majority (77%) of studies are quantitative, based in the US or UK (89%), focus on gender (85%) and relate to clinical assessments (51%) rather than examination results. Therefore, the surgical community has invested primarily in researching the experience of women in the USA and UK.

Interestingly, a report by RCOG [ 19 ] showed that men were more likely to receive non-satisfactory ARCP outcomes than women, and a study by Rushd et al. [ 17 ] found that women were more likely to pass part 2 of MRCOG than men. This may be because within O&G men are the “out-group” (a social group or category characterised by marginalisation or exclusion by the dominant cultural group) as 75% of O&G trainees are female [ 60 ].

This contrasts with other specialities in which men are the in-group and women are seen to underperform. Outside of O&G, in comparison to men, women are less likely to pass MRCS [ 4 , 14 ], receive satisfactory ARCP outcome [ 16 , 18 ], or receive positive feedback [ 24 ], whilst not performing the same number of procedures as men [ 34 , 35 ]. This often leads to poor self-confidence in women [ 32 ], which can then worsen performance [ 21 ].

It proves difficult to comment on DA for many groups due to a lack of evidence. The current research suggests that being older, having a disability, graduate entry to medicine, low parental education, and living in a lower socioeconomic area at the time of entering medical school are all associated with lower MRCS pass rates. Being older and having a lower socioeconomic background are also associated with non-satisfactory ARCP outcomes, slowing progression through training.

These characteristics may provide a compounding negative effect – for example having a previous degree will automatically make a trainee older, and living in a lower socioeconomic area makes it more likely their parents will have a non-professional job and not hold a higher degree. When multiple protected characteristics interact to produce a compounded negative effect for a person, it is often referred to as “intersectional discrimination” or “intersectionality” [ 61 ]. This is a concept which remains underrepresented in the current literature.

The literature is not yet in agreement over the presence of DA due to ethnicity. There are many studies that report perceived discrimination, however the data for exam and clinical assessment outcomes is equivocal. This may be due to the fluctuating nature of in-groups and out-groups, and multiple intersecting characteristics. Despite this, the lived experience of BME surgeons should not be ignored and requires further investigation.

What are the gaps in the literature?

The overwhelming majority of literature exploring DA addresses issues of gender, ethnicity or country of medical qualification. Whilst bias related to these characteristics is crucial to recognise, studies into other protected characteristics are few and far between. The only paper on disability reported striking differences in attainment between disabled and non-disabled registrars [ 59 ]. There has also been increased awareness about neurodiversity amongst doctors and yet an exploration into the experience of neurodiverse surgeons and their progress through training has yet to be published [ 62 ].

The implications of being LGBTQ + in surgical training have not been recognised nor formally addressed in the literature. Promisingly, the experiences of LGBTQ + medical students have been recognised at an undergraduate level, so one can hope that this will be translated into postgraduate education [ 63 , 64 ]. While this is deeply entwined with experiences of gender discrimination, it is an important characteristic that the surgical community would benefit from addressing, along with disability. To a lesser extent, the effect of socioeconomic background and age have also been overlooked.

Characterising trainees for the purpose of research

Ethnicity is deeply personal, self-defined, and may change over time as personal identity evolves, and therefore arbitrarily grouping diverse ethnic backgrounds is unlikely to capture an accurate representation of experiences. There are levels of discrimination even within minority groups; colourism in India means dark-skinned Indians will experience more discrimination than light-skinned Indians, even from those within in their own ethnic group [ 65 ]. Therefore, although the studies included in the scoping review accepted self-definitions of ethnicity, this is likely not enough to fully capture the nuances of bias and discrimination present in society. For example, Ellis et al. [ 4 ] grouped participants as “White”, “Mixed”, “Asian”, “Black” and “Other”, however they could have also assigned a skin tone value such as the NIS Skin Colour Scale [ 66 ], thus providing more detail.

Ethnicity is more than genetic heritage; it is also cultural expression. The experience of an IMG in UK postgraduate training will differ from that of a UKG, an Indian UKG who grew up in India, and an Indian UKG who grew up in the UK. These are important distinctions which are noted in the literature (e.g. by Woolf et al., 2016 [ 57 ]) however some do not distinguish between ethnicity and graduate status [ 15 ] and none delve into an individual’s cultural expression (e.g., clothing choice) and how this affects the perception of their assessors.

Reasons for DA

Despite the recognition of inequalities in all specialties of surgery, there is a paucity of data explicitly addressing why DA occurs. Reasons behind the phenomenon must be explored to enable change and eliminate biases. Qualitative research is more attuned to capturing the complexities of DA through observation or interview-based studies. Currently most published data is quantitative, and relies on performance metrics to demonstrate the presence of DA while ignoring the causes. Promisingly, there are a gradually increasing number of qualitative, predominantly interview-based, studies (Fig.  2 ).

To create a map of DA in all its guises, an analysis of the themes reported to be contributory to its development is helpful. In our review of the literature, four themes have been identified:

Training culture

In higher surgical training, for there to be equality in outcomes, there needs to be equity in opportunities. Ellis et al. [ 4 ] recognised that variation in training experiences, such as accessibility of supportive peers and senior role models, can have implications on attainment. Trainees would benefit from targeted support at times of transition, such as induction or at examinations, and it may be that currently the needs of certain groups are being met before others, reinforcing differential attainment [ 4 ].

Experience of assessment

Most literature in DA relates to the presence (or lack of) an attainment gap in assessments, such as ARCP or MRCS. It is assumed that these assessments of trainee development are objective and free of bias, and indeed several authors have described a lack of bias in these high-stakes examinations (e.g., Ong et al., 2019 [ 12 ]; Robinson et al., 2019 [ 53 ]). However, in some populations, such as disabled trainees, there are differences in attainment [ 59 ]. This is demonstrated despite legislation requiring professional bodies to make reasonable adjustments to examinations for disabled candidates, such as additional time, text formatting amendments, or wheelchair-accessible venues [ 67 ]. Therefore it would be beneficial to investigate the implementation of these adjustments across higher surgical examinations and identify any deficits.

Social networks

Relationships between colleagues may influence DA in multiple ways. Several studies identified that a lack of a relatable and inspiring mentor may explain why female or BME doctors fail to excel in surgery [ 4 , 55 ]. Certain groups may receive preferential treatment due to their perceived familiarity to seniors [ 35 ]. Robinson et al. [ 15 ] recognised that peer-to-peer relationships were also implicated in professional development, and the lack thereof could lead to poor learning outcomes. Therefore, a non-discriminatory culture and inclusion of trainees within the social network of training is posited as beneficial.

Personal characteristics

Finally, personal factors directly related to protected characteristics have been suggested as a cause of DA. For example, IMGs may perform worse in examinations due to language barriers, and those from disadvantaged backgrounds may have less opportunity to attend expensive courses [ 14 , 16 ]. Although it is impossible to exclude these innate deficits from training, we may mitigate their influence by recognising their presence and providing solutions.

The causes of DA may also be grouped into three levels, as described by Regan de Bere et al. [ 68 ]: macro (the implications of high-level policy), meso (focusing on institutional or working environments) and micro (the influence of individual factors). This can intersect with the four themes identified above, as training culture can be enshrined at both an institutional and individual level, influencing decisions that relate to opportunities for trainees, or at a macro level, such as in the decisions made on nationwide recruitment processes. These three levels can be used to more deeply explore each of the four themes to enrich the discovery of causes of DA.

Discussions outside of surgery

Authors in General Practice (e.g., Unwin et al., 2019 [ 69 ]; Pattinson et al., 2019 [ 70 ]), postgraduate medical training (e.g., Andrews, Chartash, and Hay, 2021 [ 71 ]), and undergraduate medical education (e.g., Yeates et al., 2017 [ 72 ]; Woolf et al., 2013 [ 73 ]) have published more extensively in the aetiology of DA. A study by Hope et al. [ 74 ] evaluating the bias present in MRCP exams used differential item functioning to identify individual questions which demonstrated an attainment gap between male and female and Caucasian and non-Caucasian medical trainees. Conclusions drawn about MRCP Part 1 examinations may be generalisable to MRCS Part A or FRCOphth Part 1: they are all multiple-choice examinations testing applied basic science and usually taken within the first few years of postgraduate training. Therefore it is advisable that differential item functioning should also be applied to these examinations. However, it is possible that findings in some subspecialities may not be generalisable to others, as training environments can vary profoundly. The RCOphth [ 55 ] reported that in 2021, 53% of ophthalmic trainees identified as male, whereas in Orthopaedics 85% identified as male, suggesting different training environments [ 5 ]. It is useful to identify commonalities of DA between surgical specialties and in the wider scope of medical training.

Limitations of our paper

Firstly, whilst aiming to provide a review focussed on the experience of surgical trainees, four papers contained data about either non-surgical trainees or medical students. It is difficult to draw out the surgeons from this data and therefore it is possible that there are issues with generalisability. Furthermore, we did not consider the background of each paper’s authors, as their own lived experience of attainment gap could form the lens through which they commented on surgical education, colouring their interpretation. Despite intending to include as many protected characteristics as possible, inevitably there will be lived experiences missed. Lastly, the experience of surgical trainees outside of the English-speaking world were omitted. No studies were found that originated outside of Europe or North America and therefore the presence or characteristics of DA outside of this area cannot be assumed.

Experiences of inequality in surgical assessment are prevalent in all surgical subspecialities. In order to further investigate DA, researchers should ensure all protected characteristics are considered - and how these interact - to gain insight into intersectionality. Given the paucity of current evidence, particular focus should be given to the implications of disability, and specifically neurodiversity, in progress through training as they are yet to be explored in depth. In defining protected characteristics, future authors should be explicit and should avoid generalisation of cultural backgrounds to allow authentic appreciation of attainment gap. Few authors have considered the driving forces between bias in assessment and DA, and therefore qualitative studies should be prioritised to uncover causes for and protective factors against DA. Once these influences have been identified, educational designers can develop new assessment methods that ensure equity across surgical trainees.

Data availability

All data provided during this study are included in the supplementary information files.

Abbreviations

Accreditation Council for Graduate Medical Education

American Board of Orthopaedic Surgery

American Board of Surgery

American Board of Surgery Certifying Exam

Annual Review of Competence Progression

Black, Asian, and Minority Ethnicity

Council on Resident Education in Obstetrics and Gynecology

Differential Attainment

Deprivation Quintile

European Economic Area

Entrustable Professional Activities

Fellowship of The Royal College of Ophthalmologists

Fellow of the Royal College of Surgeons

General Medical Council

Higher Surgical Training

International Medical Graduate

In-Training Evaluation Report

Member of the Royal College of Obstetricians and Gynaecologists

Member of the Royal College of Physicians

Member of the Royal College of Surgeons

Obstetrics and Gynaecology

Orthopaedic In-Training Examination

Participation In Local Areas

Postgraduate Year

The Royal College of Ophthalmologists

The Royal College of Obstetricians and Gynaecologists

The Royal College of Surgeons of England

United Kingdom Graduate

White United Kingdom Graduate

Joseph JP, Joseph AO, Jayanthi NVG, et al. BAME Underrepresentation in Surgery Leadership in the UK and Ireland in 2020: An Uncomfortable Truth. The Bulletin of the Royal College of Surgeons of England. 2020; 102 (6): 232–33.

Royal College of Surgeons of England. The Royal College – Our Professional Home. An independent review on diversity and inclusion for the Royal College of Surgeons of England. Review conducted by Baroness Helena Kennedy QC. RCS England. 2021.

Sarafidou K, Greatorex R. Surgical workforce: planning today for the workforce of the future. Bull Royal Coll Surg Engl. 2011;93(2):48–9. https://doi.org/10.1308/147363511X552575 .

Article   Google Scholar  

Ellis R, Brennan P, Lee AJ, et al. Differential attainment at MRCS according to gender, ethnicity, age and socioeconomic factors: a retrospective cohort study. J R Soc Med. 2022;115(7):257–72. https://doi.org/10.1177/01410768221079018 .

Hope C, Humes D, Griffiths G, et al. Personal Characteristics Associated with Progression in Trauma and Orthopaedic Specialty Training: A Longitudinal Cohort Study.Journal of Surgical Education 2022; 79 (1): 253–59. doi:10.1016/j.jsurg.2021.06.027.

Takeshita J, Wang S, Loren AW, et al. Association of Racial/Ethnic and Gender Concordance Between Patients and Physicians With Patient Experience Ratings. JAMA Network Open. 2022; 3(11). doi:10.1001/jamanetworkopen.2020.24583.

Katz, P. The Scalpel’s Edge: The Culture of Surgeons. Allyn and Bacon, 1999.

Tørring B, Gittell JH, Laursen M, et al. (2019) Communication and relationship dynamics in surgical teams in the operating room: an ethnographic study. BMC Health Services Research. 2019;19, 528. doi:10.1186/s12913-019-4362-0.

Veazey Brooks J & Bosk CL. (2012) Remaking surgical socialization: work hour restrictions, rites of passage, and occupational identity. Social Science & Medicine. 2012;75(9):1625-32. doi: 10.1016/j.socscimed.2012.07.007.

Arksey H & OʼMalley L. Scoping studies: Towards a methodological framework. International Journal of Social Research Methodology. 2005;8(1), 19–32.

Maker VK, Marco MZ, Dana V, et al. Can We Predict Which Residents Are Going to Pass/Fail the Oral Boards? Journal of Surgical Education. 2012;69 (6): 705–13.

Ong TQ, Kopp JP, Jones AT, et al. Is there gender Bias on the American Board of Surgery general surgery certifying examination? J Surg Res. 2019;237:131–5. https://doi.org/10.1016/j.jss.2018.06.014 .

Pico K, Gioe TJ, Vanheest A, et al. Do men outperform women during orthopaedic residency training? Clin Orthop Relat Res. 2010;468(7):1804–8. https://doi.org/10.1007/s11999-010-1318-4 .

Vinnicombe Z, Little M, Super J, et al. Differential attainment, socioeconomic factors and surgical training. Ann R Coll Surg Engl. 2022;104(8):577–82. https://doi.org/10.1308/rcsann.2021.0255 .

Robinson DBT, Hopkins L, James OP, et al. Egalitarianism in surgical training: let equity prevail. Postgraduate Medical Journal. 2020;96 (1141), 650–654. doi:10.1136/postgradmedj-2020-137563.

Luton OW, Mellor K, Robinson DBT, et al. Differential attainment in higher surgical training: scoping pan-specialty spectra. Postgraduate Medical Journal. 2022;99(1174),849–854. doi:10.1136/postgradmedj-2022-141638.

Rushd S, Landau AB, Khan JA, Allgar V & Lindow SW. An analysis of the performance of UK medical graduates in the MRCOG Part 1 and Part 2 written examinations. Postgraduate Medical Journal. 2012;88 (1039), 249–254. doi:10.1136/postgradmedj-2011-130479.

Hope C, Lund J, Griffiths G, et al. Differences in ARCP outcome by surgical specialty: a longitudinal cohort study. Br J Surg. 2021;108. https://doi.org/10.1093/bjs/znab282.051 .

Royal College of Obstetricians and Gynaecologists. Report Differential Attainment 2019. https://www.rcog.org.uk/media/jscgfgwr/differential-attainment-tef-report-2019.pdf [Last accessed 28/12/23].

Anderson JE, Zern NK, Calhoun KE, et al. Assessment of Potential Gender Bias in General Surgery Resident Milestone Evaluations. JAMA Surgery. 2022;157 (12), 1164–1166. doi:10.1001/jamasurg.2022.3929.

Landau SI, Syvyk S, Wirtalla C, et al. Trainee Sex and Accreditation Council for Graduate Medical Education Milestone Assessments during general surgery residency. JAMA Surg. 2021;156(10):925–31. https://doi.org/10.1001/jamasurg.2021.3005 .

Kwasny L, Shebrain S, Munene G, et al. Is there a gender bias in milestones evaluations in general surgery residency training? Am J Surg. 2021;221(3):505–8. https://doi.org/10.1016/j.amjsurg.2020.12.020 .

Salles A, Mueller CM & Cohen GL. A Values Affirmation Intervention to Improve Female Residents’ Surgical Performance. Journal of Graduate Medical Education. 2016;8 (3), 378–383. doi:10.4300/JGME-D-15-00214.1.

Galvin S, Parlier A, Martino E, et al. Gender Bias in nurse evaluations of residents in Obstetrics and Gynecology. Obstet Gynecol. 2015;126(7S–12S). https://doi.org/10.1097/AOG.0000000000001044 .

Gerull KM, Loe M, Seiler K, et al. Assessing gender bias in qualitative evaluations of surgical residents. Am J Surg. 2019;217(2):306–13. https://doi.org/10.1016/j.amjsurg.2018.09.029 .

Hayward CZ, Sachdeva A, Clarke JR. Is there gender bias in the evaluation of surgical residents? Surgery. 1987;102(2):297–9.

Google Scholar  

Cookenmaster C, Shebrain S, Vos D, et al. Gender perception bias of operative autonomy evaluations among residents and faculty in general surgery training. Am J Surg. 2021;221(3):515–20. https://doi.org/10.1016/j.amjsurg.2020.11.016 .

Olumolade OO, Rollins PD, Daignault-Newton S, et al. Closing the Gap: Evaluation of Gender Disparities in Urology Resident Operative Autonomy and Performance.Journal of Surgical Education.2022;79 (2), 524–530. doi.org/10.1016/j.jsurg.2021.10.010.

Chen JX, Chang EH, Deng F, et al. Autonomy in the Operating Room: A Multicenter Study of Gender Disparities During Surgical Training. Journal of Graduate Medical Education. 2021;13(5), 666–672. doi: 10.4300/JGME-D-21-00217.1.

Meyerson SL, Sternbach JM, Zwischenberger JB, & Bender EM. The Effect of Gender on Resident Autonomy in the Operating room. Journal of Surgical Education. 2017. 74(6), e111–e118. doi.org/10.1016/j.jsurg.2017.06.014.

Hoops H, Heston A, Dewey E, et al. Resident autonomy in the operating room: Does gender matter? The AmericanJournalofSurgery. 2019; 217(2), 301–305. doi.org/10.1016/j.amjsurg.2018.12.023.

Padilla EP, Stahl CC, Jung SA, et al. Gender Differences in Entrustable Professional Activity Evaluations of General Surgery Residents. Annals of Surgery. 2022;275 (2), 222–229. doi:10.1097/SLA.0000000000004905.

Cooney CM, Aravind P, Hultman CS, et al. An Analysis of Gender Bias in Plastic Surgery Resident Assessment. Journal of Graduate Medical Education. 2021;13 (4), 500–506. doi:10.4300/JGME-D-20-01394.1.

Roshan A, Farooq A, Acai A, et al. The effect of gender dyads on the quality of narrative assessments of general surgery trainees. The American Journal of Surgery. 2022; 224 (1A), 179–184. doi.org/10.1016/j.amjsurg.2021.12.001.

Gong D, Winn BJ, Beal CJ, et al. Gender Differences in Case Volume Among Ophthalmology Residents. Archives of Ophthalmology. 2019;137 (9), 1015–1020. doi:10.1001/jamaophthalmol.2019.2427.

Foley KE, Izquierdo KM, von Muchow MG, et al. Colon and Rectal Surgery Robotic Training Programs: An Evaluation of Gender Disparities. Diseases of the Colon and Rectum. 2020; 63(7), 974–979. doi.org/10.1097/DCR.0000000000001625.

Ali A, Subhi Y, Ringsted C et al. Gender differences in the acquisition of surgical skills: a systematic review. Surgical Endoscopy. 2015;29 (11), 3065–3073. doi:10.1007/s00464-015-4092-2.

Eruchalu CN, He K, Etheridge JC, et al. Gender and Racial/Ethnic Disparities in Operative Volumes of Graduating General Surgery Residents.The Journal of Surgical Research. 2022; 279, 104–112. doi.org/10.1016/j.jss.2022.05.020.

Antonoff MB, Feldman H, Luc JGY, et al. Gender Bias in the Evaluation of Surgical Performance: Results of a Prospective Randomized Trial. Annals of Surgery. 2023;277 (2), 206–213. doi:10.1097/SLA.0000000000005015.

Dill-Macky A, Hsu C, Neumayer LA, et al. The Role of Implicit Bias in Surgical Resident Evaluations. Journal of Surgical Education. 2022;79 (3), 761–768. doi:10.1016/j.jsurg.2021.12.003.

Skjold-Ødegaard B, Ersdal HL, Assmus J et al. Comparison of Performance Score for Female and Male Residents in General Surgery Doing Supervised Real-Life Laparoscopic Appendectomy: Is There a Norse Shield-Maiden Effect? World Journal of Surgery. 2021;45 (4), 997–1005. doi:10.1007/s00268-020-05921-4.

Leape CP, Hawken JB, Geng X, et al. An investigation into gender bias in the evaluation of orthopedic trainee arthroscopic skills. Journal of Shoulder and Elbow Surgery. 2022;31 (11), 2402–2409. doi:10.1016/j.jse.2022.05.024.

Vogt VY, Givens VM, Keathley CA, et al. Is a resident’s score on a videotaped objective structured assessment of technical skills affected by revealing the resident’s identity? American Journal of Obstetrics and Gynecology. 2023;189 (3), 688–691. doi:10.1067/S0002-9378(03)00887-1.

Fjørtoft K, Konge L, Christensen J et al. Overcoming Gender Bias in Assessment of Surgical Skills. Journal of Surgical Education. 2022;79 (3), 753–760. doi:10.1016/j.jsurg.2022.01.006.

Grantcharov TP, Bardram L, Funch-Jensen P, et al. Impact of Hand Dominance, Gender, and Experience with Computer Games on Performance in Virtual Reality Laparoscopy. Surgical Endoscopy 2003;17 (7): 1082–85.

Rosser Jr JC, Rosser LE & Savalgi RS. Objective Evaluation of a Laparoscopic Surgical Skill Program for Residents and Senior Surgeons. Archives of Surgery. 1998; 133 (6): 657–61.

White MT & Welch K. Does gender predict performance of novices undergoing Fundamentals of Laparoscopic Surgery (FLS) training? The American Journal of Surgery. 2012;203 (3), 397–400. doi:10.1016/j.amjsurg.2011.09.020.

Nugent E, Joyce C, Perez-Abadia G, et al. Factors influencing microsurgical skill acquisition during a dedicated training course. Microsurgery. 2012;32 (8), 649–656. doi:10.1002/micr.22047.

Milam LA, Cohen GL, Mueller C et al. Stereotype threat and working memory among surgical residents. The American Journal of Surgery. 2018;216 (4), 824–829. doi:10.1016/j.amjsurg.2018.07.064.

Myers SP, Dasari M, Brown JB, et al. Effects of Gender Bias and Stereotypes in Surgical Training: A Randomized Clinical Trial. JAMA Surgery. 2020; 155(7), 552–560. doi.org/10.1001/jamasurg.2020.1127.

Yeo HL, Patrick TD, Jialin M, et al. Association of Demographic and Program Factors With American Board of Surgery Qualifying and Certifying Examinations Pass Rates. JAMA Surgery 2020; 155 (1): 22–30. doi:0.1001/jamasurg.2019.4081.

Foster N, Meghan P, Bettger JP, et al. Objective Test Scores Throughout Orthopedic Surgery Residency Suggest Disparities in Training Experience. Journal of Surgical Education 2021;78 (5): 1400–1405. doi:10.1016/j.jsurg.2021.01.003.

Robinson DBT, Hopkins L, Brown C, et al. Prognostic Significance of Ethnicity on Differential Attainment in Core Surgical Training (CST). Journal of the American College of Surgeons. 2019;229 (4), e191. doi:10.1016/j.jamcollsurg.2019.08.1254.

Roe V, Patterson F, Kerrin M, et al. What supported your success in training? A qualitative exploration of the factors associated with an absence of an ethnic attainment gap in post-graduate specialty training. General Medical Council. 2019. https://www.gmc-uk.org/-/media/documents/gmc-da-final-report-success-factors-in-training-211119_pdf-80914221.pdf [Last accessed 28/12/23].

Royal College of Ophthalmologists. Data on Differential attainment in ophthalmology and monitoring equality, diversity, and inclusion: Recommendations to the RCOphth. London, Royal College of Ophthalmologists. 2022. https://www.rcophth.ac.uk/wp-content/uploads/2023/01/Differential-Attainment-Report-2022.pdf [Last accessed 28/12/23].

Tiffin PA, Orr J, Paton LW, et al. UK nationals who received their medical degrees abroad: selection into, and subsequent performance in postgraduate training: a national data linkage study. BMJ Open. 2018;8:e023060. doi: 10.1136/bmjopen-2018-023060.

Woolf K, Rich A, Viney R, et al. Perceived causes of differential attainment in UK postgraduate medical training: a national qualitative study. BMJ Open. 2016;6 (11), e013429. doi:10.1136/bmjopen-2016-013429.

Brooks JT, Porter SE, Middleton KK, et al. The Majority of Black Orthopaedic Surgeons Report Experiencing Racial Microaggressions During Their Residency Training. Clinical Orthopaedics and Related Research. 2023;481 (4), 675–686. doi:10.1097/CORR.0000000000002455.

Ellis R, Cleland J, Scrimgeour D, et al. The impact of disability on performance in a high-stakes postgraduate surgical examination: a retrospective cohort study. Journal of the Royal Society of Medicine. 2022;115 (2), 58–68. doi:10.1177/01410768211032573.

Royal College of Obstetricians & Gynaecologists. RCOGWorkforceReport2022. Available at: https://www.rcog.org.uk/media/fdtlufuh/workforce-report-july-2022-update.pdf [Last accessed 28/12/23].

Crenshaw KW. On Intersectionality: Essential Writings. Faculty Books. 2017; 255.

Brennan CM & Harrison W. The Dyslexic Surgeon. The Bulletin of the Royal College of Surgeons of England. 2020;102 (3): 72–75. doi:10.1308/rcsbull.2020.72.

Toman L. Navigating medical culture and LGBTQ identity. Clinical Teacher. 2019;16: 335–338. doi:10.1111/tct.13078.

Torales J, Castaldelli-Maia JM & Ventriglio A. LGBT + medical students and disclosure of their sexual orientation: more than in and out of the closet. International Review of Psychiatry. 2022;34:3–4, 402–406. doi:10.1080/09540261.2022.2101881.

Guda VA & Kundu RV. India’s Fair Skin Phenomena. SKINmed. 2021;19(3), 177–178.

Massey D & Martin JA. The NIS skin color scale. Princeton University Press. 2003.

Intercollegiate Committee for Basic Surgical Examinations.AccessArrangementsandReasonableAdjustmentsPolicyforCandidateswithaDisabilityorSpecificLearningdifficulty. 2020. https://www.intercollegiatemrcsexams.org.uk/-/media/files/imrcs/mrcs/mrcs-regulations/access-arrangements-and-reasonable-adjustments-january-2020.pdf [Last accessed 28/12/23].

Regan de Bere S, Nunn S & Nasser M. Understanding differential attainment across medical training pathways: A rapid review of the literature. General Medical Council. 2015. https://www.gmc-uk.org/-/media/documents/gmc-understanding-differential-attainment_pdf-63533431.pdf [Last accessed 28/12/23].

Unwin E, Woolf K, Dacre J, et al. Sex Differences in Fitness to Practise Test Scores: A Cohort Study of GPs. The British Journal of General Practice: The Journal of the Royal College of General Practitioners. 2019; 69 (681): e287–93. doi:10.3399/bjgp19X701789.

Pattinson J, Blow C, Sinha B et al. Exploring Reasons for Differences in Performance between UK and International Medical Graduates in the Membership of the Royal College of General Practitioners Applied Knowledge Test: A Cognitive Interview Study. BMJ Open. 2019;9 (5): e030341. doi:10.1136/bmjopen-2019-030341.

Andrews J, Chartash D & Hay S. Gender Bias in Resident Evaluations: Natural Language Processing and Competency Evaluation. Medical Education. 2021;55 (12): 1383–87. doi:10.1111/medu.14593.

Yeates P, Woolf K, Benbow E, et al. A Randomised Trial of the Influence of Racial Stereotype Bias on Examiners’ Scores, Feedback and Recollections in Undergraduate Clinical Exams. BMC Medicine 2017;15 (1): 179. doi:10.1186/s12916-017-0943-0.

Woolf K, McManus IC, Potts HWW et al. The Mediators of Minority Ethnic Underperformance in Final Medical School Examinations. British Journal of Educational Psychology. 2013; 83 (1): 135–59. doi:10.1111/j.2044-8279.2011.02060.x.

Hope D, Adamson K, McManus IC, et al. Using Differential Item Functioning to Evaluate Potential Bias in a High Stakes Postgraduate Knowledge Based Assessment. BMC Medical Education. 2018;18 (1): 64. doi:10.1186/s12909-018-1143-0.

Download references

No sources of funding to be declared.

Author information

Authors and affiliations.

Department of Surgery and Cancer, Imperial College London, London, UK

Rebecca L. Jones, Suwimol Prusmetikul & Sarah Whitehorn

Department of Ophthalmology, Cheltenham General Hospital, Gloucestershire Hospitals NHS Foundation Trust, Alexandra House, Sandford Road, Cheltenham, GL53 7AN, UK

Rebecca L. Jones

Department of Orthopaedics, Faculty of Medicine, Ramathibodi Hospital, Mahidol University, Bangkok, Thailand

Suwimol Prusmetikul

You can also search for this author in PubMed   Google Scholar

Contributions

RJ, SP and SW conceived the study. RJ carried out the search. RJ, SP and SW reviewed and appraised articles. RJ, SP and SW extracted data and synthesized results from articles. RJ, SP and SW prepared the original draft of the manuscript. RJ and SP prepared Figs. 1 and 2. All authors reviewed and edited the manuscript and agreed to the final version.

Corresponding author

Correspondence to Rebecca L. Jones .

Ethics declarations

Ethics approval and consent to participate.

Not required for this scoping review.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary material 2, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Jones, R.L., Prusmetikul, S. & Whitehorn, S. Differential attainment in assessment of postgraduate surgical trainees: a scoping review. BMC Med Educ 24 , 597 (2024). https://doi.org/10.1186/s12909-024-05580-2

Download citation

Received : 27 February 2024

Accepted : 20 May 2024

Published : 30 May 2024

DOI : https://doi.org/10.1186/s12909-024-05580-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Differential attainment
  • Postgraduate

BMC Medical Education

ISSN: 1472-6920

peer reviewed journal quantitative research

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Mechanisms of school-based peer education interventions to improve young people’s health literacy or health behaviours: A realist-informed systematic review

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliation Centre for Public Health, University of Bristol, Bristol, United Kingdom

ORCID logo

Roles Data curation, Formal analysis, Investigation, Methodology, Project administration, Writing – review & editing

Affiliation Faculty of Health and Medicine, Lancaster University, Lancaster, United Kingdom

Roles Writing – review & editing

Affiliation College of Medicine and Health, University of Exeter, Exeter, United Kingdom

Affiliation Mental Health Foundation, London, United Kingdom

Roles Conceptualization, Writing – review & editing

Roles Conceptualization, Supervision, Writing – review & editing

  • Emily Widnall, 
  • Steven Dodd, 
  • Abigail Emma Russell, 
  • Esther Curtin, 
  • Ruth Simmonds, 
  • Mark Limmer, 
  • Judi Kidger

PLOS

  • Published: May 31, 2024
  • https://doi.org/10.1371/journal.pone.0302431
  • Peer Review
  • Reader Comments

Table 1

Introduction

Peer education interventions are widely used in secondary schools with an aim to improve students’ health literacy and/or health behaviours. Although peer education is a popular intervention technique with some evidence of effectiveness, we know relatively little about the key components that lead to health improvements among young people, or components that may be less helpful. This review aims to identify the main mechanisms involved in school-based peer education health interventions for 11–18-year-olds.

Five electronic databases were searched for eligible studies during October 2020, an updated search was then conducted in January 2023 to incorporate any new studies published between November 2020 and January 2023. To be included in the review, studies must have evaluated a school-based peer education intervention designed to address aspects of the health of students aged 11-18 years old and contain data relevant to mechanisms of effect of these interventions. No restrictions were placed on publication date, or country but only manuscripts available in English language were included.

Forty papers were identified for inclusion with a total of 116 references to intervention mechanisms which were subsequently grouped thematically into 10 key mechanisms. The four most common mechanisms discussed were: 1) Peerness; similar, relatable and credible 2) A balance between autonomy and support, 3) School values and broader change in school culture; and 4) Informal, innovative and personalised delivery methods. Mechanisms were identified in quantitative, qualitative and mixed methods intervention evaluations.

This study highlights a number of key mechanisms that can be used to inform development of future school-based peer education health interventions to maximise effectiveness. Future studies should aim to create theories of change or logic models, and then test the key mechanisms, rather than relying on untested theoretical assumptions. Future work should also examine whether particular mechanisms may lead to harm, and also whether certain mechanisms are more or less important to address different health issues, or whether a set of generic mechanisms always need to be activated for success.

Citation: Widnall E, Dodd S, Russell AE, Curtin E, Simmonds R, Limmer M, et al. (2024) Mechanisms of school-based peer education interventions to improve young people’s health literacy or health behaviours: A realist-informed systematic review. PLoS ONE 19(5): e0302431. https://doi.org/10.1371/journal.pone.0302431

Editor: Sogo France Matlala, Sefako Makgatho Health Sciences University, SOUTH AFRICA

Received: June 14, 2023; Accepted: April 4, 2024; Published: May 31, 2024

Copyright: © 2024 Widnall et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This study was funded by the National Institute for Health and Care Research (NIHR) School for Public Health Research (project number SPHR PHPES025). The views presented in this study are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

School-based health interventions offer a valuable opportunity for prevention and early intervention to ensure good health and well-being in school-aged children. One key area of intervention is improving health literacy and health behaviours among adolescents which is an important public health topic given the strong links between health literacy and adult health outcomes [ 1 ] as well as generally promoting health during the life course [ 2 ]. One popular approach is the use of peer-to-peer teaching methods, with evidence of a global rise in peer education interventions over the past few decades [ 3 , 4 ]. The literature offers several reasons as to why peer education interventions are popular and likely to be an effective method for improving health literacy and health behaviours. The importance of peer influence, particularly in adolescence, is well-documented [ 5 , 6 ], and there is evidence that young people are more likely to seek help for health and well-being from informal sources of support such as friends in comparison to a school teacher or trusted adults [ 7 ]. Peers also play an important role within their school community and are likely to be seen as more relatable than teachers with less of an imbalance of authority [ 8 ]. Peers have also been found to be seen as role models by younger students in previous research [ 9 ].

Many assumptions on the effectiveness of peer education for health improvement centre around adolescence. For example, discussing adolescence as a key period for establishing health-related beliefs and opinions and decision-making in relation to health behaviours [ 10 ], as well as discussion of an increase in risk-taking behaviour in adolescence [ 11 ]. Many also argue that peer education is based on the rationale that peers have a stronger influence on individual behaviour due to the level of familiarity and trust and the comfort they are able to provide [ 12 ].

Existing peer education interventions cover a wide range of health areas, including mental health, physical health, sexual health, and health promotion and behaviour (e.g., healthy eating and smoking prevention [ 13 – 16 ]). This review focuses specifically on peer education, typically involving the selection and training of ‘peer educators’, who subsequently teach or support younger or similar aged students in their school, known as ‘peer learners’. Although there are variations in peer-led interventions including 1:1 peer mentoring and counselling [ 17 – 20 ], these are beyond the scope of this review.

A number of key theories have been suggested as underpinning peer education. The most widely cited key theoretical underpinnings within the peer education literature include Bandura’s Social Learning/Social Cognitive Theory [ 21 , 22 ]; the Diffusion of Innovation Theory [ 23 , 24 ] and within the peer education health literature, the Health Belief Model [ 25 ].

Although key theories are often stated within peer education papers, studies often fail to show clearly how the outcomes are derived from theoretical constructs [ 26 ]. Despite the range of theoretical approaches discussed within the peer education literature, there remains a lack of evidence for the specific mechanisms at play which lead to improved health outcomes in peer-led interventions. There has also been a call for the development of logic models to identify the change mechanisms that lead to changes in health literacy and/or behaviours [ 27 ].

Traditional systematic review approaches have often been criticised for being overly specific and rigid [ 28 – 30 ]. Although a number of reviews exist to assess the effectiveness of peer education interventions, they typically lack in explaining why these interventions may or may not work, in what contexts, and under what circumstances. Realist reviews have emerged as a strategy for synthesising evidence for complex social interventions, such as for peer education [ 31 – 33 ] and other complex health-related interventions such as housing and mental health programs [ 34 – 36 ]. Realist reviews provide an explanatory focus to understand and unpack the mechanisms by which an intervention works, for whom, and in what circumstances [ 29 ].

The aim of this review is to address the limitations in the existing literature, in particular the lack of current exploration of mechanisms of change, by identifying the key mechanisms involved in school-based peer education interventions that aim to improve health literacy and health behaviours in young people.

The PICO (Population, Intervention, Comparator and Outcome) format was followed when developing our research questions and this review was completed in accordance with the 2009 PRISMA statement [ 37 ] and pre-registered on PROSPERO (CRD42021229192). A Prisma checklist can be found in S1 Table .

Realist approach to systematic review

Rather than focusing on meta-analysis and pooling intervention effect sizes, realist reviews seek to generate learning and insights into why interventions work or do not work and what explains these effects [ 29 ]. Additionally, a realist review uses the contextual characteristics of programs to help explain program successes and/or failures. This approach to evaluating existing evidence is explanatory because it combines both theoretical thinking and empirical evidence about how and why interventions work (or do not work).

Search strategy and selection criteria

The search process was part of a wider review of effectiveness of school-based peer education interventions to improve young people’s health [ 38 ]. Although the initial search identified a larger number of studies, this review only reports on 40 papers that directly comment on mechanisms of change involved in the peer education interventions. Initial searches as part of the wider review of effectiveness were conducted during October 2020. An updated search was then conducted in January 2023 to incorporate any new studies published between November 2020 and January 2023 that contained data on mechanisms of effect.

Five electronic databases were searched for eligible studies: CINAHL, Embase, ERIC, MEDLINE and PsycInfo. Search terms were developed by looking at key texts and in discussion with the study team and involved pilot searches and subsequent refinements. An example of the search terms can be found in the S2 Table . Given the lack of literature on peer education mechanisms and the exploratory nature of this review, no restrictions were placed on publication date, country or language.

The inclusion and exclusion criteria for the review are included in Table 1 .

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0302431.t001

Primary outcome(s)

1) Evidence of mechanisms or factors associated with peer education interventions that explain why improvements were or were not seen in participant health.

Data extraction, selection and coding.

Two authors (SD and EW) independently screened papers according to the inclusion criteria above using the Rayyan online review platform ( https://www.rayyan.ai/ ). Any cases of uncertainty or disagreement were discussed and agreed among the wider research team.

Two authors (SD and EW) independently extracted the data, discussing and resolving any discrepancies that arose. Typically, disagreements arose during the initial coding process where we detailing types of mechanisms appearing in the paper, before we had consolidated the themes and confirmed the names of the mechanisms. These discrepancies were resolved through iterative discussions between EW and SD as well as with the wider team.

Data extraction included author, year of publication, location, study aim, design and sample size, description of the intervention, outcome measures and findings relating to mechanisms of peer education interventions. All data relating to how and why the peer education intervention was thought to be effective (or ineffective) was extracted onto a spreadsheet.

In deciding whether data or themes were pertinent to the synthesis, reviewers considered if the identified data offered an explanatory account of what was going on between the intervention(s) and its outcomes. For qualitative studies, data were derived from themes, participant quotes and lesson observations. For quantitative studies, data were derived from author reflections on intervention content and reflections on numerical data as insight into underlying mechanisms.

EW and SD conducted a thematic analysis of data about reasons for effectiveness to identify overarching themes and create a table of ‘key mechanisms’ ( Table 3 ). The analysis followed some aspects of the framework approach [ 39 ], primarily by creating an analytical framework to code all extracted mechanisms. After reading the included papers and extracting any relevant mechanism data into the extraction table, EW and SD made initial notes and started a set of preliminary codes. EW and SD agreed the list of codes which closely described all the mechanisms that were discussed in the included papers. This process went through several iterations through discussions between EW and SD as well as the wider author team. Once the coding framework (set of identified mechanisms) was agreed upon, EW and SD used the final framework to code all relevant mechanisms data within the extraction table. We went through an iterative process of comparing and consolidating mechanisms between papers and reduced an initial larger number of mechanisms to reach the final 10 included.

Quality appraisal

The Mixed Methods Appraisal Tool (MMAT) was used to assess quality of reporting procedures. This tool consists of five specific quality rating items depending on study design (qualitative, quantitative randomized, quantitative non-randomized, quantitative descriptive and quantitative mixed methods). Each paper was given a rating from 0-5 using the relevant questions depending on study design. The following ratings were used to summarise study quality: 0-1 indicating poor quality, 2-3 indicating average quality and 4-5 indicating high quality. Quality ratings of all included papers can be found in S3 Table .

A total of 2,474 studies were identified after the searches and 40 studies were eligible for inclusion. Fig 1 illustrates a flow diagram of the search. A summary of all included studies can be found in S3 Table which includes study author, location and year, health area under investigation, sample size, study design, summary of key mechanisms, how the mechanisms were identified as well as the quality score and rating. The table also records whether the intervention under evaluation used a logic model and whether the paper specifically referred to mechanisms of change.

thumbnail

https://doi.org/10.1371/journal.pone.0302431.g001

Of the 40 papers included in this review, only two papers referenced a logic model [ 40 , 41 ] and only 9 papers specifically referred to the word ‘mechanism(s)’ within their write-up, and this tended to be in the context of calling for future research to focus on intervention mechanisms that lead to effectiveness.

Of the 40 included studies, 20 were mixed methods evaluations, 15 were quantitative studies and 5 were qualitative studies. Typically, mechanisms discussed in relation to quantitative outcomes were in line with key theoretical underpinnings and through authors’ reflections on outcome data, whereas in mixed methods and qualitative papers, mechanisms were identified directly from qualitative themes or quotes. The findings in these studies were derived from the views of young people and teachers in interview/focus group data, classroom observations, and pre and post intervention self-report surveys.

Table 2 details all health areas and study designs of included studies. The specific health area of interest in each study is also detailed in S3 Table .

thumbnail

https://doi.org/10.1371/journal.pone.0302431.t002

Data extraction identified 116 mechanisms referenced within the 40 papers, which were categorised thematically into 10 ‘key mechanism’ groupings (see Table 3 ). Studies typically discussed 2-3 mechanisms each, but papers ranged from discussing one to six individual mechanisms.

thumbnail

https://doi.org/10.1371/journal.pone.0302431.t003

Study quality varied with over one-third of studies (n = 15) rated as high quality, half of the studies rated as medium quality (n = 20) and five studies rated as low quality (13.8%). Several studies lacked detailed description of methodology and many had incomplete outcome data.

Mechanisms discussed in relation to quantitative study findings often centred around existing theoretical assumptions. Table 4 lists the theories cited within the 35 included papers. Theories typically centred around interpersonal influences, social reinforcement and peers acting as role models. The most widely cited theories were Bandura’s Social Learning/Cognitive Theory [ 21 , 22 ] and Diffusion of Innovation Theory [ 23 ].

thumbnail

https://doi.org/10.1371/journal.pone.0302431.t004

Description of key mechanisms

Peerness; similarity, relatability and credibility..

Nineteen studies discussed the similarity and relatability of the peer educators in relation to the success of peer education interventions, often defined as the concept of ‘peerness’. [ 41 , 44 – 46 , 48 , 52 , 57 , 59 , 69 – 78 ].

Peers are viewed as sharing similar concerns and/or pressures, and often have similar experiences and insights related to health, and thus peers feel better able to communicate or mutually self-disclose health concerns to each other as opposed to a teacher or other adult.

Young people discussed feeling ‘less awkward’ getting information from someone closer to their own age when reflecting on a sex education intervention [ 46 ]. However, one study also recommended at least a two-year age gap between peer educators and peer learners due to trust issues with peers of the same age [ 74 ].

The extent to which the peer educators were demographically diverse, and similar to the peer learners they engaged with, was a key feature to ensure ‘peerness’, particularly with regards to age, gender, and lived experience of health condition/unhealthy behaviour. There was some disagreement across studies over whether peers should have experience of undesirable health behaviours. Specifically, the ASSIST Trial included peer educators who already smoked. The authors stated they could not be sure that regular smokers would effectively discourage non-smokers from smoking, but reflected that including smokers in the group of peer educators would help to engage ‘smoking cliques’ in the informal diffusion of knowledge [ 52 ].

Related to ‘peerness’ was the idea of peers being ‘credible’ sources of knowledge and young people being more likely to rely on peers for information. One study described young people seeing older peers as ‘in the know’ and more likely to be credible when providing advice and imparting knowledge [ 79 ]. Linking to the demographic diversity aspect, peer educators who are not representative of the diversity of peer learner groups were also seen to lack credibility.

A balance between autonomy and support.

Another common mechanism, also referred to in nineteen studies, related to the positive impact of peer educator autonomy, but a need to balance this with ongoing support from teachers throughout the intervention [ 40 , 43 , 48 , 50 , 53 , 57 , 69 – 71 , 74 , 75 , 77 , 80 – 85 ]. Support included both teacher presence within the session as well as peer educators being able to seek additional support outside of the sessions.

Autonomy referred to peers being given leeway in how they choose to carry out their role, but also to involving peer educators in designing the intervention. Greater autonomy for the peer educators typically meant adults ceding control, and consequently increased educator’s feelings of investment, belief in the content and confidence in their abilities (see Weichold & Silbereisen, 2012). Peer educator autonomy also included peer educators identifying peers in need and using their own social channels to diffuse health information.

Several studies acknowledged that not all ‘problems’ or health discussions could be solved or managed by peer educators, and sometimes it was required to gain additional support form teachers to help find solutions or explanations [ 71 ]. One sexual health study found that although peer educators could share HIV information with their peers, they did not consider themselves competent to deal with issues such as rape and trauma which they felt required teacher involvement, additional training, and a reliable referral system. This finding was echoed by another paper where teachers suggested that peer education is likely to achieve better results for ‘general health education knowledge’. But specialised health knowledge should be taught by teachers [ 83 ]. Another study discussed the potential problem of placing too much responsibility on the peer educators [ 74 ].

A further sexual health study described peer educators having difficulty gaining trust as a ‘teacher’ which took some time for their peers to get used to. Many peer educators described feeling anxious/unconfident at first which sometimes hindered their teaching abilities. [ 77 ].

“When I taught one person, she said, ‘Don’t confuse me! Why do you need to teach me?”

Studies also highlighted the importance of autonomy-supportive language to motivate behaviour change in peers by being encouraging and empathic without dictating what their friends should or should not do [ 40 ].

Some studies also discuss involving peer educators in designing or adapting content so it is more relevant to their peers [ 82 ].

Autonomy was also discussed in relation to the training of peer educators. In one physical activity study, the benefits of trainers providing choice, valuing peer educator input and using ‘autonomy-supportive language’ were discussed [ 57 ].

“ In (school) classes we just get a teacher telling us things , we were a lot more involved in what was going to happen and things like that . ” ( Peer Educator )

The importance of peer educators being provided with adequate support in both training and delivery, was also discussed in relation to peers feeling overburdened, a potential negative consequence of peer education. Peer educators in one sex education study discussed the difficulty of having added responsibility, preparation and missing classes. This sometimes led to missing planning meetings and not being devoted to lesson planning [ 85 ].

School values and broader change in school culture.

Nine studies discussed the opportunity for peer education interventions to create broader system change within the school or changes in school culture [ 40 , 42 , 49 , 51 , 56 , 58 , 59 , 65 , 86 ] and six studies discussed the benefit of health messages being aligned with existing school values/norms [ 47 , 48 , 50 , 55 , 80 , 84 ].

Studies discussed the culture of ‘connectedness’ and ‘belonging’ resulting from peer education interventions and these being carried forward through subsequent years of school, which in turn had a wider impact on school culture. Studies also discussed the positive impacts on broader shifts in school culture and building social cohesion across year groups of including more peer educators and peer learners, which sometimes led to higher ‘reach’ and increased health-related help-seeking amongst students. Some studies had a more specific focus on school-level changes, for example one suicide prevention intervention (Sources of Strength), focussed on changing norms across the full student populations through 3 months of school-wide messaging [ 58 ]. Additionally, this suicide prevention study found that trained peer leaders reported much more positive expectations that adults at school help suicidal students as well as increased norms for help-seeking from adults at school, perhaps therefore shifting help-seeking attitudes across the school more broadly.

Interventions were also found to be more effective if messages were consistent with existing school practice and policies. For example, one study described how in order to achieve effective promotion of health messages by peer educators, it is necessary that the intervention fits with the views of the teachers and the existing school culture, to promote the agency of staff and students in pursuing better health [ 84 ]. This also included the involvement of teachers in the programme as well as the existence of school-based policy that supported the intervention messages.

One evaluation of a smoking intervention also suggested that peer interventions can counter existing peer culture within a school, particularly given that the adoption of smoking habit is inherently social [ 49 ].

Informal, innovative and personalised delivery methods.

Thirteen studies discussed the importance of peer educators delivering the intervention in a personalised, informal and dynamic style [ 27 , 43 , 45 , 46 , 48 , 50 , 52 , 60 , 78 , 83 , 84 , 87 , 88 ]. Linking to the importance of similarity of experiences and views in the ‘peerness’ mechanism, studies discussed the benefit of peers exchanging personal stories and experiences, which were thought to be more salient for the peer learners [ 48 ]. Equally, this mechanism relates to peer educators being given the autonomy and freedom to be able to deliver lessons in this manner.

Peer interventions often involve multiple learning modalities, which were often more creative and engaging than traditionally structured teacher-student lessons. For example, in an intervention that focused on using short comedy sketches to convey sexual health messages [ 46 ], the peer educators discussed how the least successful workshop was the ‘ least interactive most lecture-like – like feeding them information ’. Similarly, a paper on sun safety discussed learning through interesting and interactive activities being more memorable [ 83 ]. One study also described how the novel content of the lessons led curiosity among peer learners [ 88 ]. Sharing personal stories and experiences is also a unique learning modality in comparison to typical lessons.

One study evaluating the prevention of foetal alcohol syndrome reflected on the lack of personal and dynamic delivery used which lead to weakened effects of their intervention due to its ‘highly didactic instructional approach’ and lack of interactive experiences to teach adolescents the skills they need to avoid unhealthy risk behaviours [ 27 ]. The authors of this study reflected that this particular intervention is largely a biomedical presentation and lacks the interactive experiences shown to be effective when teaching adolescents to avoid risky behaviours.

Studies discussed a general preference for the informal delivery of peer educator sessions which led to students feeling more relaxed than they would with teachers and being able to be more open. One study discussed how they felt a shift from ‘talking at’ to ‘chatting to’. However, this sometimes led to a tension between maintaining informality and exerting authority/control over the class [ 50 ]. Students often identified that successful activities tended to be practical, involving moving around and having ‘fun’.

The removal of the typical teacher-student power dynamic/authoritarian relationship was also discussed in terms of how peers are seen as social equals. When asked how peer educators interacted with peer learners, in a sexual health intervention study, one peer educator answered: “… just how we usually speak…we didn’t speak to them like year 9s , we spoke to them like sort of equals ”.

One paper discussed how when peers used more disciplinary approaches (for example asking why students were not engaging in the session), this led to disengagement from peer learners [ 48 ], perhaps as this reflected a more typical teacher-student relationship. Similarly in a sex education intervention, peer educators noted that the more they channelled the role of a traditional health teacher, the less effective they felt they were as workshop facilitators [ 46 ]

Friendship groups and closeness to a peer educator.

Thirteen studies discussed the role of friendship and closeness between peer educators and peer learners [ 44 , 52 – 54 , 56 , 57 , 59 , 60 , 65 , 76 , 77 , 86 , 89 ]. One study (mental health; mindfulness) discussed how familiarity with their peers could boost motivation to engage with the intervention, feel more relaxed, and therefore learn more quickly [ 90 ].

A sexual health study described how friendships between peer educators and peer learners led to increased motivation, and how close relationships can aid in teaching and learning [ 77 ].

“ When they first came to teach us , we felt good because they are our friends , they are always with us , they listen to us , and they understand us . Therefore , we understood what they were teaching us , and we tried to follow what they taught us . ”

A smoking intervention discussed how peer education interventions are likely to be best introduced into close-knit groups characterised by intra-group interaction, endurance of peer-peer relationships, and likelihood for peers to stay in touch once the intervention has ended, which in turn enhanced the sustainability of the intervention effects [ 44 , 53 ]. Other studies discussed being motivated to engage in activities through familiarity and friendship and how receiving support from close friends was more likely to be accepted and met with a positive attitude [ 57 ].

One study (mental health; suicide prevention) also demonstrated the importance of personal affiliations to peer leaders and natural friendship networks as a medium for promoting peer-led prevention efforts, finding that having a friend who was a peer leader led to higher rates of intervention exposure [ 56 ].

As well as making use of existing social networks, interventions also provided opportunity for engaging with peers outside friendship cliques in order to spread health messages further. One evaluation of a smoking intervention demonstrated the opportunity for peers to make new friends as a result of the intervention, describing getting to know people they ‘wouldn’t usually mix with’, and offering the opportunity for peers to relay health messages to the wider school group [ 59 ].

One physical activity study also discussed peer learners being more open with peer educators if it were a conversation between friends [ 57 ].

“ When it’s coming from a friend or someone that they’re close with then they’re more sort of open about being active and what they would like to do , maybe rather than … So I feel that it’s sort of better coming from a friend . ”

Another study suggested that pairing environmental changes with education and awareness raising among adolescents is more likely to lead to a change in behaviour [ 55 ].

Many intervention descriptions also discussed the peer education intervention influencing students at both the classroom and school level [ 47 ].

Student nominated peers and peers as role models.

Peers acting as role models, and the importance of selecting peer educators looked up to by other students was discussed in 11 studies [ 41 , 42 , 45 , 46 , 53 , 57 , 60 , 70 , 72 , 75 , 83 ].

" Peer educators are nice people who we can look up to , so the material sank in better " [ 46 ].

Teachers from a study evaluating a sun safety intervention discussed how peer education can promote students to set an example for others, and then peer learners are more likely to accept it as normal/desirable [ 83 ].

Peer educators seen as role models by their peers can model ‘appropriate’ health behaviours. Staff reflected on the importance of peer nomination (also discussed as a separate mechanism) in terms of selecting role models for peers. One study discussed how peers nominated ‘ influential individuals who have an effect on other people , and other people look up to and see as leaders , or people they aspire to’ [ 60 ].

Many studies adopted a peer-nomination approach, whilst others relied on staff to nominate students. One study that used staff nomination asked staff members to nominate up to six students whose ‘ voices are heard’ by others students [ 56 ]. The prestige of peer nomination also led to peer educators being viewed as more credible [ 53 ].

Some studies also discussed staff and students views of the peer nomination processes. Teachers in some studies acknowledged that the peer nomination approach resulted in a diverse group of students (which has been acknowledged as important within the ‘peerness’ mechanism) some of whom were unlikely to have been selected by school staff [ 60 ]. Students sometimes expressed reservations, often due to some peer educators lacking confidence or not taking the sessions seriously. As well as role modelling appropriate health behaviours, student nominated peers also led to some concerns regarding peer educators role modelling inappropriate health behaviours. For example, in a smoking intervention study, peer educators who smoked were thought to be an asset by some students but viewed as hypocritical by others [ 60 ].

Safe and non-judgemental space to share experiences.

Ten studies discussed peers and/or external trainers creating a safe space to share their thoughts and feelings relating to health [ 57 , 65 , 70 , 71 , 77 , 78 , 84 , 86 – 88 ]. Students described having the opportunity to share information that they would be uncomfortable sharing with adults, without fear of judgement.

Students in one sexual health intervention study described teachers being reserved and less explicit in comparison to the openness of peer educators [ 50 ].

“I think teachers are quite reserved in what they would say and how explicit they would go. They [Peer Educators] are not really worried about what they say to us…it was quite open.”

Non-judgemental spaces also related to the facilitators who trained the peer educators, for example teachers in one study reflected on how students were likely to feel judged if they raised a point about them or their peers smoking if a teacher led the training, but this was reduced by having an external organisation involved with health promotion trainers and youth workers and teachers asked to take a more passive role [ 70 ]. External trainers in another study also discussed how peer educators felt they could trust them [ 57 ].

“ I think it was nice that they could probably talk to us and they know that we , you know , they could trust us . We wouldn’t go in and talk about them behind their back and I think , and I could trust them as well . ” (Trainer)

However, one sexual health intervention found that as sexual health matters were not talked about openly in schools, peer learners were initially very shy to share their opinions. This particular study had to run separate sessions for boys and girls to facilitate discussion[ 45 ].

Peer learners in another sexual health intervention discussed the ability to talk openly on sensitive issues which they felt unable to do in typical school lessons with teachers [ 88 ].

“ You could speak like normally like you would with your friends about stuff , you weren’t frightened that you’d use a bad word .” Another said “ I don’t think you get the chance to talk a lot in other classes . I don’t know , it’s more difficult to just speak out on like sensitive things .”

Providing a safe, non-judgemental ear for listening was also discussed in the Hope Squad suicide prevention program on the basis that “ kids tell kids ” when they are suicidal and peer support could help ameliorate social disconnectedness [ 65 ].

Frequency of interaction and informal diffusion beyond the classroom.

Nine studies discussed the frequency of interaction between peer educators and peer learners [ 27 , 41 , 42 , 52 , 53 , 56 , 57 , 69 , 78 ]. This included the benefit of multiple sessions and repetitive delivery of health messages but also the opportunity peer education provides to continue relaying positive health messages beyond the classroom. Studies discussed how peers can continue the conversation with each other outside of lessons, which demonstrated an advantage of peer diffusion of health-related knowledge in comparison to a traditional teacher-student classroom context which is constrained to one single lesson.

Peer educators in one sexual health study discussed being available after the lesson to support students with specific needs, for example specific questions about sexual health concerns or how to access contraception. Peer educators therefore provided direct health-related help for peer learners outside of lessons [ 78 ].

“ Like a lot of them [participants] are wanting condoms , and so [The Health Program] provided the condoms . Even when we ran out of condoms and they asked for some , we referred them to the closest health department in our area . ”

This mechanism is also related to the peer educator autonomy mechanism as well as the mechanism of friendship groups and closeness to peer educators. Particularly for interventions that rely on information diffusion outside of the classroom as this involves peer educators acting autonomously in terms of both who they target, what information they impart and how as well as existing relationships outside of the classroom making interactions more likely.

Some peer learners in a physical activity intervention, discussed not feeling they had received any ‘support’ from peers and peer educators believed this may be due to the informal ways they gave support; “ They didn’t really know I was peer supporting because with some friends we did it quite subtly .” (Peer Educator) [ 57 ].

Ratio of peer educators to peer learners.

Four studies discussed the importance of the number of peer leaders being trained, particularly in interventions relying on information diffusion across the school [ 41 , 53 , 56 , 58 ]. One such study focussed on suicide prevention and found training more peer leaders increased school-wide exposure [ 56 ]. This study also found that training up to 15% of the student population as peer leaders increased intervention exposure but after this, the effect appeared to level off.

Similarly, a smoking prevention trial (ASSIST), also discussed training 15% of the target group to maintain a so-called ‘critical mass’ of peer educators. This proportion had been discussed in previous literature around HIV prevention. This critical mass was also discussed in a sexual health study involving informal diffusion. Given the sensitive topic, they increased the proportion of peer nominated students to 25% of the year group during recruitment [ 41 ].

Another suicide prevention intervention (Sources of Strength) discussed how the ratio of peer educators to peer learners should be addressed in future studies as the optimal proportion of students to train as peer educators remains unanswered. This was particularly discussed in reference to the impact the ratio would have in smaller schools which may have different social norms which could dissuade disclosure of suicidal behaviour (i.e. if there were too many peer educators encouraging peers to seek support from adults this may dissuade help-seeking behaviours altogether) [ 58 ]. As a core focus of the Sources of Strength intervention is to engage ‘trusted adults’ to help distressed and suicidal peers, the balance of peer educators to peer learners as well as the mechanism of balance between peer educator autonomy and staff support are likely to be particularly relevant to this particular intervention

Simplicity of health messages.

Two studies reflected on the need for health messages to be simple, particularly for interventions relying on informal diffusion of information [ 74 , 80 ]. One study modelled on informal diffusion of knowledge had a dual focus of physical activity and healthy eating, however the study deemed the dual focus ‘ too complex for information diffusion through adolescent peer networks ’. The study concluded that there is a need for health messages to be simple for trainers to teach and students to pass on. However, a tension was also highlighted between a desire not to oversimplify or isolate health behaviours and the need to present clear succinct health promotion messages.

In contrast to this, students in one study evaluating a sexual health intervention discussed the need for depth in health promotion messages. One young person reflecting on sex education in schools and criticising the lack of depth in lessons delivered by teachers with the main emphasis of the lesson being “don’t do it, and that was basically it” [ 46 ]. In a control school of another study evaluating a sexual health intervention, feedback from students in a control school also criticised teachers for being repetitive and repeating the same content. “ We do the same subject every time , all about puberty and development . I don’t think they can think of anything else to teach us .” [ 88 ].

Given this contrast, it may be that there are differing views between teaching staff and students on how complex and in-depth health messaging should be. It appears that students feel that more in-depth discussion is more likely to lead to a change in behaviour in comparison to simple and repetitive health messaging.

Mechanisms by health area

We identified the most common mechanisms per health area. For alcohol, smoking and substance use, ‘peerness’ (n = 7) was the most cited. For healthy lifestyles, peerness (n = 4), peer educator autonomy vs. additional support from teachers (n = 4) and school values/broader change in school culture (n = 4) were equally cited. For sex education, a balance between peer educator autonomy and teacher support (n = 8), and information/personalised delivery methods (n = 8) were the two most cited mechanisms, followed by ‘peerness’ (n = 7). ‘Broader system/cultural change’ (n = 4) was the most cited mechanism for mental health interventions followed by friendship groups/closeness to a peer educator (n = 3).

This review aimed to identify the key mechanisms in health-based peer education interventions in school settings. We identified 10 key mechanisms from the literature across four health areas. Health areas covered in this review demonstrate that mechanisms of school-based peer led interventions have been explored predominantly within alcohol, smoking and substance use, healthy lifestyle interventions and sexual health interventions, which follows the pattern of our effectiveness review [ 38 ].

With regard to providing support for peer educators, one study discussed the importance of close adult supervision, but also the need for resources and a reliable help-seeking/referral pathway to be in place to support the intervention so peer educators do not become overburdened or feel they are letting their peers down [ 84 ]. Close adult supervision and established help-seeking pathways are likely to be important requirements for future peer education interventions.

Several studies discussed the need for peer educators to be sufficiently similar to their classmates to enhance credibility, however as identified in existing studies [ 81 ], future research needs to take this further in developing more detailed definitions of ‘peerness’ to understand in what ways peer educators need to be similar to peer learners for example defining specific characteristics which could in turn refine and improve the selection method of peer educators to give interventions maximum impact. Furthermore, the ideal age gap between peer educators and peer learners requires further research. Studies differed in how similar in age peers should be and it is possible that some degree of social distinction (e.g. age) is important for credibility.

The role of friendship groups and closeness to peer education mechanism is consistent with previous findings from peer-led programs and theoretical models [ 56 , 91 ] as well as the need for informal and interactive experience as a means to teach adolescents skills they need to avoid unhealthy risk behaviours [ 92 ]. It is likely that there are important links between some of the key mechanisms identified, for example if peer educators are granted autonomy, this may naturally lead to a more informal, interactive and personalised delivery of lessons and avoid didactic approaches that were found to be unhelpful [ 27 ].

While multiple studies reported suggested mechanisms which led to improved health literacy and/or behaviour, many failed to demonstrate this explicitly in their findings and often authors reflected on quantitative findings using existing theoretical assumptions. This gap requires more studies to carry out robust process evaluations or to use realist approaches that explicitly focus on mechanisms of change so we can continue to understand better the pathways to impact on different health outcomes.

A key implication of the paper is the need for logic models and theories of change which underpin peer education interventions for health. We found a tendency for papers to discuss peer education mechanisms more generically, rather than as they relate to specific health outcomes. For example, a number of key mechanisms were discussed in relation to why peer education works as an approach, without linking this to why it works for health outcomes. Further research is required to understand how applicable and important these more generic peer education mechanisms may be when applied to specific health interventions or outcomes. Creating detailed logic models would help deal with this missing link to more clearly map out how specific intervention components led to particular health outcomes. The importance of logic models and theories of change have been highlighted within the new Medical Research Council Guidance regarding health interventions and the importance of detailing how interventions are expected to work. [ 93 ] It is likely that different mechanisms are more or less important for different health outcomes, leading to different peer education approaches (e.g. informal diffusion of knowledge versus formal classroom/taught approaches).

Another interesting finding was that at least one mechanism (simple health messages) was found to make implementation of the intervention easier, but perhaps reduced intervention effectiveness. This example again highlights the need for future studies to more clearly theorise and test mechanisms that are being activated and how these relate to effectiveness. One opportunity realist evaluation provides is to identify potential problems or components of interventions that may be unhelpful. One potential harm raised in this review was the pressure and responsibility placed on peer educators, which was particularly highlighted through the mechanism of finding a balance between peer educator autonomy and teacher support. Providing peer educators with full control may risk inappropriate or potentially harmful content being passed on to peer learners, but it also risks peer educators feeling burdened by peer disclosures or feeling left not able to talk about their own health concerns. Pressure on peer educators has been discussed in the wider literature with regard to a number of issues including peer educators having to deal with personal questions about their own experiences, potential hostility from members of their peer group, reduced confidence when unable to manage difficult situations, as well as frustration when peer expectations are not met and feeling unable to address their own problems or seek help [ 94 – 97 ]. A further implication of this review therefore will be unpicking potential harms of peer education and minimising these harms through well thought out logic models, for example how can we minimise pressure on peer educators and fully support them to teach their peers, whilst allowing them a level of autonomy and freedom to maintain benefits observed from these mechanisms.

One study included in this review concluded that peer education ‘would seem to be a method in search of theory, rather than the application of theory to practice’ [ 83 ]. Despite a number of papers being identified that evaluated peer education interventions to improve young people’s health [ 38 ], relatively few of these explicitly focused on mechanisms relating to intervention effectiveness. There is a clear need for future research and intervention evaluations to focus on change mechanisms for example through the development of logic models that link program activities to anticipated results; a priority mentioned in one included study within this review [ 27 ]. Logic models will in turn provide an increased focus on pathways to outcomes to better understand how and why peer education interventions can improve adolescent health outcomes. It is still unclear whether there are distinct mechanisms at work in each health area, or if generic peer education mechanisms will apply across all health areas.

Limitations

It is likely that some relevant studies may have not been picked up by our search terms due to the wide range of definitions of ‘peer education’ and multiple variations of this term. We did not search grey literature as part of this review which is another possible area that may have picked up additional relevant studies. This review only focused on universal peer education which targeted whole classes or year groups, it may be the case that there is more literature on key mechanisms involved in more targeted peer education interventions that focus on specific at-risk groups. Studies that did not meet our inclusion criteria (e.g. wider age ranges, out-of-school settings), may have included information about mechanisms that have not been identified within this review.

A number of studies excluded from this current review focused on the impacts of participating in peer education on peer educators themselves. These papers were excluded as they were focussed on non-health outcomes such as increased empathy, confidence and self-esteem in peer educators, rather than why peer education interventions are effective for improving health. Another implication of this review is the need for a future focus on the impact of being a peer educator on a number of social outcomes, perhaps taking a similar realist approach to better understand this impact, which may also include any risk of harm and the testing of dark logic models [ 98 ].

This review identified 10 key mechanisms of peer education interventions across four key health areas (alcohol, smoking and substance use, sex education, healthy lifestyles and mental health). This review is the first realist-informed study to synthesize key mechanisms of school-based peer education seeking to improve health literacy and health behaviours across all health areas.

To further our understanding of the active components of school-based peer education interventions for health improvement, more process or realist evaluations that focus on mechanisms of change are required. Although several mechanisms were identified in this review, many were drawn from theoretical assumptions that were never tested or evaluated, and no papers were found to hypothesise mechanisms through logic models or context-mechanism-outcome configurations and then test these mechanisms within their evaluations. Future studies of peer education interventions should focus not only on the mechanisms at play, but how these relate to specific health outcomes, as well as the contextual factors that may constrain or help the mechanisms to be activated, and which ultimately impact on the effectiveness of an intervention.

Pre-registration

This review was pre-registered on PROSPERO: CRD42021229192 (accessible here: https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=229192 ). One deviation was made from the original protocol which was the use of a different quality appraisal tool. Initially we had planned to use the Canadian Effective Public Health Project Practice (EPHPP) Quality Assessment Tool for Quantitative Studies and the Critical Appraisals Skills Programme (CASP) checklist for qualitative studies. The authors instead used a combined mixed methods tool (the Mixed Methods Appraisal Tool; MMAT) for both quantitative and qualitative studies. This was due to the large volume and variation of studies which meant there were benefits to using a single brief quality check tool across all included studies, allowing us to standardise scores across study types.

Supporting information

S1 table. prisma checklist..

https://doi.org/10.1371/journal.pone.0302431.s001

S2 Table. Full search strategy.

https://doi.org/10.1371/journal.pone.0302431.s002

S3 Table. Overview of included studies.

https://doi.org/10.1371/journal.pone.0302431.s003

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 64. Triandis HC, editor Values, attitudes, and interpersonal behavior. Nebraska symposium on motivation; 1979: University of Nebraska Press.

The International Journal of Indian Psychȯlogy

The International Journal of Indian Psychȯlogy

Understanding Anxiety Prevalence and Risk Factors Among College Students: A Literature Review

| Published: May 31, 2024

peer reviewed journal quantitative research

Anxiety represents a significant mental health challenge among college students, impacting academic performance, social relationships, and overall well-being. This review aims to synthesize existing literature to provide a comprehensive understanding of anxiety prevalence and associated risk factors within collegiate populations. Through an analysis of various studies, this review delineates the prevalence rates of anxiety among college students. It examines the extent to which anxiety affects this demographic and highlights its implications on academic performance, social interactions, and overall well-being. Various factors contributing to anxiety among college students are explored. These include academic pressures, social stressors, and lifestyle choices. The review examines how these factors interact and exacerbate anxiety symptoms among college students. Understanding the prevalence and risk factors of anxiety in college students is crucial for developing effective interventions and support systems. By addressing these factors, institutions can promote mental health and well-being on campus, ultimately fostering a conducive environment for student success. In conclusion, this literature review underscores the significance of addressing anxiety among college students. By synthesizing existing research, it provides insights into the prevalence rates and contributing factors of anxiety within collegiate populations.

Anxiety , College students , Prevalence , Risk factors , Mental health

peer reviewed journal quantitative research

This is an Open Access Research distributed under the terms of the Creative Commons Attribution License (www.creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any Medium, provided the original work is properly cited.

© 2024, Tabassum, S.

Received: March 23, 2024; Revision Received: May 27, 2024; Accepted: May 31, 2024

Shahdeen Tabassum @ [email protected]

peer reviewed journal quantitative research

Article Overview

Published in   Volume 12, Issue 2, April-June, 2024

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Adv Pract Oncol
  • v.6(2); Mar-Apr 2015

Logo of jadpraconcol

Understanding and Evaluating Survey Research

A variety of methodologic approaches exist for individuals interested in conducting research. Selection of a research approach depends on a number of factors, including the purpose of the research, the type of research questions to be answered, and the availability of resources. The purpose of this article is to describe survey research as one approach to the conduct of research so that the reader can critically evaluate the appropriateness of the conclusions from studies employing survey research.

SURVEY RESEARCH

Survey research is defined as "the collection of information from a sample of individuals through their responses to questions" ( Check & Schutt, 2012, p. 160 ). This type of research allows for a variety of methods to recruit participants, collect data, and utilize various methods of instrumentation. Survey research can use quantitative research strategies (e.g., using questionnaires with numerically rated items), qualitative research strategies (e.g., using open-ended questions), or both strategies (i.e., mixed methods). As it is often used to describe and explore human behavior, surveys are therefore frequently used in social and psychological research ( Singleton & Straits, 2009 ).

Information has been obtained from individuals and groups through the use of survey research for decades. It can range from asking a few targeted questions of individuals on a street corner to obtain information related to behaviors and preferences, to a more rigorous study using multiple valid and reliable instruments. Common examples of less rigorous surveys include marketing or political surveys of consumer patterns and public opinion polls.

Survey research has historically included large population-based data collection. The primary purpose of this type of survey research was to obtain information describing characteristics of a large sample of individuals of interest relatively quickly. Large census surveys obtaining information reflecting demographic and personal characteristics and consumer feedback surveys are prime examples. These surveys were often provided through the mail and were intended to describe demographic characteristics of individuals or obtain opinions on which to base programs or products for a population or group.

More recently, survey research has developed into a rigorous approach to research, with scientifically tested strategies detailing who to include (representative sample), what and how to distribute (survey method), and when to initiate the survey and follow up with nonresponders (reducing nonresponse error), in order to ensure a high-quality research process and outcome. Currently, the term "survey" can reflect a range of research aims, sampling and recruitment strategies, data collection instruments, and methods of survey administration.

Given this range of options in the conduct of survey research, it is imperative for the consumer/reader of survey research to understand the potential for bias in survey research as well as the tested techniques for reducing bias, in order to draw appropriate conclusions about the information reported in this manner. Common types of error in research, along with the sources of error and strategies for reducing error as described throughout this article, are summarized in the Table .

An external file that holds a picture, illustration, etc.
Object name is jadp-06-168-g01.jpg

Sources of Error in Survey Research and Strategies to Reduce Error

The goal of sampling strategies in survey research is to obtain a sufficient sample that is representative of the population of interest. It is often not feasible to collect data from an entire population of interest (e.g., all individuals with lung cancer); therefore, a subset of the population or sample is used to estimate the population responses (e.g., individuals with lung cancer currently receiving treatment). A large random sample increases the likelihood that the responses from the sample will accurately reflect the entire population. In order to accurately draw conclusions about the population, the sample must include individuals with characteristics similar to the population.

It is therefore necessary to correctly identify the population of interest (e.g., individuals with lung cancer currently receiving treatment vs. all individuals with lung cancer). The sample will ideally include individuals who reflect the intended population in terms of all characteristics of the population (e.g., sex, socioeconomic characteristics, symptom experience) and contain a similar distribution of individuals with those characteristics. As discussed by Mady Stovall beginning on page 162, Fujimori et al. ( 2014 ), for example, were interested in the population of oncologists. The authors obtained a sample of oncologists from two hospitals in Japan. These participants may or may not have similar characteristics to all oncologists in Japan.

Participant recruitment strategies can affect the adequacy and representativeness of the sample obtained. Using diverse recruitment strategies can help improve the size of the sample and help ensure adequate coverage of the intended population. For example, if a survey researcher intends to obtain a sample of individuals with breast cancer representative of all individuals with breast cancer in the United States, the researcher would want to use recruitment strategies that would recruit both women and men, individuals from rural and urban settings, individuals receiving and not receiving active treatment, and so on. Because of the difficulty in obtaining samples representative of a large population, researchers may focus the population of interest to a subset of individuals (e.g., women with stage III or IV breast cancer). Large census surveys require extremely large samples to adequately represent the characteristics of the population because they are intended to represent the entire population.

DATA COLLECTION METHODS

Survey research may use a variety of data collection methods with the most common being questionnaires and interviews. Questionnaires may be self-administered or administered by a professional, may be administered individually or in a group, and typically include a series of items reflecting the research aims. Questionnaires may include demographic questions in addition to valid and reliable research instruments ( Costanzo, Stawski, Ryff, Coe, & Almeida, 2012 ; DuBenske et al., 2014 ; Ponto, Ellington, Mellon, & Beck, 2010 ). It is helpful to the reader when authors describe the contents of the survey questionnaire so that the reader can interpret and evaluate the potential for errors of validity (e.g., items or instruments that do not measure what they are intended to measure) and reliability (e.g., items or instruments that do not measure a construct consistently). Helpful examples of articles that describe the survey instruments exist in the literature ( Buerhaus et al., 2012 ).

Questionnaires may be in paper form and mailed to participants, delivered in an electronic format via email or an Internet-based program such as SurveyMonkey, or a combination of both, giving the participant the option to choose which method is preferred ( Ponto et al., 2010 ). Using a combination of methods of survey administration can help to ensure better sample coverage (i.e., all individuals in the population having a chance of inclusion in the sample) therefore reducing coverage error ( Dillman, Smyth, & Christian, 2014 ; Singleton & Straits, 2009 ). For example, if a researcher were to only use an Internet-delivered questionnaire, individuals without access to a computer would be excluded from participation. Self-administered mailed, group, or Internet-based questionnaires are relatively low cost and practical for a large sample ( Check & Schutt, 2012 ).

Dillman et al. ( 2014 ) have described and tested a tailored design method for survey research. Improving the visual appeal and graphics of surveys by using a font size appropriate for the respondents, ordering items logically without creating unintended response bias, and arranging items clearly on each page can increase the response rate to electronic questionnaires. Attending to these and other issues in electronic questionnaires can help reduce measurement error (i.e., lack of validity or reliability) and help ensure a better response rate.

Conducting interviews is another approach to data collection used in survey research. Interviews may be conducted by phone, computer, or in person and have the benefit of visually identifying the nonverbal response(s) of the interviewee and subsequently being able to clarify the intended question. An interviewer can use probing comments to obtain more information about a question or topic and can request clarification of an unclear response ( Singleton & Straits, 2009 ). Interviews can be costly and time intensive, and therefore are relatively impractical for large samples.

Some authors advocate for using mixed methods for survey research when no one method is adequate to address the planned research aims, to reduce the potential for measurement and non-response error, and to better tailor the study methods to the intended sample ( Dillman et al., 2014 ; Singleton & Straits, 2009 ). For example, a mixed methods survey research approach may begin with distributing a questionnaire and following up with telephone interviews to clarify unclear survey responses ( Singleton & Straits, 2009 ). Mixed methods might also be used when visual or auditory deficits preclude an individual from completing a questionnaire or participating in an interview.

FUJIMORI ET AL.: SURVEY RESEARCH

Fujimori et al. ( 2014 ) described the use of survey research in a study of the effect of communication skills training for oncologists on oncologist and patient outcomes (e.g., oncologist’s performance and confidence and patient’s distress, satisfaction, and trust). A sample of 30 oncologists from two hospitals was obtained and though the authors provided a power analysis concluding an adequate number of oncologist participants to detect differences between baseline and follow-up scores, the conclusions of the study may not be generalizable to a broader population of oncologists. Oncologists were randomized to either an intervention group (i.e., communication skills training) or a control group (i.e., no training).

Fujimori et al. ( 2014 ) chose a quantitative approach to collect data from oncologist and patient participants regarding the study outcome variables. Self-report numeric ratings were used to measure oncologist confidence and patient distress, satisfaction, and trust. Oncologist confidence was measured using two instruments each using 10-point Likert rating scales. The Hospital Anxiety and Depression Scale (HADS) was used to measure patient distress and has demonstrated validity and reliability in a number of populations including individuals with cancer ( Bjelland, Dahl, Haug, & Neckelmann, 2002 ). Patient satisfaction and trust were measured using 0 to 10 numeric rating scales. Numeric observer ratings were used to measure oncologist performance of communication skills based on a videotaped interaction with a standardized patient. Participants completed the same questionnaires at baseline and follow-up.

The authors clearly describe what data were collected from all participants. Providing additional information about the manner in which questionnaires were distributed (i.e., electronic, mail), the setting in which data were collected (e.g., home, clinic), and the design of the survey instruments (e.g., visual appeal, format, content, arrangement of items) would assist the reader in drawing conclusions about the potential for measurement and nonresponse error. The authors describe conducting a follow-up phone call or mail inquiry for nonresponders, using the Dillman et al. ( 2014 ) tailored design for survey research follow-up may have reduced nonresponse error.

CONCLUSIONS

Survey research is a useful and legitimate approach to research that has clear benefits in helping to describe and explore variables and constructs of interest. Survey research, like all research, has the potential for a variety of sources of error, but several strategies exist to reduce the potential for error. Advanced practitioners aware of the potential sources of error and strategies to improve survey research can better determine how and whether the conclusions from a survey research study apply to practice.

The author has no potential conflicts of interest to disclose.

IMAGES

  1. A. Evaluate a primary, quantitative research, peer-reviewed journal.docx

    peer reviewed journal quantitative research

  2. Peer-reviewed Quantitative Research

    peer reviewed journal quantitative research

  3. (PDF) A quantitative study on the effectiveness of peer review for

    peer reviewed journal quantitative research

  4. Overview of peer review

    peer reviewed journal quantitative research

  5. How to find if the journal is peer reviewed or not? How to tell if a

    peer reviewed journal quantitative research

  6. Evaluate primary quantitative research peer-reviewed journal article

    peer reviewed journal quantitative research

VIDEO

  1. Study finds psychopathy and narcissism linked to leftist extremism

  2. What Is Medical Writing?

  3. How to conduct quantitative research (8 Major Steps)

  4. 2014 05 15 11 01 How to Assess a Peer Reviewed Quantitative Research Article William Bannon

  5. What is quantitative research?

  6. Building Quantitative Models in Software: Understanding, Predicting, and controlling Quality

COMMENTS

  1. Recent quantitative research on determinants of health in high ...

    Background Identifying determinants of health and understanding their role in health production constitutes an important research theme. We aimed to document the state of recent multi-country research on this theme in the literature. Methods We followed the PRISMA-ScR guidelines to systematically identify, triage and review literature (January 2013—July 2019). We searched for studies that ...

  2. A Practical Guide to Writing Quantitative and Qualitative Research

    Unlike research questions in quantitative research, research questions in qualitative research are usually continuously reviewed and reformulated. The central question and associated subquestions are stated more than the hypotheses. 15 The central question broadly explores a complex set of factors surrounding the central phenomenon, aiming to ...

  3. Quantitative and Qualitative Approaches to Generalization and

    Second, quantitative research may exploit the bottom-up generalization strategy that is inherent to many qualitative approaches. This offers a new perspective on unsuccessful replications by treating them not as scientific failures, but as a valuable source of information about the scope of a theory. ... Journal Dev. Educ. Glob. Learn. 7, 6 ...

  4. Quantitative Research Excellence: Study Design and ...

    Review of General Psychology. Dec 2000. Open Access. ... HERD: Health Environments Research & Design Journal. Jul 2023. Free access. Joint Recommendations on Reporting Empirical Research in Outdoor, Experiential, Environmental, and Adventure Education Journals ... Quantitative Research for the Qualitative Researcher. 2014. SAGE Knowledge. Book ...

  5. Quantitative measures of health policy implementation determinants and

    There were several inclusion criteria: (1) empirical studies of the implementation of public policies already passed or approved that addressed physical or behavioral health, (2) quantitative self-report or archival measurement methods utilized, (3) published in peer-reviewed journals from January 1995 through April 2019, (4) published in the ...

  6. Quantifying and addressing the prevalence and bias of study ...

    Peer review information Nature Communications thanks Casper Albers, Samuel Scheiner, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer ...

  7. Quantitative Data Analysis—In the Graduate Curriculum

    A quantitative research study collects numerical data that must be analyzed to help draw the study's conclusions. ... Lam C. (2013) An overview of experimental and quasi-experimental research in technical communication journals (1992-2011). IEEE Transactions on Professional Communication 56 ... Review of analysis techniques in mental health ...

  8. Quantitative Research

    The peer review process includes a detailed review by experts in the field, who will assess the methods used, sample size, and limitations of the study. A study may be rejected for publication if it does not meet the rigorous standards of the journal or is deemed to be poorly conducted (see also Chap. 59, "Critical Appraisal of Quantitative ...

  9. Review Article Synthesizing Quantitative Evidence for Evidence-based

    In quantitative research, the aim is usually to establish changes in the other variable (experimental studies), and/or imply a correlation or association between variables (observational studies). ... Flaws in their presentation in peer-reviewed journals led to the establishment of the Preferred Items for Reporting Systematic Reviews and Meta ...

  10. Quantitative measures used in empirical evaluations of ...

    Quantitative measures used in empirical evaluations of mental health policy implementation: A systematic review ... we conducted a systematic review of peer-reviewed journal articles published in 1995-2020. Data extracted included study characteristics, measure development and testing, implementation determinants and outcomes, and measure ...

  11. Synthesising quantitative and qualitative evidence to inform guidelines

    Qualitative and quantitative research is collected and analysed at the same time in a parallel or complementary manner. Integration can occur at three points: ... Provenance and peer review: Not commissioned; externally peer reviewed. ... field-specific studies on the state of the art of mixed-methods research. Journal of Mixed-methods Research ...

  12. Journal of the Society for Social Work and Research

    Ranked #500 out of 1,415 "Sociology and Political Science" journals. Founded in 2009, the Journal of the Society for Social Work and Research ( JSSWR) is the flagship publication of the Society for Social Work and Research (SSWR), a freestanding organization founded in 1994 to advance social work research. JSSWR is a peer-reviewed ...

  13. International Journal of Quantitative Research in Education

    IJQRE aims to enhance the practice and theory of quantitative research in education. In this journal, "education" is defined in the broadest sense of the word, to include settings outside the school. IJQRE publishes peer-reviewed, empirical research employing a variety of quantitative methods and approaches, including but not limited to surveys, cross sectional studies, longitudinal ...

  14. (PDF) Quantitative Research Methods : A Synopsis Approach

    Abstract. The aim of th is study i s to e xplicate the quanti tative methodology. The study established that. quantitative research de als with quantifying and analyzing variables in o rder to get ...

  15. How to appraise quantitative research

    Title, keywords and the authors. The title of a paper should be clear and give a good idea of the subject area. The title should not normally exceed 15 words 2 and should attract the attention of the reader. 3 The next step is to review the key words. These should provide information on both the ideas or concepts discussed in the paper and the ...

  16. The relationship between workload and burnout among nurses: The ...

    Burnout is a large problem in social professions, especially in health care worldwide [] and is consistently associated with nurses intention to leave their profession [].Burnout is a state of emotional, physical, and mental exhaustion caused by a long-term mismatch of the demands associated with the job and the resources of the worker [].One of the causes for the alarming increase in nursing ...

  17. Fragments of peer review: A quantitative analysis of the ...

    This paper examines research on peer review between 1969 and 2015 by looking at records indexed from the Scopus database. Although it is often argued that peer review has been poorly investigated, we found that the number of publications in this field doubled from 2005. A half of this work was indexed as research articles, a third as editorial notes and literature reviews and the rest were ...

  18. Critiquing Quantitative Research Reports: Key Points for the Beginner

    The first step in the critique process is for the reader to browse the abstract and article for an overview. During this initial review a great deal of information can be obtained. The abstract should provide a clear, concise overview of the study. During this review it should be noted if the title, problem statement, and research question (or ...

  19. Development and validation of a nomogram for predicting cognitive

    The Journal of Advanced Nursing (JAN) is a world-leading nursing journal that contributes to the advancement of evidence-based nursing, midwifery and healthcare. Abstract Aims This study aimed to construct a nomogram for predicting the risk of cognitive frailty in patients on maintenance haemodialysis.

  20. Quantitative research on the impact of COVID‐19 on frontline nursing

    Psychological resilience, coping behaviours and social support among health care workers during the COVID‐19 pandemic: A systematic review of quantitative studies. Journal of Nursing Management, 29 (7), 1893-1905. 10.1111/jonm.13336 [PMC free article] [Google Scholar] Liu, M.

  21. Journal of the International AIDS Society

    Journal of the International AIDS Society (JIAS) is a peer-reviewed and open access journal publishing HIV and AIDS research from a wide range of disciplines. Lessons from the field: understanding the use of a youth tailored U = U tool by peer educators in Lesotho with adolescents and youth living with HIV - Lenz - 2024 - Journal of the ...

  22. Masks and respirators for prevention of respiratory infections: a state

    It is worth noting that the scientific literature contains peer-reviewed articles which selectively cite flawed empirical studies to support the argument that masks are universally harmful. In 2021, for example, a research letter in JAMA Pediatrics claimed that masks increased the carbon dioxide content of inhaled air in children aged 6-17 ...

  23. Differential attainment in assessment of postgraduate surgical trainees

    The aim of this scoping review is to understand the breadth of research about the presence of DA in postgraduate surgical education and to determine themes pertaining to causes of inequalities. A scoping review was chosen to provide a means to map the available literature, including published peer-reviewed primary research and grey literature.

  24. Encyclopedia

    In 2022, the European Coalition for Advancing Research Assessment (CoARA) published an agreement with ten commitments, including the recognition of the "diversity of contributions to, and careers in, research", the "focus on qualitative evaluation for which peer review is central, supported by responsible use of quantitative indicators ...

  25. Genetic and Genomic Pathways to Improved Wheat

    Wheat (Triticum aestivum L.) is a fundamental crop essential for both human and animal consumption. Addressing the challenge of enhancing wheat yield involves sophisticated applications of molecular genetics and genomic techniques. This review synthesizes current research identifying and characterizing pivotal genes that impact traits such as grain size, number, and weight, critical factors ...

  26. The Methodological Underdog: A Review of Quantitative Research in the

    Differences in methodological strengths and weaknesses between quantitative and qualitative research are discussed, followed by a data mining exercise on 1,089 journal articles published in Adult Education Quarterly, Studies in Continuing Education, and International Journal of Lifelong Learning. A categorization of quantitative adult education ...

  27. Mechanisms of school-based peer education interventions to improve

    Introduction Peer education interventions are widely used in secondary schools with an aim to improve students' health literacy and/or health behaviours. Although peer education is a popular intervention technique with some evidence of effectiveness, we know relatively little about the key components that lead to health improvements among young people, or components that may be less helpful ...

  28. Understanding Anxiety Prevalence and Risk Factors Among College

    The International Journal of Indian Psychȯlogy(ISSN 2348-5396) is an interdisciplinary, peer-reviewed, academic journal that examines the intersection of Psychology, Social sciences, Education, and Home science with IJIP. IJIP is an international electronic journal published in quarterly. All peer-reviewed articles must meet rigorous standards and can represent a broad range of substantive ...

  29. Understanding and Evaluating Survey Research

    Survey research is defined as "the collection of information from a sample of individuals through their responses to questions" ( Check & Schutt, 2012, p. 160 ). This type of research allows for a variety of methods to recruit participants, collect data, and utilize various methods of instrumentation. Survey research can use quantitative ...

  30. Distinguishing Between Quantitative and Qualitative Research: A

    American Sociological Review, 21, 683-690. Crossref. ISI. Google Scholar. ... Living within blurry boundaries: The value of distinguishing between qualitative and quantitative research. Journal of Mixed Methods Research, 12(3), 268-279. Crossref. ISI. Google Scholar. Morgan D. L. (2018b). Rebuttal. Journal of ... Social Media Peer Communication ...