• How it works

researchprospect post subheader

Reliability and Validity – Definitions, Types & Examples

Published by Alvin Nicolas at August 16th, 2021 , Revised On October 26, 2023

A researcher must test the collected data before making any conclusion. Every  research design  needs to be concerned with reliability and validity to measure the quality of the research.

What is Reliability?

Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid.

Example: If you weigh yourself on a weighing scale throughout the day, you’ll get the same results. These are considered reliable results obtained through repeated measures.

Example: If a teacher conducts the same math test of students and repeats it next week with the same questions. If she gets the same score, then the reliability of the test is high.

What is the Validity?

Validity refers to the accuracy of the measurement. Validity shows how a specific test is suitable for a particular situation. If the results are accurate according to the researcher’s situation, explanation, and prediction, then the research is valid. 

If the method of measuring is accurate, then it’ll produce accurate results. If a method is reliable, then it’s valid. In contrast, if a method is not reliable, it’s not valid. 

Example:  Your weighing scale shows different results each time you weigh yourself within a day even after handling it carefully, and weighing before and after meals. Your weighing machine might be malfunctioning. It means your method had low reliability. Hence you are getting inaccurate or inconsistent results that are not valid.

Example:  Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from various participants, it means the validity of the questionnaire and product is high as it has high reliability.

Most of the time, validity is difficult to measure even though the process of measurement is reliable. It isn’t easy to interpret the real situation.

Example:  If the weighing scale shows the same result, let’s say 70 kg each time, even if your actual weight is 55 kg, then it means the weighing scale is malfunctioning. However, it was showing consistent results, but it cannot be considered as reliable. It means the method has low reliability.

Internal Vs. External Validity

One of the key features of randomised designs is that they have significantly high internal and external validity.

Internal validity  is the ability to draw a causal link between your treatment and the dependent variable of interest. It means the observed changes should be due to the experiment conducted, and any external factor should not influence the  variables .

Example: age, level, height, and grade.

External validity  is the ability to identify and generalise your study outcomes to the population at large. The relationship between the study’s situation and the situations outside the study is considered external validity.

Also, read about Inductive vs Deductive reasoning in this article.

Looking for reliable dissertation support?

We hear you.

  • Whether you want a full dissertation written or need help forming a dissertation proposal, we can help you with both.
  • Get different dissertation services at ResearchProspect and score amazing grades!

Threats to Interval Validity

Threats of external validity, how to assess reliability and validity.

Reliability can be measured by comparing the consistency of the procedure and its results. There are various methods to measure validity and reliability. Reliability can be measured through  various statistical methods  depending on the types of validity, as explained below:

Types of Reliability

Types of validity.

As we discussed above, the reliability of the measurement alone cannot determine its validity. Validity is difficult to be measured even if the method is reliable. The following type of tests is conducted for measuring validity. 

Does your Research Methodology Have the Following?

  • Great Research/Sources
  • Perfect Language
  • Accurate Sources

If not, we can help. Our panel of experts makes sure to keep the 3 pillars of Research Methodology strong.

Does your Research Methodology Have the Following?

How to Increase Reliability?

  • Use an appropriate questionnaire to measure the competency level.
  • Ensure a consistent environment for participants
  • Make the participants familiar with the criteria of assessment.
  • Train the participants appropriately.
  • Analyse the research items regularly to avoid poor performance.

How to Increase Validity?

Ensuring Validity is also not an easy job. A proper functioning method to ensure validity is given below:

  • The reactivity should be minimised at the first concern.
  • The Hawthorne effect should be reduced.
  • The respondents should be motivated.
  • The intervals between the pre-test and post-test should not be lengthy.
  • Dropout rates should be avoided.
  • The inter-rater reliability should be ensured.
  • Control and experimental groups should be matched with each other.

How to Implement Reliability and Validity in your Thesis?

According to the experts, it is helpful if to implement the concept of reliability and Validity. Especially, in the thesis and the dissertation, these concepts are adopted much. The method for implementation given below:

Frequently Asked Questions

What is reliability and validity in research.

Reliability in research refers to the consistency and stability of measurements or findings. Validity relates to the accuracy and truthfulness of results, measuring what the study intends to. Both are crucial for trustworthy and credible research outcomes.

What is validity?

Validity in research refers to the extent to which a study accurately measures what it intends to measure. It ensures that the results are truly representative of the phenomena under investigation. Without validity, research findings may be irrelevant, misleading, or incorrect, limiting their applicability and credibility.

What is reliability?

Reliability in research refers to the consistency and stability of measurements over time. If a study is reliable, repeating the experiment or test under the same conditions should produce similar results. Without reliability, findings become unpredictable and lack dependability, potentially undermining the study’s credibility and generalisability.

What is reliability in psychology?

In psychology, reliability refers to the consistency of a measurement tool or test. A reliable psychological assessment produces stable and consistent results across different times, situations, or raters. It ensures that an instrument’s scores are not due to random error, making the findings dependable and reproducible in similar conditions.

What is test retest reliability?

Test-retest reliability assesses the consistency of measurements taken by a test over time. It involves administering the same test to the same participants at two different points in time and comparing the results. A high correlation between the scores indicates that the test produces stable and consistent results over time.

How to improve reliability of an experiment?

  • Standardise procedures and instructions.
  • Use consistent and precise measurement tools.
  • Train observers or raters to reduce subjective judgments.
  • Increase sample size to reduce random errors.
  • Conduct pilot studies to refine methods.
  • Repeat measurements or use multiple methods.
  • Address potential sources of variability.

What is the difference between reliability and validity?

Reliability refers to the consistency and repeatability of measurements, ensuring results are stable over time. Validity indicates how well an instrument measures what it’s intended to measure, ensuring accuracy and relevance. While a test can be reliable without being valid, a valid test must inherently be reliable. Both are essential for credible research.

Are interviews reliable and valid?

Interviews can be both reliable and valid, but they are susceptible to biases. The reliability and validity depend on the design, structure, and execution of the interview. Structured interviews with standardised questions improve reliability. Validity is enhanced when questions accurately capture the intended construct and when interviewer biases are minimised.

Are IQ tests valid and reliable?

IQ tests are generally considered reliable, producing consistent scores over time. Their validity, however, is a subject of debate. While they effectively measure certain cognitive skills, whether they capture the entirety of “intelligence” or predict success in all life areas is contested. Cultural bias and over-reliance on tests are also concerns.

Are questionnaires reliable and valid?

Questionnaires can be both reliable and valid if well-designed. Reliability is achieved when they produce consistent results over time or across similar populations. Validity is ensured when questions accurately measure the intended construct. However, factors like poorly phrased questions, respondent bias, and lack of standardisation can compromise their reliability and validity.

You May Also Like

In historical research, a researcher collects and analyse the data, and explain the events that occurred in the past to test the truthfulness of observations.

Struggling to figure out “whether I should choose primary research or secondary research in my dissertation?” Here are some tips to help you decide.

This article presents the key advantages and disadvantages of secondary research so you can select the most appropriate research approach for your study.

USEFUL LINKS

LEARNING RESOURCES

researchprospect-reviews-trust-site

COMPANY DETAILS

Research-Prospect-Writing-Service

  • How It Works

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • The 4 Types of Reliability in Research | Definitions & Examples

The 4 Types of Reliability in Research | Definitions & Examples

Published on 3 May 2022 by Fiona Middleton . Revised on 26 August 2022.

Reliability tells you how consistently a method measures something. When you apply the same method to the same   sample   under the same conditions, you should get the same results. If not, the method of measurement may be unreliable.

There are four main types of reliability. Each can be estimated by comparing different sets of results produced by the same method.

Table of contents

Test-retest reliability, interrater reliability, parallel forms reliability, internal consistency, which type of reliability applies to my research.

Test-retest reliability measures the consistency of results when you repeat the same test on the same sample at a different point in time. You use it when you are measuring something that you expect to stay constant in your sample.

Why test-retest reliability is important

Many factors can influence your results at different points in time: for example, respondents might experience different moods, or external conditions might affect their ability to respond accurately.

Test-retest reliability can be used to assess how well a method resists these factors over time. The smaller the difference between the two sets of results, the higher the test-retest reliability.

How to measure test-retest reliability

To measure test-retest reliability, you conduct the same test on the same group of people at two different points in time. Then you calculate the correlation between the two sets of results.

Improving test-retest reliability

  • When designing tests or questionnaires , try to formulate questions, statements, and tasks in a way that won’t be influenced by the mood or concentration of participants.
  • When planning your methods of data collection , try to minimise the influence of external factors, and make sure all samples are tested under the same conditions.
  • Remember that changes can be expected to occur in the participants over time, and take these into account.

Prevent plagiarism, run a free check.

Inter-rater reliability (also called inter-observer reliability) measures the degree of agreement between different people observing or assessing the same thing. You use it when data is collected by researchers assigning ratings, scores or categories to one or more variables .

Why inter-rater reliability is important

People are subjective, so different observers’ perceptions of situations and phenomena naturally differ. Reliable research aims to minimise subjectivity as much as possible so that a different researcher could replicate the same results.

When designing the scale and criteria for data collection, it’s important to make sure that different people will rate the same variable consistently with minimal bias. This is especially important when there are multiple researchers involved in data collection or analysis.

How to measure inter-rater reliability

To measure inter-rater reliability, different researchers conduct the same measurement or observation on the same sample. Then you calculate the correlation between their different sets of results. If all the researchers give similar ratings, the test has high inter-rater reliability.

Improving inter-rater reliability

  • Clearly define your variables and the methods that will be used to measure them.
  • Develop detailed, objective criteria for how the variables will be rated, counted, or categorised.
  • If multiple researchers are involved, ensure that they all have exactly the same information and training.

Parallel forms reliability measures the correlation between two equivalent versions of a test. You use it when you have two different assessment tools or sets of questions designed to measure the same thing.

Why parallel forms reliability is important

If you want to use multiple different versions of a test (for example, to avoid respondents repeating the same answers from memory), you first need to make sure that all the sets of questions or measurements give reliable results.

How to measure parallel forms reliability

The most common way to measure parallel forms reliability is to produce a large set of questions to evaluate the same thing, then divide these randomly into two question sets.

The same group of respondents answers both sets, and you calculate the correlation between the results. High correlation between the two indicates high parallel forms reliability.

Improving parallel forms reliability

  • Ensure that all questions or test items are based on the same theory and formulated to measure the same thing.

Internal consistency assesses the correlation between multiple items in a test that are intended to measure the same construct.

You can calculate internal consistency without repeating the test or involving other researchers, so it’s a good way of assessing reliability when you only have one dataset.

Why internal consistency is important

When you devise a set of questions or ratings that will be combined into an overall score, you have to make sure that all of the items really do reflect the same thing. If responses to different items contradict one another, the test might be unreliable.

How to measure internal consistency

Two common methods are used to measure internal consistency.

  • Average inter-item correlation : For a set of measures designed to assess the same construct, you calculate the correlation between the results of all possible pairs of items and then calculate the average.
  • Split-half reliability : You randomly split a set of measures into two sets. After testing the entire set on the respondents, you calculate the correlation between the two sets of responses.

Improving internal consistency

  • Take care when devising questions or measures: those intended to reflect the same concept should be based on the same theory and carefully formulated.

It’s important to consider reliability when planning your research design , collecting and analysing your data, and writing up your research. The type of reliability you should calculate depends on the type of research  and your  methodology .

If possible and relevant, you should statistically calculate reliability and state this alongside your results .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Middleton, F. (2022, August 26). The 4 Types of Reliability in Research | Definitions & Examples. Scribbr. Retrieved 14 May 2024, from https://www.scribbr.co.uk/research-methods/reliability-explained/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, reliability vs validity in research | differences, types & examples, the 4 types of validity | types, definitions & examples, a quick guide to experimental design | 5 steps & examples.

Reliability In Psychology Research: Definitions & Examples

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

Reliability in psychology research refers to the reproducibility or consistency of measurements. Specifically, it is the degree to which a measurement instrument or procedure yields the same results on repeated trials. A measure is considered reliable if it produces consistent scores across different instances when the underlying thing being measured has not changed.

Reliability ensures that responses are consistent across times and occasions for instruments like questionnaires . Multiple forms of reliability exist, including test-retest, inter-rater, and internal consistency.

For example, if people weigh themselves during the day, they would expect to see a similar reading. Scales that measured weight differently each time would be of little use.

The same analogy could be applied to a tape measure that measures inches differently each time it is used. It would not be considered reliable.

If findings from research are replicated consistently, they are reliable. A correlation coefficient can be used to assess the degree of reliability. If a test is reliable, it should show a high positive correlation.

Of course, it is unlikely the same results will be obtained each time as participants and situations vary. Still, a strong positive correlation between the same test results indicates reliability.

Reliability is important because unreliable measures introduce random error that attenuates correlations and makes it harder to detect real relationships.

Ensuring high reliability for key measures in psychology research helps boost the sensitivity, validity, and replicability of studies. Estimating and reporting reliable evidence is considered an important methodological practice.

There are two types of reliability: internal and external.
  • Internal reliability refers to how consistently different items within a single test measure the same concept or construct. It ensures that a test is stable across its components.
  • External reliability measures how consistently a test produces similar results over repeated administrations or under different conditions. It ensures that a test is stable over time and situations.
Some key aspects of reliability in psychology research include:
  • Test-retest reliability : The consistency of scores for the same person across two or more separate administrations of the same measurement procedure over time. High test-retest reliability suggests the measure provides a stable, reproducible score.
  • Interrater reliability : The level of agreement in scores on a measure between different raters or observers rating the same target. High interrater reliability suggests the ratings are objective and not overly influenced by rater subjectivity or bias.
  • Internal consistency reliability : The degree to which different test items or parts of an instrument that measure the same construct yield similar results. Analyzed statistically using Cronbach’s alpha, a high value suggests the items measure the same underlying concept.

Test-Retest Reliability

The test-retest method assesses the external consistency of a test. Examples of appropriate tests include questionnaires and psychometric tests. It measures the stability of a test over time.

A typical assessment would involve giving participants the same test on two separate occasions. If the same or similar results are obtained, then external reliability is established.

Here’s how it works:

  • A test or measurement is administered to participants at one point in time.
  • After a certain period, the same test is administered again to the same participants without any intervention or treatment in between.
  • The scores from the two administrations are then correlated using a statistical method, often Pearson’s correlation.
  • A high correlation between the scores from the two test administrations indicates good test-retest reliability, suggesting the test yields consistent results over time.

This method is especially useful for tests that measure stable traits or characteristics that aren’t expected to change over short periods.

The disadvantage of the test-retest method is that it takes a long time for results to be obtained. The reliability can be influenced by the time interval between tests and any events that might affect participants’ responses during this interval.

Beck et al. (1996) studied the responses of 26 outpatients on two separate therapy sessions one week apart, they found a correlation of .93 therefore demonstrating high test-restest reliability of the depression inventory.

This is an example of why reliability in psychological research is necessary, if it wasn’t for the reliability of such tests some individuals may not be successfully diagnosed with disorders such as depression and consequently will not be given appropriate therapy.

The timing of the test is important; if the duration is too brief, then participants may recall information from the first test, which could bias the results.

Alternatively, if the duration is too long, it is feasible that the participants could have changed in some important way which could also bias the results.

The test-retest method assesses the external consistency of a test. This refers to the degree to which different raters give consistent estimates of the same behavior. Inter-rater reliability can be used for interviews.

Inter-Rater Reliability

Inter-rater reliability, often termed inter-observer reliability, refers to the extent to which different raters or evaluators agree in assessing a particular phenomenon, behavior, or characteristic. It’s a measure of consistency and agreement between individuals scoring or evaluating the same items or behaviors.

High inter-rater reliability indicates that the findings or measurements are consistent across different raters, suggesting the results are not due to random chance or subjective biases of individual raters.

Statistical measures, such as Cohen’s Kappa or the Intraclass Correlation Coefficient (ICC), are often employed to quantify the level of agreement between raters, helping to ensure that findings are objective and reproducible.

Ensuring high inter-rater reliability is essential, especially in studies involving subjective judgment or observations, as it provides confidence that the findings are replicable and not heavily influenced by individual rater biases.

Note it can also be called inter-observer reliability when referring to observational research. Here, researchers observe the same behavior independently (to avoid bias) and compare their data. If the data is similar, then it is reliable.

Where observer scores do not significantly correlate, then reliability can be improved by:

  • Train observers in the observation techniques and ensure everyone agrees with them.
  • Ensuring behavior categories have been operationalized. This means that they have been objectively defined.
For example, if two researchers are observing ‘aggressive behavior’ of children at nursery they would both have their own subjective opinion regarding what aggression comprises.

In this scenario, they would be unlikely to record aggressive behavior the same, and the data would be unreliable.

However, if they were to operationalize the behavior category of aggression, this would be more objective and make it easier to identify when a specific behavior occurs.

For example, while “aggressive behavior” is subjective and not operationalized, “pushing” is objective and operationalized. Thus, researchers could count how many times children push each other over a certain duration of time.

Internal Consistency Reliability

Internal consistency reliability refers to how well different items on a test or survey that are intended to measure the same construct produce similar scores.

For example, a questionnaire measuring depression may have multiple questions tapping issues like sadness, changes in sleep and appetite, fatigue, and loss of interest. The assumption is that people’s responses across these different symptom items should be fairly consistent.

Cronbach’s alpha is a common statistic used to quantify internal consistency reliability. It calculates the average inter-item correlations among the test items. Values range from 0 to 1, with higher values indicating greater internal consistency. A good rule of thumb is that alpha should generally be above .70 to suggest adequate reliability.

An alpha of .90 for a depression questionnaire, for example, means there is a high average correlation between respondents’ scores on the different symptom items.

This suggests all the items are measuring the same underlying construct (depression) in a consistent manner. It taps the unidimensionality of the scale – evidence it is measuring one thing.

If some items were unrelated to others, the average inter-item correlations would be lower, resulting in a lower alpha. This would indicate the presence of multiple dimensions in the scale, rather than a unified single concept.

So, in summary, high internal consistency reliability evidenced through high Cronbach’s alpha provides support for the fact that various test items successfully tap into the same latent variable the researcher intends to measure. It suggests the items meaningfully cohere together to reliably measure that construct.

Split-Half Method

The split-half method assesses the internal consistency of a test, such as psychometric tests and questionnaires.

There, it measures the extent to which all parts of the test contribute equally to what is being measured.

The split-half approach provides another method of quantifying internal consistency by taking advantage of the natural variation when a single test is divided in half.

It’s somewhat cumbersome to implement but avoids limitations associated with Cronbach’s alpha. However, alpha remains much more widely used in practice due to its relative ease of calculation.

  • A test or questionnaire is split into two halves, typically by separating even-numbered items from odd-numbered items, or first-half items vs. second-half.
  • Each half is scored separately, and the scores are correlated using a statistical method, often Pearson’s correlation.
  • The correlation between the two halves gives an indication of the test’s reliability. A higher correlation suggests better reliability.
  • To adjust for the test’s shortened length (because we’ve split it in half), the Spearman-Brown prophecy formula is often applied to estimate the reliability of the full test based on the split-half reliability.

The reliability of a test could be improved by using this method. For example, any items on separate halves of a test with a low correlation (e.g., r = .25) should either be removed or rewritten.

The split-half method is a quick and easy way to establish reliability. However, it can only be effective with large questionnaires in which all questions measure the same construct. This means it would not be appropriate for tests that measure different constructs.

For example, the Minnesota Multiphasic Personality Inventory has sub scales measuring differently behaviors such as depression, schizophrenia, social introversion. Therefore the split-half method was not be an appropriate method to assess reliability for this personality test.

Validity vs. Reliability In Psychology

In psychology, validity and reliability are fundamental concepts that assess the quality of measurements.

  • Validity refers to the degree to which a measure accurately assesses the specific concept, trait, or construct that it claims to be assessing. It refers to the truthfulness of the measure.
  • Reliability refers to the overall consistency, stability, and repeatability of a measurement. It is concerned with how much random error might be distorting scores or introducing unwanted “noise” into the data.

A key difference is that validity refers to what’s being measured, while reliability refers to how consistently it’s being measured.

An unreliable measure cannot be truly valid because if a measure gives inconsistent, unpredictable scores, it clearly isn’t measuring the trait or quality it aims to measure in a truthful, systematic manner. Establishing reliability provides the foundation for determining the measure’s validity.

A pivotal understanding is that reliability is a necessary but not sufficient condition for validity.

It means a test can be reliable, consistently producing the same results, without being valid, or accurately measuring the intended attribute.

However, a valid test, one that truly measures what it purports to, must be reliable. In the pursuit of rigorous psychological research, both validity and reliability are indispensable.

Ideally, researchers strive for high scores on both -Validity to make sure you’re measuring the correct construct and reliability to make sure you’re measuring it accurately and precisely. The two qualities are independent but both crucial elements of strong measurement procedures.

Validity vs reliability as data research quality evaluation outline diagram. Labeled educational comparison with reliable or valid information vector illustration. Method, technique or test indication

Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Manual for the beck depression inventory The Psychological Corporation. San Antonio , TX.

Clifton, J. D. W. (2020). Managing validity versus reliability trade-offs in scale-building decisions. Psychological Methods, 25 (3), 259–270. https:// doi.org/10.1037/met0000236

Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10 (4), 255–282. https://doi.org/10.1007/BF02288892

Hathaway, S. R., & McKinley, J. C. (1943). Manual for the Minnesota Multiphasic Personality Inventory . New York: Psychological Corporation.

Jannarone, R. J., Macera, C. A., & Garrison, C. Z. (1987). Evaluating interrater agreement through “case-control” sampling. Biometrics, 43 (2), 433–437. https://doi.org/10.2307/2531825

LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11 (4), 815–852. https://doi.org/10.1177/1094428106296642

Watkins, M. W., & Pacheco, M. (2000). Interobserver agreement in behavioral research: Importance and calculation. Journal of Behavioral Education, 10 , 205–212

Print Friendly, PDF & Email

Related Articles

What Is a Focus Group?

Research Methodology

What Is a Focus Group?

Cross-Cultural Research Methodology In Psychology

Cross-Cultural Research Methodology In Psychology

What Is Internal Validity In Research?

What Is Internal Validity In Research?

What Is Face Validity In Research? Importance & How To Measure

Research Methodology , Statistics

What Is Face Validity In Research? Importance & How To Measure

Criterion Validity: Definition & Examples

Criterion Validity: Definition & Examples

Convergent Validity: Definition and Examples

Convergent Validity: Definition and Examples

  • Privacy Policy

Research Method

Home » Reliability – Types, Examples and Guide

Reliability – Types, Examples and Guide

Table of Contents

Reliability

Reliability

Definition:

Reliability refers to the consistency, dependability, and trustworthiness of a system, process, or measurement to perform its intended function or produce consistent results over time. It is a desirable characteristic in various domains, including engineering, manufacturing, software development, and data analysis.

Reliability In Engineering

In engineering and manufacturing, reliability refers to the ability of a product, equipment, or system to function without failure or breakdown under normal operating conditions for a specified period. A reliable system consistently performs its intended functions, meets performance requirements, and withstands various environmental factors, stress, or wear and tear.

Reliability In Software Development

In software development, reliability relates to the stability and consistency of software applications or systems. A reliable software program operates consistently without crashing, produces accurate results, and handles errors or exceptions gracefully. Reliability is often measured by metrics such as mean time between failures (MTBF) and mean time to repair (MTTR).

Reliability In Data Analysis and Statistics

In data analysis and statistics, reliability refers to the consistency and repeatability of measurements or assessments. For example, if a measurement instrument consistently produces similar results when measuring the same quantity or if multiple raters consistently agree on the same assessment, it is considered reliable. Reliability is often assessed using statistical measures such as test-retest reliability, inter-rater reliability, or internal consistency.

Research Reliability

Research reliability refers to the consistency, stability, and repeatability of research findings . It indicates the extent to which a research study produces consistent and dependable results when conducted under similar conditions. In other words, research reliability assesses whether the same results would be obtained if the study were replicated with the same methodology, sample, and context.

What Affects Reliability in Research

Several factors can affect the reliability of research measurements and assessments. Here are some common factors that can impact reliability:

Measurement Error

Measurement error refers to the variability or inconsistency in the measurements that is not due to the construct being measured. It can arise from various sources, such as the limitations of the measurement instrument, environmental factors, or the characteristics of the participants. Measurement error reduces the reliability of the measure by introducing random variability into the data.

Rater/Observer Bias

In studies involving subjective assessments or ratings, the biases or subjective judgments of the raters or observers can affect reliability. If different raters interpret and evaluate the same phenomenon differently, it can lead to inconsistencies in the ratings, resulting in lower inter-rater reliability.

Participant Factors

Characteristics or factors related to the participants themselves can influence reliability. For example, factors such as fatigue, motivation, attention, or mood can introduce variability in responses, affecting the reliability of self-report measures or performance assessments.

Instrumentation

The quality and characteristics of the measurement instrument can impact reliability. If the instrument lacks clarity, has ambiguous items or instructions, or is prone to measurement errors, it can decrease the reliability of the measure. Poorly designed or unreliable instruments can introduce measurement error and decrease the consistency of the measurements.

Sample Size

Sample size can affect reliability, especially in studies where the reliability coefficient is based on correlations or variability within the sample. A larger sample size generally provides more stable estimates of reliability, while smaller samples can yield less precise estimates.

Time Interval

The time interval between test administrations can impact test-retest reliability. If the time interval is too short, participants may recall their previous responses and answer in a similar manner, artificially inflating the reliability coefficient. On the other hand, if the time interval is too long, true changes in the construct being measured may occur, leading to lower test-retest reliability.

Content Sampling

The specific items or questions included in a measure can affect reliability. If the measure does not adequately sample the full range of the construct being measured or if the items are too similar or redundant, it can result in lower internal consistency reliability.

Scoring and Data Handling

Errors in scoring, data entry, or data handling can introduce variability and impact reliability. Inaccurate or inconsistent scoring procedures, data entry mistakes, or mishandling of missing data can affect the reliability of the measurements.

Context and Environment

The context and environment in which measurements are obtained can influence reliability. Factors such as noise, distractions, lighting conditions, or the presence of others can introduce variability and affect the consistency of the measurements.

Types of Reliability

There are several types of reliability that are commonly discussed in research and measurement contexts. Here are some of the main types of reliability:

Test-Retest Reliability

This type of reliability assesses the consistency of a measure over time. It involves administering the same test or measure to the same group of individuals on two separate occasions and then comparing the results. If the scores are similar or highly correlated across the two testing points, it indicates good test-retest reliability.

Inter-Rater Reliability

Inter-rater reliability examines the degree of agreement or consistency between different raters or observers who are assessing the same phenomenon. It is commonly used in subjective evaluations or assessments where judgments are made by multiple individuals. High inter-rater reliability suggests that different observers are likely to reach the same conclusions or make consistent assessments.

Internal Consistency Reliability

Internal consistency reliability assesses the extent to which the items or questions within a measure are consistent with each other. It is commonly measured using techniques such as Cronbach’s alpha. High internal consistency reliability indicates that the items within a measure are measuring the same construct or concept consistently.

Parallel Forms Reliability

Parallel forms reliability assesses the consistency of different versions or forms of a test that are intended to measure the same construct. Two equivalent versions of a test are administered to the same group of individuals, and the scores are compared to determine the level of agreement between the forms.

Split-Half Reliability

Split-half reliability involves splitting a measure into two halves and examining the consistency between the two halves. It can be done by dividing the items into odd-even pairs or by randomly splitting the items. The scores from the two halves are then compared to assess the degree of consistency.

Alternate Forms Reliability

Alternate forms reliability is similar to parallel forms reliability, but it involves administering two different versions of a test to the same group of individuals. The two forms should be equivalent and measure the same construct. The scores from the two forms are then compared to assess the level of agreement.

Applications of Reliability

Reliability has several important applications across various fields and disciplines. Here are some common applications of reliability:

Psychological and Educational Testing

Reliability is crucial in psychological and educational testing to ensure that the scores obtained from assessments are consistent and dependable. It helps to determine the accuracy and stability of measures such as intelligence tests, personality assessments, academic exams, and aptitude tests.

Market Research

In market research, reliability is important for ensuring consistent and dependable data collection. Surveys, questionnaires, and other data collection instruments need to have high reliability to obtain accurate and consistent responses from participants. Reliability analysis helps researchers identify and address any issues that may affect the consistency of the data.

Health and Medical Research

Reliability is essential in health and medical research to ensure that measurements and assessments used in studies are consistent and trustworthy. This includes the reliability of diagnostic tests, patient-reported outcome measures, observational measures, and psychometric scales. High reliability is crucial for making valid inferences and drawing reliable conclusions from research findings.

Quality Control and Manufacturing

Reliability analysis is widely used in industries such as manufacturing and quality control to assess the reliability of products and processes. It helps to identify and address sources of variation and inconsistency, ensuring that products meet the required standards and specifications consistently.

Social Science Research

Reliability plays a vital role in social science research, including fields such as sociology, anthropology, and political science. It is used to assess the consistency of measurement tools, such as surveys or observational protocols, to ensure that the data collected is reliable and can be trusted for analysis and interpretation.

Performance Evaluation

Reliability is important in performance evaluation systems used in organizations and workplaces. Whether it’s assessing employee performance, evaluating the reliability of scoring rubrics, or measuring the consistency of ratings by supervisors, reliability analysis helps ensure fairness and consistency in the evaluation process.

Psychometrics and Scale Development

Reliability analysis is a fundamental step in psychometrics, which involves developing and validating measurement scales. Researchers assess the reliability of items and subscales to ensure that the scale measures the intended construct consistently and accurately.

Examples of Reliability

Here are some examples of reliability in different contexts:

Test-Retest Reliability Example: A researcher administers a personality questionnaire to a group of participants and then administers the same questionnaire to the same participants after a certain period, such as two weeks. The scores obtained from the two administrations are highly correlated, indicating good test-retest reliability.

Inter-Rater Reliability Example: Multiple teachers assess the essays of a group of students using a standardized grading rubric. The ratings assigned by the teachers show a high level of agreement or correlation, indicating good inter-rater reliability.

Internal Consistency Reliability Example: A researcher develops a questionnaire to measure job satisfaction. The researcher administers the questionnaire to a group of employees and calculates Cronbach’s alpha to assess internal consistency. The calculated value of Cronbach’s alpha is high (e.g., above 0.8), indicating good internal consistency reliability.

Parallel Forms Reliability Example: Two versions of a mathematics exam are created, which are designed to measure the same mathematical skills. Both versions of the exam are administered to the same group of students, and the scores from the two versions are highly correlated, indicating good parallel forms reliability.

Split-Half Reliability Example: A researcher develops a survey to measure self-esteem. The survey consists of 20 items, and the researcher randomly divides the items into two halves. The scores obtained from each half of the survey show a high level of agreement or correlation, indicating good split-half reliability.

Alternate Forms Reliability Example: A researcher develops two versions of a language proficiency test, which are designed to measure the same language skills. Both versions of the test are administered to the same group of participants, and the scores from the two versions are highly correlated, indicating good alternate forms reliability.

Where to Write About Reliability in A Thesis

When writing about reliability in a thesis, there are several sections where you can address this topic. Here are some common sections in a thesis where you can discuss reliability:

Introduction :

In the introduction section of your thesis, you can provide an overview of the study and briefly introduce the concept of reliability. Explain why reliability is important in your research field and how it relates to your study objectives.

Theoretical Framework:

If your thesis includes a theoretical framework or a literature review, this is a suitable section to discuss reliability. Provide an overview of the relevant theories, models, or concepts related to reliability in your field. Discuss how other researchers have measured and assessed reliability in similar studies.

Methodology:

The methodology section is crucial for addressing reliability. Describe the research design, data collection methods, and measurement instruments used in your study. Explain how you ensured the reliability of your measurements or data collection procedures. This may involve discussing pilot studies, inter-rater reliability, test-retest reliability, or other techniques used to assess and improve reliability.

Data Analysis:

In the data analysis section, you can discuss the statistical techniques employed to assess the reliability of your data. This might include measures such as Cronbach’s alpha, Cohen’s kappa, or intraclass correlation coefficients (ICC), depending on the nature of your data and research design. Present the results of reliability analyses and interpret their implications for your study.

Discussion:

In the discussion section, analyze and interpret the reliability results in relation to your research findings and objectives. Discuss any limitations or challenges encountered in establishing or maintaining reliability in your study. Consider the implications of reliability for the validity and generalizability of your results.

Conclusion:

In the conclusion section, summarize the main points discussed in your thesis regarding reliability. Emphasize the importance of reliability in research and highlight any recommendations or suggestions for future studies to enhance reliability.

Importance of Reliability

Reliability is of utmost importance in research, measurement, and various practical applications. Here are some key reasons why reliability is important:

  • Consistency : Reliability ensures consistency in measurements and assessments. Consistent results indicate that the measure or instrument is stable and produces similar outcomes when applied repeatedly. This consistency allows researchers and practitioners to have confidence in the reliability of the data collected and the conclusions drawn from it.
  • Accuracy : Reliability is closely linked to accuracy. A reliable measure produces results that are close to the true value or state of the phenomenon being measured. When a measure is unreliable, it introduces error and uncertainty into the data, which can lead to incorrect interpretations and flawed decision-making.
  • Trustworthiness : Reliability enhances the trustworthiness of measurements and assessments. When a measure is reliable, it indicates that it is dependable and can be trusted to provide consistent and accurate results. This is particularly important in fields where decisions and actions are based on the data collected, such as education, healthcare, and market research.
  • Comparability : Reliability enables meaningful comparisons between different groups, individuals, or time points. When measures are reliable, differences or changes observed can be attributed to true differences in the underlying construct, rather than measurement error. This allows for valid comparisons and evaluations, both within a study and across different studies.
  • Validity : Reliability is a prerequisite for validity. Validity refers to the extent to which a measure or assessment accurately captures the construct it is intended to measure. If a measure is unreliable, it cannot be valid, as it does not consistently reflect the construct of interest. Establishing reliability is an important step in establishing the validity of a measure.
  • Decision-making : Reliability is crucial for making informed decisions based on data. Whether it’s evaluating employee performance, diagnosing medical conditions, or conducting research studies, reliable measurements and assessments provide a solid foundation for decision-making processes. They help to reduce uncertainty and increase confidence in the conclusions drawn from the data.
  • Quality Assurance : Reliability is essential for maintaining quality assurance in various fields. It allows organizations to assess and monitor the consistency and dependability of their processes, products, and services. By ensuring reliability, organizations can identify areas of improvement, address sources of variation, and deliver consistent and high-quality outcomes.

Limitations of Reliability

Here are some limitations of reliability:

  • Limited to consistency: Reliability primarily focuses on the consistency of measurements and findings. However, it does not guarantee the accuracy or validity of the measurements. A measurement can be consistent but still systematically biased or flawed, leading to inaccurate results. Reliability alone cannot address validity concerns.
  • Context-dependent: Reliability can be influenced by the specific context, conditions, or population under study. A measurement or instrument that demonstrates high reliability in one context may not necessarily exhibit the same level of reliability in a different context. Researchers need to consider the specific characteristics and limitations of their study context when interpreting reliability.
  • Inadequate for complex constructs: Reliability is often based on the assumption of unidimensionality, which means that a measurement instrument is designed to capture a single construct. However, many real-world phenomena are complex and multidimensional, making it challenging to assess reliability accurately. Reliability measures may not adequately capture the full complexity of such constructs.
  • Susceptible to systematic errors: Reliability focuses on minimizing random errors, but it may not detect or address systematic errors or biases in measurements. Systematic errors can arise from flaws in the measurement instrument, data collection procedures, or sample selection. Reliability assessments may not fully capture or address these systematic errors, leading to biased or inaccurate results.
  • Relies on assumptions: Reliability assessments often rely on certain assumptions, such as the assumption of measurement invariance or the assumption of stable conditions over time. These assumptions may not always hold true in real-world research settings, particularly when studying dynamic or evolving phenomena. Failure to meet these assumptions can compromise the reliability of the research.
  • Limited to quantitative measures: Reliability is typically applied to quantitative measures and instruments, which can be problematic when studying qualitative or subjective phenomena. Reliability measures may not fully capture the richness and complexity of qualitative data, limiting their applicability in certain research domains.

Also see Reliability Vs Validity

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Validity

Validity – Types, Examples and Guide

Alternate Forms Reliability

Alternate Forms Reliability – Methods, Examples...

Construct Validity

Construct Validity – Types, Threats and Examples

Internal Validity

Internal Validity – Threats, Examples and Guide

Reliability Vs Validity

Reliability Vs Validity

Internal_Consistency_Reliability

Internal Consistency Reliability – Methods...

Grad Coach

Validity & Reliability In Research

A Plain-Language Explanation (With Examples)

By: Derek Jansen (MBA) | Expert Reviewer: Kerryn Warren (PhD) | September 2023

Validity and reliability are two related but distinctly different concepts within research. Understanding what they are and how to achieve them is critically important to any research project. In this post, we’ll unpack these two concepts as simply as possible.

This post is based on our popular online course, Research Methodology Bootcamp . In the course, we unpack the basics of methodology  using straightfoward language and loads of examples. If you’re new to academic research, you definitely want to use this link to get 50% off the course (limited-time offer).

Overview: Validity & Reliability

  • The big picture
  • Validity 101
  • Reliability 101 
  • Key takeaways

First, The Basics…

First, let’s start with a big-picture view and then we can zoom in to the finer details.

Validity and reliability are two incredibly important concepts in research, especially within the social sciences. Both validity and reliability have to do with the measurement of variables and/or constructs – for example, job satisfaction, intelligence, productivity, etc. When undertaking research, you’ll often want to measure these types of constructs and variables and, at the simplest level, validity and reliability are about ensuring the quality and accuracy of those measurements .

As you can probably imagine, if your measurements aren’t accurate or there are quality issues at play when you’re collecting your data, your entire study will be at risk. Therefore, validity and reliability are very important concepts to understand (and to get right). So, let’s unpack each of them.

Free Webinar: Research Methodology 101

What Is Validity?

In simple terms, validity (also called “construct validity”) is all about whether a research instrument accurately measures what it’s supposed to measure .

For example, let’s say you have a set of Likert scales that are supposed to quantify someone’s level of overall job satisfaction. If this set of scales focused purely on only one dimension of job satisfaction, say pay satisfaction, this would not be a valid measurement, as it only captures one aspect of the multidimensional construct. In other words, pay satisfaction alone is only one contributing factor toward overall job satisfaction, and therefore it’s not a valid way to measure someone’s job satisfaction.

what is a reliability research

Oftentimes in quantitative studies, the way in which the researcher or survey designer interprets a question or statement can differ from how the study participants interpret it . Given that respondents don’t have the opportunity to ask clarifying questions when taking a survey, it’s easy for these sorts of misunderstandings to crop up. Naturally, if the respondents are interpreting the question in the wrong way, the data they provide will be pretty useless . Therefore, ensuring that a study’s measurement instruments are valid – in other words, that they are measuring what they intend to measure – is incredibly important.

There are various types of validity and we’re not going to go down that rabbit hole in this post, but it’s worth quickly highlighting the importance of making sure that your research instrument is tightly aligned with the theoretical construct you’re trying to measure .  In other words, you need to pay careful attention to how the key theories within your study define the thing you’re trying to measure – and then make sure that your survey presents it in the same way.

For example, sticking with the “job satisfaction” construct we looked at earlier, you’d need to clearly define what you mean by job satisfaction within your study (and this definition would of course need to be underpinned by the relevant theory). You’d then need to make sure that your chosen definition is reflected in the types of questions or scales you’re using in your survey . Simply put, you need to make sure that your survey respondents are perceiving your key constructs in the same way you are. Or, even if they’re not, that your measurement instrument is capturing the necessary information that reflects your definition of the construct at hand.

If all of this talk about constructs sounds a bit fluffy, be sure to check out Research Methodology Bootcamp , which will provide you with a rock-solid foundational understanding of all things methodology-related. Remember, you can take advantage of our 60% discount offer using this link.

Need a helping hand?

what is a reliability research

What Is Reliability?

As with validity, reliability is an attribute of a measurement instrument – for example, a survey, a weight scale or even a blood pressure monitor. But while validity is concerned with whether the instrument is measuring the “thing” it’s supposed to be measuring, reliability is concerned with consistency and stability . In other words, reliability reflects the degree to which a measurement instrument produces consistent results when applied repeatedly to the same phenomenon , under the same conditions .

As you can probably imagine, a measurement instrument that achieves a high level of consistency is naturally more dependable (or reliable) than one that doesn’t – in other words, it can be trusted to provide consistent measurements . And that, of course, is what you want when undertaking empirical research. If you think about it within a more domestic context, just imagine if you found that your bathroom scale gave you a different number every time you hopped on and off of it – you wouldn’t feel too confident in its ability to measure the variable that is your body weight 🙂

It’s worth mentioning that reliability also extends to the person using the measurement instrument . For example, if two researchers use the same instrument (let’s say a measuring tape) and they get different measurements, there’s likely an issue in terms of how one (or both) of them are using the measuring tape. So, when you think about reliability, consider both the instrument and the researcher as part of the equation.

As with validity, there are various types of reliability and various tests that can be used to assess the reliability of an instrument. A popular one that you’ll likely come across for survey instruments is Cronbach’s alpha , which is a statistical measure that quantifies the degree to which items within an instrument (for example, a set of Likert scales) measure the same underlying construct . In other words, Cronbach’s alpha indicates how closely related the items are and whether they consistently capture the same concept . 

Reliability reflects whether an instrument produces consistent results when applied to the same phenomenon, under the same conditions.

Recap: Key Takeaways

Alright, let’s quickly recap to cement your understanding of validity and reliability:

  • Validity is concerned with whether an instrument (e.g., a set of Likert scales) is measuring what it’s supposed to measure
  • Reliability is concerned with whether that measurement is consistent and stable when measuring the same phenomenon under the same conditions.

In short, validity and reliability are both essential to ensuring that your data collection efforts deliver high-quality, accurate data that help you answer your research questions . So, be sure to always pay careful attention to the validity and reliability of your measurement instruments when collecting and analysing data. As the adage goes, “rubbish in, rubbish out” – make sure that your data inputs are rock-solid.

Literature Review Course

Psst… there’s more!

This post is an extract from our bestselling short course, Methodology Bootcamp . If you want to work smart, you don't want to miss this .

You Might Also Like:

Research aims, research objectives and research questions

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS.

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS AND I HAVE GREATLY BENEFITED FROM THE CONTENT.

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Logo for BCcampus Open Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Chapter 5: Psychological Measurement

Reliability and Validity of Measurement

Learning Objectives

  • Define reliability, including the different types and how they are assessed.
  • Define validity, including the different types and how they are assessed.
  • Describe the kinds of evidence that would be relevant to assessing the reliability and validity of a particular measure.

Again, measurement involves assigning scores to individuals so that they represent some characteristic of the individuals. But how do researchers know that the scores actually represent the characteristic, especially when it is a construct like intelligence, self-esteem, depression, or working memory capacity? The answer is that they conduct research using the measure to confirm that the scores make sense based on their understanding of the construct being measured. This is an extremely important point. Psychologists do not simply  assume  that their measures work. Instead, they collect data to demonstrate  that they work. If their research does not demonstrate that a measure works, they stop using it.

As an informal example, imagine that you have been dieting for a month. Your clothes seem to be fitting more loosely, and several friends have asked if you have lost weight. If at this point your bathroom scale indicated that you had lost 10 pounds, this would make sense and you would continue to use the scale. But if it indicated that you had gained 10 pounds, you would rightly conclude that it was broken and either fix it or get rid of it. In evaluating a measurement method, psychologists consider two general dimensions: reliability and validity.

Reliability

Reliability  refers to the consistency of a measure. Psychologists consider three types of consistency: over time (test-retest reliability), across items (internal consistency), and across different researchers (inter-rater reliability).

Test-Retest Reliability

When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time.  Test-retest reliability  is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.

Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the  same  group of people at a later time, and then looking at  test-retest correlation  between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing Pearson’s  r . Figure 5.2 shows the correlation between two sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered two times, a week apart. Pearson’s r for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.

Score at time 1 is on the x-axis and score at time 2 is on the y-axis, showing fairly consistent scores

Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.

Internal Consistency

A second kind of reliability is  internal consistency , which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioural and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials.

Like test-retest reliability, internal consistency can only be assessed by collecting and analyzing data. One approach is to look at a  split-half correlation . This involves splitting the items into two sets, such as the first and second halves of the items or the even- and odd-numbered items. Then a score is computed for each set of items, and the relationship between the two sets of scores is examined. For example, Figure 5.3 shows the split-half correlation between several university students’ scores on the even-numbered items and their scores on the odd-numbered items of the Rosenberg Self-Esteem Scale. Pearson’s  r  for these data is +.88. A split-half correlation of +.80 or greater is generally considered good internal consistency.

Score on even-numbered items is on the x-axis and score on odd-numbered items is on the y-axis, showing fairly consistent scores

Perhaps the most common measure of internal consistency used by researchers in psychology is a statistic called  Cronbach’s α  (the Greek letter alpha). Conceptually, α is the mean of all possible split-half correlations for a set of items. For example, there are 252 ways to split a set of 10 items into two sets of five. Cronbach’s α would be the mean of the 252 split-half correlations. Note that this is not how α is actually computed, but it is a correct way of interpreting the meaning of this statistic. Again, a value of +.80 or greater is generally taken to indicate good internal consistency.

Interrater Reliability

Many behavioural measures involve significant judgment on the part of an observer or a rater.  Inter-rater reliability  is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does in fact have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other. Inter-rater reliability would also have been measured in Bandura’s Bobo doll study. In this case, the observers’ ratings of how many acts of aggression a particular child committed while playing with the Bobo doll should have been highly positively correlated. Interrater reliability is often assessed using Cronbach’s α when the judgments are quantitative or an analogous statistic called Cohen’s κ (the Greek letter kappa) when they are categorical.

Validity  is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimetre longer than another’s would indicate nothing about which one had higher self-esteem.

Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure. Here we consider three basic kinds: face validity, content validity, and criterion validity.

Face Validity

Face validity  is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.

Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behaviour, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.

Content Validity

Content validity  is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that he or she thinks positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.

Criterion Validity

Criterion validity  is the extent to which people’s scores on a measure are correlated with other variables (known as  criteria ) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.

A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity ; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).

Criteria can also include other measures of the same construct. For example, one would expect new measures of test anxiety or physical risk taking to be positively correlated with existing measures of the same constructs. This is known as convergent validity .

Assessing convergent validity requires collecting data using the measure. Researchers John Cacioppo and Richard Petty did this when they created their self-report Need for Cognition Scale to measure how much people value and engage in thinking (Cacioppo & Petty, 1982) [1] . In a series of studies, they showed that people’s scores were positively correlated with their scores on a standardized academic achievement test, and that their scores were negatively correlated with their scores on a measure of dogmatism (which represents a tendency toward obedience). In the years since it was created, the Need for Cognition Scale has been used in literally hundreds of studies and has been shown to be correlated with a wide variety of other variables, including the effectiveness of an advertisement, interest in politics, and juror decisions (Petty, Briñol, Loersch, & McCaslin, 2009) [2] .

Discriminant Validity

Discriminant validity , on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.

When they created the Need for Cognition Scale, Cacioppo and Petty also provided evidence of discriminant validity by showing that people’s scores were not correlated with certain other variables. For example, they found only a weak correlation between people’s need for cognition and a measure of their cognitive style—the extent to which they tend to think analytically by breaking ideas into smaller parts or holistically in terms of “the big picture.” They also found no correlation between people’s need for cognition and measures of their test anxiety and their tendency to respond in socially desirable ways. All these low correlations provide evidence that the measure is reflecting a conceptually distinct construct.

Key Takeaways

  • Psychological researchers do not simply assume that their measures work. Instead, they conduct research to show that they work. If they cannot show that they work, they stop using them.
  • There are two distinct criteria by which researchers evaluate their measures: reliability and validity. Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to.
  • Validity is a judgment based on various types of evidence. The relevant evidence includes the measure’s reliability, whether it covers the construct of interest, and whether the scores it produces are correlated with other variables they are expected to be correlated with and not correlated with variables that are conceptually distinct.
  • The reliability and validity of a measure is not established by any single study but by the pattern of results across multiple studies. The assessment of reliability and validity is an ongoing process.
  • Practice: Ask several friends to complete the Rosenberg Self-Esteem Scale. Then assess its internal consistency by making a scatterplot to show the split-half correlation (even- vs. odd-numbered items). Compute Pearson’s  r too if you know how.
  • Discussion: Think back to the last college exam you took and think of the exam as a psychological measure. What construct do you think it was intended to measure? Comment on its face and content validity. What data could you collect to assess its reliability and criterion validity?
  • Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42 , 116–131. ↵
  • Petty, R. E, Briñol, P., Loersch, C., & McCaslin, M. J. (2009). The need for cognition. In M. R. Leary & R. H. Hoyle (Eds.), Handbook of individual differences in social behaviour (pp. 318–329). New York, NY: Guilford Press. ↵

The consistency of a measure.

The consistency of a measure over time.

The consistency of a measure on the same group of people at different times.

Consistency of people’s responses across the items on a multiple-item measure.

Method of assessing internal consistency through splitting the items into two sets and examining the relationship between them.

A statistic in which α is the mean of all possible split-half correlations for a set of items.

The extent to which different observers are consistent in their judgments.

The extent to which the scores from a measure represent the variable they are intended to.

The extent to which a measurement method appears to measure the construct of interest.

The extent to which a measure “covers” the construct of interest.

The extent to which people’s scores on a measure are correlated with other variables that one would expect them to be correlated with.

In reference to criterion validity, variables that one would expect to be correlated with the measure.

When the criterion is measured at the same time as the construct.

when the criterion is measured at some point in the future (after the construct has been measured).

When new measures positively correlate with existing measures of the same constructs.

The extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct.

Research Methods in Psychology - 2nd Canadian Edition Copyright © 2015 by Paul C. Price, Rajiv Jhangiani, & I-Chant A. Chiang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

what is a reliability research

what is a reliability research

Understanding Reliability and Validity

These related research issues ask us to consider whether we are studying what we think we are studying and whether the measures we use are consistent.

Reliability

Reliability is the extent to which an experiment, test, or any measuring procedure yields the same result on repeated trials. Without the agreement of independent observers able to replicate research procedures, or the ability to use research tools and procedures that yield consistent measurements, researchers would be unable to satisfactorily draw conclusions, formulate theories, or make claims about the generalizability of their research. In addition to its important role in research, reliability is critical for many parts of our lives, including manufacturing, medicine, and sports.

Reliability is such an important concept that it has been defined in terms of its application to a wide range of activities. For researchers, four key types of reliability are:

Equivalency Reliability

Equivalency reliability is the extent to which two items measure identical concepts at an identical level of difficulty. Equivalency reliability is determined by relating two sets of test scores to one another to highlight the degree of relationship or association. In quantitative studies and particularly in experimental studies, a correlation coefficient, statistically referred to as r , is used to show the strength of the correlation between a dependent variable (the subject under study), and one or more independent variables , which are manipulated to determine effects on the dependent variable. An important consideration is that equivalency reliability is concerned with correlational, not causal, relationships.

For example, a researcher studying university English students happened to notice that when some students were studying for finals, their holiday shopping began. Intrigued by this, the researcher attempted to observe how often, or to what degree, this these two behaviors co-occurred throughout the academic year. The researcher used the results of the observations to assess the correlation between studying throughout the academic year and shopping for gifts. The researcher concluded there was poor equivalency reliability between the two actions. In other words, studying was not a reliable predictor of shopping for gifts.

Stability Reliability

Stability reliability (sometimes called test, re-test reliability) is the agreement of measuring instruments over time. To determine stability, a measure or test is repeated on the same subjects at a future date. Results are compared and correlated with the initial test to give a measure of stability.

An example of stability reliability would be the method of maintaining weights used by the U.S. Bureau of Standards. Platinum objects of fixed weight (one kilogram, one pound, etc...) are kept locked away. Once a year they are taken out and weighed, allowing scales to be reset so they are "weighing" accurately. Keeping track of how much the scales are off from year to year establishes a stability reliability for these instruments. In this instance, the platinum weights themselves are assumed to have a perfectly fixed stability reliability.

Internal Consistency

Internal consistency is the extent to which tests or procedures assess the same characteristic, skill or quality. It is a measure of the precision between the observers or of the measuring instruments used in a study. This type of reliability often helps researchers interpret data and predict the value of scores and the limits of the relationship among variables.

For example, a researcher designs a questionnaire to find out about college students' dissatisfaction with a particular textbook. Analyzing the internal consistency of the survey items dealing with dissatisfaction will reveal the extent to which items on the questionnaire focus on the notion of dissatisfaction.

Interrater Reliability

Interrater reliability is the extent to which two or more individuals (coders or raters) agree. Interrater reliability addresses the consistency of the implementation of a rating system.

A test of interrater reliability would be the following scenario: Two or more researchers are observing a high school classroom. The class is discussing a movie that they have just viewed as a group. The researchers have a sliding rating scale (1 being most positive, 5 being most negative) with which they are rating the student's oral responses. Interrater reliability assesses the consistency of how the rating system is implemented. For example, if one researcher gives a "1" to a student response, while another researcher gives a "5," obviously the interrater reliability would be inconsistent. Interrater reliability is dependent upon the ability of two or more individuals to be consistent. Training, education and monitoring skills can enhance interrater reliability.

Related Information: Reliability Example

An example of the importance of reliability is the use of measuring devices in Olympic track and field events. For the vast majority of people, ordinary measuring rulers and their degree of accuracy are reliable enough. However, for an Olympic event, such as the discus throw, the slightest variation in a measuring device -- whether it is a tape, clock, or other device -- could mean the difference between the gold and silver medals. Additionally, it could mean the difference between a new world record and outright failure to qualify for an event. Olympic measuring devices, then, must be reliable from one throw or race to another and from one competition to another. They must also be reliable when used in different parts of the world, as temperature, air pressure, humidity, interpretation, or other variables might affect their readings.

Validity refers to the degree to which a study accurately reflects or assesses the specific concept that the researcher is attempting to measure. While reliability is concerned with the accuracy of the actual measuring instrument or procedure, validity is concerned with the study's success at measuring what the researchers set out to measure.

Researchers should be concerned with both external and internal validity. External validity refers to the extent to which the results of a study are generalizable or transferable. (Most discussions of external validity focus solely on generalizability; see Campbell and Stanley, 1966. We include a reference here to transferability because many qualitative research studies are not designed to be generalized.)

Internal validity refers to (1) the rigor with which the study was conducted (e.g., the study's design, the care taken to conduct measurements, and decisions concerning what was and wasn't measured) and (2) the extent to which the designers of a study have taken into account alternative explanations for any causal relationships they explore (Huitt, 1998). In studies that do not explore causal relationships, only the first of these definitions should be considered when assessing internal validity.

Scholars discuss several types of internal validity. For brief discussions of several types of internal validity, click on the items below:

Face Validity

Face validity is concerned with how a measure or procedure appears. Does it seem like a reasonable way to gain the information the researchers are attempting to obtain? Does it seem well designed? Does it seem as though it will work reliably? Unlike content validity, face validity does not depend on established theories for support (Fink, 1995).

Criterion Related Validity

Criterion related validity, also referred to as instrumental validity, is used to demonstrate the accuracy of a measure or procedure by comparing it with another measure or procedure which has been demonstrated to be valid.

For example, imagine a hands-on driving test has been shown to be an accurate test of driving skills. By comparing the scores on the written driving test with the scores from the hands-on driving test, the written test can be validated by using a criterion related strategy in which the hands-on driving test is compared to the written test.

Construct Validity

Construct validity seeks agreement between a theoretical concept and a specific measuring device or procedure. For example, a researcher inventing a new IQ test might spend a great deal of time attempting to "define" intelligence in order to reach an acceptable level of construct validity.

Construct validity can be broken down into two sub-categories: Convergent validity and discriminate validity. Convergent validity is the actual general agreement among ratings, gathered independently of one another, where measures should be theoretically related. Discriminate validity is the lack of a relationship among measures which theoretically should not be related.

To understand whether a piece of research has construct validity, three steps should be followed. First, the theoretical relationships must be specified. Second, the empirical relationships between the measures of the concepts must be examined. Third, the empirical evidence must be interpreted in terms of how it clarifies the construct validity of the particular measure being tested (Carmines & Zeller, p. 23).

Content Validity

Content Validity is based on the extent to which a measurement reflects the specific intended domain of content (Carmines & Zeller, 1991, p.20).

Content validity is illustrated using the following examples: Researchers aim to study mathematical learning and create a survey to test for mathematical skill. If these researchers only tested for multiplication and then drew conclusions from that survey, their study would not show content validity because it excludes other mathematical functions. Although the establishment of content validity for placement-type exams seems relatively straight-forward, the process becomes more complex as it moves into the more abstract domain of socio-cultural studies. For example, a researcher needing to measure an attitude like self-esteem must decide what constitutes a relevant domain of content for that attitude. For socio-cultural studies, content validity forces the researchers to define the very domains they are attempting to study.

Related Information: Validity Example

Many recreational activities of high school students involve driving cars. A researcher, wanting to measure whether recreational activities have a negative effect on grade point average in high school students, might conduct a survey asking how many students drive to school and then attempt to find a correlation between these two factors. Because many students might use their cars for purposes other than or in addition to recreation (e.g., driving to work after school, driving to school rather than walking or taking a bus), this research study might prove invalid. Even if a strong correlation was found between driving and grade point average, driving to school in and of itself would seem to be an invalid measure of recreational activity.

The challenges of achieving reliability and validity are among the most difficult faced by researchers. In this section, we offer commentaries on these challenges.

Difficulties of Achieving Reliability

It is important to understand some of the problems concerning reliability which might arise. It would be ideal to reliably measure, every time, exactly those things which we intend to measure. However, researchers can go to great lengths and make every attempt to ensure accuracy in their studies, and still deal with the inherent difficulties of measuring particular events or behaviors. Sometimes, and particularly in studies of natural settings, the only measuring device available is the researcher's own observations of human interaction or human reaction to varying stimuli. As these methods are ultimately subjective in nature, results may be unreliable and multiple interpretations are possible. Three of these inherent difficulties are quixotic reliability, diachronic reliability and synchronic reliability.

Quixotic reliability refers to the situation where a single manner of observation consistently, yet erroneously, yields the same result. It is often a problem when research appears to be going well. This consistency might seem to suggest that the experiment was demonstrating perfect stability reliability. This, however, would not be the case.

For example, if a measuring device used in an Olympic competition always read 100 meters for every discus throw, this would be an example of an instrument consistently, yet erroneously, yielding the same result. However, quixotic reliability is often more subtle in its occurrences than this. For example, suppose a group of German researchers doing an ethnographic study of American attitudes ask questions and record responses. Parts of their study might produce responses which seem reliable, yet turn out to measure felicitous verbal embellishments required for "correct" social behavior. Asking Americans, "How are you?" for example, would in most cases, elicit the token, "Fine, thanks." However, this response would not accurately represent the mental or physical state of the respondents.

Diachronic reliability refers to the stability of observations over time. It is similar to stability reliability in that it deals with time. While this type of reliability is appropriate to assess features that remain relatively unchanged over time, such as landscape benchmarks or buildings, the same level of reliability is more difficult to achieve with socio-cultural phenomena.

For example, in a follow-up study one year later of reading comprehension in a specific group of school children, diachronic reliability would be hard to achieve. If the test were given to the same subjects a year later, many confounding variables would have impacted the researchers' ability to reproduce the same circumstances present at the first test. The final results would almost assuredly not reflect the degree of stability sought by the researchers.

Synchronic reliability refers to the similarity of observations within the same time frame; it is not about the similarity of things observed. Synchronic reliability, unlike diachronic reliability, rarely involves observations of identical things. Rather, it concerns itself with particularities of interest to the research.

For example, a researcher studies the actions of a duck's wing in flight and the actions of a hummingbird's wing in flight. Despite the fact that the researcher is studying two distinctly different kinds of wings, the action of the wings and the phenomenon produced is the same.

Comments on a Flawed, Yet Influential Study

An example of the dangers of generalizing from research that is inconsistent, invalid, unreliable, and incomplete is found in the Time magazine article, "On A Screen Near You: Cyberporn" (De Witt, 1995). This article relies on a study done at Carnegie Mellon University to determine the extent and implications of online pornography. Inherent to the study are methodological problems of unqualified hypotheses and conclusions, unsupported generalizations and a lack of peer review.

Ignoring the functional problems that manifest themselves later in the study, it seems that there are a number of ethical problems within the article. The article claims to be an exhaustive study of pornography on the Internet, (it was anything but exhaustive), it resembles a case study more than anything else. Marty Rimm, author of the undergraduate paper that Time used as a basis for the article, claims the paper was an "exhaustive study" of online pornography when, in fact, the study based most of its conclusions about pornography on the Internet on the "descriptions of slightly more than 4,000 images" (Meeks, 1995, p. 1). Some USENET groups see hundreds of postings in a day.

Considering the thousands of USENET groups, 4,000 images no longer carries the authoritative weight that its author intended. The real problem is that the study (an undergraduate paper similar to a second-semester composition assignment) was based not on pornographic images themselves, but on the descriptions of those images. This kind of reduction detracts significantly from the integrity of the final claims made by the author. In fact, this kind of research is commensurate with doing a study of the content of pornographic movies based on the titles of the movies, then making sociological generalizations based on what those titles indicate. (This is obviously a problem with a number of types of validity, because Rimm is not studying what he thinks he is studying, but instead something quite different. )

The author of the Time article, Philip Elmer De Witt writes, "The research team at CMU has undertaken the first systematic study of pornography on the Information Superhighway" (Godwin, 1995, p. 1). His statement is problematic in at least three ways. First, the research team actually consisted of a few of Rimm's undergraduate friends with no methodological training whatsoever. Additionally, no mention of the degree of interrater reliability is made. Second, this systematic study is actually merely a "non-randomly selected subset of commercial bulletin-board systems that focus on selling porn" (Godwin, p. 6). As pornography vending is actually just a small part of the whole concerning the use of pornography on the Internet, the entire premise of this study's content validity is firmly called into question. Finally, the use of the term "Information Superhighway" is a false assessment of what in actuality is only a few USENET groups and BBSs (Bulletin Board System), which make up only a small fraction of the entire "Information Superhighway" traffic. Essentially, what is here is yet another violation of content validity.

De Witt is quoted as saying: "In an 18-month study, the team surveyed 917,410 sexually-explicit pictures, descriptions, short-stories and film clips. On those USENET newsgroups where digitized images are stored, 83.5 percent of the pictures were pornographic" (De Witt 40).

Statistically, some interesting contradictions arise. The figure 917,410 was taken from adult-oriented BBSs--none came from actual USENET groups or the Internet itself. This is a glaring discrepancy. Out of the 917,410 files, 212,114 are only descriptions (Hoffman & Novak, 1995, p.2). The question is, how many actual images did the "researchers" see?

"Between April and July 1994, the research team downloaded all available images (3,254)...the team encountered technical difficulties with 13 percent of these images...This left a total of 2,830 images for analysis" (p. 2). This means that out of 917,410 files discussed in this study, 914,580 of them were not even pictures! As for the 83.5 percent figure, this is actually based on "17 alt.binaries groups that Rimm considered pornographic" (p. 2).

In real terms, 17 USENET groups is a fraction of a percent of all USENET groups available. Worse yet, Time claimed that "...only about 3 percent of all messages on the USENET [represent pornographic material], while the USENET itself represents 11.5 percent of the traffic on the Internet" (De Witt, p. 40).

Time neglected to carry the interpretation of this data out to its logical conclusion, which is that less than half of 1 percent (3 percent of 11 percent) of the images on the Internet are associated with newsgroups that contain pornographic imagery. Furthermore, of this half percent, an unknown but even smaller percentage of the messages in newsgroups that are 'associated with pornographic imagery', actually contained pornographic material (Hoffman & Novak, p. 3).

Another blunder can be seen in the avoidance of peer-review, which suggests that there was some political interests being served in having the study become a Time cover story. Marty Rimm contracted the Georgetown Law Review and Time in an agreement to publish his study as long as they kept it under lock and key. During the months before publication, many interested scholars and professionals tried in vain to obtain a copy of the study in order to check it for flaws. De Witt justified not letting such peer-review take place, and also justified the reliability and validity of the study, on the grounds that because the Georgetown Law Review had accepted it, it was therefore reliable and valid, and needed no peer-review. What he didn't know, was that law reviews are not edited by professionals, but by "third year law students" (Godwin, p. 4).

There are many consequences of the failure to subject such a study to the scrutiny of peer review. If it was Rimm's desire to publish an article about on-line pornography in a manner that legitimized his article, yet escaped the kind of critical review the piece would have to undergo if published in a scholarly journal of computer-science, engineering, marketing, psychology, or communications. What better venue than a law journal? A law journal article would have the added advantage of being taken seriously by law professors, lawyers, and legally-trained policymakers. By virtue of where it appeared, it would automatically be catapulted into the center of the policy debate surrounding online censorship and freedom of speech (Godwin).

Herein lies the dangerous implication of such a study: Because the questions surrounding pornography are of such immediate political concern, the study was placed in the forefront of the U.S. domestic policy debate over censorship on the Internet, (an integral aspect of current anti-First Amendment legislation) with little regard for its validity or reliability.

On June 26, the day the article came out, Senator Grassley, (co-sponsor of the anti-porn bill, along with Senator Dole) began drafting a speech that was to be delivered that very day in the Senate, using the study as evidence. The same day, at the same time, Mike Godwin posted on WELL (Whole Earth 'Lectronic Link, a forum for professionals on the Internet) what turned out to be the overstatement of the year: "Philip's story is an utter disaster, and it will damage the debate about this issue because we will have to spend lots of time correcting misunderstandings that are directly attributable to the story" (Meeks, p. 7).

As Godwin was writing this, Senator Grassley was speaking to the Senate: "Mr. President, I want to repeat that: 83.5 percent of the 900,000 images reviewed--these are all on the Internet--are pornographic, according to the Carnegie-Mellon study" ( p. 7). Several days later, Senator Dole was waving the magazine in front of the Senate like a battle flag.

Donna Hoffman, professor at Vanderbilt University, summed up the dangerous political implications by saying, "The critically important national debate over First Amendment rights and restrictions of information on the Internet and other emerging media requires facts and informed opinion, not hysteria" (p.1).

In addition to the hysteria, Hoffman sees a plethora of other problems with the study. "Because the content analysis and classification scheme are 'black boxes,'" Hoffman said, "because no reliability and validity results are presented, because no statistical testing of the differences both within and among categories for different types of listings has been performed, and because not a single hypothesis has been tested, formally or otherwise, no conclusions should be drawn until the issues raised in this critique are resolved" (p. 4).

However, the damage has already been done. This questionable research by an undergraduate engineering major has been generalized to such an extent that even the U.S. Senate, and in particular Senators Grassley and Dole, have been duped, albeit through the strength of their own desires to see only what they wanted to see.

Annotated Bibliography

American Psychological Association. (1985). Standards for educational and psychological testing. Washington, DC: Author.

This work on focuses on reliability, validity and the standards that testers need to achieve in order to ensure accuracy.

Babbie, E.R. & Huitt, R.E. (1979). The practice of social research 2nd ed . Belmont, CA: Wadsworth Publishing.

An overview of social research and its applications.

Beauchamp, T. L., Faden, R.R., Wallace, Jr., R.J. & Walters, L . ( 1982). Ethical issues in social science research. Baltimore and London: The Johns Hopkins University Press.

A systematic overview of ethical issues in Social Science Research written by researchers with firsthand familiarity with the situations and problems researchers face in their work. This book raises several questions of how reliability and validity can be affected by ethics.

Borman, K.M. et al. (1986). Ethnographic and qualitative research design and why it doesn't work. American behavioral scientist 30 , 42-57.

The authors pose questions concerning threats to qualitative research and suggest solutions.

Bowen, K. A. (1996, Oct. 12). The sin of omission -punishable by death to internal validity: An argument for integration of quantitative research methods to strengthen internal validity. Available: http://trochim.human.cornell.edu/gallery/bowen/hss691.htm

An entire Web site that examines the merits of integrating qualitative and quantitative research methodologies through triangulation. The author argues that improving the internal validity of social science will be the result of such a union.

Brinberg, D. & McGrath, J.E. (1985). Validity and the research process . Beverly Hills: Sage Publications.

The authors investigate validity as value and propose the Validity Network Schema, a process by which researchers can infuse validity into their research.

Bussières, J-F. (1996, Oct.12). Reliability and validity of information provided by museum Web sites. Available: http://www.oise.on.ca/~jfbussieres/issue.html

This Web page examines the validity of museum Web sites which calls into question the validity of Web-based resources in general. Addresses the issue that all Websites should be examined with skepticism about the validity of the information contained within them.

Campbell, D. T. & Stanley, J.C. (1963). Experimental and quasi-experimental designs for research. Boston: Houghton Mifflin.

An overview of experimental research that includes pre-experimental designs, controls for internal validity, and tables listing sources of invalidity in quasi-experimental designs. Reference list and examples.

Carmines, E. G. & Zeller, R.A. (1991). Reliability and validity assessment . Newbury Park: Sage Publications.

An introduction to research methodology that includes classical test theory, validity, and methods of assessing reliability.

Carroll, K. M. (1995). Methodological issues and problems in the assessment of substance use. Psychological Assessment, Sep. 7 n3 , 349-58.

Discusses methodological issues in research involving the assessment of substance abuse. Introduces strategies for avoiding problems with the reliability and validity of methods.

Connelly, F. M. & Clandinin, D.J. (1990). Stories of experience and narrative inquiry. Educational Researcher 19:5 , 2-12.

A survey of narrative inquiry that outlines criteria, methods, and writing forms. It includes a discussion of risks and dangers in narrative studies, as well as a research agenda for curricula and classroom studies.

De Witt, P.E. (1995, July 3). On a screen near you: Cyberporn. Time, 38-45.

This is an exhaustive Carnegie Mellon study of online pornography by Marty Rimm, electrical engineering student.

Fink, A., ed. (1995). The survey Handbook, v.1 .Thousand Oaks, CA: Sage.

A guide to survey; this is the first in a series referred to as the "survey kit". It includes bibliograpgical references. Addresses survey design, analysis, reporting surveys and how to measure the validity and reliability of surveys.

Fink, A., ed. (1995). How to measure survey reliability and validity v. 7 . Thousand Oaks, CA: Sage.

This volume seeks to select and apply reliability criteria and select and apply validity criteria. The fundamental principles of scaling and scoring are considered.

Godwin, M. (1995, July). JournoPorn, dissection of the Time article. Available: http://www.hotwired.com

A detailed critique of Time magazine's Cyberporn , outlining flaws of methodology as well as exploring the underlying assumptions of the article.

Hambleton, R.K. & Zaal, J.N., eds. (1991). Advances in educational and psychological testing . Boston: Kluwer Academic.

Information on the concepts of reliability and validity in psychology and education.

Harnish, D.L. (1992). Human judgment and the logic of evidence: A critical examination of research methods in special education transition literature . In D.L. Harnish et al. eds., Selected readings in transition.

This article investigates threats to validity in special education research.

Haynes, N. M. (1995). How skewed is 'the bell curve'? Book Product Reviews . 1-24.

This paper claims that R.J. Herrnstein and C. Murray's The Bell Curve: Intelligence and Class Structure in American Life does not have scientific merit and claims that the bell curve is an unreliable measure of intelligence.

Healey, J. F. (1993). Statistics: A tool for social research, 3rd ed . Belmont: Wadsworth Publishing.

Inferential statistics, measures of association, and multivariate techniques in statistical analysis for social scientists are addressed.

Helberg, C. (1996, Oct.12). Pitfalls of data analysis (or how to avoid lies and damned lies). Available: http//maddog/fammed.wisc.edu/pitfalls/

A discussion of things researchers often overlook in their data analysis and how statistics are often used to skew reliability and validity for the researchers purposes.

Hoffman, D. L. and Novak, T.P. (1995, July). A detailed critique of the Time article: Cyberporn. Available: http://www.hotwired.com

A methodological critique of the Time article that uncovers some of the fundamental flaws in the statistics and the conclusions made by De Witt.

Huitt, William G. (1998). Internal and External Validity . http://www.valdosta.peachnet.edu/~whuitt/psy702/intro/valdgn.html

A Web document addressing key issues of external and internal validity.

Jones, J. E. & Bearley, W.L. (1996, Oct 12). Reliability and validity of training instruments. Organizational Universe Systems. Available: http://ous.usa.net/relval.htm

The authors discuss the reliability and validity of training design in a business setting. Basic terms are defined and examples provided.

Cultural Anthropology Methods Journal. (1996, Oct. 12). Available: http://www.lawrence.edu/~bradleyc/cam.html

An online journal containing articles on the practical application of research methods when conducting qualitative and quantitative research. Reliability and validity are addressed throughout.

Kirk, J. & Miller, M. M. (1986). Reliability and validity in qualitative research. Beverly Hills: Sage Publications.

This text describes objectivity in qualitative research by focusing on the issues of validity and reliability in terms of their limitations and applicability in the social and natural sciences.

Krakower, J. & Niwa, S. (1985). An assessment of validity and reliability of the institutinal perfarmance survey . Boulder, CO: National center for higher education management systems.

Educational surveys and higher education research and the efeectiveness of organization.

Lauer, J. M. & Asher, J.W. (1988). Composition Research. New York: Oxford University Press.

A discussion of empirical designs in the context of composition research as a whole.

Laurent, J. et al. (1992, Mar.) Review of validity research on the stanford-binet intelligence scale: 4th Ed. Psychological Assessment . 102-112.

This paper looks at the results of construct and criterion- related validity studies to determine if the SB:FE is a valid measure of intelligence.

LeCompte, M. D., Millroy, W.L., & Preissle, J. eds. (1992). The handbook of qualitative research in education. San Diego: Academic Press.

A compilation of the range of methodological and theoretical qualitative inquiry in the human sciences and education research. Numerous contributing authors apply their expertise to discussing a wide variety of issues pertaining to educational and humanities research as well as suggestions about how to deal with problems when conducting research.

McDowell, I. & Newell, C. (1987). Measuring health: A guide to rating scales and questionnaires . New York: Oxford University Press.

This gives a variety of examples of health measurement techniques and scales and discusses the validity and reliability of important health measures.

Meeks, B. (1995, July). Muckraker: How Time failed. Available: http://www.hotwired.com

A step-by-step outline of the events which took place during the researching, writing, and negotiating of the Time article of 3 July, 1995 titled: On A Screen Near You: Cyberporn .

Merriam, S. B. (1995). What can you tell from an N of 1?: Issues of validity and reliability in qualitative research. Journal of Lifelong Learning v4 , 51-60.

Addresses issues of validity and reliability in qualitative research for education. Discusses philosophical assumptions underlying the concepts of internal validity, reliability, and external validity or generalizability. Presents strategies for ensuring rigor and trustworthiness when conducting qualitative research.

Morris, L.L, Fitzgibbon, C.T., & Lindheim, E. (1987). How to measure performance and use tests. In J.L. Herman (Ed.), Program evaluation kit (2nd ed.). Newbury Park, CA: Sage.

Discussion of reliability and validity as it pertyains to measuring students' performance.

Murray, S., et al. (1979, April). Technical issues as threats to internal validity of experimental and quasi-experimental designs. San Francisco: University of California. 8-12.

(From Yang et al. bibliography--unavailable as of this writing.)

Russ-Eft, D. F. (1980). Validity and reliability in survey research. American Institutes for Research in the Behavioral Sciences August , 227 151.

An investigation of validity and reliability in survey research with and overview of the concepts of reliability and validity. Specific procedures for measuring sources of error are suggested as well as general suggestions for improving the reliability and validity of survey data. A extensive annotated bibliography is provided.

Ryser, G. R. (1994). Developing reliable and valid authentic assessments for the classroom: Is it possible? Journal of Secondary Gifted Education Fall, v6 n1 , 62-66.

Defines the meanings of reliability and validity as they apply to standardized measures of classroom assessment. This article defines reliability as scorability and stability and validity is seen as students' ability to use knowledge authentically in the field.

Schmidt, W., et al. (1982). Validity as a variable: Can the same certification test be valid for all students? Institute for Research on Teaching July, ED 227 151.

A technical report that presents specific criteria for judging content, instructional and curricular validity as related to certification tests in education.

Scholfield, P. (1995). Quantifying language. A researcher's and teacher's guide to gathering language data and reducing it to figures . Bristol: Multilingual Matters.

A guide to categorizing, measuring, testing, and assessing aspects of language. A source for language-related practitioners and researchers in conjunction with other resources on research methods and statistics. Questions of reliability, and validity are also explored.

Scriven, M. (1993). Hard-Won Lessons in Program Evaluation . San Francisco: Jossey-Bass Publishers.

A common sense approach for evaluating the validity of various educational programs and how to address specific issues facing evaluators.

Shou, P. (1993, Jan.). The singer loomis inventory of personality: A review and critique. [Paper presented at the Annual Meeting of the Southwest Educational Research Association.]

Evidence for reliability and validity are reviewed. A summary evaluation suggests that SLIP (developed by two Jungian analysts to allow examination of personality from the perspective of Jung's typology) appears to be a useful tool for educators and counselors.

Sutton, L.R. (1992). Community college teacher evaluation instrument: A reliability and validity study . Diss. Colorado State University.

Studies of reliability and validity in occupational and educational research.

Thompson, B. & Daniel, L.G. (1996, Oct.). Seminal readings on reliability and validity: A "hit parade" bibliography. Educational and psychological measurement v. 56 , 741-745.

Editorial board members of Educational and Psychological Measurement generated bibliography of definitive publications of measurement research. Many articles are directly related to reliability and validity.

Thompson, E. Y., et al. (1995). Overview of qualitative research . Diss. Colorado State University.

A discussion of strengths and weaknesses of qualitative research and its evolution and adaptation. Appendices and annotated bibliography.

Traver, C. et al. (1995). Case Study . Diss. Colorado State University.

This presentation gives an overview of case study research, providing definitions and a brief history and explanation of how to design research.

Trochim, William M. K. (1996) External validity. (. Available: http://trochim.human.cornell.edu/kb/EXTERVAL.htm

A comprehensive treatment of external validity found in William Trochim's online text about research methods and issues.

Trochim, William M. K. (1996) Introduction to validity. (. Available: hhttp://trochim.human.cornell.edu/kb/INTROVAL.htm

An introduction to validity found in William Trochim's online text about research methods and issues.

Trochim, William M. K. (1996) Reliability. (. Available: http://trochim.human.cornell.edu/kb/reltypes.htm

A comprehensive treatment of reliability found in William Trochim's online text about research methods and issues.

Validity. (1996, Oct. 12). Available: http://vislab-www.nps.navy.mil/~haga/validity.html

A source for definitions of various forms and types of reliability and validity.

Vinsonhaler, J. F., et al. (1983, July). Improving diagnostic reliability in reading through training. Institute for Research on Teaching ED 237 934.

This technical report investigates the practical application of a program intended to improve the diagnoses of reading deficient students. Here, reliability is assumed and a pragmatic answer to a specific educational problem is suggested as a result.

Wentland, E. J. & Smith, K.W. (1993). Survey responses: An evaluation of their validity . San Diego: Academic Press.

This book looks at the factors affecting response validity (or the accuracy of self-reports in surveys) and provides several examples with varying accuracy levels.

Wiget, A. (1996). Father juan greyrobe: Reconstructing tradition histories, and the reliability and validity of uncorroborated oral tradition. Ethnohistory 43:3 , 459-482.

This paper presents a convincing argument for the validity of oral histories in ethnographic research where at least some of the evidence can be corroborated through written records.

Yang, G. H., et al. (1995). Experimental and quasi-experimental educational research . Diss. Colorado State University.

This discussion defines experimentation and considers the rhetorical issues and advantages and disadvantages of experimental research. Annotated bibliography.

Yarroch, W. L. (1991, Sept.). The Implications of content versus validity on science tests. Journal of Research in Science Teaching , 619-629.

The use of content validity as the primary assurance of the measurement accuracy for science assessment examinations is questioned. An alternative accuracy measure, item validity, is proposed to look at qualitative comparisons between different factors.

Yin, R. K. (1989). Case study research: Design and methods . London: Sage Publications.

This book discusses the design process of case study research, including collection of evidence, composing the case study report, and designing single and multiple case studies.

Related Links

Internal Validity Tutorial. An interactive tutorial on internal validity.

http://server.bmod.athabascau.ca/html/Validity/index.shtml

Howell, Jonathan, Paul Miller, Hyun Hee Park, Deborah Sattler, Todd Schack, Eric Spery, Shelley Widhalm, & Mike Palmquist. (2005). Reliability and Validity. Writing@CSU . Colorado State University. https://writing.colostate.edu/guides/guide.cfm?guideid=66

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 18, Issue 3
  • Validity and reliability in quantitative studies
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • Roberta Heale 1 ,
  • Alison Twycross 2
  • 1 School of Nursing, Laurentian University , Sudbury, Ontario , Canada
  • 2 Faculty of Health and Social Care , London South Bank University , London , UK
  • Correspondence to : Dr Roberta Heale, School of Nursing, Laurentian University, Ramsey Lake Road, Sudbury, Ontario, Canada P3E2C6; rheale{at}laurentian.ca

https://doi.org/10.1136/eb-2015-102129

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Evidence-based practice includes, in part, implementation of the findings of well-conducted quality research studies. So being able to critique quantitative research is an important skill for nurses. Consideration must be given not only to the results of the study but also the rigour of the research. Rigour refers to the extent to which the researchers worked to enhance the quality of the studies. In quantitative research, this is achieved through measurement of the validity and reliability. 1

  • View inline

Types of validity

The first category is content validity . This category looks at whether the instrument adequately covers all the content that it should with respect to the variable. In other words, does the instrument cover the entire domain related to the variable, or construct it was designed to measure? In an undergraduate nursing course with instruction about public health, an examination with content validity would cover all the content in the course with greater emphasis on the topics that had received greater coverage or more depth. A subset of content validity is face validity , where experts are asked their opinion about whether an instrument measures the concept intended.

Construct validity refers to whether you can draw inferences about test scores related to the concept being studied. For example, if a person has a high score on a survey that measures anxiety, does this person truly have a high degree of anxiety? In another example, a test of knowledge of medications that requires dosage calculations may instead be testing maths knowledge.

There are three types of evidence that can be used to demonstrate a research instrument has construct validity:

Homogeneity—meaning that the instrument measures one construct.

Convergence—this occurs when the instrument measures concepts similar to that of other instruments. Although if there are no similar instruments available this will not be possible to do.

Theory evidence—this is evident when behaviour is similar to theoretical propositions of the construct measured in the instrument. For example, when an instrument measures anxiety, one would expect to see that participants who score high on the instrument for anxiety also demonstrate symptoms of anxiety in their day-to-day lives. 2

The final measure of validity is criterion validity . A criterion is any other instrument that measures the same variable. Correlations can be conducted to determine the extent to which the different instruments measure the same variable. Criterion validity is measured in three ways:

Convergent validity—shows that an instrument is highly correlated with instruments measuring similar variables.

Divergent validity—shows that an instrument is poorly correlated to instruments that measure different variables. In this case, for example, there should be a low correlation between an instrument that measures motivation and one that measures self-efficacy.

Predictive validity—means that the instrument should have high correlations with future criterions. 2 For example, a score of high self-efficacy related to performing a task should predict the likelihood a participant completing the task.

Reliability

Reliability relates to the consistency of a measure. A participant completing an instrument meant to measure motivation should have approximately the same responses each time the test is completed. Although it is not possible to give an exact calculation of reliability, an estimate of reliability can be achieved through different measures. The three attributes of reliability are outlined in table 2 . How each attribute is tested for is described below.

Attributes of reliability

Homogeneity (internal consistency) is assessed using item-to-total correlation, split-half reliability, Kuder-Richardson coefficient and Cronbach's α. In split-half reliability, the results of a test, or instrument, are divided in half. Correlations are calculated comparing both halves. Strong correlations indicate high reliability, while weak correlations indicate the instrument may not be reliable. The Kuder-Richardson test is a more complicated version of the split-half test. In this process the average of all possible split half combinations is determined and a correlation between 0–1 is generated. This test is more accurate than the split-half test, but can only be completed on questions with two answers (eg, yes or no, 0 or 1). 3

Cronbach's α is the most commonly used test to determine the internal consistency of an instrument. In this test, the average of all correlations in every combination of split-halves is determined. Instruments with questions that have more than two responses can be used in this test. The Cronbach's α result is a number between 0 and 1. An acceptable reliability score is one that is 0.7 and higher. 1 , 3

Stability is tested using test–retest and parallel or alternate-form reliability testing. Test–retest reliability is assessed when an instrument is given to the same participants more than once under similar circumstances. A statistical comparison is made between participant's test scores for each of the times they have completed it. This provides an indication of the reliability of the instrument. Parallel-form reliability (or alternate-form reliability) is similar to test–retest reliability except that a different form of the original instrument is given to participants in subsequent tests. The domain, or concepts being tested are the same in both versions of the instrument but the wording of items is different. 2 For an instrument to demonstrate stability there should be a high correlation between the scores each time a participant completes the test. Generally speaking, a correlation coefficient of less than 0.3 signifies a weak correlation, 0.3–0.5 is moderate and greater than 0.5 is strong. 4

Equivalence is assessed through inter-rater reliability. This test includes a process for qualitatively determining the level of agreement between two or more observers. A good example of the process used in assessing inter-rater reliability is the scores of judges for a skating competition. The level of consistency across all judges in the scores given to skating participants is the measure of inter-rater reliability. An example in research is when researchers are asked to give a score for the relevancy of each item on an instrument. Consistency in their scores relates to the level of inter-rater reliability of the instrument.

Determining how rigorously the issues of reliability and validity have been addressed in a study is an essential component in the critique of research as well as influencing the decision about whether to implement of the study findings into nursing practice. In quantitative studies, rigour is determined through an evaluation of the validity and reliability of the tools or instruments utilised in the study. A good quality research study will provide evidence of how all these factors have been addressed. This will help you to assess the validity and reliability of the research and help you decide whether or not you should apply the findings in your area of clinical practice.

  • Lobiondo-Wood G ,
  • Shuttleworth M
  • ↵ Laerd Statistics . Determining the correlation coefficient . 2013 . https://statistics.laerd.com/premium/pc/pearson-correlation-in-spss-8.php

Twitter Follow Roberta Heale at @robertaheale and Alison Twycross at @alitwy

Competing interests None declared.

Read the full text or download the PDF:

Reliability and Validity

  • Reference work entry
  • pp 1643–1644
  • Cite this reference work entry

what is a reliability research

  • Yori Gidron 3  

1141 Accesses

2 Citations

These two concepts are the basis for assessment in most scientific work in medical and social sciences. Reliability refers to the degree of consistency in measurement and to the lack of error. There are several types of indices of reliability. Internal reliability (measured by Cronbach’s alpha) is a measure of repeatability of a measure. In psychometrics, a questionnaire of, for example, 10 items, is said to be reliable if its internal reliability coefficient is at least 0.70. This reflects approximately the mean correlation between each score on each item, with all remaining item scores, repeated across all items. Methodologically, this reflects a measure of repeatability, a basic premise of science. Another type of reliability is inter-rater reliability, which refers to the degree of agreement between two or more observers, evaluating a patient’s behavior, for example. Thus, in the original type A behavior interview, which currently places more emphasis on hostility,...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References and Readings

Barefoot, J. C., Dahlstrom, W. G., & Williams, R. B., Jr. (1983). Hostility, CHD incidence, and total mortality: A 25-year follow-up study of 255 physicians. Psychosomatic Medicine, 45 , 59–63.

PubMed   CAS   Google Scholar  

Del Greco, L., Walop, W., & McCarthy, R. H. (1987). Questionnaire development: 2. Validity and reliability. CMAJ, 136 , 699–700.

PubMed   Google Scholar  

Download references

Author information

Authors and affiliations.

Faculty of Medicine and Pharmacy, Free University of Brussels (VUB), 103, Laarbeeklaan, Jette, 1090, Belgium

Dr. Yori Gidron

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Yori Gidron .

Editor information

Editors and affiliations.

Behavioral Medicine Research Center, Department of Psychology, University of Miami, Miami, FL, USA

Marc D. Gellman

Cardiovascular Safety, Quintiles, Durham, NC, USA

J. Rick Turner

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media, New York

About this entry

Cite this entry.

Gidron, Y. (2013). Reliability and Validity. In: Gellman, M.D., Turner, J.R. (eds) Encyclopedia of Behavioral Medicine. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-1005-9_1549

Download citation

DOI : https://doi.org/10.1007/978-1-4419-1005-9_1549

Publisher Name : Springer, New York, NY

Print ISBN : 978-1-4419-1004-2

Online ISBN : 978-1-4419-1005-9

eBook Packages : Medicine Reference Module Medicine

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

The Meaning of Reliability in Sociology

Four Procedures for Assessing Reliability

  • Key Concepts
  • Major Sociologists
  • News & Issues
  • Research, Samples, and Statistics
  • Recommended Reading
  • Archaeology

Reliability is the degree to which a measurement instrument gives the same results each time that it is used, assuming that the underlying thing being measured does not change.

Key Takeaways: Reliability

  • If a measurement instrument provides similar results each time it is used (assuming that whatever is being measured stays the same over time), it is said to have high reliability.
  • Good measurement instruments should have both high reliability and high accuracy.
  • Four methods sociologists can use to assess reliability are the test-retest procedure, the alternate forms procedure, the split-halves procedure, and the internal consistency procedure.

Imagine that you’re trying to assess the reliability of a thermometer in your home. If the temperature in a room stays the same, a reliable thermometer will always give the same reading. A thermometer that lacks reliability would change even when the temperature does not. Note, however, that the thermometer does not have to be accurate in order to be reliable. It might always register three degrees too high, for example. Its degree of reliability has to do instead with the predictability of its relationship with whatever is being tested.

Methods to Assess Reliability

In order to assess reliability, the thing being measured must be measured more than once. For example, if you wanted to measure the length of a sofa to make sure it would fit through a door, you might measure it twice. If you get an identical measurement twice, you can be confident you measured reliably.

There are four procedures for assessing the reliability of a test. (Here, the term "test" refers to a group of statements on a questionnaire, an observer's quantitative or qualitative  evaluation, or a combination of the two.)

The Test-Retest Procedure

Here, the same test is given two or more times. For example, you might create a questionnaire with a set of ten statements to assess confidence. These ten statements are then given to a subject twice at two different times. If the respondent gives similar answers both times, you can assume the questions assessed the subject's answers reliably.

One advantage of this method is that only one test needs to be developed for this procedure. However, there are a few downsides of the test-retest procedure. Events might occur between testing times that affect the respondents' answers; answers might change over time simply because people change and grow over time; and the subject might adjust to the test the second time around, think more deeply about the questions, and reevaluate their answers. For instance, in the example above, some respondents might have become more confident between the first and second testing session, which would make it more difficult to interpret the results of the test-retest procedure.

The Alternate Forms Procedure

In the alternate forms procedure (also called parallel forms reliability ), two tests are given. For example, you might create two sets of five statements measuring confidence. Subjects would be asked to take each of the five-statement questionnaires. If the person gives similar answers for both tests, you can assume you measured the concept reliably. One advantage is that cueing will be less of a factor because the two tests are different. However, it's important to ensure that both alternate versions of the test are indeed measuring the same thing.

The Split-Halves Procedure

In this procedure, a single test is given once. A grade is assigned to each half separately and grades are compared from each half. For example, you might have one set of ten statements on a questionnaire to assess confidence. Respondents take the test and the questions are then split into two sub-tests of five items each. If the score on the first half mirrors the score on the second half, you can presume that the test measured the concept reliably. On the plus side, history, maturation, and cueing aren't at play. However, scores can vary greatly depending on the way in which the test is divided into halves.

The Internal Consistency Procedure

Here, the same test is administered once, and the score is based upon average similarity of responses. For example, in a ten-statement questionnaire to measure confidence, each response can be seen as a one-statement sub-test. The similarity in responses to each of the ten statements is used to assess reliability. If the respondent doesn't answer all ten statements in a similar way, then one can assume that the test is not reliable. One way that researchers can assess internal consistency is by using statistical software to calculate Cronbach’s alpha .

With the internal consistency procedure, history, maturation, and cueing aren't a consideration. However, the number of statements in the test can affect the assessment of reliability when assessing it internally.

  • How to Calculate Percent Error
  • Difference Between Independent and Dependent Variables
  • Understanding Validity in Sociology
  • A Writing Portfolio Can Help You Perfect Your Writing Skills
  • Structural Equation Modeling
  • Myers-Briggs Personality Types: Definitions and Examples
  • Testing and Assessment for Special Education
  • Collection of Learning Styles Tests and Inventories
  • The History of the Thermometer
  • A Beginner's Guide to Understanding Ambient Air Temperature
  • Scientific Method Vocabulary Terms
  • Temperature Definition in Science
  • Scales Used in Social Science Research
  • Constructing a Questionnaire
  • How Does a Thermometer Measure Air Temperature?
  • The Reliability of Radiocarbon Dating

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Family Med Prim Care
  • v.4(3); Jul-Sep 2015

Validity, reliability, and generalizability in qualitative research

Lawrence leung.

1 Department of Family Medicine, Queen's University, Kingston, Ontario, Canada

2 Centre of Studies in Primary Care, Queen's University, Kingston, Ontario, Canada

In general practice, qualitative research contributes as significantly as quantitative research, in particular regarding psycho-social aspects of patient-care, health services provision, policy setting, and health administrations. In contrast to quantitative research, qualitative research as a whole has been constantly critiqued, if not disparaged, by the lack of consensus for assessing its quality and robustness. This article illustrates with five published studies how qualitative research can impact and reshape the discipline of primary care, spiraling out from clinic-based health screening to community-based disease monitoring, evaluation of out-of-hours triage services to provincial psychiatric care pathways model and finally, national legislation of core measures for children's healthcare insurance. Fundamental concepts of validity, reliability, and generalizability as applicable to qualitative research are then addressed with an update on the current views and controversies.

Nature of Qualitative Research versus Quantitative Research

The essence of qualitative research is to make sense of and recognize patterns among words in order to build up a meaningful picture without compromising its richness and dimensionality. Like quantitative research, the qualitative research aims to seek answers for questions of “how, where, when who and why” with a perspective to build a theory or refute an existing theory. Unlike quantitative research which deals primarily with numerical data and their statistical interpretations under a reductionist, logical and strictly objective paradigm, qualitative research handles nonnumerical information and their phenomenological interpretation, which inextricably tie in with human senses and subjectivity. While human emotions and perspectives from both subjects and researchers are considered undesirable biases confounding results in quantitative research, the same elements are considered essential and inevitable, if not treasurable, in qualitative research as they invariable add extra dimensions and colors to enrich the corpus of findings. However, the issue of subjectivity and contextual ramifications has fueled incessant controversies regarding yardsticks for quality and trustworthiness of qualitative research results for healthcare.

Impact of Qualitative Research upon Primary Care

In many ways, qualitative research contributes significantly, if not more so than quantitative research, to the field of primary care at various levels. Five qualitative studies are chosen to illustrate how various methodologies of qualitative research helped in advancing primary healthcare, from novel monitoring of chronic obstructive pulmonary disease (COPD) via mobile-health technology,[ 1 ] informed decision for colorectal cancer screening,[ 2 ] triaging out-of-hours GP services,[ 3 ] evaluating care pathways for community psychiatry[ 4 ] and finally prioritization of healthcare initiatives for legislation purposes at national levels.[ 5 ] With the recent advances of information technology and mobile connecting device, self-monitoring and management of chronic diseases via tele-health technology may seem beneficial to both the patient and healthcare provider. Recruiting COPD patients who were given tele-health devices that monitored lung functions, Williams et al. [ 1 ] conducted phone interviews and analyzed their transcripts via a grounded theory approach, identified themes which enabled them to conclude that such mobile-health setup and application helped to engage patients with better adherence to treatment and overall improvement in mood. Such positive findings were in contrast to previous studies, which opined that elderly patients were often challenged by operating computer tablets,[ 6 ] or, conversing with the tele-health software.[ 7 ] To explore the content of recommendations for colorectal cancer screening given out by family physicians, Wackerbarth, et al. [ 2 ] conducted semi-structure interviews with subsequent content analysis and found that most physicians delivered information to enrich patient knowledge with little regard to patients’ true understanding, ideas, and preferences in the matter. These findings suggested room for improvement for family physicians to better engage their patients in recommending preventative care. Faced with various models of out-of-hours triage services for GP consultations, Egbunike et al. [ 3 ] conducted thematic analysis on semi-structured telephone interviews with patients and doctors in various urban, rural and mixed settings. They found that the efficiency of triage services remained a prime concern from both users and providers, among issues of access to doctors and unfulfilled/mismatched expectations from users, which could arouse dissatisfaction and legal implications. In UK, a care pathways model for community psychiatry had been introduced but its benefits were unclear. Khandaker et al. [ 4 ] hence conducted a qualitative study using semi-structure interviews with medical staff and other stakeholders; adopting a grounded-theory approach, major themes emerged which included improved equality of access, more focused logistics, increased work throughput and better accountability for community psychiatry provided under the care pathway model. Finally, at the US national level, Mangione-Smith et al. [ 5 ] employed a modified Delphi method to gather consensus from a panel of nominators which were recognized experts and stakeholders in their disciplines, and identified a core set of quality measures for children's healthcare under the Medicaid and Children's Health Insurance Program. These core measures were made transparent for public opinion and later passed on for full legislation, hence illustrating the impact of qualitative research upon social welfare and policy improvement.

Overall Criteria for Quality in Qualitative Research

Given the diverse genera and forms of qualitative research, there is no consensus for assessing any piece of qualitative research work. Various approaches have been suggested, the two leading schools of thoughts being the school of Dixon-Woods et al. [ 8 ] which emphasizes on methodology, and that of Lincoln et al. [ 9 ] which stresses the rigor of interpretation of results. By identifying commonalities of qualitative research, Dixon-Woods produced a checklist of questions for assessing clarity and appropriateness of the research question; the description and appropriateness for sampling, data collection and data analysis; levels of support and evidence for claims; coherence between data, interpretation and conclusions, and finally level of contribution of the paper. These criteria foster the 10 questions for the Critical Appraisal Skills Program checklist for qualitative studies.[ 10 ] However, these methodology-weighted criteria may not do justice to qualitative studies that differ in epistemological and philosophical paradigms,[ 11 , 12 ] one classic example will be positivistic versus interpretivistic.[ 13 ] Equally, without a robust methodological layout, rigorous interpretation of results advocated by Lincoln et al. [ 9 ] will not be good either. Meyrick[ 14 ] argued from a different angle and proposed fulfillment of the dual core criteria of “transparency” and “systematicity” for good quality qualitative research. In brief, every step of the research logistics (from theory formation, design of study, sampling, data acquisition and analysis to results and conclusions) has to be validated if it is transparent or systematic enough. In this manner, both the research process and results can be assured of high rigor and robustness.[ 14 ] Finally, Kitto et al. [ 15 ] epitomized six criteria for assessing overall quality of qualitative research: (i) Clarification and justification, (ii) procedural rigor, (iii) sample representativeness, (iv) interpretative rigor, (v) reflexive and evaluative rigor and (vi) transferability/generalizability, which also double as evaluative landmarks for manuscript review to the Medical Journal of Australia. Same for quantitative research, quality for qualitative research can be assessed in terms of validity, reliability, and generalizability.

Validity in qualitative research means “appropriateness” of the tools, processes, and data. Whether the research question is valid for the desired outcome, the choice of methodology is appropriate for answering the research question, the design is valid for the methodology, the sampling and data analysis is appropriate, and finally the results and conclusions are valid for the sample and context. In assessing validity of qualitative research, the challenge can start from the ontology and epistemology of the issue being studied, e.g. the concept of “individual” is seen differently between humanistic and positive psychologists due to differing philosophical perspectives:[ 16 ] Where humanistic psychologists believe “individual” is a product of existential awareness and social interaction, positive psychologists think the “individual” exists side-by-side with formation of any human being. Set off in different pathways, qualitative research regarding the individual's wellbeing will be concluded with varying validity. Choice of methodology must enable detection of findings/phenomena in the appropriate context for it to be valid, with due regard to culturally and contextually variable. For sampling, procedures and methods must be appropriate for the research paradigm and be distinctive between systematic,[ 17 ] purposeful[ 18 ] or theoretical (adaptive) sampling[ 19 , 20 ] where the systematic sampling has no a priori theory, purposeful sampling often has a certain aim or framework and theoretical sampling is molded by the ongoing process of data collection and theory in evolution. For data extraction and analysis, several methods were adopted to enhance validity, including 1 st tier triangulation (of researchers) and 2 nd tier triangulation (of resources and theories),[ 17 , 21 ] well-documented audit trail of materials and processes,[ 22 , 23 , 24 ] multidimensional analysis as concept- or case-orientated[ 25 , 26 ] and respondent verification.[ 21 , 27 ]

Reliability

In quantitative research, reliability refers to exact replicability of the processes and the results. In qualitative research with diverse paradigms, such definition of reliability is challenging and epistemologically counter-intuitive. Hence, the essence of reliability for qualitative research lies with consistency.[ 24 , 28 ] A margin of variability for results is tolerated in qualitative research provided the methodology and epistemological logistics consistently yield data that are ontologically similar but may differ in richness and ambience within similar dimensions. Silverman[ 29 ] proposed five approaches in enhancing the reliability of process and results: Refutational analysis, constant data comparison, comprehensive data use, inclusive of the deviant case and use of tables. As data were extracted from the original sources, researchers must verify their accuracy in terms of form and context with constant comparison,[ 27 ] either alone or with peers (a form of triangulation).[ 30 ] The scope and analysis of data included should be as comprehensive and inclusive with reference to quantitative aspects if possible.[ 30 ] Adopting the Popperian dictum of falsifiability as essence of truth and science, attempted to refute the qualitative data and analytes should be performed to assess reliability.[ 31 ]

Generalizability

Most qualitative research studies, if not all, are meant to study a specific issue or phenomenon in a certain population or ethnic group, of a focused locality in a particular context, hence generalizability of qualitative research findings is usually not an expected attribute. However, with rising trend of knowledge synthesis from qualitative research via meta-synthesis, meta-narrative or meta-ethnography, evaluation of generalizability becomes pertinent. A pragmatic approach to assessing generalizability for qualitative studies is to adopt same criteria for validity: That is, use of systematic sampling, triangulation and constant comparison, proper audit and documentation, and multi-dimensional theory.[ 17 ] However, some researchers espouse the approach of analytical generalization[ 32 ] where one judges the extent to which the findings in one study can be generalized to another under similar theoretical, and the proximal similarity model, where generalizability of one study to another is judged by similarities between the time, place, people and other social contexts.[ 33 ] Thus said, Zimmer[ 34 ] questioned the suitability of meta-synthesis in view of the basic tenets of grounded theory,[ 35 ] phenomenology[ 36 ] and ethnography.[ 37 ] He concluded that any valid meta-synthesis must retain the other two goals of theory development and higher-level abstraction while in search of generalizability, and must be executed as a third level interpretation using Gadamer's concepts of the hermeneutic circle,[ 38 , 39 ] dialogic process[ 38 ] and fusion of horizons.[ 39 ] Finally, Toye et al. [ 40 ] reported the practicality of using “conceptual clarity” and “interpretative rigor” as intuitive criteria for assessing quality in meta-ethnography, which somehow echoed Rolfe's controversial aesthetic theory of research reports.[ 41 ]

Food for Thought

Despite various measures to enhance or ensure quality of qualitative studies, some researchers opined from a purist ontological and epistemological angle that qualitative research is not a unified, but ipso facto diverse field,[ 8 ] hence any attempt to synthesize or appraise different studies under one system is impossible and conceptually wrong. Barbour argued from a philosophical angle that these special measures or “technical fixes” (like purposive sampling, multiple-coding, triangulation, and respondent validation) can never confer the rigor as conceived.[ 11 ] In extremis, Rolfe et al. opined from the field of nursing research, that any set of formal criteria used to judge the quality of qualitative research are futile and without validity, and suggested that any qualitative report should be judged by the form it is written (aesthetic) and not by the contents (epistemic).[ 41 ] Rolfe's novel view is rebutted by Porter,[ 42 ] who argued via logical premises that two of Rolfe's fundamental statements were flawed: (i) “The content of research report is determined by their forms” may not be a fact, and (ii) that research appraisal being “subject to individual judgment based on insight and experience” will mean those without sufficient experience of performing research will be unable to judge adequately – hence an elitist's principle. From a realism standpoint, Porter then proposes multiple and open approaches for validity in qualitative research that incorporate parallel perspectives[ 43 , 44 ] and diversification of meanings.[ 44 ] Any work of qualitative research, when read by the readers, is always a two-way interactive process, such that validity and quality has to be judged by the receiving end too and not by the researcher end alone.

In summary, the three gold criteria of validity, reliability and generalizability apply in principle to assess quality for both quantitative and qualitative research, what differs will be the nature and type of processes that ontologically and epistemologically distinguish between the two.

Source of Support: Nil.

Conflict of Interest: None declared.

Research-Methodology

Research Reliability

Reliability refers to whether or not you get the same answer by using an instrument to measure something more than once. In simple terms, research reliability is the degree to which research method produces stable and consistent results.

A specific measure is considered to be reliable if its application on the same object of measurement number of times produces the same results.

Research reliability can be divided into three categories:

1. Test-retest reliability relates to the measure of reliability that has been obtained by conducting the same test more than one time over period of time with the participation of the same sample group.

Example: Employees of ABC Company may be asked to complete the same questionnaire about   employee job satisfaction two times with an interval of one week, so that test results can be compared to assess stability of scores.

Research Reliability

2. Parallel forms reliability relates to a measure that is obtained by conducting assessment of the same phenomena with the participation of the same sample group via more than one assessment method.

Example: The levels of employee satisfaction of ABC Company may be assessed with questionnaires, in-depth interviews and focus groups and results can be compared.

Research Reliability

3. Inter-rater reliability as the name indicates relates to the measure of sets of results obtained by different assessors using same methods. Benefits and importance of assessing inter-rater reliability can be explained by referring to subjectivity of assessments.

Example: Levels of employee motivation at ABC Company can be assessed using observation method by two different assessors, and inter-rater reliability relates to the extent of difference between the two assessments.

Research Reliability

4. Internal consistency reliability is applied to assess the extent of differences within the test items that explore the same construct produce similar results. It can be represented in two main formats.

a) average inter-item correlation is a specific form of internal consistency that is obtained by applying the same construct on each item of the test

b) split-half reliability as another type of internal consistency reliability involves all items of a test to be ‘spitted in half’.

Research Reliability

My e-book,  The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance  offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline. John Dudovskiy

Research Reliability

Reliability and validity: Importance in Medical Research

Affiliations.

  • 1 Al-Nafees Medical College,Isra University, Islamabad, Pakistan.
  • 2 Fauji Foundation Hospital, Foundation University Medical College, Islamabad, Pakistan.
  • PMID: 34974579
  • DOI: 10.47391/JPMA.06-861

Reliability and validity are among the most important and fundamental domains in the assessment of any measuring methodology for data-collection in a good research. Validity is about what an instrument measures and how well it does so, whereas reliability concerns the truthfulness in the data obtained and the degree to which any measuring tool controls random error. The current narrative review was planned to discuss the importance of reliability and validity of data-collection or measurement techniques used in research. It describes and explores comprehensively the reliability and validity of research instruments and also discusses different forms of reliability and validity with concise examples. An attempt has been taken to give a brief literature review regarding the significance of reliability and validity in medical sciences.

Keywords: Validity, Reliability, Medical research, Methodology, Assessment, Research tools..

Publication types

  • Biomedical Research*
  • Reproducibility of Results

Validity, Accuracy and Reliability Explained with Examples

This is part of the NSW HSC science curriculum part of the Working Scientifically skills.

Part 1 – Validity

Part 2 – Accuracy

Part 3 – Reliability

Science experiments are an essential part of high school education, helping students understand key concepts and develop critical thinking skills. However, the value of an experiment lies in its validity, accuracy, and reliability. Let's break down these terms and explore how they can be improved and reduced, using simple experiments as examples.

Target Analogy to Understand Accuracy and Reliability

The target analogy is a classic way to understand the concepts of accuracy and reliability in scientific measurements and experiments. 

what is a reliability research

Accuracy refers to how close a measurement is to the true or accepted value. In the analogy, it's how close the arrows come to hitting the bullseye (represents the true or accepted value).

Reliability  refers to the consistency of a set of measurements. Reliable data can be reproduced under the same conditions. In the analogy, it's represented by how tightly the arrows are grouped together, regardless of whether they hit the bullseye. Therefore, we can have scientific results that are reliable but inaccurate.

  • Validity  refers to how well an experiment investigates the aim or tests the underlying hypothesis. While validity is not represented in this target analogy, the validity of an experiment can sometimes be assessed by using the accuracy of results as a proxy. Experiments that produce accurate results are likely to be valid as invalid experiments usually do not yield accurate result.

Validity refers to how well an experiment measures what it is supposed to measure and investigates the aim.

Ask yourself the questions:

  • "Is my experimental method and design suitable?"
  • "Is my experiment testing or investigating what it's suppose to?"

what is a reliability research

For example, if you're investigating the effect of the volume of water (independent variable) on plant growth, your experiment would be valid if you measure growth factors like height or leaf size (these would be your dependent variables).

However, validity entails more than just what's being measured. When assessing validity, you should also examine how well the experimental methodology investigates the aim of the experiment.

Assessing Validity

An experiment’s procedure, the subsequent methods of analysis of the data, the data itself, and the conclusion you draw from the data, all have their own associated validities. It is important to understand this division because there are different factors to consider when assessing the validity of any single one of them. The validity of an experiment as a whole , depends on the individual validities of these components.

When assessing the validity of the procedure , consider the following:

  • Does the procedure control all necessary variables except for the dependent and independent variables? That is, have you isolated the effect of the independent variable on the dependent variable?
  • Does this effect you have isolated actually address the aim and/or hypothesis?
  • Does your method include enough repetitions for a reliable result? (Read more about reliability below)

When assessing the validity of the method of analysis of the data , consider the following:

  • Does the analysis extrapolate or interpolate the experimental data? Generally, interpolation is valid, but extrapolation is invalid. This because by extrapolating, you are ‘peering out into the darkness’ – just because your data showed a certain trend for a certain range it does not mean that this trend will hold for all.
  • Does the analysis use accepted laws and mathematical relationships? That is, do the equations used for analysis have scientific or mathematical base? For example, `F = ma` is an accepted law in physics, but if in the analysis you made up a relationship like `F = ma^2` that has no scientific or mathematical backing, the method of analysis is invalid.
  • Is the most appropriate method of analysis used? Consider the differences between using a table and a graph. In a graph, you can use the gradient to minimise the effects of systematic errors and can also reduce the effect of random errors. The visual nature of a graph also allows you to easily identify outliers and potentially exclude them from analysis. This is why graphical analysis is generally more valid than using values from tables.

When assessing the validity of your results , consider the following: 

  • Is your primary data (data you collected from your own experiment) BOTH accurate and reliable? If not, it is invalid.
  • Are the secondary sources you may have used BOTH reliable and accurate?

When assessing the validity of your conclusion , consider the following:

  • Does your conclusion relate directly to the aim or the hypothesis?

How to Improve Validity

Ways of improving validity will differ across experiments. You must first identify what area(s) of the experiment’s validity is lacking (is it the procedure, analysis, results, or conclusion?). Then, you must come up with ways of overcoming the particular weakness. 

Below are some examples of this.

Example – Validity in Chemistry Experiment 

Let's say we want to measure the mass of carbon dioxide in a can of soft drink.

Heating a can of soft drink

The following steps are followed:

  • Weigh an unopened can of soft drink on an electronic balance.
  • Open the can.
  • Place the can on a hot plate until it begins to boil.
  • When cool, re-weigh the can to determine the mass loss.

To ensure this experiment is valid, we must establish controlled variables:

  • type of soft drink used
  • temperature at which this experiment is conducted
  • period of time before soft drink is re-weighed

Despite these controlled variables, this experiment is invalid because it actually doesn't help us measure the mass of carbon dioxide in the soft drink. This is because by heating the soft drink until it boils, we are also losing water due to evaporation. As a result, the mass loss measured is not only due to the loss of carbon dioxide, but also water. A simple way to improve the validity of this experiment is to not heat it; by simply opening the can of soft drink, carbon dioxide in the can will escape without loss of water.

Example – Validity in Physics Experiment

Let's say we want to measure the value of gravitational acceleration `g` using a simple pendulum system, and the following equation:

$$T = 2\pi \sqrt{\frac{l}{g}}$$

  • `T` is the period of oscillation
  • `l` is the length of string attached to the mass
  • `g` is the acceleration due to gravity

Pendulum practical

  • Cut a piece of a string or dental floss so that it is 1.0 m long.
  • Attach a 500.0 g mass of high density to the end of the string.
  • Attach the other end of the string to the retort stand using a clamp.
  • Starting at an angle of less than 10º, allow the pendulum to swing and measure the pendulum’s period for 10 oscillations using a stopwatch.
  • Repeat the experiment with 1.2 m, 1.5 m and 1.8 m strings.

The controlled variables we must established in this experiment include:

  • mass used in the pendulum
  • location at which the experiment is conducted

The validity of this experiment depends on the starting angle of oscillation. The above equation (method of analysis) is only true for small angles (`\theta < 15^{\circ}`) such that `\sin \theta = \theta`. We also want to make sure the pendulum system has a small enough surface area to minimise the effect of air resistance on its oscillation.

what is a reliability research

In this instance, it would be invalid to use a pair of values (length and period) to calculate the value of gravitational acceleration. A more appropriate method of analysis would be to plot the length and period squared to obtain a linear relationship, then use the value of the gradient of the line of best fit to determine the value of `g`. 

Accuracy refers to how close the experimental measurements are to the true value.

Accuracy depends on

  • the validity of the experiment
  • the degree of error:
  • systematic errors are those that are systemic in your experiment. That is, they effect every single one of your data points consistently, meaning that the cause of the error is always present. For example, it could be a badly calibrated temperature gauge that reports every reading 5 °C above the true value.
  • random errors are errors that occur inconsistently. For example, the temperature gauge readings might be affected by random fluctuations in room temperature. Some readings might be above the true value, some might then be below the true value.
  • sensitivity of equipment used.

Assessing Accuracy 

The effect of errors and insensitive equipment can both be captured by calculating the percentage error:

$$\text{% error} = \frac{\text{|experimental value – true value|}}{\text{true value}} \times 100%$$

Generally, measurements are considered accurate when the percentage error is less than 5%. You should always take the context of the experimental into account when assessing accuracy. 

While accuracy and validity have different definitions, the two are closely related. Accurate results often suggest that the underlying experiment is valid, as invalid experiments are unlikely to produce accurate results.

In a simple pendulum experiment, if your measurements of the pendulum's period are close to the calculated value, your experiment is accurate. A table showing sample experimental measurements vs accepted values from using the equation above is shown below. 

what is a reliability research

All experimental values in the table above are within 5% of accepted (theoretical) values, they are therefore considered as accurate. 

How to Improve Accuracy

  • Remove systematic errors : for example, if the experiment’s measuring instruments are poorly calibrated, then you should correctly calibrate it before doing the experiment again.
  • Reduce the influence of random errors : this can be done by having more repetitions in the experiment and reporting the average values. This is because if you have enough of these random errors – some above the true value and some below the true value – then averaging them will make them cancel each other out This brings your average value closer and closer to the true value.
  • Use More Sensitive Equipments: For example, use a recording to measure time by analysing motion of an object frame by frame, instead of using a stopwatch. The sensitivity of an equipment can be measured by the limit of reading . For example, stopwatches may only measure to the nearest millisecond – that is their limit of reading. But recordings can be analysed to the frame. And, depending on the frame rate of the camera, this could mean measuring to the nearest microsecond.
  • Obtain More Measurements and Over a Wider Range:  In some cases, the relationship between two variables can be more accurately determined by testing over a wider range. For example, in the pendulum experiment, periods when strings of various lengths are used can be measured. In this instance, repeating the experiment does not relate to reliability because we have changed the value of the independent variable tested.

Reliability

Reliability involves the consistency of your results over multiple trials.

Assessing Reliability

The reliability of an experiment can be broken down into the reliability of the procedure and the reliability of the final results.

The reliability of the procedure refers to how consistently the steps of your experiment produce similar results. For example, if an experiment produces the same values every time it is repeated, then it is highly reliable. This can be assessed quantitatively by looking at the spread of measurements, using statistical tests such as greatest deviation from the mean, standard deviations, or z-scores.

Ask yourself: "Is my result reproducible?"

The reliability of results cannot be assessed if there is only one data point or measurement obtained in the experiment. There must be at least 3. When you're repeating the experiment to assess the reliability of its results, you must follow the  same steps , use the  same value  for the independent variable. Results obtained from methods with different steps cannot be assessed for their reliability.

Obtaining only one measurement in an experiment is not enough because it could be affected by errors and have been produced due to pure chance. Repeating the experiment and obtaining the same or similar results will increase your confidence that the results are reproducible (therefore reliable).

In the soft drink experiment, reliability can be assessed by repeating the steps at least three times:

reliable results example

The mass loss measured in all three trials are fairly consistent, suggesting that the reliability of the underly method is high.

The reliability of the final results refers to how consistently your final data points (e.g. average value of repeated trials) point towards the same trend. That is, how close are they all to the trend line? This can be assessed quantitatively using the `R^2` value. `R^2` value ranges between 0 and 1, a value of 0 suggests there is no correlation between data points, and a value of 1 suggests a perfect correlation with no variance from trend line.

In the pendulum experiment, we can calculate the `R^2` value (done in Excel) by using the final average period values measured for each pendulum length.

what is a reliability research

Here, a `R^2` value of 0.9758 suggests the four average values are fairly close to the overall linear trend line (low variance from trend line). Thus, the results are fairly reliable. 

How to Improve Reliability

A common misconception is that increasing the number of trials increases the reliability of the procedure . This is not true. The only way to increase the reliability of the procedure is to revise it. This could mean using instruments that are less susceptible to random errors, which cause measurements to be more variable.

Increasing the number of trials actually increases the reliability of the final results . This is because having more repetitions reduces the influence of random errors and brings the average values closer to the true values. Generally, the closer experimental values are to true values, the closer they are to the true trend. That is, accurate data points are generally reliable and all point towards the same trend.

Reliable but Inaccurate / Invalid

It is important to understand that results from an experiment can be reliable (consistent), but inaccurate (deviate greatly from theoretical values) and/or invalid. In this case, your procedure  is reliable, but your final results likely are not.

Examples of Reliability

Using the soft drink example again, if the mass losses measured for three soft drinks (same brand and type of drink) are consistent, then it's reliable. 

Using the pendulum example again, if you get similar period measurements every time you repeat the experiment, it’s reliable.  

However, in both cases, if the underlying methods are invalid, the consistent results would be invalid and inaccurate (despite being reliable).

Do you have trouble understanding validity, accuracy or reliability in your science experiment or depth study?

Consider getting personalised help from our 1-on-1 mentoring program !

RETURN TO WORKING SCIENTIFICALLY

  • choosing a selection results in a full page refresh
  • press the space key then arrow keys to make a selection

Illustration

  • Basics of Research Process
  • Methodology

Reliability in Research: Definitions and Types

  • Speech Topics
  • Basics of Essay Writing
  • Essay Topics
  • Other Essays
  • Main Academic Essays
  • Research Paper Topics
  • Basics of Research Paper Writing
  • Miscellaneous
  • Chicago/ Turabian
  • Data & Statistics
  • Admission Writing Tips
  • Admission Advice
  • Other Guides
  • Student Life
  • Studying Tips
  • Understanding Plagiarism
  • Academic Writing Tips
  • Basics of Dissertation & Thesis Writing

Illustration

  • Essay Guides
  • Research Paper Guides
  • Formatting Guides
  • Admission Guides
  • Dissertation & Thesis Guides

reliability

Table of contents

Illustration

Use our free Readability checker

Reliability in research refers to the consistency of a measure. It demonstrates whehter the same results would be obtained if the study was repeated. If a test or tool is reliable, it gives consistent results across different situations or over time. A study with high reliability can be trusted because its outcomes are dependable and can be reproduced. Unreliable research can lead to misleading or incorrect conclusions. That's why you should ensure that your study results can be trusted.

When you’ve collected your data and need to measure your research results, it’s time to consider the reliability level of your methods and tools. It often happens that calculation methods produce errors. Particularly, in case you make wrong initial assumptions. In order to avoid getting wrong conclusions it is better to invest some time into checking whether they are reliable. Today we’ll talk about the reliability of research approaches, what it means and how to check it properly. Main verification methods such as split-half, inter-item and inter-rater will be examined and explained below. Let’s go and find out how to use them with our PhD dissertation writing services !

What Is Reliability in Research: Definition

First, let’s define reliability . It is highly important to ensure your data analysis methods are reliable, meaning that they are likely to produce stable and consistent results whenever you use them against different datasets. So, a special parameter named ‘reliability’ has been introduced in order to evaluate their consistency. High reliability means that a method or a tool you are evaluating will repeatedly produce the same or similar results when the conditions remain stable. This parameter has the following key components:

  • probability
  • availability
  • dependability.

Follow our thesis writing services to find out what are the main types of this parameter and how they can be used.

Main Types of Reliability

There are four main types of reliability. Each of them shows consistency of a different approach to data collection and analysis. These types are related to different ways of conducting research, however all of them are equally considered as quality measurements for the tools and methods they describe. We’ll examine each of these 4 types below, discussing their differences, purposes and areas of usage. Let’s take a closer look!

Test Retest Reliability: Definition

The first type is called ‘test-retest’ reliability. You can use it in case you need to analyze methods which are to be applied to the same group of individuals many times. When running the same tests across the same object over and over again, it is important to know whether they produce reliable results. If the latter don’t change significantly over a period of time, we can assume that this parameter shows a high consistency level. Therefore, these methods must be helpful for your research.

Test Retest Reliability: Examples

Let’s review an example of test-retest reliability which might provide more clarity about this parameter for a student preparing their own research. Suppose, a group of a local mall’s consumers has been monitored by a research team for several years. Shopping habits and preferences of each person of the group were examined, particularly by conducting surveys . If their responses did not change significantly over those years, it means that the current research approach can be considered reliable from the test-retest aspect. Otherwise, some of the methods used to collect this data need to be reviewed and updated to avoid introducing errors in the research.

Parallel Forms Reliability: Definition

Another type is parallel forms reliability. It is applied to a research approach when different versions of an assessment tool are used to examine the same group of respondents. In case the results obtained with the help of all these versions correlate with each other, the approach can be considered reliable. However, an analyst needs to ensure that all the versions contain the same elements before assessing their consistency. For example, if two versions examine different qualities of the target group, it wouldn’t make much sense to compare one version to another.

Parallel Forms Reliability: Examples

A parallel forms reliability example using a real-life situation would help illustrate the definition provided above. Let’s take the previous example where a focus group of consumers is examined to analyze dependencies and trends of a local mall’s goods consumption. Let’s suppose the data about their shopping preferences is obtained by conducting a survey among them, one or several times. At the next stage the same data is collected by analyzing the mall’s sales information. In both cases an assessment tool refers to the same characteristics (e.g., preferred shopping hours). If the results are correlated in both cases, it means that the approach is consistent.

Inter Rater Reliability: Definition

The next type is called inter-rater reliability. This measure does not involve different tools but requires a collective effort of several researchers, or raters, to examine the target population independently from each other. Once they are done with that, their assessment results need to be compared across each other. Strong correlation between all these results would mean that the methods used in this case are consistent. In case some of the observers don’t agree with others, the assessment approach to this problem needs to be reviewed and most probably corrected.

Inter Rater Reliability: Examples

Let’s review an inter rater reliability example – another case to help you visualize this parameter and the ways to use it in your own research. We’ll suppose that the consumer focus group from the previous example is independently tested by three researchers who use the same set of testing types:

  • conducting surveys.
  • interviewing respondents about their preferred items (e.g. bakery or housing supplies) or preferred shopping hours.
  • analyzing sales statistics collected by the mall.

In case each of these researchers obtains the same or very similar results at the end leading to similar conclusions, we can assume that the research approach used in this project is consistent.

What Is Internal Consistency Reliability: Definition

The final type is called internal consistency reliability. This measure can be used to evaluate the degree to which different tools or parts of a test produce similar results after probing the same area or object. The purpose is to try calculating or analyzing some value using different ways. In case the same results are obtained in each case, we can assume that the measurement method itself is consistent. Depending on how precise the calculations are, small deviations between these results may or may not be allowed.

Internal Consistency Reliability: Examples

In the end of this review of reliability types let’s check out an internal consistency reliability example.  Let’s take the same situation as described in previous examples: a focus consumer group whose shopping preferences are analyzed with the help of several different methods. In order to test the consistency of these methods, a researcher can randomly split the focus group in half and analyze each half independently. If done properly, random splitting must provide two subgroups with nearly identical qualities, so they can be viewed as the same construct. If analytic measures provide strongly correlated results for both these groups, the research approach is consistent.

Reliability Coefficient: What Is It

In order to evaluate how well a test measures a selected object, a special parameter named reliability coefficient has been introduced. Its definition is fully explained by its name: it shows whether a test is repeatable or reliable. The coefficient is a number lying within the range between 0 and 1.00, where 0 indicates no reliability and 1.00 indicates perfect reliability. The following proportion is used to calculate this coefficient, R:

R = (N/(N-1)) * ((Total Variance - Sum of Variance)/Total Variance),

where N is the number of times the test has been run. A real test could hardly have a perfect reliability. Typically, having the coefficient of 0.8 or higher means the test can be considered reliable enough.

Reliability: The Same as Quality

It is important to understand the difference between quality vs reliability. These concepts are somewhat related however they have different practical meanings. We use quality to indicate that an object or a solution performs its proper functions well and allows its users to achieve the intended purpose. Reliability indicates how well this object or solution is able to maintain its quality level as time passes or conditions change. It can be stated that reliability is one of the subsets of quality which is used to evaluate the consistency of a certain object or solution in a dynamic environment. Because of its nature, reliability is a probabilistic value. We also have a reliability vs validity blog. It is so crucial to understanding their difference for your research.

Reliability: Key Takeaways

In this article we have reviewed the concept of reliability in research. Its main types and their usage in real life research cases have been examined to a certain degree. Ways of measuring this value, particularly its coefficient, have also been explained.

Illustration

In case you are having troubles with using this concept in your own work or just need help with writing a high quality paper and earning a high score – feel free to check out our writing services! A team of skilled writers with rich experience in various academic areas is ready to help you upon ‘ write a paper for me ’ request.

Reliability: Frequently Asked Questions

1. how do you determine reliability in research.

One can determine reliability in research using a simple correlation between two scores from the same person. It is quite easy to make a rough estimation of a reliability coefficient for these two items using the formula provided above. In order to make a more precise estimation, you’ll need to obtain more scores and use them for calculation. The more test runs you make, the more precise your coefficient is.

2. Why is reliability important in research?

Reliability refers to the consistency of the results in research. This makes reliability important for nearly any kind of research: psychological, economical, industrial, social etc.. A project that may affect the lives of many people needs to be conducted carefully and its results need to be double checked. If the methods used have been unreliable, its results may contain errors and cause negative effects.

4. What is reliability of a test?

Reliability of a test refers to the extent to which this test can be run without errors. The higher the reliability, the more usable your tests are and the less the probability of errors in your research is. Tests might be constructed incorrectly because of wrong assumptions or incorrect information received from a source. Measuring reliability helps to counter that and to find the ways to improve the quality of tests.

3. How does reliability affect research?

Levels of reliability affect each project which uses complex analysis methods. It is important to know the degree to which your research method produces stable and consistent results. In case the consistency is low, your work might be useless because of incorrect assumptions. If you don’t want your project to fail, you have to assess the consistency of your methods.

Joe_Eckel_1_ab59a03630.jpg

Joe Eckel is an expert on Dissertations writing. He makes sure that each student gets precious insights on composing A-grade academic writing.

You may also like

correlation vs causation

ORIGINAL RESEARCH article

Reliability and validity of a novel attention assessment scale (broken ring envision search test) in the chinese population.

Yue Shi

  • Department of Rehabilitation Medicine, Third Affiliated Hospital of Soochow University, Changzhou, China

Background: The correct assessment of attentional function is the key to cognitive research. A new attention assessment scale, the Broken Ring enVision Search Test (BReViS), has not been validated in China. The purpose of this study was to assess the reliability and validity of the BReViS in the Chinese population.

Methods: From July to October 2023, 100 healthy residents of Changzhou were selected and subjected to the BReViS, Digital Cancelation Test (D-CAT), Symbol Digit Modalities Test (SDMT), and Digit Span Test (DST). Thirty individuals were randomly chosen to undergo the BReViS twice for test–retest reliability assessment. Correlation analysis was conducted between age, education level, gender, and various BReViS sub-tests including Selective Attention (SA), Orientation of Attention (OA), Focal Attention (FA), and Total Errors (Err). Intergroup comparisons and multiple linear regression analyses were performed. Additionally, correlation analyses between the BReViS sub-tests and with other attention tests were also analyzed.

Results: The correlation coefficients of the BReViS sub-tests (except for FA) between the two tests were greater than 0.600 ( p  < 0.001), indicating good test–retest reliability. The Cronbach’s alpha coefficient was 0.874, suggesting high internal consistency reliability. SA showed a significant negative correlation with the net score of D-CAT ( r  = −0.405, p  < 0.001), and a significant positive correlation with the error rate of D-CAT ( r  = 0.401, p  < 0.001), demonstrating good criterion-related validity. The correlation analysis among the results of each sub-test showed that the correlation coefficient between SA and Err was 0.532 ( p  < 0.001), and between OA and Err was-0.229 ( p  < 0.05), whereas there was no significant correlation between SA, OA, and FA, which indicated that the scale had good informational content validity and structural validity. Both SA and Err were significantly correlated with age and years of education, while gender was significantly correlated with OA and Err. Multiple linear regression suggested that Err was mainly affected by age and gender. There were significant differences in the above indexes among different age, education level and gender groups. Correlation analysis with other attention tests revealed that SA negatively correlated with DST forward and backward scores and SDMT scores. Err positively correlated with D-CAT net scores and negatively with D-CAT error rate, DST forward and backward scores, and SDMT scores. OA and FA showed no significant correlation with other attention tests.

Conclusion: The BReViS test, demonstrating good reliability and validity, assessing not only selective attention but also gauging capacities in immediate memory, information processing speed, visual scanning, and hand-eye coordination. The results are susceptible to demographic variables such as age, gender, and education level.

1 Introduction

Attention is the foundation of all cognitive functions, the prerequisite for continuous information processing, and a gateway for the flow of information to enter the brain and undergo selection ( Petersen and Posner, 2012 ). Precise and accurate assessment of attentional functions is key in cognitive research and a precondition for the rehabilitation of cognitive disorders. In clinical neuropsychology, visual search tasks (VSTs) are frequently used to evaluate selective visual attention deficits in patients with neurological conditions ( Eglin et al., 1989 ; Luck et al., 1989 ; Utz et al., 2013 ). These typically include paper-and-pencil target cancellation tasks such as the Attention Matrix ( Della Sala et al., 1992 ), Ruff 2&7 Selective Attention Test ( Marioni et al., 2012 ), Letter Cancellation Test ( Uttl and Pilkenton-Taylor, 2001 ), and the Visual Spatial Attention subtest in the Oxford Cognitive Screen ( Demeyere et al., 2015 ), which are effective tools for detecting attention deficits post-stroke. However, existing VSTs do not take into account the potential impact of stimulus layout and crowding on the test results of participants. Facchin et al. developed a novel attention assessment scale—the Broken Ring enVision Search Test (BReViS) to evaluate attentional functions ( Facchin et al., 2023 ). It assesses different components of attention including selective attention, the visual–spatial orientation of attention, and focal attention involving crowding phenomena, and is a novel open-ended paper-and-pencil assessment tool.

While studies have shown the effectiveness and applicability of the BReViS test in the Italian population and provided specific Italian normative data, its suitability for the Mainland Chinese population is yet to be concluded. Therefore, this study aims to examine the reliability and validity of the BReViS test in the healthy Chinese population and to analyze the characteristics of its preliminary application, in the hope of finding a simple and feasible tool for the clinical environment to assess neuropsychological patients’ attention deficits and provide a basis for the assessment and rehabilitation treatment of attentional disorders.

2 Sample and methods

2.1 study procedure.

General Information: From July to October 2023, a total of 100 healthy residents, including staff and accompanying personnel from the First People’s Hospital of Changzhou and residents of Tianning and Xinbei districts of Changzhou, were selected. The cohort comprised 47 males and 53 females; ages ranged from 19 to 84 years, with an average age of (52.35 ± 22.01) years; years of education ranged from 2 to 20 years, with an average of (12.39 ± 3.86) years. Of these, the number of people with 2 years of education was 1.

Inclusion criteria: Age 19–84 years; Right-handed; Normal or corrected-to-normal vision.

Exclusion criteria: Auditory, visual, or speech impairments; Past history of neurological or psychiatric diseases (including brain injury, stroke, clinically diagnosed dementia, depression, etc.); History of addiction to tobacco, alcohol, or addictive drugs.

Grouping method: In order to make between-group comparisons between different ages, education levels and genders, the subjects were divided into 4 groups according to different ages in the statistical analyses, with those aged 18–34 years classified as the youth group, those aged 35–49 years classified as the young-adult group, those aged 50–65 years classified as the middle-aged group, and those older than 65 years classified as the senior group. Similarly, they were divided into four groups according to their education level: those with 1–6 years of education were classified as the elementary group, those with 7–9 years of education were classified as the middle school group, those with 10–12 years of education were classified as the high school/vocational group, and those with more than 12 years of education were classified as the college/university and above group. They were divided into male and female groups by gender. Demographic characteristics of the groups are reported in Table 1 . Thirty subjects were randomly selected as the retesting group and the BReViS test was administered again after 2 weeks. There were 30 subjects in the retesting group, of whom 14 were male and 16 were female; their ages ranged from 19 to 72 years, with a mean of (44.07 ± 15.67) years; and their years of education ranged from 6 to 19 years, with a mean of (13.86 ± 2.81) years.

www.frontiersin.org

Table 1 . Demographic characteristics of the patients’ sample.

2.2 Measurements and applied questionnaires

2.2.1 the brevis test.

It was developed by Facchin et al. (2023) . We have obtained authorization from the original authors to use it. The test consisted of four cancellation quiz cards, each consisting of five rows of circles with notches in different orientations arranged in different layouts and degrees of crowding, with 25 targets per card and randomly defined target locations. Subjects were asked to identify and cross out all the targets on each card that had the same notch orientation as the circles shown at the top of the card, and to record the execution time, number of omissions, self-corrections, and errors crossings for the completion of the 4 test cards. The performance time for each quiz card was calculated based on the execution time and omissions for each card. The calculation formula is as follows:

By combining the execution times of the four test cards, the following four indices are calculated: Selective Attention (SA), Orientation of Attention (OA), Focal Attention (FA), and Total Errors (Err).

SA represents the capacity to suppress irrelevant stimuli (distractors) and solely select relevant stimuli (targets) under the simplest conditions. It directly corresponds to the performance time of the first card (linear layout, low crowding), which is less affected by random arrays and crowded displays. SA = Performance time for the first card. Higher SA index values suggest lower efficiency of selective attention.

OA refers to the strategic direction of visual attention, which is the capacity to guide selective visual attention with effective endogenous strategies throughout the visual scene ( Connor et al., 2004 ), one of the two components of visual–spatial attention measured by BReViS. High OA index values indicate an inability to follow effective endogenous strategies during the visual search process, necessitating exogenous cues to perform the task correctly. It is calculated with the following formula using the performance time of each card:

FA can be interpreted as the ability to adjust the focus of attention based on the position of stimuli within the array, another component of visual–spatial attention ( Castiello and Umilta, 1990 ). It corresponds to the comparison between two levels of crowding: high and low. High FA index values suggest a higher sensitivity to crowding. It is calculated with the following formula using the performance time of each card:

The Err index represents the overall errors made across all sub-tests. Err = Total number of errors across all four test cards.

2.2.2 Other attention tests

The Digit Cancellation Test (D-CAT) is used to measure selective attention ( Hatta et al., 2004 ). Participants were required to locate and strike through the number preceding the number 3 from a random sequence of numbers 1–9, with the time taken to complete the test recorded. Net scores and error rates are calculated based on the number of correct cancelations, omissions, and mistakes. Higher net scores and lower error rates indicate better selective attention.

The Symbol Digit Modalities Test (SDMT) was published by Aaron Smith in 1973 and revised in 1982 to assess speed of information processing, visual scanning ability, and hand-eye coordination ( Strober et al., 2019 ). This test involves an encoding key of 9 different abstract symbols, each associated with a number. Participants must write the number corresponding to each symbol as quickly as possible within 90 s. Scoring is based on the number of correct symbols and reversed symbols. Higher scores indicate better speed of information processing, visual scanning ability, and hand-eye coordination.

The Digit Span Test (DST) is a commonly used psychological assessment tool that measures short-term memory and attention span ( Park and Lee, 2019 ). In its traditional form, the Digit Span Test consists of two parts: forward digit span and backward digit span. This test evaluates the participant’s ability to recall a sequence of numbers in the correct order both forwards and backwards after the tester reads them out. Participants repeat a series of random numbers at a rate of one number per second, starting with a sequence of 3 numbers and increasing in length up to 12 numbers or until two consecutive errors are made. One point is scored for each correctly recalled sequence. The higher the scores on forward and backward digit span, the greater the capacity of immediate memory.

2.2.3 Sample size calculation

This study mainly used correlation analysis and multiple linear regression analysis, so it was calculated using G*Power software 3.1 ( Faul et al., 2009 ), correlation analysis input target effect size of 0.3, type I error of 5% (α = 0.05), and power of 80% (β = 0.20), and the sample size of 82 participants was calculated. Multiple linear regression analyses were conducted with an input independent variable of 3 (U = 3), effect size = 0.15 (F2 = 0.15), type I error of 5% (α = 0.05), and power of 80% (β = 0.20), resulting in a calculated sample size of 77 participants. The final sample size was 100 participants, taking into account an allowable 20% dropout rate.

2.2.4 Experimental procedure

Participants filled out informed consent forms; They were subjected to the BReViS test and other attention tests. Among them, 30 were randomly selected to retake the BReViS test after two weeks. All tests were administered by the same physician.

2.3 Statistical analysis

SPSS 17.0 software was used for statistical analysis. Spearman’s correlation analysis was employed to assess the correlation between the BReVis test and other attention tests, as well as the correlation between each sub-test of the BReViS and age, educational level, and gender. Kruskal-Wallis test was used to compare the differences in the BReViS sub-test scores among different age and educational level groups, while Mann–Whitney U test was utilized to compare the differences between gender groups. Multiple linear regression analysis was conducted to investigate the influence of demographic characteristics on scale evaluation results, with statistical significance set at p  < 0.05. Pearson correlation coefficient was employed to analyze the test–retest reliability of the BReViS; Cronbach’s α coefficient was used to indicate internal consistency, with a coefficient above 0.80 considered excellent, between 0.70 and 0.80 acceptable, and below 0.7 indicating poor reliability. The Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy and Bartlett’s test of sphericity were employed to analyze the appropriateness of factor analysis, to validate the structural validity of the BReViS. Finally, correlation analyses between the results of the BReViS subtests were conducted using Spearman’s correlation analysis to test the content and structural validity of the scale.

3.1 Descriptive results

The descriptive mean results on the four BReViS sub-tests scores are reported in Tables 2 – 4 .

www.frontiersin.org

Table 2 . Mean performance time (and SD) for each sub-test, divided by age group.

www.frontiersin.org

Table 3 . Mean performance time (and SD) for each sub-test, divided by education level group.

www.frontiersin.org

Table 4 . Mean performance time (and SD) for each sub-test, divided by gender group.

3.2 Correlation analysis of age with the BReViS sub-tests

Age showed a positive correlation with both SA ( r  = 0.776, p  < 0.001) and Err ( r  = 0.607, p  < 0.001), with no significant correlation with the other sub-tests.

3.3 Comparison of different age groups

As shown in Table 5 , analyses of multiple between-group comparisons across age groups showed significant differences in sub-test scores for SA and Err ( p  < 0.001). Detailed two-by-two intergroup comparisons highlighted significant differences in SA scores between the youth and middle-aged groups (adjusted p  = 0.006), as well as between the youth and senior groups (adjusted p  = 0.000). Similarly, Err scores differed significantly between the young and middle-aged groups (adjusted p  = 0.005), and between the youth and senior groups (adjusted p = 0.000). Additionally, a distinct variance was observed in SA scores between the young-adult and senior groups (adjusted p  = 0.017), as shown in Table 6 .

www.frontiersin.org

Table 5 . Analysis of variance between different age groups (Mean Rank).

www.frontiersin.org

Table 6 . Two-by-two comparison of SA and Err between different age groups.

3.4 Correlation analysis of education level with the BReViS sub-tests

Years of education were negatively correlated with both SA ( r  = −0.715, p  < 0.001) and Err ( r  = −0.502, p  < 0.001), with no significant correlation with the remaining sub-tests.

3.5 Comparison of different education level groups

As shown in Table 7 , analyses of multiple between-group comparisons across education level groups unveiled significant disparities in the scores for sub-tests SA and Err, while OA and FA did not exhibit such differences ( p  < 0.001). Detailed two-by-two intergroup comparisons highlighted significant differences in SA scores: the college/university and above group demonstrated significant disparities when compared with the elementary, middle school, and high school/vocational groups (adjusted p  = 0.000 for all comparisons). Similarly, Err scores significantly differed between the college/university and above group and the elementary group (adjusted p = 0.000), as well as between the college/university and above group and both the middle school (adjusted p  = 0.027) and high school/vocational groups (adjusted p  = 0.006), as detailed in Table 8 .

www.frontiersin.org

Table 7 . Analysis of variance between different education level groups (mean rank).

www.frontiersin.org

Table 8 . Two-by-two comparison of SA and Err between different education level groups.

3.6 Correlation analysis of gender with the BReViS sub-tests

Gender showed a negative correlation with OA ( r  = −0.251, p  = 0.012) and a positive correlation with Err ( r  = 0.215, p  = 0.032), with no significant correlation with SA and FA.

3.7 Comparison of the two gender groups

The comparison results between the two gender groups showed a significant difference in OA and Err ( p  < 0.05), while no significant difference was observed in SA and FA, as detailed in Table 9 . Combining the results from Table 4 , it was evident that males scored higher in the OA test and lower in the Err test compared to females.

www.frontiersin.org

Table 9 . Comparison of the two gender groups (Mean Rank).

3.8 Impact of demographic variables

Multiple linear regression analysis suggested that when demographic variables age, education level, and gender were introduced into the linear regression model of SA and Err, SA was affected by years of education level and age, while Err was influenced by age and gender ( Table 10 ).

www.frontiersin.org

Table 10 . Impact of demographic variables.

3.9 Relevance to other attention tests

SA was negatively correlated with the net score of D-CAT and positively correlated with the error rate of D-CAT. It was also negatively correlated with DST forward and backward scores and SDMT scores. Err showed a positive correlation with the net score of D-CAT and a negative correlation with the error rate of D-CAT, DST forward and backward scores, and SDMT scores. OA and FA did not show significant correlation with other attention tests ( Table 11 ).

www.frontiersin.org

Table 11 . Relevance to other attention tests.

3.10 Reliability testing

3.10.1 Re-testability of the BReViS test: Results showed that the correlation coefficients for SA, OA, and Err were all greater than 0.600, p  < 0.001. Only the correlation coefficient for FA was below 0.6, p  > 0.05, which was not statistically significant ( Table 12 ).

www.frontiersin.org

Table 12 . Re-testability of the BReViS test.

3.10.2 Internal Consistency Reliability: Cronbach’s alpha coefficient was 0.874, indicating high internal consistency reliability for the BReViS test.

3.11 Validity testing

3.11.1 Construct Validity: The Kaiser-Meyer-Olkin (KMO) measure and Bartlett’s test of sphericity results were 0.763 and 252.601 (P<0.001), respectively, indicating the scale was not very suitable for factor analysis.

3.11.2 Criterion Validity: In this study, the D-CAT was used as a criterion, and Spearman’s correlation analysis was used to calculate the correlation between BReViS’s SA and the net scores and error rates of D-CAT to evaluate the degree of criterion-related validity. The results showed that SA was significantly negatively correlated with the net score of D-CAT ( r  = −0.405, p < 0.001) and significantly positively correlated with the error rate of D-CAT ( r  = 0.401, p < 0.001), indicating the questionnaire has good criterion-related validity, as seen in Table 11 .

3.12 Correlation between sub-tests

The correlation analysis of the results among the various sub-tests of the BReViS test indicated that the correlation coefficient between SA and Err was 0.532, and between OA and Err was-0.229, with p  < 0.05, suggesting a certain degree of consistency between them, which contributes to ensuring the reliability of the scale. Meanwhile, the correlation between SA, OA, and FA was not high, indicating that the scale has excellent information content and structural validity, as seen in Table 13 .

www.frontiersin.org

Table 13 . Correlation between sub-tests.

4 Discussion

Attention is a fundamental psychological concept, deeply embedded in cognitive processing, defined by the deliberate focusing on particular stimuli ( van Es et al., 2018 ). This focusing elevates the level of awareness about these stimuli, epitomizing attention’s selective nature. Solso, MacLin M.K., and MacLin O.H. (2005) highlight that “the essence of attention lies in the concentration and focus of consciousness,” underlining attention’s critical role in selecting an item from an array of simultaneous stimuli or thought sequences ( Baddeley, 1988 ). Selective attention, therefore, is the capacity to direct an individual’s finite processing resources toward a particular environmental aspect. This complex concept encompasses a range of processes, including spatial attention with its directional and focal elements ( Carrasco, 2011 ). Such capability allows for the filtration of extensive information from the surroundings, facilitating the efficient usage of scarce cognitive resources.

Historically, attention has been a central theme in psychological studies, resulting in a plethora of theoretical frameworks and experimental methodologies. One of the most significant paradigms for investigating selective visual attention’s traits is visual search ( Bacon and Egeth, 1997 ; Verghese, 2001 ; Wolfe, 2003 ). Everyday life is replete with visual search scenarios, whether it’s choosing products on supermarket shelves, animals searching for food amidst leaves, locating a friend in a large gathering, or playing visual search games ( Wolfe, 2020 ). Clinical neuropsychology frequently employs visual search tasks (VST) to evaluate selective visual attention deficits in patients with neurological conditions ( Senger et al., 2017 ). Standard VST protocols involve participants identifying a target among numerous stimuli, like figures or letters, assessing performance based on response accuracy and time ( Wolfe et al., 2002 ).

Studies suggest that visual task outcomes are not just influenced by attention toward the target’s location (the spatial component) but also by adjusting the attention window according to the task requirements (the focal component) ( Albonico et al., 2016 ), with each component operating independently ( Castiello and Umilta, 1990 ; Carrasco and Yeshurun, 2009 ). Traditional VSTs, however, tend to neglect the influence of distractor arrangement and density on performance, thus failing to adequately capture the nuances of spatial attention ( Weintraub and Mesulam, 1988 ; Mesulam, 2000 ). The BReViS assessment offers a refreshing alternative to conventional paper-and-pencil visual search tests by modifying the stimulus arrangement within the visual field, allowing for a comprehensive evaluation of selective visual attention and its distinct facets. Though previously utilized within the Italian demographic without undergoing thorough reliability and validity verification, this study introduces the BReViS test to the Mainland Chinese audience, undertaking a comprehensive examination of its reliability and validity among individuals aged 19 to 84.

4.1 Reliability testing

When a test has good reliability, it will yield almost the same scores for the same group of people at different times. The quality of reliability is also a prerequisite for validity testing. In this study, the test–retest reliability of the BReViS showed high correlation coefficients for three of the four sub-tests—SA, OA, and Err—on reassessment after two weeks. The test–retest results indicate that the BReViS test has good retest reliability, suggesting good temporal stability. The lack of statistical significance for FA in the correlation analysis may be due to the longer duration of this test, which may lead to fatigue in older participants resulting in unstable scores. Additionally, a higher Cronbach’s alpha coefficient indicates stronger internal consistency of the scale. It is generally considered that a Cronbach’s alpha coefficient greater than 0.7 indicates good consistency among items ( Tavakol and Dennick, 2011 ). The results of this study show a total Cronbach’s alpha coefficient of 0.874 for the BReViS test, indicating high internal consistency reliability. It’s interesting to note that the average score for FA increased from −1.57 in the first test to 0.67 in the second, indicating a higher sensitivity to crowding in the latter. Research has shown that sensitivity to visual crowding is influenced by various factors that can affect an individual’s ability to distinguish objects in cluttered environments. These factors include contrast, eccentricity, visual acuity and age, spatial frequency, attention and perceptual learning, as well as stimulus similarity ( Coates et al., 2013 ; Veríssimo et al., 2022 ). Therefore, factors such as the brightness of the room, the depth of color of the test figures, the position of the test paper in the field of vision, whether the participant is focused, has undergone perceptual learning, and the objects surrounding the test paper can all affect sensitivity to crowding. The variability in the results of the two tests in this study reminds us that these influences need to be more tightly controlled in future studies.

4.2 Validity testing

The Kaiser-Meyer-Olkin (KMO) measure and Bartlett’s test suggested that the structure of the BReViS test might not be well suited for factor analyses, but that there was some correlation between the BReViS measures. The correlation analysis among the results of each subt-est of the BReViS showed a correlation coefficient of 0.532 between SA and Err, and − 0.229 between OA and Err, with p  < 0.05, indicating a certain level of consistency between them, which contributes to ensuring the reliability of the scale. However, the correlations among SA, OA, and FA were not high, suggesting that the scale has excellent information content and structural validity. Given that BReViS was developed to assess SA, this study employed the D-CAT as a criterion measure and found a significant correlation between SA and the D-CAT results, indicating good criterion-related validity.

4.3 The influence of age on BReViS

This study showed that age was significantly positively correlated with the sub-tests SA and Err. Multiple linear regression analysis suggested that SA is greatly influenced by age and education level, while Err is more influenced by age and gender. Therefore, age is a major factor influencing BReViS test results, which is consistent with the findings of the scale developers in the Italian population and previous research. The rank-sum test analysis across different age groups reveals that young adults significantly outperform both middle-aged and senior groups in selective attention tasks, making fewer errors. Additionally, the young-adult group demonstrate superior selective attention capabilities compared to those in the senior group. This pattern supports the notion that selective attention abilities undergo a pronounced growth during adolescence, which is then followed by a discernible decline as individuals age ( Moore and Zirnsak, 2017 ). Neurophysiological alterations, observable through changes in the amplitude and latency of event-related potential (ERP) components, accompany this evolution in attention processing ( Madden et al., 2007 ). Complementing these findings, functional MRI studies have identified a diminished activation in critical regions associated with visual attention control - namely, the bilateral fusiform gyrus, the right lingual gyrus, and the right precuneus-in elderly individuals when compared to their younger counterparts ( Lyketsos et al., 1999 ; Lee et al., 2003 ).

4.4 The influence of education level on BReViS

This study found that years of education were negatively correlated with both SA and Err, and significant differences in SA and Err scores were also observed across different education level groups. Analysis using rank-sum tests across different educational attainment groups indicates that individuals with tertiary education (the college/university group and above) perform significantly better in selective attention tasks than those from the elementary ( Mueller et al., 2008 ; Yehezkel et al., 2015 ), middle School and high school/vocational groups. They made fewer errors, suggesting a correlation between higher education levels and improved selective attention abilities. Studies have shown that individuals with higher levels of education often perform better on various cognitive tests ( Lindenberger and Baltes, 1997 ; Hultsch et al., 1999 ), likely due to the enhanced cognitive strategies, problem-solving skills, and knowledge base provided by formal education. Additionally, higher education may mitigate the impact of aging on cognitive performance ( Lee et al., 2003 ; Jones et al., 2006 ; Tun and Lachman, 2008 ; Marioni et al., 2012 ). Research by Stern et al. (2005) and others indicates that higher educational attainment can moderate the decline in reaction and attention abilities due to aging and lower the risk of dementia ( Bell et al., 2006 ), partly because cognitive reserve accumulation improves brain network efficiency ( Rubia et al., 2010 ). These findings highlight the importance of considering educational background when interpreting cognitive assessment results.

4.5 The influence of gender on BReViS

In this study, the SA index was influenced by age and educational level, but no significant gender differences were observed. Gender was positively correlated with the Err index and negatively correlated with the OA index, with significant differences between genders, indicating that females committed more total errors than males. Males had higher OA scores than females, suggesting that males in the visual search process rely on exogenous cues to perform tasks correctly and are less likely to follow effective endogenous strategies. This is consistent with the observations made by the authors in a normal Italian population. The differences in OA scores between males and females may be related to the activation of different brain regions during the execution of spatial selective attention tasks. Males show increased activation in the left hemisphere’s inferior parietal lobule, while females show significant activation in the right hemisphere’s inferior frontal gyrus, insula, caudate, and temporal areas ( de Fockert et al., 2001 ; Boi et al., 2011 ), which may be related to the modulation by estrogen and testosterone ( Oberauer, 2019 ). Additionally, FA was not observed to be affected by gender, age and years of education in this study, which is in line with the results of the most recent application of the scale, i.e., crowding did not worsen with age ( Pegoraro et al., 2024 ), and these findings are consistent with previous studies ( Malavita et al., 2017 ; Shamsi et al., 2022 ).

4.6 The correlation between BReViS and other attention scales

SA was significantly positively correlated with the cancellation time and error rate in the D-CAT and significantly negatively correlated with the net score of cancellation. Err was negatively correlated with the net score of cancellation and positively correlated with the cancellation error rate. These results indicate that BReViS’s SA and Err have good consistency with the D-CAT in assessing selective attention in the normal population.

Research demonstrates that enhancing selective attention significantly improves test outcomes in immediate memory capabilities ( Plebanek and Sloutsky, 2019 ). For instance, within the context of the DST, superior selective attention enables individuals to recall and reproduce digit sequences with greater accuracy, thus exhibiting an increased memory capacity. This study reveals a negative correlation between SA and Err with the scores of forward and backward span in the DST, offering a crucial insight: higher scores of SA and Err indicate weaker selective attention, an increased error rate, and a noticeable decline in the subjects’ immediate memory capacity. This finding highlights the close interrelation among immediate memory, selective attention, and cognitive efficiency, suggesting that individuals with a larger immediate memory capacity can more effectively resist distractions, thereby reducing error rates ( Posner and Petersen, 1990 ; Rayner, 1998 ; Ku, 2018 ). In clinical practice, this correlation is important to identify and assess deficits in attention, working memory, or other cognitive functions.

The negative correlation between SA and Err with scores on the SDMT unveils a significant cognitive phenomenon: there is a direct correlation between elevated selective attention and increased efficiency of visual scanning, speed of information processing, and hand-eye coordination. Selective attention, a critical dimension of attention management, involves filtering task-relevant information from the environment while disregarding irrelevant distractions ( De la Torre et al., 2015 ). The efficacy of selective attention depends to a large extent on the efficiency of visual scanning, a crucial aspect because it requires the individual to quickly localize and identify key targets among numerous visual stimuli ( Reigal et al., 2019 ). Furthermore, the acceleration of information processing speed is a key factor in enhancing the efficiency of selective attention, allowing individuals to recognize important information within shorter durations and respond accordingly ( Posner, 1980 ). In tasks requiring rapid identification of visual information followed by corresponding physical actions, exceptional hand-eye coordination markedly improves the precision and efficiency of task execution ( Castiello and Umilta, 1990 ). Thus, the effective concentration of selective attention on specific stimuli or tasks is supported by an individual’s performance in terms of a combination of speed of information processing, visual scanning ability, and hand-eye coordination. The improvement of these cognitive abilities not only further enhances the performance of selective attention but also, reciprocally, enhances the operational efficacy of these cognitive functions, thereby creating a positive feedback loop. This phenomenon offers profound insights into how individuals process information efficiently in complex environments within the domain of cognitive science.

The allocation of attentional resources in space involves two distinct processes: the orienting process, which selectively concentrates on specific aspects of the environment while ignoring others. The OA index reflects orienting ability, influenced by factors like stimulus salience, personal interests or goals, and the presence of attention-directing cues ( Chun et al., 2011 ). The focusing process narrows attention to a specific area or object, acting like a magnifying glass, allowing selective concentration on a limited spatial area ( Turatto et al., 2000 ; Chun et al., 2011 ). The FA index reflects focusing ability. Some studies suggest that focusing and orienting may vary based on visual conditions ( Turatto et al., 2000 ). This research found no significant correlation between OA and FA with DST and SDMT, suggesting that orienting and focusing abilities might not be affected by immediate memory capacity, information processing speed, visual scanning ability, and hand-eye coordination skills.

5 Conclusion

The BreViS test, demonstrating good reliability and validity, is adept for application across a broad age range (19 to 84 years) within the general population, assessing not only selective attention but also gauging capacities in immediate memory, information processing speed, visual scanning, and hand-eye coordination. The influence of demographic variables such as age, gender, and education level on test outcomes underscores the necessity for nuanced interpretation of results in research and clinical settings.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Ethics statement

The studies involving humans were approved by the Ethics Committee of the Third Affiliated Hospital of Soochow University. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study.

Author contributions

YS: Writing – original draft. YZ: Writing – review & editing.

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Albonico, A., Malaspina, M., Bricolo, E., Martelli, M., and Daini, R. (2016). Temporal dissociation between the focal and orientation components of spatial attention in central and peripheral vision. Acta Psychologica 171, 85–92. doi: 10.1016/j.actpsy.2016.10.003

PubMed Abstract | Crossref Full Text | Google Scholar

Bacon, W. J., and Egeth, H. E. (1997). Goal-directed guidance of attention: evidence from conjunctive visual search. J. Exp. Psychol. Hum. Percept. Perform. 23, 948–961. doi: 10.1037/0096-1523.23.4.948

Crossref Full Text | Google Scholar

Baddeley, A. (1988). Cognitive psychology and human memory. Trends Neurosci. 11, 176–181. doi: 10.1016/0166-2236(88)90145-2

Bell, E. C., Willson, M. C., Wilman, A. H., Dave, S., and Silverstone, P. H. (2006). Males and females differ in brain activation during cognitive tasks. NeuroImage 30, 529–538. doi: 10.1016/j.neuroimage.2005.09.049

Boi, M., Vergeer, M., Ogmen, H., and Herzog, M. H. (2011). Nonretinotopic exogenous attention. Curr. Biol. 21, 1732–1737. doi: 10.1016/j.cub.2011.08.059

Carrasco, M. (2011). Visual attention: the past 25 years. Vis. Res. 51, 1484–1525. doi: 10.1016/j.visres.2011.04.012

Carrasco, M., and Yeshurun, Y. (2009). Covert attention effects on spatial resolution. Prog. Brain Res. 176, 65–86. doi: 10.1016/S0079-6123(09)17605-7

Castiello, U., and Umilta, C. (1990). Size of the attentional focus and efficiency of processing. Acta Psychol. 73, 195–209. doi: 10.1016/0001-6918(90)90022-8

Chun, M. M., Golomb, J. D., and Turk-Browne, N. B. (2011). A taxonomy of external and internal attention. Annu. Rev. Psychol. 62, 73–101. doi: 10.1146/annurev.psych.093008.100427

Coates, D. R., Chin, J. M., and Chung, S. T. (2013). Factors affecting crowded acuity: eccentricity and contrast Optometry and vision science. Am. Acad. Optom. 90, 628–638. doi: 10.1097/OPX.0b013e31829908a4

Connor, C. E., Egeth, H. E., and Yantis, S. (2004). Visual attention: bottom-up versus top-down. Curr. Biol. 14, R850–R852. doi: 10.1016/j.cub.2004.09.041

de Fockert, J. W., Rees, G., Frith, C. D., and Lavie, N. (2001). The role of working memory in visual selective attention. Science 291, 1803–1806,

Google Scholar

De la Torre, G. G., Barroso, J. M., León-Carrión, J., Mestre, J. M., and Bozal, R. G. (2015). Reaction time and attention: toward a new standard in the assessment of ADHD? A pilot study. J. Atten. Disord. 19, 1074–1082. doi: 10.1177/1087054712466440

Della Sala, S., Laiacona, M., Spinnler, H., and Ubezio, C. (1992). A cancellation test: its reliability in assessing attentional deficits in Alzheimer's disease. Psychol. Med. 22, 885–901. doi: 10.1017/S0033291700038460

Demeyere, N., Riddoch, M. J., Slavkova, E. D., Bickerton, W. L., and Humphreys, G. W. (2015). The Oxford cognitive screen (OCS): validation of a stroke-specific short cognitive screening tool. Psychol. Assess. 27, 883–894. doi: 10.1037/pas0000082

Eglin, M., Robertson, L. C., and Knight, R. T. (1989). Visual search performance in the neglect syndrome. J. Cogn. Neurosci. 1, 372–385. doi: 10.1162/jocn.1989.1.4.372

Facchin, A., Simioni, M., Maffioletti, S., and Daini, R. (2023). Broken ring enVision search (BReViS): a new clinical test of attention to assess the effect of layout and crowding on visual search. Brain Sci. 13:494. doi: 10.3390/brainsci13030494

Faul, F., Erdfelder, E., Buchner, A., and Lang, A. G. (2009). Statistical power analyses using G*power 3.1: tests for correlation and regression analyses. Behav. Res. Methods 41, 1149–1160. doi: 10.3758/BRM.41.4.1149

Hatta, T., Masui, T., Ito, Y., Ito, E., Hasegawa, Y., and Matsuyama, Y. (2004). Relation between the prefrontal cortex and cerebro-cerebellar functions: evidence from the results of stabilometrical indexes. Appl. Neuropsychol. 11, 153–160. doi: 10.1207/s15324826an1103_3

Hultsch, D., Hertzog, C., Small, B. J., and Dixon, R. A. (1999). Use it or lose it? Engage lifestyle as a buffer of cognitive decline in aging? Psychol. Aging 14, 245–263. doi: 10.1037/0882-7974.14.2.245

Jones, R. N., Yang, F. M., Zhang, Y., Kiely, D. K., Marcantonio, E. R., and Inouye, S. K. (2006). Does educational attainment contribute to risk for delirium? A potential role for cognitive reserve. Journal of Gerontology: Medical Sciences. 61, 1307–1311. doi: 10.1093/gerona/61.12.1307

Ku, Y. (2018). Selective attention on representations in working memory: cognitive and neural mechanisms. PeerJ 6:e4585. doi: 10.7717/peerj.4585

Lee, S., Kawachi, I., Berkman, L. F., and Grodstein, F. (2003). Education, other socioeconomic indicators, and cognitive function. Am. J. Epidemiol. 157, 712–720. doi: 10.1093/aje/kwg042

Lindenberger, U., and Baltes, P. B. (1997). Intellectual functioning in old and very old age: cross-sectional results from the Berlin aging study. Psychol. Aging 12, 410–432. doi: 10.1037/0882-7974.12.3.410

Luck, S. J., Hillyard, S. A., Mangun, G. R., and Gazzaniga, M. S. (1989). Independent hemispheric attentional systems mediate visual search in split-brain patients. Nature 342, 543–545. doi: 10.1038/342543a0

Lyketsos, C. G., Chen, L., and Anthony, J. C. (1999). Cognitive decline in adulthood: an 11.5 year follow-up of the Baltimore epidemiological catchment area study. Am. J. Psychiatry 156, 58–65. doi: 10.1176/ajp.156.1.58

Madden, D. J., Spaniol, J., Whiting, W. L., Bucur, B., Provenzale, J. M., Cabeza, R., et al. (2007). Adult age differences in the functional neuroanatomy of visual attention: a combined fMRI and DTI study. Neurobiol. Aging 28, 459–476. doi: 10.1016/j.neurobiolaging.2006.01.005

Malavita, M. S., Vidyasagar, T. R., and McKendrick, A. M. (2017). The effect of aging and attention on visual crowding and surround suppression of perceived contrast threshold. Invest. Ophthalmol. Vis. Sci. 58, 860–867. doi: 10.1167/iovs.16-20632

Marioni, R. E., van den Hout, A., Valenzuela, M. J., Brayne, C., Matthews, F. E., Function, M. R. C. C., et al. (2012). Active cognitive lifestyle associates with cognitive recovery and a reduced risk of cognitive decline. J. Alzheimers Dis. 28, 223–230. doi: 10.3233/JAD-2011-110377

Mesulam, M.-M. (2000). Principles of behavioral and cognitive neurology . Oxford, UK: Oxford University Press.

Moore, T., and Zirnsak, M. (2017). Neural mechanisms of selective visual attention. Annu. Rev. Psychol. 68, 47–72. doi: 10.1146/annurev-psych-122414-033400

Mueller, V., Brehmer, Y., von Oertzen, T., Li, S. C., and Lindenberger, U. (2008). Electrophysiological correlates of selective attention: a lifespan comparison. BMC Neurosci. 9:18. doi: 10.1186/1471-2202-9-18

Oberauer, K. (2019). Working memory and attention - a conceptual analysis and review. J. Cogn. 2:36. doi: 10.5334/joc.58

Park, M. O., and Lee, S. H. (2019). Effect of a dual-task program with different cognitive tasks applied to stroke patients: a pilot randomized controlled trial. Neuro Rehabil. 44, 239–249. doi: 10.3233/NRE-182563

Pegoraro, S., Facchin, A., Luchesa, F., Rolandi, E., Guaita, A., Arduino, L. S., et al. (2024). The complexity of Reading revealed by a study with healthy older adults. Brain Sci. 14:230. doi: 10.3390/brainsci14030230

Petersen, S. E., and Posner, M. I. (2012). The attention system of the human brain: 20 years after. Annu. Rev. Neurosci. 35, 73–89. doi: 10.1146/annurev-neuro-062111-150525

Plebanek, D. J., and Sloutsky, V. M. (2019). Selective attention, filtering, and the development of working memory. Dev. Sci. 22:e12727. doi: 10.1111/desc.12727

Posner, M. I. (1980). Orienting of attention. Q. J. Exp. Psychol. 32, 3–25. doi: 10.1080/00335558008248231

Posner, M. I., and Petersen, S. E. (1990). The attention system of the human brain. Annu. Rev. Neurosci. 13, 25–42. doi: 10.1146/annurev.ne.13.030190.000325

Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychol. Bull. 124, 372–422. doi: 10.1037/0033-2909.124.3.372

Reigal, R. E., Barrero, S., Martín, I., Morales-Sánchez, V., Juárez-Ruiz de Mier, R., and Hernández-Mendo, A. (2019). Relationships between reaction time, selective attention, physical activity, and physical fitness in children. Front. Psychol. 10:2278. doi: 10.3389/fpsyg.2019.02278

Rubia, K., Hyde, Z., Halari, R., Giampietro, V., and Smith, A. (2010). Effects of age and sex on developmental neural networks of visual-spatial attention allocation. NeuroImage 51, 817–827. doi: 10.1016/j.neuroimage.2010.02.058

Senger, C., Margarido, M. R. R. A., De Moraes, C. G., De Fendi, L. I., Messias, A., and Paula, J. S. (2017). Visual search performance in patients with vision impairment: a systematic review. Curr. Eye Res. 42, 1561–1571. doi: 10.1080/02713683.2017.1338348

Shamsi, F., Liu, R., and Kwon, M. (2022). Foveal crowding appears to be robust to normal aging and glaucoma unlike parafoveal and peripheral crowding. J. Vis. 22:10. doi: 10.1167/jov.22.8.10

Stern, Y., Haback, C., Moeller, J., Scarmeas, N., Anderson, K. E., Hilton, H. J., et al. (2005). Brain networks associated with cognitive reserve in healthy young and old adults. Cerb. Cort. 15, 394–402. doi: 10.1093/cercor/bhh142

Strober, L., DeLuca, J., Benedict, R. H., Jacobs, A., Cohen, J. A., Chiaravalloti, N., et al. (2019). Symbol digit modalities test: a valid clinical trial endpoint for measuring cognition in multiple sclerosis. Mult. Scler. 25, 1781–1790. doi: 10.1177/1352458518808204

Tavakol, M., and Dennick, R. (2011). Making sense of Cronbach's alpha. Int. J. Med. Educ. 2, 53–55. doi: 10.5116/ijme.4dfb.8dfd

Tun, P. A., and Lachman, M. E. (2008). Age differences in reaction time and attention in a national telephone sample of adults: education, sex, and task complexity matter. Dev. Psychol. 44, 1421–1429. doi: 10.1037/a0012845

Turatto, M., Benso, F., Facoetti, A., Galfano, G., Mascetti, G. G., and Umiltà, C. (2000). Automatic and voluntary focusing of attention. Percept. Psychophys. 62, 935–952. doi: 10.3758/BF03212079

Uttl, B., and Pilkenton-Taylor, C. (2001). Letter cancellation performance across the adult life span. Clin. Neuropsychol. 15, 521–530. doi: 10.1076/clin.15.4.521.1881

Utz, K. S., Hankeln, T. M., Jung, L., Lammer, A., Waschbisch, A., Lee, D. H., et al. (2013). Visual search as a tool for a quick and reliable assessment of cognitive functions in patients with multiple sclerosis. PLoS One 8:e81531. doi: 10.1371/journal.pone.0081531

van Es, D. M., Theeuwes, J., and Knapen, T. (2018). Spatial sampling in human visual cortex is modulated by both spatial and feature-based attention. eLife . 7:e36928. doi: 10.7554/eLife.36928

Verghese, P. (2001). Visual search and attention: a signal detection theory approach. Neuron 31, 523–535. doi: 10.1016/S0896-6273(01)00392-0

Veríssimo, J., Verhaeghen, P., Goldman, N., Weinstein, M., and Ullman, M. T. (2022). Evidence that ageing yields improvements as well as declines across attention and executive functions. Nat. Hum. Behav. 6, 97–110. doi: 10.1038/s41562-021-01169-7

Weintraub, S., and Mesulam, M. M. (1988). Visual Hemispatial inattention: stimulus parameters and exploratory strategies. J. Neurol. Neurosurg. Psychiatry 51, 1481–1488. doi: 10.1136/jnnp.51.12.1481

Wolfe, J. M. (2003). Moving towards solutions to some enduring controversies in visual search. Trends Cogn. Sci. 7, 70–76. doi: 10.1016/S1364-6613(02)00024-4

Wolfe, J. M. (2020). Visual search: how do we find what we are looking for? Annu. Rev. Vis. Sci. 6, 539–562. doi: 10.1146/annurev-vision-091718-015048

Wolfe, J. M., Oliva, A., Horowitz, T. S., Butcher, S. J., and Bompas, A. (2002). Segmentation of objects from backgrounds in visual search tasks. Vis. Res. 42, 2985–3004. doi: 10.1016/S0042-6989(02)00388-7

Yehezkel, O., Sterkin, A., Lev, M., and Polat, U. (2015). Crowding is proportional to visual acuity in young and aging eyes. J. Vis. 15:23. doi: 10.1167/15.8.23

Keywords: attention, attention assessment, broken ring enVision search test, reliability, validity, age, education level, gender

Citation: Shi Y and Zhang Y (2024) Reliability and validity of a novel attention assessment scale (broken ring enVision search test) in the Chinese population. Front. Psychol . 15:1375326. doi: 10.3389/fpsyg.2024.1375326

Received: 23 January 2024; Accepted: 25 April 2024; Published: 09 May 2024.

Reviewed by:

Copyright © 2024 Shi and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yi Zhang, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

  • Dean’s Office
  • External Advisory Council
  • Computing Council
  • Extended Computing Council
  • Undergraduate Advisory Group
  • Break Through Tech AI
  • Building 45 Event Space
  • Infinite Mile Awards: Past Winners
  • Frequently Asked Questions
  • Undergraduate Programs
  • Graduate Programs
  • Educating Computing Bilinguals
  • Online Learning
  • Industry Programs
  • AI Policy Briefs
  • Envisioning the Future of Computing Prize
  • SERC Symposium 2023
  • SERC Case Studies
  • SERC Scholars Program
  • SERC Postdocs
  • Common Ground Subjects
  • For First-Year Students and Advisors
  • For Instructors: About Common Ground Subjects
  • Common Ground Award for Excellence in Teaching
  • New and Incoming Faculty
  • Faculty Resources
  • Faculty Openings
  • Search for: Search
  • MIT Homepage

Using ideas from game theory to improve the reliability of language models

what is a reliability research

A new “consensus game,” developed by MIT CSAIL researchers, elevates AI’s text comprehension and generation skills.

Imagine you and a friend are playing a game where your goal is to communicate secret messages to each other using only cryptic sentences. Your friend’s job is to guess the secret message behind your sentences. Sometimes, you give clues directly, and other times, your friend has to guess the message by asking yes-or-no questions about the clues you’ve given. The challenge is that both of you want to make sure you’re understanding each other correctly and agreeing on the secret message.

MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers have created a similar “game” to help improve how AI understands and generates text. It is known as a “consensus game” and it involves two parts of an AI system — one part tries to generate sentences (like giving clues), and the other part tries to understand and evaluate those sentences (like guessing the secret message).

The researchers discovered that by treating this interaction as a game, where both parts of the AI work together under specific rules to agree on the right message, they could significantly improve the AI’s ability to give correct and coherent answers to questions. They tested this new game-like approach on a variety of tasks, such as reading comprehension, solving math problems, and carrying on conversations, and found that it helped the AI perform better across the board.

Traditionally, large language models answer one of two ways: generating answers directly from the model (generative querying) or using the model to score a set of predefined answers (discriminative querying), which can lead to differing and sometimes incompatible results. With the generative approach, “Who is the president of the United States?” might yield a straightforward answer like “Joe Biden.” However, a discriminative query could incorrectly dispute this fact when evaluating the same answer, such as “Barack Obama.”

So, how do we reconcile mutually incompatible scoring procedures to achieve coherent, efficient predictions?

“Imagine a new way to help language models understand and generate text, like a game. We’ve developed a training-free, game-theoretic method that treats the whole process as a complex game of clues and signals, where a generator tries to send the right message to a discriminator using natural language. Instead of chess pieces, they’re using words and sentences,” says Athul Jacob, an MIT PhD student in electrical engineering and computer science and CSAIL affiliate. “Our way to navigate this game is finding the ‘approximate equilibria,’ leading to a new decoding algorithm called ‘equilibrium ranking.’ It’s a pretty exciting demonstration of how bringing game-theoretic strategies into the mix can tackle some big challenges in making language models more reliable and consistent.”

When tested across many tasks, like reading comprehension, commonsense reasoning, math problem-solving, and dialogue, the team’s algorithm consistently improved how well these models performed. Using the ER algorithm with the LLaMA-7B model even outshone the results from much larger models. “Given that they are already competitive, that people have been working on it for a while, but the level of improvements we saw being able to outperform a model that’s 10 times the size was a pleasant surprise,” says Jacob.

“Diplomacy,” a strategic board game set in pre-World War I Europe, where players negotiate alliances, betray friends, and conquer territories without the use of dice — relying purely on skill, strategy, and interpersonal manipulation — recently had a second coming. In November 2022, computer scientists, including Jacob, developed “Cicero,” an AI agent that achieves human-level capabilities in the mixed-motive seven-player game, which requires the same aforementioned skills, but with natural language. The math behind this partially inspired the Consensus Game.

While the history of AI agents long predates when OpenAI’s software entered the chat in November 2022, it’s well documented that they can still cosplay as your well-meaning, yet pathological friend.

The consensus game system reaches equilibrium as an agreement, ensuring accuracy and fidelity to the model’s original insights. To achieve this, the method iteratively adjusts the interactions between the generative and discriminative components until they reach a consensus on an answer that accurately reflects reality and aligns with their initial beliefs. This approach effectively bridges the gap between the two querying methods.

In practice, implementing the consensus game approach to language model querying, especially for question-answering tasks, does involve significant computational challenges. For example, when using datasets like MMLU, which have thousands of questions and multiple-choice answers, the model must apply the mechanism to each query. Then, it must reach a consensus between the generative and discriminative components for every question and its possible answers.

The system did struggle with a grade school right of passage: math word problems. It couldn’t generate wrong answers, which is a critical component of understanding the process of coming up with the right one.

“The last few years have seen really impressive progress in both strategic decision-making and language generation from AI systems, but we’re just starting to figure out how to put the two together. Equilibrium ranking is a first step in this direction, but I think there’s a lot we’ll be able to do to scale this up to more complex problems,” says Jacob.

An avenue of future work involves enhancing the base model by integrating the outputs of the current method. This is particularly promising since it can yield more factual and consistent answers across various tasks, including factuality and open-ended generation. The potential for such a method to significantly improve the base model’s performance is high, which could result in more reliable and factual outputs from ChatGPT and similar language models that people use daily.

“Even though modern language models, such as ChatGPT and Gemini, have led to solving various tasks through chat interfaces, the statistical decoding process that generates a response from such models has remained unchanged for decades,” says Google Research Scientist Ahmad Beirami, who was not involved in the work. “The proposal by the MIT researchers is an innovative game-theoretic framework for decoding from language models through solving the equilibrium of a consensus game. The significant performance gains reported in the research paper are promising, opening the door to a potential paradigm shift in language model decoding that may fuel a flurry of new applications.”

Jacob wrote the paper with MIT-IBM Watson Lab researcher Yikang Shen and MIT Department of Electrical Engineering and Computer Science assistant professors Gabriele Farina and Jacob Andreas, who is also a CSAIL member. They presented their work at the International Conference on Learning Representations (ICLR) earlier this month, where it was highlighted as a “spotlight paper.” The research also received a “best paper award” at the NeurIPS R0-FoMo Workshop in December 2023.

Related Stories

what is a reliability research

medRxiv

A QUANTITATIVE ASSESSMENT OF VISUAL FUNCTION FOR YOUNG AND MEDICALLY COMPLEX CHILDREN WITH CEREBRAL VISUAL IMPAIRMENT: DEVELOPMENT AND INTER-RATER RELIABILITY

  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Kathleen M. Weden
  • For correspondence: [email protected]
  • Info/History
  • Preview PDF

Background Cerebral Visual Impairment (CVI) is the most common cause of low vision in children. Standardized, quantifiable measures of visual function are needed.

Objective This study developed and evaluated a new method for quantifying visual function in young and medically complex children with CVI using remote videoconferencing.

Methods Children diagnosed with CVI who had been unable to complete clinic-based recognition acuity tests were recruited from a low-vision rehabilitation clinic(n=22)Video-based Visual Function Assessment (VFA) was implemented using videoconference technology. Three low-vision rehabilitation clinicians independently scored recordings of each child’s VFA. Interclass correlations for inter-rater reliability was analyzed using intraclass correlations (ICC). Correlations were estimated between the video-based VFA scores and both clinically obtained acuity measures and children’s cognitive age equivalence.

Results Inter-rater reliability was analyzed using intraclass correlations (ICC). Correlations were estimated between the VFA scores, clinically obtained acuity measures, and cognitive age equivalence. ICCs showed good agreement (ICC and 95% CI 0.835 (0.701-0.916)) on VFA scores across raters and agreement was comparable to that from previous, similar studies. VFA scores strongly correlated (r= -0.706, p=0.002) with clinically obtained acuity measures. VFA scores and the cognitive age equivalence were moderately correlated (r= 0.518, p=0.005), with notable variation in VFA scores for participants below a ten month cognitive age-equivalence. The variability in VFA scores among children with lowest cognitive age-equivalence may have been an artifact of the study’s scoring method, or may represent existent variability in visual function for children with the lowest cognitive age-equivalence.

Conclusions Our new VFA is a reliable, quantitative measure of visual function for young and medically complex children with CVI. Future study of the VFA intrarater reliability and validity is warranted.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This work was supported by the EyeSight Foundation of Alabama, Alie B. Gorrie Low Vision Research Fund and Research to Prevent Blindness. Additional support came from the National Institutes of Health [UL1 TR003096 to R.O.] and Grant T32 HS013852.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

IRB of the University of Alabama at Birmingham gave ethical approval for this work

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Data Availability

All data produced in the present study are available upon reasonable request to the authors

View the discussion thread.

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Reddit logo

Citation Manager Formats

  • EndNote (tagged)
  • EndNote 8 (xml)
  • RefWorks Tagged
  • Ref Manager
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Rehabilitation Medicine and Physical Therapy
  • Addiction Medicine (323)
  • Allergy and Immunology (627)
  • Anesthesia (163)
  • Cardiovascular Medicine (2365)
  • Dentistry and Oral Medicine (287)
  • Dermatology (206)
  • Emergency Medicine (378)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (833)
  • Epidemiology (11758)
  • Forensic Medicine (10)
  • Gastroenterology (702)
  • Genetic and Genomic Medicine (3726)
  • Geriatric Medicine (348)
  • Health Economics (632)
  • Health Informatics (2388)
  • Health Policy (929)
  • Health Systems and Quality Improvement (895)
  • Hematology (340)
  • HIV/AIDS (780)
  • Infectious Diseases (except HIV/AIDS) (13301)
  • Intensive Care and Critical Care Medicine (767)
  • Medical Education (365)
  • Medical Ethics (104)
  • Nephrology (398)
  • Neurology (3488)
  • Nursing (198)
  • Nutrition (523)
  • Obstetrics and Gynecology (673)
  • Occupational and Environmental Health (661)
  • Oncology (1819)
  • Ophthalmology (535)
  • Orthopedics (218)
  • Otolaryngology (286)
  • Pain Medicine (232)
  • Palliative Medicine (66)
  • Pathology (445)
  • Pediatrics (1031)
  • Pharmacology and Therapeutics (426)
  • Primary Care Research (420)
  • Psychiatry and Clinical Psychology (3172)
  • Public and Global Health (6133)
  • Radiology and Imaging (1276)
  • Rehabilitation Medicine and Physical Therapy (745)
  • Respiratory Medicine (825)
  • Rheumatology (379)
  • Sexual and Reproductive Health (372)
  • Sports Medicine (322)
  • Surgery (400)
  • Toxicology (50)
  • Transplantation (172)
  • Urology (145)

IMAGES

  1. What does Reliability and Validity mean in Research

    what is a reliability research

  2. Reliability vs. Validity in Research

    what is a reliability research

  3. Types of reliability in research

    what is a reliability research

  4. What is Reliability in Research?

    what is a reliability research

  5. What is Reliability?

    what is a reliability research

  6. Reliability in Psychology Research: Definitions & Examples

    what is a reliability research

VIDEO

  1. Kuder-Richardson 20 (KR-20): Reliability Testing

  2. Validity and Reliability in Research: The Smaller and BIGGER Picture Conceptions

  3. Observation as a data collection technique (Urdu/Hindi)

  4. Reliability and Validity in Research || Validity and Reliability in Research in Urdu and Hindi

  5. Validity vs Reliability || Research ||

  6. Travel Time Reliability Video Series 5: Conducting the Analysis

COMMENTS

  1. Reliability vs. Validity in Research

    Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt. It's important to consider reliability and validity when you are creating your research design, planning your methods, and writing up your results, especially in quantitative research. Failing to do so can lead to several types of research ...

  2. Reliability and Validity

    Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid. Example: If you weigh yourself on a ...

  3. Reliability vs Validity in Research

    Revised on 10 October 2022. Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method, technique, or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure. It's important to consider reliability and validity when you are ...

  4. The 4 Types of Reliability in Research

    Interrater reliability. Inter-rater reliability (also called inter-observer reliability) measures the degree of agreement between different people observing or assessing the same thing. You use it when data is collected by researchers assigning ratings, scores or categories to one or more variables.. Example: Inter-rater reliability In an observational study where a team of researchers collect ...

  5. Reliability In Psychology Research: Definitions & Examples

    Reliability in psychology research refers to the reproducibility or consistency of measurements. Specifically, it is the degree to which a measurement instrument or procedure yields the same results on repeated trials. A measure is considered reliable if it produces consistent scores across different instances when the underlying thing being measured has not changed.

  6. Reliability

    Reliability refers to the consistency, dependability, and trustworthiness of a system, process, or measurement to perform its intended function or produce consistent results over time. It is a desirable characteristic in various domains, including engineering, manufacturing, software development, and data analysis. Reliability In Engineering.

  7. Validity & Reliability In Research

    As with validity, reliability is an attribute of a measurement instrument - for example, a survey, a weight scale or even a blood pressure monitor. But while validity is concerned with whether the instrument is measuring the "thing" it's supposed to be measuring, reliability is concerned with consistency and stability.

  8. Reliability and Validity of Measurement

    Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to. Validity is a judgment based on various types of evidence.

  9. Guide: Understanding Reliability and Validity

    Stability reliability (sometimes called test, re-test reliability) is the agreement of measuring instruments over time. To determine stability, a measure or test is repeated on the same subjects at a future date. Results are compared and correlated with the initial test to give a measure of stability.

  10. Validity and reliability in quantitative studies

    Validity. Validity is defined as the extent to which a concept is accurately measured in a quantitative study. For example, a survey designed to explore depression but which actually measures anxiety would not be considered valid. The second measure of quality in a quantitative study is reliability, or the accuracy of an instrument.In other words, the extent to which a research instrument ...

  11. Reliability and Validity

    This reliability can indicate stability of a measure and of the psychological phenomenon being assessed. Validity of instruments refers to the degree to which instruments assess the construct they aim to measure. To test this, researchers have developed several types of validity indices. Face validity reflects the degree to which items of a ...

  12. Definition of Reliability in Research

    Reliability is the degree to which a measurement instrument gives the same results each time that it is used, assuming that the underlying thing being measured does not change.

  13. Reliability.

    One of the first principles in research design is that measures should be selected that are reliable. Reliability is defined as the reproducibility of measurements, and this is the degree to which a measure produces the same values when applied repeatedly to a person or process that has not changed. This quality is observed when there are no or few random contaminations to the measure. If ...

  14. Reliability vs Validity: Differences & Examples

    Reliability and validity are criteria by which researchers assess measurement quality. Measuring a person or item involves assigning scores to represent an attribute. This process creates the data that we analyze. However, to provide meaningful research results, that data must be good.

  15. Validity, reliability, and generalizability in qualitative research

    Hence, the essence of reliability for qualitative research lies with consistency.[24,28] A margin of variability for results is tolerated in qualitative research provided the methodology and epistemological logistics consistently yield data that are ontologically similar but may differ in richness and ambience within similar dimensions.

  16. Research Reliability

    Research Reliability. Reliability refers to whether or not you get the same answer by using an instrument to measure something more than once. In simple terms, research reliability is the degree to which research method produces stable and consistent results. A specific measure is considered to be reliable if its application on the same object ...

  17. Reliability and validity: Importance in Medical Research

    Reliability and validity are among the most important and fundamental domains in the assessment of any measuring methodology for data-collection in a good research. Validity is about what an instrument measures and how well it does so, whereas reliability concerns the truthfulness in the data obtain …

  18. Reliability in research (definition, types and examples)

    Reliability in research is the measure of the stability or accuracy of the methods and results of an analysis. It means obtaining identical results after repeating the same procedures several times. A useful way to think of reliability is in its association with consistency. If the outcome of the research from one source produces equivalent ...

  19. Reliability in Research: Definition and Assessment Types

    Reliability and validity can both help researchers assess the quality of a project. While similar, these two concepts measure slightly different things: Reliability: Reliability measures the consistency of a set of research measures. Validity: Validity focuses on the accuracy of a set of research measures.

  20. Validity, Accuracy and Reliability: A Comprehensive Guide

    Part 3 - Reliability. Science experiments are an essential part of high school education, helping students understand key concepts and develop critical thinking skills. However, the value of an experiment lies in its validity, accuracy, and reliability. Let's break down these terms and explore how they can be improved and reduced, using ...

  21. Reliability in Research: Definition, Types & Examples

    Reliability in research refers to the consistency of a measure. It demonstrates whehter the same results would be obtained if the study was repeated. If a test or tool is reliable, it gives consistent results across different situations or over time. A study with high reliability can be trusted because its outcomes are dependable and can be ...

  22. (PDF) Validity and Reliability in Quantitative Research

    Abstract and Figures. The validity and reliability of the scales used in research are important factors that enable the research to yield healthy results. For this reason, it is useful to ...

  23. Frontiers

    1 Introduction. Attention is the foundation of all cognitive functions, the prerequisite for continuous information processing, and a gateway for the flow of information to enter the brain and undergo selection (Petersen and Posner, 2012).Precise and accurate assessment of attentional functions is key in cognitive research and a precondition for the rehabilitation of cognitive disorders.

  24. The research on the failure mechanism and dynamic reliability of

    The reliability of the lubricated clearance is investigated by the surrogate model, and the applicability of the model is verified by the Monte Carlo method (MCM). The results demonstrate that the IRSM has higher accuracy compared to the traditional response surface method (TRSM). ... Feng K. Research on variational mode decomposition in ...

  25. Ensure Data Reliability with Qualitative Research Tools

    Ensuring the reliability of data in qualitative research is crucial for drawing accurate and meaningful conclusions. Qualitative research tools, which include interviews, focus groups, and content ...

  26. Using ideas from game theory to improve the reliability of language

    MIT researchers' "consensus game" is a game-theoretic approach for language model decoding. The equilibrium-ranking algorithm harmonizes generative and discriminative querying to enhance prediction accuracy across various tasks, outperforming larger models and demonstrating the potential of game theory in improving language model consistency and truthfulness.

  27. A Quantitative Assessment of Visual Function for Young and Medically

    Background Cerebral Visual Impairment (CVI) is the most common cause of low vision in children. Standardized, quantifiable measures of visual function are needed. Objective This study developed and evaluated a new method for quantifying visual function in young and medically complex children with CVI using remote videoconferencing. Methods Children diagnosed with CVI who had been unable to ...

  28. Optimal Planning of Energy Storage in ...

    Second, a grid reliability simulation is performed based on the economic optimization results of energy storage capacity in distribution feeders. Finally, considering the comprehensive indexes of economy and reliability, the technique for order preference by similarity to ideal solution method is used, and the final scheme is determined by the ...

  29. The Atlanta VA Health Care System Is Celebrating National VA Research

    The Research program includes the Center for Visual and Neurocognitive Rehabilitation (CVNR), Geriatric Research Education and Clinical Center (GRECC), strong research programs in multiple specialty clinical areas, and growing programs in precision oncology.