• Memberships

Univariate Analysis: basic theory and example

Univariate Analysis - Toolshero

Univariate analysis: this article explains univariate analysis in a practical way. The article begins with a general explanation and an explanation of the reasons for applying this method in research, followed by the definition of the term and a graphical representation of the different ways of representing univariate statistics. Enjoy the read!

Introduction

Research is a dynamic process that carefully uses different techniques and methods to gain insights, validate hypotheses and make informed decisions.

Using a variety of analytical methods, researchers can gain a thorough understanding of their data, revealing patterns, trends, and relationships.

Free Toolshero ebook

One of the main approaches or methods for research is the univariate analysis, which provides valuable insights into individual variables and their characteristics.

In this article, we dive into the world of univariate analysis, its definition, importance and applications in research.

Techniques and methods in research

Research methodologies encompass a wide variety of techniques and methods that help researchers extract meaningful information from their data. Some common approaches are:

Descriptive statistics

Summarizing data using measures such as mean, median, mode, variance, and standard deviation.

Inferential statistics

Drawing conclusions about a broader population based on a sample. Methods such as hypothesis testing and confidence intervals are used for this.

Multivariate analysis

Exploring relationships between multiple variables simultaneously, allowing researchers to explore complex interactions and dependencies. A bivariate analysis is when the relationship between two variables is explored.

Qualitative analysis

Discovering insights and trying to understand subjective type of data, such as interviews, observations and case studies.

Quantitative analysis

Analyzing numerical data using statistical methods to reveal patterns and trends.

What is univariate analysis?

Univariate analysis focuses on the study and interpretation of only one variable on its own, without considering possible relationships with other variables.

The method aims to understand the characteristics and behavior of that specific variable. Univariate analysis is the simplest form of analyzing data.

Definition of univariate

The term univariate consists of two elements: uni, which means one, and variate, which refers to a statistical variable. Therefore, univariate analysis focuses on exploring and summarizing the properties of one variable independently.

Importance of univariate analysis

Univariate analysis serves as an important first step in many research projects, as it provides essential insights and lays a foundation for further research. It offers researchers the following benefits:

Data exploration

Univariate analysis allows researchers to understand the distribution, central tendency, and variability of a variable.

Identification of outliers

By detecting anomalous values, univariate analysis helps identify outliers that require further investigation or treatment during the data analysis phase.

Data cleaning

Univariate analysis helps identify missing data, inconsistencies or errors within a variable, allowing researchers to refine and optimize their data set before moving on to more complex analyses.

Variable selection

Researchers can use the univariate analysis to determine which variables are most promising for further research. This enables efficient allocation of resources and hypothesis testing.

Reporting and visualization

Summarizing and visualizing univariate statistics facilitates clear and concise reporting of research results. This makes complex data more accessible to a wider audience.

Research Methods For Business Students Course A-Z guide to writing a rockstar Research Paper with a bulletproof Research Methodology!   More information

Applications of univariate analysis

Univariate analysis is used in various research areas and disciplines. It is often used in:

  • Epidemiological studies to analyze risk factors
  • Social science research to investigate attitudes, behaviors or socio-economic variables
  • Market research to understand consumer preferences, buying patterns or market trends
  • Environmental studies to investigate pollution, climate data or species distributions

By using univariate analysis, researchers can uncover valuable insights, detect trends, and lay the groundwork for more comprehensive statistical analysis.

Types of univariate analyses

The most common method of performing univariate analysis is summary statistics. The correct statistics are determined by the level of measurement or the nature of the information in the variabels. The following are the most common types of summary statistics:

  • Measures of dispersion: these numbers describe how evenly the values are distributed in a dataset. The range, standard deviation, interquartile range, and variance are some examples.
  • Range: the difference between the highest and lowest value in a data set.
  • Standard deviation: an average measure of the spread.
  • Interquartile range: the spread of the middle 50% of the values.
  • Measures of central tendency: these numbers describe the location of the center point of a data set or the middle value of the data set. The mean, median and mode are the three main measures of central tendency.

Univariate Analysis Types - Toolshero

Figure 1. Univariate Analysis – Types

Frequency table

Frequency indicates how often something occurs. The frequency of observation thus indicates the number of times an event occurs.

The frequency distribution table can display qualitative and numerical or quantitative variables. The distribution provides an overview of the data and allows you to spot patterns.

The bar chart is displayed in the form of rectangular bars. The chart compares different categories. The chart can be plotted vertically or horizontally.

In most cases, the bar is plotted vertically.

The horizontal or x-axis represents the category and the vertical y-axis represents the value of the category.

This diagram can be used, for example, to see which part of a budget is the largest.

A histogram is a graph that shows how often certain values occur in a data set. It consists of bars whose height indicates how often a certain value occurs.

Frequency polygon

The frequency polygon is very similar to the histogram. It is used to compare data sets or to display the cumulative frequency distribution.

The frequency polygon is displayed as a line graph.

The pie chart displays the data in a circular format. The diagram is divided into pieces where each piece is proportional to its part of the complete category. So each “pie slice” in the pie chart is a portion of the total. The total of the pieces should always be 100.

Example situation of an Univariate Analysis

An example of univariate analysis might be examining the age of employees in a company.

Data is collected on the age of all employees and then a univariate analysis is performed to understand the characteristics and distribution of this single variable.

We can calculate summary statistics, such as the mean, median, and standard deviation, to get an idea of the central tendency and range of ages.

Histograms can also be used to visualize the frequency of different age groups and to identify any patterns or outliers.

Join the Toolshero community

Now it’s your turn

What do you think? Do you recognize the explanation about the univariate analysis? Have you ever heard of univariate analysis? Have you applied it yourself during any of the studies you have conducted? Do you know of any other methods or techniques used in conjunction with univariate analysis? Are you familiar with the visual graphs used in univariate analysis?

Share your experience and knowledge in the comments box below.

More information about the Univariate Analysis

  • Barick, R. (2021). Research Methods For Business Students . Retrieved 02/16/2024 from Udemy.
  • Dowdy, S., Wearden, S., & Chilko, D. (2011). Statistics for research . John Wiley & Sons.
  • Garfield, J., & Ben‐Zvi, D. (2007). How students learn statistics revisited: A current review of research on teaching and learning statistics . International statistical review, 75(3), 372-396.
  • Ostle, B. (1963). Statistics in research . Statistics in research., (2nd Ed).
  • Wagner III, W. E. (2019). Using IBM® SPSS® statistics for research methods and social science statistics . Sage Publications .

How to cite this article: Janse, B. (2024). Univariate Analysis . Retrieved [insert date] from Toolshero: https://www.toolshero.com/research/univariate-analysis/

Original publication date: 03/22/2024 | Last update: 03/22/2024

Add a link to this page on your website: <a href=”https://www.toolshero.com/research/univariate-analysis/”>Toolshero: Univariate Analysis</a>

Did you find this article interesting?

Your rating is more than welcome or share this article via Social media!

Average rating 4.2 / 5. Vote count: 5

No votes so far! Be the first to rate this post.

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?

Ben Janse

Ben Janse is a young professional working at ToolsHero as Content Manager. He is also an International Business student at Rotterdam Business School where he focusses on analyzing and developing management models. Thanks to his theoretical and practical knowledge, he knows how to distinguish main- and side issues and to make the essence of each article clearly visible.

Related ARTICLES

Respondents - Toolshero

Respondents: the definition, meaning and the recruitment

market research toolshero

Market Research: the Basics and Tools

Gartner Magic Quadrant - Toolshero

Gartner Magic Quadrant report and basics explained

Bivariate Analysis - Toolshero

Bivariate Analysis in Research explained

Contingency table - Toolshero

Contingency Table: the Theory and an Example

Content Analysis - Toolshero

Content Analysis explained plus example

Also interesting.

Field research - Toolshero

Field Research explained

Observational Research - Toolshero

Observational Research Method explained

Research Ethics - Toolshero

Research Ethics explained

Leave a reply cancel reply.

You must be logged in to post a comment.

BOOST YOUR SKILLS

Toolshero supports people worldwide ( 10+ million visitors from 100+ countries ) to empower themselves through an easily accessible and high-quality learning platform for personal and professional development.

By making access to scientific knowledge simple and affordable, self-development becomes attainable for everyone, including you! Join our learning platform and boost your skills with Toolshero.

what is univariate analysis in research

POPULAR TOPICS

  • Change Management
  • Marketing Theories
  • Problem Solving Theories
  • Psychology Theories

ABOUT TOOLSHERO

  • Free Toolshero e-book
  • Memberships & Pricing

What is Univariate Analysis? (Definition & Example)

The term  univariate analysis refers to the analysis of one variable. You can remember this because the prefix “uni” means “one.”

The purpose of univariate analysis is to understand the distribution of values for a single variable. You can contrast this type of analysis with the following:

  • Bivariate Analysis : The analysis of two variables.
  • Multivariate Analysis:  The analysis of two or more variables.

For example, suppose we have the following dataset:

what is univariate analysis in research

We could choose to perform univariate analysis on any of the individual variables in the dataset to gain a better understanding of its distribution of values.

For example, we may choose to perform univariate analysis on the variable  Household Size :

Example of univariate analysis

There are three common ways to perform univariate analysis:

1. Summary Statistics

The most common way to perform univariate analysis is to describe a variable using summary statistics .

There are two popular types of summary statistics:

  • Measures of central tendency :  these numbers describe where the center of a dataset is located. Examples include the  mean  and the  median .
  • Measures of dispersion :  these numbers describe how spread out the values are in the dataset. Examples include the  range ,  interquartile range ,  standard deviation , and  variance .

2. Frequency Distributions

Another way to perform univariate analysis is to create a frequency distribution , which describes how often different values occur in a dataset.

Yet another way to perform univariate analysis is to create charts to visualize the distribution of values for a certain variable.

Common examples include:

  • Density Curves

The following examples show how to perform each type of univariate analysis using the Household Size variable from our dataset mentioned earlier:

Summary Statistics

We can calculate the following measures of central tendency for Household Size:

  • Mean (the average value): 3.8
  • Median (the middle value): 4

These values give us an idea of where the “center” value is located.

We can also calculate the following  measures of dispersion:

  • Range (the difference between the max and min): 6
  • Interquartile Range (the spread of the middle 50% of values): 2.5
  • Standard Deviation (an average measure of spread): 1.87

These values give us an idea of how spread out the values are for this variable.

Frequency Distributions

We can also create the following frequency distribution table to summarize how often different values occur:

what is univariate analysis in research

This allows us to quickly see that the most frequent household size is  4 .

Resource: You can use this Frequency Calculator to automatically produce a frequency distribution for any variable.

We can create the following charts to help us visualize the distribution of values for Household Size:

A boxplot is a plot that shows the five-number summary of a dataset.

The five-number summary includes:

  • The minimum value
  • The first quartile
  • The median value
  • The third quartile
  • The maximum value

Here’s what a boxplot would look like for the variable Household Size:

what is univariate analysis in research

Resource: You can use this Boxplot Generator to automatically produce a boxplot for any variable.

2. Histogram

A histogram is a type of chart that uses vertical bars to display frequencies. This type of chart is a useful way to visualize the distribution of values in a dataset.

Here’s what a histogram would look like for the variable Household Size:

what is univariate analysis in research

3. Density Curve

A density curve is a curve on a graph that represents the distribution of values in a dataset.

It’s particularly useful for visualizing  the “shape” of a distribution, including whether or not a distribution has one or more “peaks” of frequently occurring values and whether or not the distribution is skewed to the left or the right .

Here’s what a density curve would look like for the variable Household Size:

what is univariate analysis in research

4. Pie Chart

A pie chart is a type of chart that is shaped like a circle and uses slices to represent proportions of a whole.

Here’s what a pie chart would look like for the variable Household Size:

what is univariate analysis in research

Depending on the type of data, one of these charts may be more useful for visualizing the distribution of values than the others.

What Are Levels of an Independent Variable?

How to apply bayes’ theorem in excel, related posts, how to normalize data between -1 and 1, vba: how to check if string contains another..., how to interpret f-values in a two-way anova, how to create a vector of ones in..., how to determine if a probability distribution is..., what is a symmetric histogram (definition & examples), how to find the mode of a histogram..., how to find quartiles in even and odd..., how to calculate sxy in statistics (with example), how to calculate expected value of x^3.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Thorac Dis
  • v.9(6); 2017 Jun

How to describe univariate data

Stefania canova.

1 Department of Medical Oncology, San Gerardo Hospital Monza, Monza 20900, Italy;

Diego Luigi Cortinovis

Federico ambrogi.

2 Department of Clinical Sciences and Community Health, Medical Statistics, Biometry and Bioinformatics, University of Milan, Milan 20133, Italy

Univariate analysis has the purpose to describe a single variable distribution in one sample. It is the first important step of every clinical trial. In this short review, we focus on this analysis, the methods that authors should use to report this type of data, information that they should not miss and mistakes that they must avoid.

Introduction

A variable is any characteristic that can be observed or measured on a subject. In clinical studies a sample of subjects is collected and some variables of interest are considered. Univariate descriptive analysis of a single variable has the purpose to describe the variable distribution in one sample and it is the first important step of every clinical study.

Authors should identify the type and number of examined variables, as well as missing data for each variable.

Variables can be categorical or numerical.

Categorical or qualitative data can be binary, nominal or ordinal. Binary variables are characterized by only two possible categories, for example male/female, dead/alive.

When there are more than two categories/classes, it is important to distinguish between nominal variables, such as blood group, and ordinal variables, such as disease stage.

Categorical data should be presented not only giving percentages for each class, but also absolute frequencies.

Numerical or quantitative data can be broadly divided into discrete or continuous. Discrete variables arise mainly from counts, such as the number of words in a sentence, the number of components of a family, while continuous variables arise mainly from measurements, such as height, blood pressure or tumour size. Such variables are continuous as, in principle, any value (in the admissible range of measurement) can be taken, while discrete variables can take only certain numerical values. For continuous variables, the only limitation arises from the accuracy of the instrument of measurement. Discrete variables are sometimes treated as continuous, when the number of possible values is very large. Numerical variables can be transformed in categorical by grouping values into two or more categories to simplify the comprehension of results (but not in general the analysis). Categorization of numerical variables results in loss of information, especially with two groups, and should be done with caution.

Authors should always specify how categorization was obtained, in particular how the choice of cut-points was made, if on the basis of previous analyses or arbitrarily by the authors (using median and quartiles for example). In absence of previous analyses, theoretical or clinical arguments should justify categorization to avoid biases and to obtain reliable results ( 1 ).

Researchers should avoid arbitrary cut-points and should prefer categorization into at least three groups avoiding dichotomization.

Frequency distribution and central tendency

A variable can be described by its frequency distribution that reports the absolute (or relative to the total) number of times a specific value/class of a variable is observed in the sample. Continuous variables should be divided in classes for this purpose. For ordered nominal variables and for numerical variables, the cumulative frequencies can also be computed. Instead of tables, graphs can be used to describe the distributions. Pie charts, where each slice represents the proportion of observations of each category, are useful for nominal data (without ordering), while bar charts can be used for ordinal categorical data or for discrete data. Histograms must be used for continuous data.

Another useful possibility is a box-whisker plot which is composed of a box representing upper and lower quartiles, a central line indicating the median, while the whiskers represents extreme centiles, with extreme values showed above and below the whiskers.

Due to space limitations, tables reporting summary values for each distribution are often used to describe the variables considered in the study. Before summarizing the distribution with few numbers, it is always necessary to look at the whole distribution. If the shape of the distribution is approximately symmetric (like for the Gaussian distribution), the mean and the standard deviation (SD) can be used, reporting the results as mean (SD), and avoiding the ±. If the shape of the distribution is skewed, it is better to use the median and the quartiles. A general recommendation could be to report, in every case, mean, median, SD and the quartiles. Mean, median and mode are very similar in case of symmetric distributions. In case of skewed distributions, median is less influenced by extreme observations.

Another summary measure is the mode that is the most frequent observation. This is rarely useful for numerical variables, whereas it is the only measure to be used with categorical variables. When describing categorical variables in tables, not only percentages for each class, but also their absolute frequencies, should always be reported

SD should not be confused with standard error (SE).

SE is a measure of the dispersion of the sample means around the population mean and is used for inferential (not descriptive) purposes. SE is the ratio between the SD and the square root of the sample size (n) ( 2 ).

SD is especially useful when the distribution is approximately Gaussian, as in the Gaussian case about 95% of observations are included within two SD of the mean ( 3 ).

Rounding numbers

The general rule is to present summary statistics at no more than one decimal place than the raw data ( 4 ). In the case of percentages, it is often enough to approximate at one decimal place. Rounding should be done only in the final report, not during analysis, to maintain precision and not to lose information.

According to one commonly used rule, excess digits are removed if the first one in excess is less than five. In case the first excess digit is more or equal to five, the last not in excess is increased by one. Be aware that computers output always contains spurious results that should be rounded according to the original accuracy of the measurements.

Time to event data

In many clinical studies, the time to onset of an event is of interest. Censored data refer to subjects included in the analysis but for whom the event of interest has not yet been observed when the study is closed ( 3 ). For example, in survival studies censored data include both patients still alive at the end of follow-up and patients lost during follow-up.

When reporting the number of events, it is advisable to avoid calculating the percentage with respect to the total number of subjects unless all subjects were followed-up for the same amount of time.

The completeness of follow-up is an indicator of study quality. Therefore, researchers should report the number of subjects lost to follow-up in addition to the follow-up range (minimum and maximum). The Kaplan-Meier method is suitable to describe the distribution of such a variable taking correctly into consideration the follow-up time and censored observations ( 5 ).

Authors should graphically report the number of subjects at risk. Moreover, they should indicate censoring times and confidence intervals, as well as which software was used to perform analyses.

Conclusions

When reporting study results, authors should keep in mind the advice of the International Committee of Medical Journal Editors (1991): “Describe statistical methods with enough detail to enable a knowledgeable reader with access to the original data to verify the reported results.”

Acknowledgements

Conflicts of Interest: The authors have no conflicts of interest to declare.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

8.1 - the univariate approach: analysis of variance (anova).

In the univariate case, the data can often be arranged in a table as shown in the table below:

The columns correspond to the responses to g different treatments or from g different populations. And, the rows correspond to the subjects in each of these treatments or populations.

  • \(Y_{ij}\) = Observation from subject j in group i
  • \(n_{i}\) = Number of subjects in group i
  • \(N = n_{1} + n_{2} + \dots + n_{g}\) = Total sample size.

Assumptions for the Analysis of Variance are the same as for a two-sample t -test except that there are more than two groups:

  • The data from group i has common mean = \(\mu_{i}\); i.e., \(E\left(Y_{ij}\right) = \mu_{i}\) . This means that there are no sub-populations with different means.
  • Homoskedasticity : The data from all groups have common variance \(\sigma^2\); i.e., \(var(Y_{ij}) = \sigma^{2}\). That is, the variability in the data does not depend on group membership.
  • Independence: The subjects are independently sampled.
  • Normality : The data are normally distributed.

The hypothesis of interest is that all of the means are equal. Mathematically we write this as:

\(H_0\colon \mu_1 = \mu_2 = \dots = \mu_g\)

The alternative is expressed as:

\(H_a\colon \mu_i \ne \mu_j \) for at least one \(i \ne j\).

i.e., there is a difference between at least one pair of group population means. The following notation should be considered:

This involves taking an average of all the observations for j = 1 to \(n_{i}\) belonging to the i th group. The dot in the second subscript means that the average involves summing over the second subscript of y .

This involves taking the average of all the observations within each group and over the groups and dividing by the total sample size. The double dots indicate that we are summing over both subscripts of y .

  • \(\bar{y}_{i.} = \frac{1}{n_i}\sum_{j=1}^{n_i}Y_{ij}\) = Sample mean for group i .
  • \(\bar{y}_{..} = \frac{1}{N}\sum_{i=1}^{g}\sum_{j=1}^{n_i}Y_{ij}\) = Grand mean.

Here we are looking at the average squared difference between each observation and the grand mean. Note that if the observations tend to be far away from the Grand Mean then this will take a large value. Conversely, if all of the observations tend to be close to the Grand mean, this will take a small value. Thus, the total sum of squares measures the variation of the data about the Grand mean.

An Analysis of Variance (ANOVA) is a partitioning of the total sum of squares. In the second line of the expression below, we are adding and subtracting the sample mean for the i th group. In the third line, we can divide this out into two terms, the first term involves the differences between the observations and the group means, \(\bar{y}_i\), while the second term involves the differences between the group means and the grand mean.

\(\begin{array}{lll} SS_{total} & = & \sum_{i=1}^{g}\sum_{j=1}^{n_i}\left(Y_{ij}-\bar{y}_{..}\right)^2 \\ & = & \sum_{i=1}^{g}\sum_{j=1}^{n_i}\left((Y_{ij}-\bar{y}_{i.})+(\bar{y}_{i.}-\bar{y}_{..})\right)^2 \\ & = &\underset{SS_{error}}{\underbrace{\sum_{i=1}^{g}\sum_{j=1}^{n_i}(Y_{ij}-\bar{y}_{i.})^2}}+\underset{SS_{treat}}{\underbrace{\sum_{i=1}^{g}n_i(\bar{y}_{i.}-\bar{y}_{..})^2}} \end{array}\)

The first term is called the error sum of squares and measures the variation in the data about their group means.

Note that if the observations tend to be close to their group means, then this value will tend to be small. On the other hand, if the observations tend to be far away from their group means, then the value will be larger. The second term is called the treatment sum of squares and involves the differences between the group means and the Grand mean. Here, if group means are close to the Grand mean, then this value will be small. While, if the group means tend to be far away from the Grand mean, this will take a large value. This second term is called the Treatment Sum of Squares and measures the variation of the group means about the Grand mean.

The Analysis of Variance results is summarized in an analysis of variance table below:

  Hover over the light bulb to get more information on that item.

The ANOVA table contains columns for Source, Degrees of Freedom, Sum of Squares, Mean Square and F . Sources include Treatment and Error which together add up to the Total.

The degrees of freedom for treatment in the first row of the table are calculated by taking the number of groups or treatments minus 1. The total degree of freedom is the total sample size minus 1.  The Error degrees of freedom are obtained by subtracting the treatment degrees of freedom from the total degrees of freedom to obtain N - g .

The formulae for the Sum of Squares are given in the SS column. The Mean Square terms are obtained by taking the Sums of Squares terms and dividing them by the corresponding degrees of freedom.

The final column contains the F statistic which is obtained by taking the MS for treatment and dividing it by the MS for Error.

Under the null hypothesis that the treatment effect is equal across group means, that is \(H_{0} \colon \mu_{1} = \mu_{2} = \dots = \mu_{g} \), this F statistic is F -distributed with g - 1 and N - g degrees of freedom:

\(F \sim F_{g-1, N-g}\)

The numerator degrees of freedom g - 1 comes from the degrees of freedom for treatments in the ANOVA table. This is referred to as the numerator degrees of freedom since the formula for the F -statistic involves the Mean Square for Treatment in the numerator. The denominator degrees of freedom N - g is equal to the degrees of freedom for error in the ANOVA table. This is referred to as the denominator degrees of freedom because the formula for the F -statistic involves the Mean Square Error in the denominator.

We reject \(H_{0}\) at level \(\alpha\) if the F statistic is greater than the critical value of the F -table, with g - 1 and N - g degrees of freedom, and evaluated at level \(\alpha\).

\(F > F_{g-1, N-g, \alpha}\)

what is univariate analysis in research

Univariate Analysis

what is univariate analysis in research

Understanding Univariate Analysis

Univariate analysis is one of the simplest forms of statistical analysis, where the data being analyzed contains only one variable. Since it's a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it. You can think of univariate analysis as a way to summarize and find patterns in data that can be represented in a single variable.

Types of Univariate Analysis

There are two types of univariate analysis based on the type of data it handles:

This is used when the data is numerical. It helps to understand the distribution of numerical values through measures of central tendency (mean, median, and mode) and measures of dispersion (range, variance , standard deviation , and interquartile range).

  • Qualitative Analysis: This is used for categorical data, which represents characteristics such as a person's gender, marital status, hometown, etc. It summarizes data by counting the number of observations in each category. Bar charts and pie charts are often used to visualize the distribution of categorical data.

Measures in Univariate Analysis

When performing univariate analysis, several statistical metrics are commonly used:

  • Frequency Distribution: This is a summary of how often each value appears in the dataset. For categorical data, it may be a count or percentage of individuals in each category.
  • Central Tendency: This includes the mean, median, and mode of the data, which represent the center point of the dataset.
  • Variability: This includes the range, variance, and standard deviation, which provide insights into the spread of the data.
  • Skewness and Kurtosis : These provide information about the asymmetry and peakedness of the data distribution, respectively.

Graphical Representation in Univariate Analysis

Graphical representations provide a visual summary of the data, which can be more intuitive and reveal trends, outliers , and patterns that may not be apparent from numerical statistics alone. Common graphs used in univariate analysis include:

  • Histograms: Used for numerical data to show the frequency distribution.
  • Bar Charts: Used for categorical data to show the frequency or proportion of each category.

Provide a visual summary of the minimum, first quartile , median, third quartile, and maximum in a dataset.

Applications of Univariate Analysis

Univariate analysis has a wide range of applications, including:

  • Describing trends and patterns in sales figures, test scores, or any other type of quantitative data.
  • Summarizing survey responses where the questions have categorical answers.
  • Identifying data entry errors or outliers that may need to be addressed before conducting further analysis.
  • Providing businesses with a better understanding of customer behavior by analyzing a single variable such as purchase frequency.

Limitations of Univariate Analysis

While univariate analysis is useful for understanding the distribution and central tendencies of a single variable, it has limitations:

  • It doesn’t provide any insight into relationships between variables.
  • It can’t identify causality or correlation.

It may not provide a complete understanding of complex data sets that require multivariate analysis .

Univariate analysis is a key step in any statistical data analysis. It provides a foundation for understanding the basic features of the data at hand. By summarizing and visualizing data distributions, univariate analysis helps analysts and researchers to make informed decisions about further analysis and potential actions based on the data. However, for a more comprehensive understanding of data, especially when dealing with multiple variables and their relationships, multivariate analysis is necessary.

The world's most comprehensive data science & artificial intelligence glossary

Please sign up or login with your details

Generation Overview

AI Generator calls

AI Video Generator calls

AI Chat messages

Genius Mode messages

Genius Mode images

AD-free experience

Private images

  • Includes 500 AI Image generations, 1750 AI Chat Messages, 30 AI Video generations, 60 Genius Mode Messages and 60 Genius Mode Images per month. If you go over any of these limits, you will be charged an extra $5 for that group.
  • For example: if you go over 500 AI images, but stay within the limits for AI Chat and Genius Mode, you'll be charged $5 per additional 500 AI Image generations.
  • Includes 100 AI Image generations and 300 AI Chat Messages. If you go over any of these limits, you will have to pay as you go.
  • For example: if you go over 100 AI images, but stay within the limits for AI Chat, you'll have to reload on credits to generate more images. Choose from $5 - $1000. You'll only pay for what you use.

Out of credits

Refill your membership to continue using DeepAI

Share your generations with friends

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

what is univariate analysis in research

Home Market Research

Data Analysis in Research: Types & Methods

data-analysis-in-research

Content Index

Why analyze data in research?

Types of data in research, finding patterns in the qualitative data, methods used for data analysis in qualitative research, preparing data for analysis, methods used for data analysis in quantitative research, considerations in research data analysis, what is data analysis in research.

Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. 

Three essential things occur during the data analysis process — the first is data organization . Summarization and categorization together contribute to becoming the second known method used for data reduction. It helps find patterns and themes in the data for easy identification and linking. The third and last way is data analysis – researchers do it in both top-down and bottom-up fashion.

LEARN ABOUT: Research Process Steps

On the other hand, Marshall and Rossman describe data analysis as a messy, ambiguous, and time-consuming but creative and fascinating process through which a mass of collected data is brought to order, structure and meaning.

We can say that “the data analysis and data interpretation is a process representing the application of deductive and inductive logic to the research and data analysis.”

Researchers rely heavily on data as they have a story to tell or research problems to solve. It starts with a question, and data is nothing but an answer to that question. But, what if there is no question to ask? Well! It is possible to explore data even without a problem – we call it ‘Data Mining’, which often reveals some interesting patterns within the data that are worth exploring.

Irrelevant to the type of data researchers explore, their mission and audiences’ vision guide them to find the patterns to shape the story they want to tell. One of the essential things expected from researchers while analyzing data is to stay open and remain unbiased toward unexpected patterns, expressions, and results. Remember, sometimes, data analysis tells the most unforeseen yet exciting stories that were not expected when initiating data analysis. Therefore, rely on the data you have at hand and enjoy the journey of exploratory research. 

Create a Free Account

Every kind of data has a rare quality of describing things after assigning a specific value to it. For analysis, you need to organize these values, processed and presented in a given context, to make it useful. Data can be in different forms; here are the primary data types.

  • Qualitative data: When the data presented has words and descriptions, then we call it qualitative data . Although you can observe this data, it is subjective and harder to analyze data in research, especially for comparison. Example: Quality data represents everything describing taste, experience, texture, or an opinion that is considered quality data. This type of data is usually collected through focus groups, personal qualitative interviews , qualitative observation or using open-ended questions in surveys.
  • Quantitative data: Any data expressed in numbers of numerical figures are called quantitative data . This type of data can be distinguished into categories, grouped, measured, calculated, or ranked. Example: questions such as age, rank, cost, length, weight, scores, etc. everything comes under this type of data. You can present such data in graphical format, charts, or apply statistical analysis methods to this data. The (Outcomes Measurement Systems) OMS questionnaires in surveys are a significant source of collecting numeric data.
  • Categorical data: It is data presented in groups. However, an item included in the categorical data cannot belong to more than one group. Example: A person responding to a survey by telling his living style, marital status, smoking habit, or drinking habit comes under the categorical data. A chi-square test is a standard method used to analyze this data.

Learn More : Examples of Qualitative Data in Education

Data analysis in qualitative research

Data analysis and qualitative data research work a little differently from the numerical data as the quality data is made up of words, descriptions, images, objects, and sometimes symbols. Getting insight from such complicated information is a complicated process. Hence it is typically used for exploratory research and data analysis .

Although there are several ways to find patterns in the textual information, a word-based method is the most relied and widely used global technique for research and data analysis. Notably, the data analysis process in qualitative research is manual. Here the researchers usually read the available data and find repetitive or commonly used words. 

For example, while studying data collected from African countries to understand the most pressing issues people face, researchers might find  “food”  and  “hunger” are the most commonly used words and will highlight them for further analysis.

LEARN ABOUT: Level of Analysis

The keyword context is another widely used word-based technique. In this method, the researcher tries to understand the concept by analyzing the context in which the participants use a particular keyword.  

For example , researchers conducting research and data analysis for studying the concept of ‘diabetes’ amongst respondents might analyze the context of when and how the respondent has used or referred to the word ‘diabetes.’

The scrutiny-based technique is also one of the highly recommended  text analysis  methods used to identify a quality data pattern. Compare and contrast is the widely used method under this technique to differentiate how a specific text is similar or different from each other. 

For example: To find out the “importance of resident doctor in a company,” the collected data is divided into people who think it is necessary to hire a resident doctor and those who think it is unnecessary. Compare and contrast is the best method that can be used to analyze the polls having single-answer questions types .

Metaphors can be used to reduce the data pile and find patterns in it so that it becomes easier to connect data with theory.

Variable Partitioning is another technique used to split variables so that researchers can find more coherent descriptions and explanations from the enormous data.

LEARN ABOUT: Qualitative Research Questions and Questionnaires

There are several techniques to analyze the data in qualitative research, but here are some commonly used methods,

  • Content Analysis:  It is widely accepted and the most frequently employed technique for data analysis in research methodology. It can be used to analyze the documented information from text, images, and sometimes from the physical items. It depends on the research questions to predict when and where to use this method.
  • Narrative Analysis: This method is used to analyze content gathered from various sources such as personal interviews, field observation, and  surveys . The majority of times, stories, or opinions shared by people are focused on finding answers to the research questions.
  • Discourse Analysis:  Similar to narrative analysis, discourse analysis is used to analyze the interactions with people. Nevertheless, this particular method considers the social context under which or within which the communication between the researcher and respondent takes place. In addition to that, discourse analysis also focuses on the lifestyle and day-to-day environment while deriving any conclusion.
  • Grounded Theory:  When you want to explain why a particular phenomenon happened, then using grounded theory for analyzing quality data is the best resort. Grounded theory is applied to study data about the host of similar cases occurring in different settings. When researchers are using this method, they might alter explanations or produce new ones until they arrive at some conclusion.

LEARN ABOUT: 12 Best Tools for Researchers

Data analysis in quantitative research

The first stage in research and data analysis is to make it for the analysis so that the nominal data can be converted into something meaningful. Data preparation consists of the below phases.

Phase I: Data Validation

Data validation is done to understand if the collected data sample is per the pre-set standards, or it is a biased data sample again divided into four different stages

  • Fraud: To ensure an actual human being records each response to the survey or the questionnaire
  • Screening: To make sure each participant or respondent is selected or chosen in compliance with the research criteria
  • Procedure: To ensure ethical standards were maintained while collecting the data sample
  • Completeness: To ensure that the respondent has answered all the questions in an online survey. Else, the interviewer had asked all the questions devised in the questionnaire.

Phase II: Data Editing

More often, an extensive research data sample comes loaded with errors. Respondents sometimes fill in some fields incorrectly or sometimes skip them accidentally. Data editing is a process wherein the researchers have to confirm that the provided data is free of such errors. They need to conduct necessary checks and outlier checks to edit the raw edit and make it ready for analysis.

Phase III: Data Coding

Out of all three, this is the most critical phase of data preparation associated with grouping and assigning values to the survey responses . If a survey is completed with a 1000 sample size, the researcher will create an age bracket to distinguish the respondents based on their age. Thus, it becomes easier to analyze small data buckets rather than deal with the massive data pile.

LEARN ABOUT: Steps in Qualitative Research

After the data is prepared for analysis, researchers are open to using different research and data analysis methods to derive meaningful insights. For sure, statistical analysis plans are the most favored to analyze numerical data. In statistical analysis, distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities. The method is again classified into two groups. First, ‘Descriptive Statistics’ used to describe data. Second, ‘Inferential statistics’ that helps in comparing the data .

Descriptive statistics

This method is used to describe the basic features of versatile types of data in research. It presents the data in such a meaningful way that pattern in the data starts making sense. Nevertheless, the descriptive analysis does not go beyond making conclusions. The conclusions are again based on the hypothesis researchers have formulated so far. Here are a few major types of descriptive analysis methods.

Measures of Frequency

  • Count, Percent, Frequency
  • It is used to denote home often a particular event occurs.
  • Researchers use it when they want to showcase how often a response is given.

Measures of Central Tendency

  • Mean, Median, Mode
  • The method is widely used to demonstrate distribution by various points.
  • Researchers use this method when they want to showcase the most commonly or averagely indicated response.

Measures of Dispersion or Variation

  • Range, Variance, Standard deviation
  • Here the field equals high/low points.
  • Variance standard deviation = difference between the observed score and mean
  • It is used to identify the spread of scores by stating intervals.
  • Researchers use this method to showcase data spread out. It helps them identify the depth until which the data is spread out that it directly affects the mean.

Measures of Position

  • Percentile ranks, Quartile ranks
  • It relies on standardized scores helping researchers to identify the relationship between different scores.
  • It is often used when researchers want to compare scores with the average count.

For quantitative research use of descriptive analysis often give absolute numbers, but the in-depth analysis is never sufficient to demonstrate the rationale behind those numbers. Nevertheless, it is necessary to think of the best method for research and data analysis suiting your survey questionnaire and what story researchers want to tell. For example, the mean is the best way to demonstrate the students’ average scores in schools. It is better to rely on the descriptive statistics when the researchers intend to keep the research or outcome limited to the provided  sample  without generalizing it. For example, when you want to compare average voting done in two different cities, differential statistics are enough.

Descriptive analysis is also called a ‘univariate analysis’ since it is commonly used to analyze a single variable.

Inferential statistics

Inferential statistics are used to make predictions about a larger population after research and data analysis of the representing population’s collected sample. For example, you can ask some odd 100 audiences at a movie theater if they like the movie they are watching. Researchers then use inferential statistics on the collected  sample  to reason that about 80-90% of people like the movie. 

Here are two significant areas of inferential statistics.

  • Estimating parameters: It takes statistics from the sample research data and demonstrates something about the population parameter.
  • Hypothesis test: I t’s about sampling research data to answer the survey research questions. For example, researchers might be interested to understand if the new shade of lipstick recently launched is good or not, or if the multivitamin capsules help children to perform better at games.

These are sophisticated analysis methods used to showcase the relationship between different variables instead of describing a single variable. It is often used when researchers want something beyond absolute numbers to understand the relationship between variables.

Here are some of the commonly used methods for data analysis in research.

  • Correlation: When researchers are not conducting experimental research or quasi-experimental research wherein the researchers are interested to understand the relationship between two or more variables, they opt for correlational research methods.
  • Cross-tabulation: Also called contingency tables,  cross-tabulation  is used to analyze the relationship between multiple variables.  Suppose provided data has age and gender categories presented in rows and columns. A two-dimensional cross-tabulation helps for seamless data analysis and research by showing the number of males and females in each age category.
  • Regression analysis: For understanding the strong relationship between two variables, researchers do not look beyond the primary and commonly used regression analysis method, which is also a type of predictive analysis used. In this method, you have an essential factor called the dependent variable. You also have multiple independent variables in regression analysis. You undertake efforts to find out the impact of independent variables on the dependent variable. The values of both independent and dependent variables are assumed as being ascertained in an error-free random manner.
  • Frequency tables: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Analysis of variance: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Researchers must have the necessary research skills to analyze and manipulation the data , Getting trained to demonstrate a high standard of research practice. Ideally, researchers must possess more than a basic understanding of the rationale of selecting one statistical method over the other to obtain better data insights.
  • Usually, research and data analytics projects differ by scientific discipline; therefore, getting statistical advice at the beginning of analysis helps design a survey questionnaire, select data collection methods , and choose samples.

LEARN ABOUT: Best Data Collection Tools

  • The primary aim of data research and analysis is to derive ultimate insights that are unbiased. Any mistake in or keeping a biased mind to collect data, selecting an analysis method, or choosing  audience  sample il to draw a biased inference.
  • Irrelevant to the sophistication used in research data and analysis is enough to rectify the poorly defined objective outcome measurements. It does not matter if the design is at fault or intentions are not clear, but lack of clarity might mislead readers, so avoid the practice.
  • The motive behind data analysis in research is to present accurate and reliable data. As far as possible, avoid statistical errors, and find a way to deal with everyday challenges like outliers, missing data, data altering, data mining , or developing graphical representation.

LEARN MORE: Descriptive Research vs Correlational Research The sheer amount of data generated daily is frightening. Especially when data analysis has taken center stage. in 2018. In last year, the total data supply amounted to 2.8 trillion gigabytes. Hence, it is clear that the enterprises willing to survive in the hypercompetitive world must possess an excellent capability to analyze complex research data, derive actionable insights, and adapt to the new market needs.

LEARN ABOUT: Average Order Value

QuestionPro is an online survey platform that empowers organizations in data analysis and research and provides them a medium to collect data by creating appealing surveys.

MORE LIKE THIS

QuestionPro BI: From Research Data to Actionable Dashboards

QuestionPro BI: From Research Data to Actionable Dashboards

Apr 22, 2024

customer advocacy software

21 Best Customer Advocacy Software for Customers in 2024

Apr 19, 2024

quantitative data analysis software

10 Quantitative Data Analysis Software for Every Data Scientist

Apr 18, 2024

Enterprise Feedback Management software

11 Best Enterprise Feedback Management Software in 2024

Other categories.

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

Univariate Analysis: Variance, Variables, Data, and Measurement

  • Open Access
  • First Online: 04 October 2022

Cite this chapter

You have full access to this open access chapter

what is univariate analysis in research

  • Mark Tessler 2  

Part of the book series: SpringerBriefs in Sociology ((BRIEFSSOCY))

2682 Accesses

Why are some people in the Arab region more likely than others to vote in elections? Why do some countries but not others have higher levels of satisfaction with their healthcare systems? Why is domestic violence more prevalent in some communities or neighborhoods than others? What causes some people but not others to become more politically engaged, or less politically engaged, over time? Every day, we come across various phenomena that make us question how, why, and with what implications do they vary across people, countries, communities, and/or time. These phenomena—e.g. voting, satisfaction with health care, domestic violence, and political engagement—are variables, and the variance they express is the foundation and the point of departure for positivist social science research. The present chapter considers variables one at a time and focuses on descriptions of variance.

You have full access to this open access chapter,  Download chapter PDF

2.1 Thinking About Research

Why are some people in the Arab region more likely than others to vote in elections? Why do some countries but not others have higher levels of satisfaction with their healthcare systems? Why is domestic violence more prevalent in some communities than others? What causes some people but not others to become more politically engaged, or less politically engaged, over time? Every day, we come across various phenomena that make us question how, why, and with what implications do they vary across people, countries, communities, and/or time. These phenomena—e.g. voting, satisfaction with health care, domestic violence, and political engagement—are variables, and the variance they express is the foundation and the point of departure for positivist social science research. Accordingly, they are “variables of interest” in research projects that are motivated by this variance and that seek to answer questions of the kind listed above.

Most research projects begin by describing the variance referenced by the variable or variables of interest. This is description, which is the focus of the present chapter. The research project usually then goes on to propose and evaluate hypotheses about factors that account for some of the variance on the variable of interest. This is explanation, which will be the focus of Chaps.  3 and 4 .

In many research projects, the concern for explanation, as expressed in hypotheses, offers a causal story about some of the determinants of the variance on the variable of interest. It offers a cause and effect story in which the variable of interest is the effect. Alternatively, the goal of a research project motivated by the variance on the variable of interest may not be why it varies as it does, but rather, what difference does it make. This is also a cause and effect story, but this time the variable of interest is the cause.

For example, if aggregate voter turnout by country is the variable of interest, an investigator might ask and try to answer the question of why citizens of some countries are more likely to vote than are citizens of other countries. One possibility she might consider, offered only as an illustration, is that higher levels of corruption incentivize voting, such that voter turnout will be higher in countries with more corruption. An investigator might also, or instead, be interested in whether voter turnout is itself a determinant of the variance on a particular and presumably important issue. She might advance for testing the proposition that greater voter turnout helps to explain why some countries have more developed and better-quality infrastructure than other countries.

Before an investigator decides on a variable (or variables) of interest and begins a research project, she must consider why the topic is important, and not only to her but also to the broader society and global community. She must also decide on the causal story she will investigate. Will she seek to identify important determinants of the variance on the variable of interest, and then explicate the ways, or mechanisms, by which these determinants exert influences? Or will she choose to consider the variable of interest as a determinant, and then investigate whether, and also how, the variable of interest exerts influence on the variance of other variables?

As stated, the variable of interest has been chosen because the investigator considers it important, and because identifying its relationships with other variables will contribute to a better understanding of political, social, or economic life. The investigator will also want to consider what has been said about the topic in the scholarly literature, and what her investigations have the potential to add to this literature. She will ask whether the topic has for the most part been overlooked, and if not, whether findings from previous research are persuasive or appear to be flawed, or whether there are knowledge gaps that her research can help to fill. By choosing important topics and investigating how and why certain phenomena vary, social science research can make valuable contributions and enrich our knowledge and understanding of societal dynamics.

The variables and variable relationships mentioned above are fictitious, provided only to illustrate that positivist social science research usually begins with the designation of a variable of interest and a description of the way it varies, then proceeds to investigate the nature and direction of its relationships, and very often its causal relationships, with other variables. Of course, designing a research project involves much more than selecting a variable of interest and specifying its relationships to other variables, and this will be the focus of Chaps. 3 and 4 . Readers should keep a concern for explanation in mind as they engage with this chapter’s emphasis on description.

2.2 Variance and Variables

2.2.1 the concept of variance.

Once you have decided on a research topic, or while you are deciding whether a particular topic is of interest, the first objective of every researcher should be to understand how a variable varies. Thus, a central preoccupation of this chapter is with the concept of variance, with discovering and then presenting information about the way that the subject or subjects of interest vary. The chapter focuses, therefore, on univariate analysis, that is to say, variables taken one at a time.

The concept of variance is a foundational building block of a positivist approach to social and political inquiry, an approach that refers to investigations that rely on empirical evidence, or factual knowledge, acquired either through direct observation or measurement based on observable indicators. Positivist social science research does not always limit itself to considering variables one at a time, of course. As discussed briefly in the introductory chapter and as a central preoccupation of Chaps. 3 and 4 , discerning relationships and patterns of interaction that connect two or more variables is often the objective of inquiry that begins with taking variables one at a time. Often described as theory-driven evidence-based inquiry, positivist social science research that begins with separate descriptions of variance on relevant phenomena, that is to say on the variables of interest to an investigator, frequently does so in order to establish a base for moving from description to explanation, from discerning and describing how something varies to shedding light on why and/or with what implications it varies.

Although discerning and describing variance may not in these instances be the end product of a social scientific investigation, being rather the point of departure for more complex bivariate and multivariate analyses concerned with determinants, causal stories, and conditionalities, it remains important to be familiar with working with variables one at a time. Relevant to this topic are: sources and methods of collecting data on variables of interest; the development of measures that are valid and reliable and capture the variance of interest to an investigator, sometimes requiring the use of indicators to measure abstract concepts that cannot be directly observed; and the use of statistics and graphs to summarize and display variance in order to parsimoniously communicate findings. These are among the topics to which the present chapter devotes attention.

It must also be added that descriptive analysis, that is to say measuring and reporting on the variance associated with particular concepts or variables, need not always be the first stage in an investigation with multivariate objectives. It can be, and often is, an end in and of itself. When the phenomenon being investigated is important, and when the structure and/or extent of its variance are not already known, either in general or in a particular social or political environment, descriptions of variance under these conditions need not be the first step on a multi-step investigatory journey to derive significance but can be, in and of themselves, the end goal of a research project with its own inherent significance.

Finally, while the principle preoccupation of this chapter and the remaining chapters is with what should be done once an investigator has decided on a research question or variables of interest, the first half of this chapter may also be helpful in choosing a research topic. The following sections discuss how to think about and describe variance, not only on the variable(s) of interest but also on other variables that will be included in the research project. Attentiveness to variance, along with the considerations discussed earlier, will help an investigator to think about a research topic and the design of her research project.

2.2.2 Units of Analysis and Variance

Positivist social science research can be conducted with both quantitative and qualitative data and methods, each of which has strengths and limitations. Some topics and questions are best addressed using quantitative data and methods, while others are better suited to qualitative research. Still other researchers utilize both quantitative and qualitative data and methods, often using insights derived from qualitative research to better understand patterns and variable associations that result from the analysis of quantitative data.

This chapter, as well as those that follow, places emphasis on social science research that works with quantitative data. In part, this is because of the volume’s connection to the Arab Barometer and the ready availability of the Barometer’s seven waves of survey data. Nevertheless, the concept of variance also occupies a foundational position in positivist social science research that works with qualitative data. For this reason, we briefly discuss qualitative research later in the chapter and illustrate its value with several examples. Footnote 1 We also discuss what to consider when choosing between different styles of research and types of data.

In this section, we present a few examples from Arab Barometer surveys and other data sources to highlight the importance of describing and understanding the variance of certain phenomena. These examples, which use quantitative data, will also reintroduce the notion of a unit of analysis , which is the entity being studied whose status with respect to the variance is being measured. The unit of analysis in studies based on Arab Barometer data is usually, although not always, the individual. These studies investigate how (and very often also why, as discussed in Chaps. 3 and 4 ) individuals give different responses to the same survey question. An investigator would in effect be asking, what is the range of ways that individuals, that is to say, respondents, answered a question; and how many, or what proportion, of these respondents answered the question in each of the different ways that it could be answered.

Although still somewhat limited in comparison to most other world regions, the number of systematic and high-quality surveys in Arab countries is growing, as is the number of published studies based on these surveys. For example, a study of “Gender Ideals in Turbulent Times,” published in Comparative Sociology in 2017, used Arab Barometer data to describe the gender-related attitudes of men in Algeria, Egypt, Tunisia, and Yemen. After describing the variance on the attitudes of men in each country, the authors considered the impact of religiosity on this variable of interest and found that the impact of religiosity on attitudes about women varies in instructive ways across the four countries. Footnote 2

Another example, and one with objectives that clearly involve description, uses data from earlier surveys in Kuwait in 1988, 1994, and 1996 to map continuity and change in Kuwaiti social and political attitudes. Led by a team of Kuwaiti, Egyptian, and American scholars, and published in International Sociology in 2007, the study found, among other things, that support for democracy increased over time but attitudes pertaining to the status of women did not change. Consistent with their focus on description, the authors note that their study serves as a “baseline” for later research seeking to take account of differences in Kuwait and other nations. Footnote 3

Country is another commonly used unit of analysis in social science research, including quantitative work. Studies in which country is the unit of analysis might compute a country-level measure by aggregating data on the behavior or attitudes of the individuals who live in that country. Footnote 4 For example, respondents in a nationally representative survey of citizens of voting age might be asked if they voted in a given election, and responses might then be aggregated to develop a country-level measure of voter turnout. In comparing countries for descriptive and/or explanatory purposes, the country-level measure might be a single value based on an average, such as the percent who voted. Or it might involve the comparison of response distributions across the countries included in the study.

Measures of both kinds have been used, for example, in the reports of Arab Barometer findings that have been published in Journal of Democracy after each wave of surveys. In 2012, for instance, JoD published “New Findings on Arabs and Democracy.” The article presented and compared findings about attitudes and understandings related to democracy and about Islam’s role in political affairs in countries included in the first and second wave of Arab Barometer surveys. It reported, for instance, that the percentage of ordinary citizens agreeing that “it would be better if more religious people held public office” varied from a low of 17.6 percent in Lebanon to a high of 61.0 percent in Palestine in the first wave of surveys, and in the second wave, from a low of 14.3 percent in Lebanon to a high of 61.1 percent in Yemen. Footnote 5 Thus, the individual-level data from the Arab Barometer survey was aggregated by country to create statistics at the country-level.

Of course, measuring variance across countries, making country the unit of analysis, in other words, does not involve only the aggregation of individual-level data about the country’s citizens. Numerous commonly used country-level measures are produced by important international institutions, such as the United Nations, the World Bank, the Arab League, and many others. The measures themselves are numerous and very diverse, ranging, for example, from Gross Domestic Product to the UN’s Human Development Index, which is based on the proportion of school aged children actually in school, the unemployment rate, and other quality of life indicators.

Without attempting to be comprehensive, and for the broader purpose of insisting on the need to be self-aware and designate the unit of analysis in systematic social science research, it may be useful to present a small number of additional examples. On the one hand, there are quasi-academic institutions that present and regularly update ratings of countries on important concepts and variables. One example among many is Freedom House, which rates countries each year with respect to political rights and civil liberties. It awards 0 to 4 points for each of 10 political rights indicators and 15 civil liberties indicators, giving a total score from 0 to 100. In 2019, it awarded a total of 34 points to Algeria, 37 points to Morocco, and 70 points to Tunisia.

Country-level measures are also produced by individual scholars and scholarly teams for use in data-based research on particular topics or issues. A good example is the scholarly literature on what has been called the “Resource Curse,” which considers the proposition that oil and mineral wealth impedes democracy. There are active debates about both theory and method in this field of research, and one result has been the development of country-level measures of key concepts and variables. Among the measures developed by an important early study, for example, is an index of oil-reliance. Scores for the 25 most oil-reliant countries at the time of the study ranged from 47.58 (Brunei) to 3.13 (Columbia). Among Arab countries, Kuwait was judged to be the most oil-reliant and given a score of 46.14. Other Arab countries judged sufficiently oil-reliant to be rated include Bahrain (45.60), Yemen (38.58), Oman (38.43), Saudi Arabia (33.85), Qatar (33.85), Libya (29.74), Iraq (23.48), Algeria (21.44), and Syria (15.00). Footnote 6 Although subsequently used in multivariate analysis to test resource curse hypotheses, this country-level index, like numerous others in which country is the unit of analysis, offers valuable descriptive information.

Individual and country are not the only units of analysis, of course. Among the many others are community and group, with numerous possibilities for describing the attributes with respect to which these units vary. Size, location, administrative structure, ethnicity and/or religion, and economic well-being are only a few of the possibilities. Each attribute is a concept and variable with respect to which the units—communities, groups—differ, and descriptions of this variance in the form of univariate distributions can be very useful. An innovative example comes from a study of Lebanese communities that sought to assign to each community a measure pertaining to public goods provision and also to governance structure. In advance of exploring the connection between these two variables, the investigator needed to develop measures of each and present these in univariate descriptions. Interestingly, an attribute related to community governance that turned out to be particularly important was whether the community was dominated by a single faction or whether there was competition for community leadership. Footnote 7 The cross-community variance associated with this concept—community governance—was mapped in the initial, descriptive portion of the project, which involved univariate analysis and variables being considered one at a time.

2.2.3 Univariate Distributions

Before continuing the discussion of variance, and also taking a brief detour into working with qualitative data, the nature and value of univariate analysis and the presentation of descriptive information can be further illustrated by presenting univariate distributions of answers to four questions asked in Arab Barometer surveys. First, we’ll look at responses to two questions about the Islamic State; and second, we’ll consider responses to two questions about sexual harassment and domestic violence. These issues are obviously important, and in all four cases, variance in the experience, behavior, and attitudes of ordinary citizens is at best imperfectly known. Accordingly, particularly since samples in Arab Barometer surveys are probability-based and nationally representative, there can be little doubt that univariate distributions of responses given by the individuals interviewed by the Arab Barometer team are valuable and instructive with regard to Arab society at large.

The first example, which was explored in the fourth wave of Arab Barometer surveys, in 2016–2017, concerns Arab attitudes toward the Islamic State. Findings, presented in the top half of Table 2.1 , are based on surveys in Jordan, Lebanon, Palestine, Tunisia, Algeria, and Morocco, taken together. The table shows that the overwhelming majority of those interviewed have very negative attitudes toward the Islamic State. At the same time, small minorities agree with the goals of the Islamic State and believe its actions to be compatible with the teachings of Islam.

The second example, which was explored during the fifth and sixth waves of Arab Barometer surveys, in 2018–2019 and 2020–2021, deals with sexual harassment and domestic violence, very important issues about which the variance within and across countries in the Arab world (and elsewhere) is not well-known. Accordingly, once again, discerning and then describing the variance with respect to relevant experiences or behaviors make a very valuable social scientific contribution, and this is quite apart from whatever, if anything, might be learned through bivariate and multivariate analyses in subsequent phases of the research. The lower half of Table 2.1 , based on all of the respondents in the 12 countries surveyed in Wave V of the Arab Barometer, shows how people answered questions about unwanted sexual advances and in-household physical abuse. It shows that substantial majorities have not experienced physical sexual harassment and do not reside in a household in which there has been domestic violence.

Respondent answers to survey questions can be and frequently are aggregated by country for research projects in which country is the unit of analysis. Table 2.2 shows the distribution by country of one of the Wave IV questions about the Islamic State and one of the Wave V questions about sexual harassment and domestic violence. The construction of univariate distributions in which country is the unit of analysis is simple and straightforward. In most cases, it involves simply totaling the responses of everyone in the country who was surveyed and then calculating the distribution of percentages. Less straightforward, in some instances, is deciding which unit of analysis is most appropriate for the description (and explanation) of the variance that an investigator seeks to discern.

A third example from Arab Barometer data further illustrates the choice among units of analysis that a researcher may have to make. The variance in this case refers to voting, and specifically to whether or not the respondent voted in the last parliamentary election in her country. Based on responses to the question about voting in Wave V surveys, and again aggregating data from the surveys in Jordan, Lebanon, Palestine, Tunisia, Algeria, and Morocco, 45.5 percent of the individuals interviewed say they voted and 54.5 say they did not. This may be exactly what an investigator wishes to know, and it may be a point of departure for a study that asks about the attributes of individuals who are more likely or less likely to vote.

Alternatively, an investigator may not be very interested in how often individuals vote but rather in the variance in voting across a sample of countries. In this case, the variable of interest to a researcher is voter turnout; and as seen in Table 2.2 , turnout ranges across the six countries from a low of 20.8 percent in Algeria to a high of 63.8 in Lebanon. Whether it is individual-level voting or country-level turnout that references the variance of interest to an investigation, and whether, therefore, the relevant unit of analysis is the individual or the country, depends, of course, on the goals of the researcher. Either one may be most relevant, and it is also possible that both will be relevant in some studies.

Readers are encouraged to access the Arab Barometer website, arabbarometer.org , and take a closer look at the data. The website’s online analysis tool permits replication of the response distributions shown in Tables 2.1 and 2.2 . Additionally, responses to topically associated questions not shown in these tables can also be accessed. In addition, the online analysis tool permits the conduct of simple mapping operations, operations that involve disaggregating the data and examining the variance that characterizes specific subsets of the population, such as women, older individuals, or less religious individuals.

2.2.4 Qualitative Research

While this volume places emphasis on social science research that works with quantitative data, the concepts of variance and unit of analysis also occupy a foundational position in positivist social science research that works with qualitative data. In positivist qualitative social science research, as in quantitative research, the initial objective is to discern the various empirical manifestations of each concept of interest to an investigator, and then to assign each unit on which the investigator has data to one of the empirical manifestations of each concept. The resulting frequency or percentage distributions provide potentially valuable descriptive information, as they do in quantitative research.

As in quantitative research, the objectives of a qualitative study may be descriptive, in which case no more than univariate distributions are needed. Alternatively, again as in quantitative research, these distributions on variables of interest to the investigator may be the beginning stage of research projects that aspire to explanation as well as description, and that, for this reason, anticipate bivariate and/or multivariate analysis. In any of these instances, the point that deserves emphasis is that the notion of variance is central to most positivist inquiry, be it quantitative or qualitative.

A small number of examples involving qualitative research may be mentioned very briefly to illustrate this point. Among these are two studies based on fieldwork in Lebanon, one by Daniel Corstange of Columbia University and one by Melani Cammett of Harvard University. Both projects included the collection and analysis of qualitative data.

The unit of analysis in the Corstange study is the community, some of which were villages and some of which were neighborhoods in larger agglomerations. The variables with respect to which these communities were classified—hence, the variance that Corstange sought to capture—included the inter-religious confessional composition of the community and whether its leadership structure involved competition or was dominated by one group. These qualitative distinctions with respect to which each community was classified were part of Corstange’s larger goal of explaining why some communities fared better than others in obtaining needed public goods, such as electricity and water. Footnote 8

The Cammett study involved the construction of a typology based on two variables taken together. Typologies almost always involve qualitative distinctions, even if one or both of the variables used in their construction are themselves quantitative. Typologies can be particularly useful in conceptualizing and measuring variance that involves more than a single dimension.

In the Cammett study, the unit of analysis was the welfare association, more formally defined as a domestic non-state welfare provider, and each was classified according to the presence or absence of a linkage to a political organization and also to an identity-based community. The concatenation of these two dichotomous distributions yielded four categories, each representing a particular “type” of welfare society based on its political and confessional connections taken together. Cammett’s distinctions with respect to type, as is usually the case with typologies, reference qualitative variance. Among the larger goals of the Cammett study, based on the proposition that the motivations of Lebanese welfare societies are not entirely charitable, was to discern whether and how welfare society type was related to the characteristics of those that the association tended to serve and how it made decisions. Footnote 9

The unit of analysis in another study, conducted in Palestine by Wendy Pearlman of Northwestern University, was the Palestinian national movement. The project focused on the movement’s resistance to the Zionist project prior to Israeli independence and to Israel’s occupation of the West Bank and Gaza following the war of June 1967. The variable of interest to Pearlman was whether resistance activities were essentially non-violent or included significant violence, often directed at Israeli citizens who did not live in the West Bank or Gaza. Pearlman gathered information about resistance activities over time, beginning with the post-World War I period, and then classified each instance of resistance according to whether the national movement used non-violent or violent methods in pursuit of its goals. The larger goal of Pearlman’s research project was not only to describe the variance in national movement resistance activities but also to test hypotheses about determinants of this variance. Footnote 10

A final example is provided by an older but very important study by Tunisian sociologist Elbaki Hermassi. Hermassi’s project focuses on Tunisia, Algeria, and Morocco, and country is the unit of analysis. An important qualitative variable in Hermassi’s study is the character of the governing regime at independence, 1956 for Tunisia and Morocco and 1962 for Algeria. Tunisia at independence was governed by Western-educated leaders who were supported by a mass-membership political party; in Algeria, the country was led by a military-civilian coalition backed by the military and without a popular grass-roots institutional base; and in Morocco, the king sat at the top of a political system that included a parliament in which the largest party had Islamist origins. Footnote 11

Hermassi uses this country-level variation in political regime to address and answer an important question: Why did the three countries arrive at independence with such differing political systems? After all, each was part of the Arab west and each was colonized by the French. To answer this question, Hermassi takes his readers on a sophisticated and insightful historical journey that can only be hinted at here. He identifies and describes differences among the three countries—making distinctions that are also qualitative—at critical historical periods and junctures. These include differences in pre-colonial society, differences in the character of French colonialism, and differences in the origins and leadership of the nationalist movement and the struggle for independence. The classification of the three countries during each of these time periods is anchored in thick description and extensive historical detail.

These qualitative differences between the three North African countries define a multi-stage temporal sequence through which Hermassi and his readers travel using a method known as process tracing. Country-level qualitative differences during one time period help to explain country-level qualitative differences during the time period that followed, leading in the end to an explanation of the reasons that the countries began their respective political lives at independence with very different governing regimes.

Although brief, and beyond the central, quantitative, focus of this research guide, this overview of qualitative social science research suggests several take-aways, all of which apply to quantitative social science research as well. One is that the concept of variance is no less relevant to qualitative social science investigations than it is to quantitative social science research. A second is that measuring and describing qualitative variance still involves specifying the unit of analysis. A third is that typologies, which make qualitative distinctions among units of analysis, are a useful technique for capturing the variance on concepts defined by more than one attribute or experience. A fourth is that the variance being measured may be among different entities at the same point in time, among the same entity at different points in time, or among different entities at different points in time. And finally, these examples illustrate the importance of fieldwork and deep knowledge of the circumstances of the unit of analysis and variables on which the research project will focus.

2.2.5 Descriptive Statistics

While we often want to know only whether a variable has very much or very little variance across our unit of analysis, it can also be useful to understand how to calculate variance mathematically. In addition, we may also want to describe a variable’s distribution of values (numbers) in other ways, such as giving the average value of a variable or identifying the value in a distribution that occurs most frequently (mode). Two ways of describing variance are central tendency and dispersion. There are descriptive statistics for both central tendency and dispersion that can be calculated mathematically.

Measures of central tendency are the mean, the median, and the mode. The mean, or average, is the sum of all observations for a variable divided by the total number of observations. The median is the “middle” value in a variable’s distribution of values; it is the value that separates the higher half from the lower half of the values in a distribution. The mode is the value in a distribution that appears most often.

It is important to understand that measures of central tendency—the mean, median, and mode—do not tell us how spread out a distribution of values is. The values might be clustered in the middle, spread out evenly, or clustered at the extremes, with each of the distributions having the same mean. Measures of dispersion can be calculated to determine the degree to which the values of a variable differ from the mean, or how spread out the distribution is. Two of the most important measures of dispersion are the variance and standard deviation . The standard deviation, which is the square root of the variance, is one of the most frequently used ways to determine and show the dispersion of a distribution. The interquartile range is another measure of dispersion. It shows how spread out the middle 50 percent of the distribution is by subtracting the value of the 25th percentile from the value of the 75th percentile.

Calculating the Variance

The variance expresses the degree to which the values in a variable’s distribution of values differ from one another. As a descriptive statistic, it is a measure of how much the values differ from the mean. For example, if satisfaction with a country’s healthcare system is measured on a scale of 1 to 4, with 4 indicating a high level of satisfaction, and if every individual in a study chooses 4 to express her opinion, there is no variance. The mean will be 4 and none of the ratings given by these individuals differs from the mean. Alternatively, if some participants in the study choose 2, others choose 3, and still others choose 4 to express their opinion, there is variance. We calculate the degree of variance, or dispersion, by squaring the difference between the value and the mean for each participant in the study, then summing the squared deviations for all participants, and then dividing this sum by the number of participants minus 1. Footnote 12 These calculations are expressed by the following formula, and its application is shown in an exercise below.

where s 2 is the variance, x i is the value for each individual, \( \bar{\mkern6mu}\mathrm{x} \) is the mean, and n is the number of individuals.

In the example above, the unit of analysis is the individual. How much variance do you think there is in satisfaction with the healthcare system at the country level in the MENA region? Do you think it is about the same in every country, or do you think it is much higher in some countries and much lower in others? Offer only your “best guess.”

You can use the online analysis tool on the Arab Barometer’s website, following the steps shown below, to evaluate the accuracy of your best guess.

Click on the following link: https://www.arabbarometer.org/survey-data/data-analysis-tool/ .

Click “AB Wave V—2018,” click “Select all,” and click “See results.”

Click “Evaluation of public institutions and political attitudes.”

Click “Satisfaction with the healthcare system in your country.”

You can see that there is quite a bit of variance. Which two countries have the most different levels of satisfaction with their healthcare system? Which countries are most similar to each other? Do you have any ideas about why they are similar?

Do you think the variance across the countries would be greater, about the same, or lesser if the cross-country variance was calculated using only male respondents? You can evaluate your answer by adding a filter:

Click “Add filter”

Click “Gender”

Click “Apply”

An example based on the tables below, Tables 2.3 and 2.4 , further illustrates computation of the variance. In one case, the unit of analysis is the individual. In the other, country is the unit of analysis. The variable of interest in each case is satisfaction with the country’s healthcare system, which is measured on a 1 to 4 scale with 4 being very satisfied and 1 being very dissatisfied.

First are some questions based on Table 2.3 , in which the unit of analysis is the individual. You should be able to answer the questions without doing any major calculations.

What is the mean of the healthcare system ratings by the individuals in each of the three countries?

In which of the three countries is the variance of healthcare system ratings by individuals greatest? In which is the variance lowest?

We can now calculate the variance of ratings by the five individuals in Jordan. Thereafter, you may wish to calculate the variance of ratings by the five individuals in Iraq and in Tunisia. This will permit you to check the accuracy of your earlier estimates of rank with respect to magnitude of the variance among distributions for the three countries. To calculate the variance, we take the following steps:

Calculate the mean: Average satisfaction with the healthcare system in Jordan = sum of individual values/total number of observations = (4 + 4 + 3 + 1 + 2)/5 = 2.8

Calculate the sum of squared differences between each value and the mean: ((4–2.8) 2  + (4–2.8) 2  + (3–2.8) 2  + (1–2.8) 2  + (2–2.8) 2 ) = 1.44 + 1.44 + .04 + 3.24 + .64 = 6.8

Divide the sum of the differences squared by the number of observations in the data set minus 1: 6.8/4 = 1.7

We have determined that the individual-level variance in satisfaction in the healthcare system in Jordan at the individual level is 1.7.

We turn now to variance at the country level based on Jordan, Iraq, Lebanon, and Palestine, as shown in Table 2.4 . This means we are interested in how much the average values of satisfaction with the healthcare system of these countries differ from each other. Before doing any calculations, do you think the variance will be high or low?

The first step in calculating variance at the country level is to calculate the mean of the “satisfaction with the healthcare system” country scores shown in Table 2.4 . Thereafter, following the procedures used when individual was the unit of analysis, the sum of the squared difference between the country-specific values and the mean of all countries together, divided by the number of units (countries) minus one gives the variance. These operations are shown below.

Mean = (3.3 + 2.6 + 3.5 + 2.6)/4 = 12/4 = 3

Variance =  sum of squared differences between each value and the overall mean/number of observations minus 1  = ((3.3–3) 2  + (2.6–3) 2  + (3.5–3) 2  + (2.6–3) 2 )/3 = .66/3 = 0.22

In this example, we see that there is variance in healthcare system satisfaction at the individual level and at the country level. However, variance at the individual level is much higher. Another way to think about this is to say that satisfaction with the healthcare system is more similar across countries than it is across individuals. What in your opinion is the significance of this unit of analysis difference in degree of variance? If you wanted to study healthcare satisfaction in the MENA region, how might this difference in degree of variance influence your research design?

Now, think about your own country or a country in the MENA region that you know well. What do you think is the average level of satisfaction with the healthcare system in this country? How much individual-level variance in healthcare system satisfaction do you think there is? What do you think causes, and thus helps to explain, this individual level variance?

2.2.6 Visual Descriptions

Investigators and analysts will often wish to see and show more about the distribution of a variable than is expressed by univariate measures of dispersion such as the variance or standard deviation, or more about a distribution than is expressed by a measure of dispersion and a measure of central tendency taken together. They may also wish to see and show exactly how the variance is distributed. Are the values on a variable clustered at the high and low extremes? Are they spread out evenly across the range of possible values? For these reasons, an analyst will often prepare a frequency distribution to show the variance.

Frequency distributions are a common way of looking at variance. A frequency distribution is a univariate table that shows the number of times each distinct value of a variable occurs. As shown for Jordan in Table 2.3 , for example, the value “2” appears once and the value “4” appears two times. A percentage distribution shows the percentage of times a value appears in the data—the value “2” is 20 percent of all observations and the value “4” is 40 percent of all observations, Table 2.5 .

There are many other ways to visualize the data on a variable. A bar chart is a visualization of a frequency distribution, where the bars represent distinct (categorical) responses and the length or height of the bars represents the frequency of each. A histogram is similar to a bar chart, but it is used to display groupings of a variable’s values. A pie chart is similar, but the numerical values of a variable are represented by different “slices,” with each representing its proportion of the whole.

You will see examples of all of these visual descriptions in the exercise below. In the Arab Barometer’s online analysis tool, frequency distributions are on the left side of the page and charts and graphs are on the right side of the page. Above the charts and graphs on the right side of the page is a legend that permits selecting the particular type of chart or graph that is desired.

Exercise 2.1. Units of Analysis, Variance, and Descriptive Charts

Go to the data analysis tool on the Arab Barometer website using the following link: https://www.arabbarometer.org/survey-data/data-analysis-tool/

Choose a wave of the Arab Barometer. For this exercise, let’s click “AB Wave V—2018-2019.” Select the country or countries you want to examine. For this exercise, click “Select all.” Next, click on “See results.” This will bring you to a page that says “Select a Survey Topic.” Select “General topics.”

Let’s begin with looking at a variable at the individual-level of analysis: interpersonal trust. Respondents can choose among the following options: “Most people can be trusted,” “I must be very careful in dealing with people,” or “Don’t Know.”

Click on “Interpersonal Trust.” You should now see a table showing the frequency and percentage of responses. On the right side, you should also see a bar chart with the percentage of each response. Congratulations, you have just made two types of frequency distributions! We can see that there is variance in interpersonal trust in the MENA region: 15.7 percent of the individuals surveyed trust most people, while 83.1 percent do not trust others very much. What questions could we ask based on the fact that there is variance in interpersonal trust in the MENA region? We could consider other individual-level factors, such as “Does someone’s age affect how much they trust other people?” Or country-level factors, such as “Does the amount of corruption in the country in which people live affect how much they trust other people?”

We might also be interested in how interpersonal trust varies over time. To see this, click “Time series.” How does interpersonal trust vary between 2007 and 2019? Why do you think people started trusting others less after 2013?

Suppose you are considering doing research on the question: Does gender affect interpersonal trust in the MENA region? Click “Cross by” and select “Gender.” Describe the variance in how much men and how much women in the MENA region trust others.

Now, suppose you are interested in how much gender affects interpersonal trust in a certain country, in a certain age group, or in some other category. Click “Select countries,” then “Deselect all,” and then select Morocco. Describe the variance in interpersonal trust based on gender in Morocco.

What are the advantages or disadvantages of studying how gender affects interpersonal trust in the MENA region vs. in Morocco? Which study do you think would be more interesting, or more instructive? We see that the distributions for men and for women in the MENA region look almost identical: 15.6 percent of men trust most people, and 15.8 percent of women trust most people. On the other hand, in Morocco, 19 percent of men and 26 percent of women trust most people. You might conclude from this that you want to pursue this research question in Morocco, and not in the entire MENA region.

You may also be interested in how interpersonal trust varies at the country-level. To see this, click “Select countries,” “Select all,” and “Apply.” You should now see the average, or mean, of respondents who selected each response category in each country. There is quite a lot of variance in interpersonal trust at the country-level! Describe the distribution you observe. Are there clusters of countries with similar degrees of interpersonal trust? What might be the reasons these countries have similar degrees of interpersonal trust?

Repeat the steps above for two other variables: one that you would like to explore at the individual-level—either in the entire MENA region or in one or more specific countries in which you are particularly interested; and one that you are interested in exploring at the country-level. Describe the variance you observe in each case. Are there individual-level factors that you think might help to explain the variance you observe? Are there country-level factors that you think might help to explain the variance you see?

Arab Barometer data have been used to illustrate the points this chapter makes about variance, variables, univariate analysis, and unit of analysis. This has been done, in part, for convenience, but also because Arab Barometer data are readily available, which offers readers an opportunity to replicate, deepen, or expand on the examples used above. The points being illustrated with Arab Barometer data are, of course, of general significance. Different kinds of examples will be offered in the second half of this chapter.

2.3 Data Collection and Measurement

Remaining to be discussed are two essential and interrelated topics that must be addressed in the design of a research project and then implemented before any analysis can be undertaken. One of these involves data collection, which obviously must precede both the calculation of descriptive statistics and the preparation of graphs and charts. Since most of our examples use Arab Barometer data, our discussion will give special attention to the collection and use of survey data. A fuller overview of survey research is presented in Appendix 3. We will also, however, discuss other sources and methods of data collection and data generation. Even researchers who work with data that have already been collected and cleaned should have an understanding of the sources and processes associated with data collection.

The second essential topic concerns measurement, which merits special attention when the concepts and variables of interest to a researcher are to some degree abstract and cannot be directly observed. In this case, measurement involves the selection and use of indicators , phenomena that can be observed and will permit inferences about the more abstract concepts. In survey research, and equally in many other modes of data collection, the concepts and variables to be measured and the indicators to be used must be identified before data collection can begin.

2.3.1 Types of Data and Measurement Scales

Data can be categorized as categorical or numerical , and many research projects utilize both types of data. Categorical data is often the main type of data used in qualitative research, although numerical data may also be used.

Categorical data are data that are divided into groups based on certain characteristics, such as gender, nationality, or level of education. Categorical data can either be nominal or ordinal . Nominal data don’t have a logical order to the categories. There is no set order to male or female, for example. Ordinal data do have a logical order. There is an order to primary school education, secondary school education, and university degree, for example. Sometimes categorical data are represented with numbers in datasets to facilitate statistical analyses, such as assigning “female” the number 1, and “male” the number 2. When a researcher does this, they will generally provide a codebook to assist others in understanding what the numbers mean.

Numerical data are data that can be measured with numbers. Numerical data can either be discrete or continuous . Discrete data can only take on certain values, such as the number of protests that take place in a month—you can’t have 3.1 or 3.5 protests. Continuous data are data that can take any value within a certain range, such as the GDP of a country. It could be $20.05 billion USD, $40.26 billion USD, or any other number larger than zero.

2.3.2 Data Sources and Data Collection

There are many different types of data sources, and each of them is useful in different contexts. We will not discuss all of them in detail in this guide, but it may be useful to get an idea of some of the major sources of data.

Existing datasets can be extremely useful for a researcher. Many existing datasets are free and accessible online to everyone. The Arab Barometer, and other similar surveys, such as the Afrobarometer, the World Values Survey, and the European Social Survey, measure diverse attitudes, beliefs, and behaviors in various regions of the world. International organizations, such as the UN and the World Bank, publish data on socioeconomic indicators and other topics on their websites. Most of the datasets are aggregated at the country level, but some data come from surveys or administrative systems and are at the individual or sub-national level. Many countries or administrative units also publish datasets, such as crime statistics.

Many researchers also make the data that they collect available online without charge, such as through Harvard Dataverse or personal or university websites. For example, a recently published dataset accessible through Harvard Dataverse is the Global Parties Survey, which compares the values, issue positions, and rhetoric of over 1000 political parties. Researchers make datasets available not only to make future research easier, but also to increase the transparency and replicability of their own research. This is important, as transparency and replication increase our confidence in researchers’ findings and can make their propositions easier to test in other settings.

Archival research involves using documents, images, correspondence, reports, audio or audiovisual recordings, or other objects that already exist. Archival research is commonly used to answer historical research questions. Additional types of records one might access when conducting archival research include medical records, government papers, news articles, personal collections, or even tweets. Archival materials are generally accessed at museums, government offices or, of course, archives. In some cases, a researcher may need to get special permission or training before being allowed to access archival materials. What documents are used depends on the research question. Researchers sometimes use content analysis to categorize or quantify archival documents.

Content analysis is a related research technique that can generate both quantitative and qualitative data. The goal of content analysis is to characterize textual or audio data, such as news articles, speeches of officials, or even a set of tweets. Content analysis can generate quantitative data by counting the frequency, space, or time devoted to certain words, ideas, or themes in the documents being analyzed. Content analysis can generate qualitative data, and sometimes also quantitative data, through directed coding, such as, for example, categorizing certain speeches as either in favor or not in favor of a certain policy. Direct coding is usually done by multiple coders who are instructed to employ a set of coding guidelines, and confidence in the data produced usually requires agreement among the decisions and assignments of the different coders. Sentiment analysis is a type of directed coding in which texts are classified as containing certain emotions, such as positive, negative, sad, angry, happy. Content analysis is most useful when there is a large amount of scattered text from which it is difficult to draw conclusions without analyzing it systematically. More recently, advances in the fields of computational linguistics and natural language processing have allowed researchers to conduct content analysis on much larger amounts of data and have reduced the need for human coders.

Observational research is exactly what it sounds like—observing behavior. Sometimes observational research occurs in a laboratory setting, where aspects of the environment are controlled to test how participants react, but often observational research occurs in public. A researcher might be interested in the gender dynamics of a protest, for example. The researcher might attend the protest and take notes or record who stands where in a crowd, what kinds of things women and their signs say versus what kinds of things men and their signs say. This is a very flexible method of data collection, but it can be difficult to draw conclusions with so many uncontrollable factors.

A focus group is a group discussion of a specific topic led by a moderator or interviewer. You have probably heard of focus groups in the context of consumer research. Focus groups are a good way to learn and understand what a target audience thinks about a specific topic. They are sometimes used in the early stages of survey research, before actually conducting the survey. In this connection, focus groups are used to gain ideas and insights that help in developing the survey instrument or in evaluating the clarity and efficacy of a survey instrument that an investigator is planning to use.

Interviews are a way of collecting information by talking with other people. They can be structured, unstructured, or semi-structured. You might conduct interviews of protesters as though you were having a conversation to get a sense of their motivations. This unstructured format might make the respondents feel more at ease and then disclose valuable information you would never have thought to ask. On the other hand, you might also conduct interviews in a structured way, by asking the protesters a predetermined set of questions. This allows you to more easily compare responses between respondents.

Survey research is another way to collect data by asking people questions. The answers people give are the data. You might conduct a survey through face-to-face interviews, as the Arab Barometer does. This means administering a questionnaire face-to-face, with the interviewer asking the questions and recording the responses. You might also conduct a survey using phone calls, text messages, or online messaging. Another method of conducting surveys is having people complete questionnaires in person, online, or through mail. In this case, the surveys are called self-administered rather than interview-administered. We have included more information about survey research in the Appendix of this research guide.

Sampling refers to the fact that it is usually impossible for a researcher to collect all of the data that are relevant for her research project. Sampling is very often a concern in survey research, but it may also be a concern when other data collection procedures are employed. There are some projects for which this is not the case, but these are the exceptions in social science research. In survey research, for example, an investigator may be interested in the political attitudes of all of the adult citizens of her country, but it is very unlikely, virtually impossible, actually, that she or her research team can survey all of these men and women. Or, a researcher may be interested in the gender-related behavior of students in college social science classes, but again, it is virtually impossible for the researcher and her team to observe all of the social science classes in all of the colleges in her country, let alone in other countries.

A sample refers to the units about whom or which the researcher will actually collect data, and these are the data she will analyze and with which she will carry out her investigation. Population refers to the units in which the researcher is interested, those about whom or which her study seeks to provide information. In the first example above, the population is all of the adult citizens of the researcher’s country; and in the second, it is college social science classes in general, meaning virtually all such classes.

The distinction between population and sample raises two important questions that are discussed elsewhere in this guide.

The first question asks which are the units about whom or which the investigator will collect information. In other words, it asks which members of the population will be included in the sample, and how will those in the sample be selected. The answers lie in the design and construction of the investigator’s sample, a topic that is discussed in the appendix on survey research with examples from the Arab Barometer.

The second question concerns the relationship between the population and the sample. It asks whether, and if so, when, how, and with what degree of confidence, can findings based on analyses of the data in the sample provide information and insight that apply to the population. This important question is taken up in the next chapter.

2.3.3 Conceptual Definitions and Operational Definitions

The indicators and type of data that are best suited for measuring a certain concept depend on how the concept is defined. A conceptual definition should specify how we are thinking about variance related to the concept: what is the unit of analysis and what is the variance we want to capture? Take, for example, the concept of quality of healthcare services. If country is the unit of analysis, do we want to capture the amount of healthcare services that the government provides? If so, we might consider measuring the concept by using the percent of the national budget devoted to health and healthcare, or the number of physicians per 100,000 citizens. On the other hand, we may be interested in citizens’ perceptions of healthcare service provision. In this case, we may want to measure perceived quality of healthcare services by asking questions, most probably in a survey, such as “Do you find doctors helpful when you are sick?” or “When you have been sick, were you able to obtain the healthcare services you needed?” We will want to conceptualize “quality of healthcare services” differently depending on our research question, and then, accordingly, collect or use data at the appropriate level of analysis.

Once an investigator has formulated and determined her conceptual definitions, she is ready to think about and specify her operational definitions . An operational definition describes the data, indicators, and methods of analysis she will use to rate, classify or assign values to the entities in which she is interested. An operational definition, in other words, describes the procedures an investigator will use to measure each of the abstract concepts and variables in her study, concepts and variables that cannot be directly observed.

In formulating an operational definition, an investigator must decide what data and indicators best fit the variance specified in the conceptual definition of each concept and variable in her study that cannot be directly observed or directly measured. She asks, therefore, what data can be obtained or collected, do these data contain the indicators she needs, and of course, what will be the quality of the data.

Returning to the previous illustration, suppose you have decided to measure the satisfaction of individuals with the provision of healthcare services. You need to decide what type of data to use, and in this case, it makes sense to use public opinion data. Perhaps, however, survey data on this topic do not exist, or the survey data that do exist do not ask questions that you would consider good indicators of the concept you want to measure. You might consider administering your own survey, but if this is not feasible, you can consider other types of data and data collection and build your own new dataset, informed and guided by the conceptual definition you are seeking to operationalize. For example, you might collect tweets related to healthcare and use content analysis by coders to rate each tweet on a spectrum ranging from very negative to very positive.

Think of another concept in which you are interested and then ask yourself the following questions. Are you interested in variance at the individual level, country level, or a different level and unit of analysis? What is a conceptual definition that makes clear and gives meaning to the variance of the concept that you seek to measure? What type of data would best measure your variable? What elements that might be good indicators of the variable you seek to measure—questions in a survey, for example—should the dataset contain? How feasible do you think it is to obtain data that will contain these elements?

Researchers cannot always use the data and method of measuring their concepts and variables that are best suited to operationalizing their conceptual definitions. Collecting data takes time, resources, and certain skill sets. Also significant, certain types of data collection may pose risks to the researcher or the research subjects, and for this reason they must be avoided due to ethical concerns. For example, in some countries it can be dangerous to interview people about their participation in opposition political parties or movements. It is important to consider the trade-offs in using different types of data and, in some cases, different indicators. Which type of data collection is most feasible? What are good indicators of the variance you want to capture? We discuss these questions in the following section.

2.3.4 Measurement Quality

How do we decide whether a certain kind of data or particular survey questions are good indicators of the concept we seek to measure? We want data to be reliable and valid , which are the criteria by which the quality of a measure may be evaluated. We also want measures that capture all of the variance associated with the concepts and variables to be used in analyses. Attention to these criteria is particularly important if the concept to be measured is abstract and not directly observable. In this case, we will probably be measuring indicators of the concept, rather than the concept itself.

Reliability

Reliability refers to the absence of error variance, which means that the observed variance is based solely on the measure, as intended, and not on extraneous factors as well. In survey research, for example, a question will be a reliable measure if the response is based solely on the content of that question and not also on factors such as ambiguous wording, the appearance or behavior of the interviewer, comments by other persons who were present when the interview was conducted, or even the time of day of the interview.

Attention to reliability is no less important in other forms of research. For example, when coding events data from newspapers in order to classify countries or other units of analysis with respect to newsworthy events such as protests, instances of violence, violations of human rights, labor strikes, elections and electoral outcomes, or other attributes, error variance may result from unclear coding instructions, inconsistent newspaper selections, or changing standards about what constitutes an instance of violence or a violation of human rights.

Consistency over multiple trials offers evidence of reliability. In survey research, once again, this means that a question would be answered in the same way—perhaps a response of “somewhat satisfied” to a question about satisfaction with the country’s healthcare system—regardless of who was the interviewer or the time of day at which the interview was conducted. In natural science, especially laboratory science, this means that the result of a measuring operation is reproducible.

In social science, evidence of reliability is often provided by consistency among multiple indicators that purport to measure the same concept or variable. This applies not only to questions asked in a survey, but also to data collected or generated in other ways as well. A measure based on multiple indicators that agree with one another can also be described as a unidimensional measure, and unidimensionality across multiple indicators demonstrates that a measure is reliable. For this reason, researchers often seek to use multiple indicators to measure the same concept in order to increase the robustness of their results.

Note also that the values or ratings produced by different indicators need not be absolutely identical. To be consistent with one another, they need only to be strongly correlated. A number of statistical tests are available for determining the degree of inter-indicator agreement, or consistency. Cronbach’s alpha is probably the test most frequently used for this purpose. Factor analysis is also frequently used. We discuss some of these statistical techniques in Chaps. 3 and 4 .

Table 2.6 , which presents hypothetical data on indicators purporting to measure a country’s level of development, provides a simplified illustration of three patterns of agreement among multiple indicators. Although the data are fictitious, the indicators might be thought of as Gross Domestic Product, Per Capita National Income, Percentage of the Population below the Poverty Line, the Level of Unemployment, or other potentially reliable indicators of national development. The table illustrates the following three patterns.

Pattern A indicates strong inter-indicator agreement and hence a high degree of reliability. Even though the ratings are not completely identical, the correlations among them are very strong. Each of the indicators can be used with confidence that it is a reliable measure. Its reliability has been demonstrated by its agreement with other indicators. The items can also be combined to form a scale or index, which, again, can be used with confidence that it is a reliable measure.

Pattern B indicates the absence of inter-indicator agreement and hence a low level of reliability. It is possible that one of the indicators is a reliable measure of national development, but there is no basis for determining which item is the reliable measure. It is also possible that all are reliable measures but of different dimensions of national development, meaning that the concept and the measure are not unidimensional and, hence, Pattern B does not provide evidence that any indicator or combination of indicators constitutes a reliable measure for the specific concept of concern.

Pattern C indicates strong inter-indicator agreement among three of the indicators (I-1, I-2, and I-4), and these three, but not the fourth (I-3), may be considered reliable measures and used in the ways described for Pattern A. In the absence of evidence that it is reliable, I-3 should not be used to measure the same concept, national development, that the other three indicators measure.

Validity asks whether, or to what degree, a measure actually measures what it purports to measure. A concern for validity is important whenever the concept or variable to be measured cannot be directly observed, and so the investigator in this case must use an indicator, rather than the concept or variable itself, to capture the relevant variance.

It is useful to think of validity as expressing the congruence between the conceptual definition and the operational definition of a concept or variable. A conceptual definition makes explicit what an investigator understands the concept to mean, and it is important that she provide a conceptual definition when the concept cannot be directly observed, is abstract, and might also be multi-dimensional. By contrast, if the concept or variable is familiar and there is a widely shared understanding of what it means, the investigator may make this the basis of her conceptual definition.

The operational definition makes explicit the way that the concept or variable will be measured. What indicator or indicators should an investigator use, and how exactly should she use them? Suppose, for example, that an investigator is designing a study in which country is the unit of analysis and the goal is to measure the degree to which each country is democratic. Her operational definition will spell out how she will capture the cross-country variance in degree of democracy. If you, Dear Reader, were the investigator, what would be your operational definition? What indicator or indicators would you use, and how would you use them?

A concern for validity often emerges when using Arab Barometer survey data. Suppose you wanted to rate or classify individuals with respect to tolerance and interpersonal trust. After offering conceptual definitions of tolerance and interpersonal trust, what would be your operational definitions? What item or items would you feel comfortable treating as indicators of each concept and would you, therefore, include in your questionnaire or survey instrument?

It is important to make clear that validity is not about how well the variance is captured by an operational definition. That is an important concern and one by which the quality of a measure is judged, as discussed in the next section. But validity does not ask how much of the variance is captured but, rather, does the variance that is captured, however complete or incomplete that may be, actually pertain to the concept specified in your conceptual definition.

The standardized tests used to evaluate and classify students are often mentioned to illustrate this point. Do intelligence tests really measure intelligence, or do they rather measure something else—perhaps being the oldest child, perhaps income, perhaps something else? Do university entrance exams, the tawjihi , for example, really measure what they purport to measure: the likelihood of success at university? Or do they again measure something else—perhaps growing up in a middle class household? You might find it useful to construct your own operational definition of the concept “likelihood of success at university.” What indicators would you use, and how would you use them to construct a measure that would give a rating or score to each student?

An exercise with Arab Barometer survey data provides another illustration, and one in which the importance of the conceptual definition is also demonstrated. Suppose that the variable to be measured is satisfaction with the government, and the goal is to rate each respondent on a five-point scale ranging from 1 = no satisfaction at all to 5 = very high satisfaction. Do you think, for example, that a question about government corruption is a valid indicator—an indicator of the concept as it is defined in your conceptual definition? What about a question that asks, “Do you think government officials care more about being re-elected than solving important problems for the country’s citizens?” What question would you write to attempt to measure the concept of government satisfaction?

Once you have specified your measurement strategy, or operational definition, it may be necessary, or at least very advisable, to offer evidence that your measure is valid—that you can use it to measure a concept with confidence and that it does actually measure that concept. Unlike reliability, which can be demonstrated, validity must be inferred. An investigator will state why it is very likely that the measure does indeed measure the concept it purports to measure, and when appropriate, she will offer evidence or reasoning in support of this assertion.

Face Validity Sometimes, asserting “face validity” is sufficient to establish validity and to persuade consumers of the findings produced by a research project that an operational definition does indeed measure what it purports to measure. This may be the case if there is an apparently very good fit between the conceptual definition and the operational definition of a particular concept or variable. In many cases, however, face validity may not be evident and the assertion of face validity by an investigator is unlikely to be persuasive. Below are brief descriptions of the ways an investigator can support her assertion that a measure is valid. Although different, each involves some sort of comparison.

Construct Validity The measure may be considered valid if it is related to the same phenomena, and in the same way, that the concept being measured is known to be related to the measure. For example, if an investigator conducting a survey seeks to measure interpersonal trust, and if it is known that interpersonal trust is related to personal efficacy, construct validity can be demonstrated by a significant correlation between the investigator’s measure of interpersonal trust and a measure of personal efficacy.

Criterion Validity Also sometimes known as Predictive Validity. The measure may be considered valid if there is a significant correlation between the results of an investigator’s operational definition and a distinct, established, and commonly used measure of the concept or variable the investigator seeks to measure. For example, an investigator using aggregated survey data to classify countries with respect to democracy might assess validity by comparing her country ratings with those provided by Freedom House.

Known Groups The measure may be considered valid if it correctly differentiates between groups that are known to differ with respect to the concept or variable being measured. For example, evidence that the tawjihi examination is a valid measure of likelihood of success at university would be provided if university students currently doing well at university have higher exam scores than university students currently doing less well at university.

Inter-Indicator Agreement Inter-indicator agreement builds on the discussion of reliability, particularly on the significance of unidimensionality and the patterns of inter-indicator agreement shown in Table 2.6 . If each indicator in a battery of indicators has face validity, and if each one agrees with each of the others, it is very unlikely that they measure something other than the concept or variable they purport to measure.

As noted in the discussion of reliability, various statistical procedures, including Crombach’s alpha and factor analysis, can be used to assess the degree to which indicators are inter-correlated and, therefore, taken together, constitute a unidimensional and reliable measure. And if different indicators possessing face validity all reliably measure the same concept or variable, it is reasonable to infer that this concept or variable is indeed the one the investigator seeks to measure and is, therefore, valid as well as reliable.

Content Validity Content validity refers to whether, or to what degree, an operational definition captures a fuller range of the concept’s meaning and variance. As discussed in the next section, using multiple indicators increases the likelihood that a measure will possess content validity.

Exercise 2.2. Inter-Indicator Agreement and Reliability and Validity

Which of the following items from Arab Barometer surveys do you think would be most likely to be reliable and valid indicators of support for gender equality? Briefly state the reasons you have selected these particular items, or all of the items, if that is what you chose. Then, referring to the patterns of inter-indicator agreement shown in Table 2.6 , describe the pattern that you think the items you have selected would resemble. Finally, describe and explain the implications for reliability and validity of the pattern of inter-indicator agreement that you think the items you have selected would resemble.

Do you think it is important for girls to go to high school?

A married woman can work outside the home, if she wishes

It is acceptable for a woman to be a member of parliament

A university education is more important for a boy than a girl

Men and women should have equal job opportunities and wages

Women have the right to get divorced upon their request

A woman can be president or prime minister of a Muslim country

A woman should cease to work outside the home after marriage in order to devote full time to home and family

A woman can travel abroad by herself, if she wishes

On the whole, men make better political leaders than women

2.3.5 Capturing Variance

Although reliability and validity are recognized and widely-used criteria for assessing the quality of a measure in social science (and other) research, there is an additional criterion that is important as well. This is the degree to which a measure captures all, as opposed to only some, of the variance associated with the concept that an investigator seeks to measure. This can be described as the completeness of a measure.

If the variance of the concept to be measured is continuous, ranging from low to high or weak to strong, for example, a measure will be flawed if it does not capture the whole of the continuum, or at least the whole of that part of the continuum in which the investigator is interested. Such a measure may be reliable and valid, and in this sense of very good quality. But its utility may still be limited, or it may at least be less than maximally useful, if it captures only some of the variance.

A simplified and hypothetical example of a survey about religious involvement illustrates this point. If an investigator wishes to know how often a respondent attended Friday prayers at the mosque during the past year, her survey instrument should not ask a question like the following: “Over the last year, on average, did you pray at the mosque on Friday at least once a month?” A response of “No” will lump together respondents who never pray at the mosque on Friday and those who do so in two months out of three. A response of “Yes” will lump together those who attend Friday prayers once a month and those who do so every week.

In constructing the survey instrument, the investigator may have had good reason to ask a Yes-No question and make “once a month” the cutting point. But such a cutting point can be implemented during the data analysis phase, if needed, rather than asking the initial survey question in a manner that reduces much of the variance that characterizes the population being surveyed.

A more realistic example, perhaps, would be survey questions that ask about age or income. Ideally, the investigator should ask respondents for their exact age and their exact monthly or annual income. Sometimes this is difficult or impossible, however, in the latter case, perhaps, because the matter is considered sensitive. If this is the case, the investigator may decide to use age and income categories, such as 18–25 years of age and 500–1000 dinars per month. Again, while the data obtained may be reliable and valid and also useful, not all of the variance that characterizes the population has been captured: individuals 19 years of age and individuals 24 years of age are treated as if they have the same age, the variance in their actual ages, therefore, not being captured. Even more variance would remain uncaptured by wider categories or categories with no lower or upper limit, such as 55 years of age or older and, say, 5000 or more dinars a month.

The same concern arises when the data are categorical rather than continuous. Variance in this case refers to a range of types or kinds or categories. As is standard with categorizations, categories should be comprehensive and mutually exclusive, meaning that every member of a population can be assigned to one but only one category.

The challenge here is for the investigator to be knowledgeable about the array of one-and-only-one categories into which she wishes to sort the entities whose attributes she seeks to measure. And in principle, this means she must be knowledgeable about the actual, real-world variance, as well as about the categories relevant for her particular study. With respect to religious affiliation, for example, asking people in Lebanon whether they are Muslim or Christian would leave a great deal of variance uncaptured since there are important subdivisions within each category. Asking only about Muslims and Christians would therefore be appropriate only if the researcher is aware of the subdivisions within each category and has explicitly determined that her project does not require attention to these subdivisions.

There are numerous examples that involve a unit of analysis other than the individual. Consider, for example, a study in several Arab countries in which non-governmental organization, NGO, is the unit of analysis, and the variable of interest is NGO type. The investigator seeks, in other words, to prepare a distribution of NGO types, perhaps to see if the distribution differs from country to country. In this case, the investigator must decide on the categories of NGO type that she will use, and these categories, taken together, must be such that each NGO can be assigned to one and only one category.

To make it easier to assign each NGO to a category of NGO type, the investigator might be inclined to define NGO type very broadly, such as economic, sociocultural, and political NGOs. But this will, again, leave much of the variance uncaptured. Assigning NGOs to the “sociocultural” NGO type category, for example, will group together NGOs that may actually differ from one another with respect to objectives and strategies and perhaps in other ways as well. The investigator must be aware of these within-NGO type differences and make an informed decision about their relevance for her study.

Finally, there is an additional and somewhat different way in which an investigator needs to think about the variance that will and will not be captured, and this concerns the dimensional structure of the concept to be measured. For example, the United Nations has developed an index of human development and it annually gives each of the world’s countries a numerical HDI score. Investigators can use the index provided by the UN if it measures a concept that is relevant for their studies. But the HDI is based on a formula that combines a country’s situation with respect to health, education, and income, and countries with an identical HDI score are not necessarily the same with respect to the three elements. One country’s HDI may be driven up by the excellent quality of its educational system, whereas it might be the excellent quality of its health care system that is driving up the HDI score of another country.

Does this mean that the investigator should abandon the HDI and instead include separate measures of education, income, and health in their analyses? Of course, it depends on the goals of the study. But investigators seeking to measure concepts with multiple dimensions or multiple elements must be aware of these differences and then, in light of the goals of each specific research project, make informed decisions about whether the variance these differences represent does or does not need to be captured.

Other examples, ones in which the individual is the unit of analysis, remind us that attitudes and behavior may also have multiple dimensions or components, and that a researcher must again decide, therefore, whether her investigation will be best served by considering the dimensions separately or by constructing an index that combines them.

Attitudes about immigration, for example, probably have an economic, a cultural, and perhaps a political dimension. Similarly, the important concept of trust has multiple components, including general interpersonal trust, trust in important political institutions, trust in people who belong to a different religion, etc. In cases such as these, the researcher will likely want to ask about each of these dimensions or components. The use of multiple questions will enable the researcher to capture more of the variance associated with attitudes toward immigration or the concept of trust. It will remain, however, for the investigator to decide whether to consider the various elements separately or construct an index that considers them in combination with one another.

The concepts and procedures discussed in this chapter focus on description, on taking variables one at a time. But while the objectives of a positivist social science research project might be descriptive, and this might well produce valuable information and insight, familiarity with the topics discussed in the present chapter is necessary not only, or even primarily, for investigators with descriptive objectives. An understanding of many of these topics is essential for investigators who seek to explain as well as describe variance. This is not the case for every topic considered. Descriptive statistics and visual descriptions are, as their name indicates, for descriptions, for variables taken one at a time. But most of the other concepts and procedures are building blocks, or points of departure, for research endeavors that seek to explain and, toward this end, carry out bivariate and multivariate analyses. Accordingly, readers of Chaps. 3 and 4 will want to keep in mind, and may occasionally find it helpful to refer back to, the material covered in the present chapter.

Readers who wish to further explore the conduct of qualitative data and methods research may find the following source helpful: Gary King, Robert Keohane, and Sidney Verba, Designing Social Inquiry: Scientific Inference in Qualitative Research (Princeton University Press, 1994).

Jaime Kucinskas and Tamara van der Does, “Gender Ideals in Turbulent Times: An Examination of Insecurity, Islam, and Muslim Men’s Gender Attitudes during the Arab Spring.” Comparative Sociology 16 (2017): 340–368.

Katherine Meyer, Helen Rizzo, and Yousef Ali, “Changed Political Attitudes in the Middle East: The Case of Kuwait.” International Sociology 22, 7 (May 2007): 289–324.

Aggregating, here, refers to the construction or calculation of a measure pertaining to a larger unit based on the summing or averaging of smaller units inside the larger unit.

Mark Tessler, Amaney Jamal, and Michael Robbins. “New Findings on Arabs and Democracy.” Journal of Democracy 23, 4 (October 2012): 89–103.

See Michael Ross, “Does Oil Hinder Democracy.” World Politics 53 (April 2001): 325–61. Oil reliance is measured by the value of fuel-based exports divided by GDP.

Daniel Corstange, The Price of a Vote in the Middle East: Clientelism and Communal Politics in Lebanon and Yemen (Cambridge University Press, 2016).

Corstange op. cit.

Melani Cammett, Compassionate Communalism: Welfare and Sectarianism in Lebanon (Cornell University Press, 2014).

Wendy Pearlman, Violence, Non-Violence, and the Palestinian National Movement (Cambridge University Press, 2011).

Elbaki Hermassi, Leadership and National Development in North Africa (University of California Press, 1979).

The divisor is n-1 if the analysis is based on data from a sample, or subset, of a larger population. If the analysis is based on data from the entire population, the divisor is n.

Author information

Authors and affiliations.

Department of Political Science, University of Michigan, Ann Arbor, MI, USA

Mark Tessler

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

© 2023 The Author(s)

About this chapter

Tessler, M. (2023). Univariate Analysis: Variance, Variables, Data, and Measurement. In: Social Science Research in the Arab World and Beyond. SpringerBriefs in Sociology. Springer, Cham. https://doi.org/10.1007/978-3-031-13838-6_2

Download citation

DOI : https://doi.org/10.1007/978-3-031-13838-6_2

Published : 04 October 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-13837-9

Online ISBN : 978-3-031-13838-6

eBook Packages : Social Sciences Social Sciences (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Logo for VIVA Open Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

23 14. Univariate analysis

Chapter outline.

  • Where do I start with quantitative data analysis? (12 minute read time)
  • Measures of central tendency (17 minute read time, including 5-minute video)
  • Frequencies and variability (13 minute read time)

People often dread quantitative data analysis because – oh no – it’s math. And true, you’re going to have to work with numbers. For years, I thought I was terrible at math, and then I started working with data and statistics, and it turned out I had a real knack for it. (I have a statistician friend who claims statistics is not math, which is a math joke that’s way over my head, but there you go.) This chapter, and the subsequent quantitative analysis chapters, are going to focus on helping you understand descriptive statistics and a few statistical tests, NOT calculate them (with a couple of exceptions). Future research classes will focus on teaching you to calculate these tests for yourself. So take a deep breath and clear your mind of any doubts about your ability to understand and work with numerical data.

A white car with a bumper sticker that says "We all use math every day"

In this chapter, we’re going to discuss the first step in analyzing your quantitative data: univariate data analysis. Univariate data analysis is a quantitative method in which a variable is examined individually to determine its distribution , or “the way the scores are distributed across the levels of that variable” (Price et. al, Chapter 12.1, para. 2). When we talk about levels ,  what we are talking about are the possible values of the variable – like a participant’s age, income or gender. (Note that this is different than our earlier discussion in Chaper 10 of levels of measurement , but the level of measurement of your variables absolutely affects what kinds of analyses you can do with it.) Univariate analysis is n on-relational , which just means that we’re not looking into how our variables relate to each other. Instead, we’re looking at variables in isolation to try to understand them better. For this reason, univariate analysis is best for descriptive research questions.

So when do you use univariate data analysis? Always! It should be the first thing you do with your quantitative data, whether you are planning to move on to more sophisticated statistical analyses or are conducting a study to describe a new phenomenon. You need to understand what the values of each variable look like – what if one of your variables has a lot of missing data because participants didn’t answer that question on your survey? What if there isn’t much variation in the gender of your sample? These are things you’ll learn through univariate analysis.

14.1 Where do I start with quantitative data analysis?

Learning objectives.

Learners will be able to…

  • Define and construct a data analysis plan
  • Define key data management terms – variable name, data dictionary, primary and secondary data, observations/cases

No matter how large or small your data set is, quantitative data can be intimidating. There are a few ways to make things manageable for yourself, including creating a data analysis plan and organizing your data in a useful way. We’ll discuss some of the keys to these tactics below.

The data analysis plan

As part of planning for your research, and to help keep you on track and make things more manageable, you should come up with a data analysis plan. You’ve basically been working on doing this in writing your research proposal so far. A data analysis plan is an ordered outline that includes your research question, a description of the data you are going to use to answer it, and the exact step-by-step analyses, that you plan to run to answer your research question. This last part – which includes choosing your quantitative analyses – is the focus of this and the next two chapters of this book.

A basic data analysis plan might look something like what you see in Table 14.1. Don’t panic if you don’t yet understand some of the statistical terms in the plan; we’re going to delve into them throughout the next few chapters. Note here also that this is what operationalizing your variables and moving through your research with them looks like on a basic level.

An important point to remember is that you should never get stuck on using a particular statistical method because you or one of your co-researchers thinks it’s cool or it’s the hot thing in your field right now. You should certainly go into your data analysis plan with ideas, but in the end, you need to let your research question and the actual content of your data guide what statistical tests you use. Be prepared to be flexible if your plan doesn’t pan out because the data is behaving in unexpected ways.

Managing your data

Whether you’ve collected your own data or are using someone else’s data, you need to make sure it is well-organized in a database in a way that’s actually usable. “Database” can be kind of a scary word, but really, I just mean an Excel spreadsheet or a data file in whatever program you’re using to analyze your data (like SPSS, SAS, or r). (I would avoid Excel if you’ve got a very large data set – one with millions of records or hundreds of variables – because it gets very slow and can only handle a certain number of cases and variables, depending on your version. But if your data set is smaller and you plan to keep your analyses simple, you can definitely get away with Excel.) Your database or data set should be organized with variables as your columns and observations/cases as your rows. For example, let’s say we did a survey on ice cream preferences and collected the following information in Table 14.2:

There are a few key data management terms to understand:

  • Variable name : Just what it sounds like – the name of your variable. Make sure this is something useful, short and, if you’re using something other than Excel, all one word. Most statistical programs will automatically rename variables for you if they aren’t one word, but the names are usually a little ridiculous and long.
  • Observations/cases : The rows in your data set. In social work, these are often your study participants (people), but can be anything from census tracts to black bears to trains. When we talk about sample size, we’re talking about the number of observations/cases. In our mini data set, each person is an observation/case.
  • Primary data : Data you have collected yourself.
  • Secondary data : Data someone else has collected that you have permission to use in your research. For example, for my  student research project in my MSW program, I used data from a local probation program to determine if a shoplifting prevention group was reducing the rate at which people were re-offending.  I had data on who participated in the program and then received their criminal history six months after the end of their probation period. This was secondary data I used to determine whether the shoplifting prevention group had any effect on an individual’s likelihood of re-offending.
  • Data dictionary (sometimes called a code book) : This is the document where you list your variable names, what the variables actually measure or represent, what each of the values of the variable mean if the meaning isn’t obvious (i.e., if there are numbers assigned to gender), the level of measurement and anything special to know about the variables (for instance, the source if you mashed two data sets together). If you’re using secondary data, the data dictionary should be available to you.

When considering what data you might want to collect as part of your project, there are two important considerations that can create dilemmas for researchers. You might only get one chance to interact with your participants, so you must think comprehensively in your planning phase about what information you need and collect as much relevant data as possible. At the same time, though, especially when collecting sensitive information, you need to consider how onerous the data collection is for participants and whether you really need them to share that information. Just because something is interesting to us doesn’t mean it’s related enough to our research question to chase it down. Work with your research team and/or faculty early in your project to talk through these issues before you get to this point. And if you’re using secondary data , make sure you have access to all the information you need in that data before you use it.

Let’s take that mini data set we’ve got up above and I’ll show you what your data dictionary might look like in Table 14.3.

Key Takeaways

  • Getting organized at the beginning of your project with a data analysis plan will help keep you on track. Data analysis plans should include your research question, a description of your data, and a step-by-step outline of what you’re going to do with it.
  • Be flexible with your data analysis plan – sometimes data surprises us and we have to adjust the statistical tests we are using.
  • Always make a data dictionary or, if using secondary data, get a copy of the data dictionary so you (or someone else) can understand the basics of your data.
  • Make a data analysis plan for your project. Remember this should include your research question, a description of the data you will use, and a step-by-step outline of what you’re going to do with your data once you have it, including statistical tests (non-relational and relational) that you plan to use. You can do this exercise whether you’re using quantitative or qualitative data! The same principles apply.
  • Make a data dictionary for the data you are proposing to collect as part of your study. You can use the example above as a template.

14.2 Measures of central tendency

  • Explain measures of central tendency – mean, median and mode – and when to use them to describe your data
  • Explain the importance of examining the range of your data
  • Apply the appropriate measure of central tendency to a research problem or question

A measure of central tendency is one number that can give you an idea about the distribution of your data. The video below gives a more detailed introduction to central tendency. Then we’ll talk more specifically about our three measures of central tendency – mean, median and mode.

One quick note: the narrator in the video mentions  skewness  and  kurtosis . Basically, these refer to a particular shape for a distribution when you graph it out.e.That gets into some more advanced multivariate analysis that we aren’t tackling in this book, so just file them away for a more advanced class, if you ever take on.

There are three key measures of central tendency, which we’ll go into now.

The mean , also called the average, is calculated by adding all your cases and dividing the sum by the number of cases. You’ve undoubtedly calculated a mean at some point in your life. The mean is the most widely used measure of central tendency because it’s easy to understand and calculate. It can only be used with interval/ratio variables, like age, test scores or years of post-high school education. (If you think about it, using it with a nominal or ordinal variable doesn’t make much sense – why do we care about the average of our numerical values we assigned to certain races?)

The biggest drawback of using the mean is that it’s extremely sensitive to outliers , or extreme values in your data. And the smaller your data set is, the more sensitive your mean is to these outliers. One thing to remember about outliers – they are not inherently bad, and can sometimes contain really important information. Don’t automatically discard them because they skew your data.

Let’s take a minute to talk about how to locate outliers in your data. If your data set is very small, you can just take a look at it and see outliers. But in general, you’re probably going to be working with data sets that have at least a couple dozen cases, which makes just looking at your values to find outliers difficult. The best way to quickly look for outliers is probably to make a scatter plot with excel or whatever database management program you’re using.

Let’s take a very small data set as an example. Oh hey, we had one before! I’ve re-created it in Table 14.5. We’re going to add some more cases to it so it’s a little easier to illustrate what we’re doing.

Let’s say we’re interested in knowing more about the distribution of participant age. Let’s see a scatterplot of age (Figure 14.1). On our y-axis (the vertical one) is the value of age, and on our x-axis (the horizontal one) is the frequency of each age, or the number of times it appears in our data set.

Scatter plot of ages of respondents

Do you see any outliers in the scatter plot? There is one participant who is significantly older than the rest at age 54. Let’s think about what happens when we calculate our mean with and without that outlier. Complete the two exercises below by using the ages listed in our mini-data set in this section.

Next, let’s try it without the outlier.

With our outlier, the average age of our participants is 28, and without it, the average age is 25. That might not seem enormous, but it illustrates the effects of outliers on the mean.

Just because Tom is an outlier at age 54 doesn’t mean you should exclude him. The most important thing about outliers is to think critically about them and how they could affect your analysis. Finding outliers should prompt a couple of questions. First, could the data have been entered incorrectly? Is Tom actually 24, and someone just hit the “5” instead of the “2” on the number pad? What might be special about Tom that he ended up in our group, given how that he is different? Are there other relevant ways in which Tom differs from our group (is he an outlier in other ways)? Does it really matter than Tom is much older than our other participants? If we don’t think age is a relevant factor in ice cream preferences, then it probably doesn’t. If we do, then we probably should have made an effort to get a wider range of ages in our participants.

The  median (also called the 50th percentile) is the middle value when all our values are placed in numerical order. If you have five values and you put them in numerical order, the third value will be the median. When you have an even number of values, you’ll have to take the average of the middle two values to get the median. So, if you have 6 values, the average of values 3 and 4 will be the median. Keep in mind that for large data sets, you’re going to want to use either Excel or a statistical program to calculate the median – otherwise, it’s nearly impossible logistically.

Like the mean, you can only calculate the median with interval/ratio variables, like age, test scores or years of post-high school education. The median is also a lot less sensitive to outliers than the mean. While it can be more time intensive to calculate, the median is preferable in most cases to the mean for this reason. It gives us a more accurate picture of where the middle of our distribution sits in most cases. In my work as a policy analyst and researcher, I rarely, if ever, use the mean as a measure of central tendency. Its main value for me is to compare it to the median for statistical purposes. So get used to the median, unless you’re specifically asked for the mean. (When we talk about t- tests in the next chapter, we’ll talk about when the mean can be useful.)

Let’s go back to our little data set and calculate the median age of our participants (Table 14.6).

Remember, to calculate the median, you put all the values in numerical order and take the number in the middle. When there’s an even number of values, take the average of the two middle values.

What happens if we remove Tom, the outlier?

With Tom in our group, the median age is 27.5, and without him, it’s 27. You can see that the median was far less sensitive to him being included in our data than the mean was.

The  mode of a variable is the most commonly occurring value. While you can calculate the mode for interval/ratio variables, it’s mostly useful when examining and describing nominal or ordinal variables. Think of it this way – do we really care that there are two people with an income of $38,000 per year, or do we care that these people fall into a certain category related to that value, like above or below the federal poverty level?

Let’s go back to our ice cream survey (Table 14.7).

We can use the mode for a few different variables here: gender, hometown and fav_ice_cream. The cool thing about the mode is that you can use it for numeric/quantitative and text/quantitative variables.

So let’s find some modes. For hometown – or whether the participant’s hometown is the one in which the survey was administered or not – the mode is 0, or “no” because that’s the most common answer. For gender, the mode is 0, or “female.” And for fav_ice_cream, the mode is Chocolate, although there’s a lot of variation there. Sometimes, you may have more than one mode, which is still useful information.

One final thing I want to note about these three measures of central tendency: if you’re using something like a ranking question or a Likert scale, depending on what you’re measuring, you might use a mean or median, even though these look like they will only spit out ordinal variables. For example, say you’re a car designer and want to understand what people are looking for in new cars. You conduct a survey asking participants to rank the characteristics of a new car in order of importance (an ordinal question). The most commonly occurring answer – the mode – really tells you the information you need to design a car that people will want to buy. On the flip side, if you have a scale of 1 through 5 measuring a person’s satisfaction with their most recent oil change, you may want to know the mean score because it will tell you, relative to most or least satisfied, where most people fall in your survey. To know what’s most helpful, think critically about the question you want to answer and about what the actual values of your variable can tell you.

  • The  mean is the average value for a variable, calculated by adding all values and dividing the total by the number of cases. While the mean contains useful information about a variable’s distribution, it’s also susceptible to outliers, especially with small data sets.
  • In general, the mean is most useful with interval/ratio variables.
  • The  median , or 50th percentile, is the exact middle of our distribution when the values of our variable are placed in numerical order. The median is usually a more accurate measurement of the middle of our distribution because outliers have a much smaller effect on it.
  • In general, the median is only useful with interval/ratio variables.
  • The  mode is the most commonly occurring value of our variable. In general, it is only useful with nominal or ordinal variables.
  • Say you want to know the income of the typical participant in your study. Which measure of central tendency would you use? Why?
  • Find an interval/ratio variable and calculate the mean and median. Make a scatter plot and look for outliers.
  • Find a nominal variable and calculate the mode.

14.3 Frequencies and variability

  • Define descriptive statistics and understand when to use these methods.
  • Produce and describe visualizations to report quantitative data.

Descriptive statistics refer to a set of techniques for summarizing and displaying data. We’ve already been through the measures of central tendency, (which are considered descriptive statistics) which got their own chapter because they’re such a big topic. Now, we’re going to talk about other descriptive statistics and ways to visually represent data.

Frequency tables

One way to display the distribution of a variable is in a  frequency table . Table 14.2, for example, is a frequency table showing a hypothetical distribution of scores on the Rosenberg Self-Esteem Scale for a sample of 40 college students. The first column lists the values of the variable—the possible scores on the Rosenberg scale—and the second column lists the frequency of each score. This table shows that there were three students who had self-esteem scores of 24, five who had self-esteem scores of 23, and so on. From a frequency table like this, one can quickly see several important aspects of a distribution, including the range of scores (from 15 to 24), the most and least common scores (22 and 17, respectively), and any extreme scores that stand out from the rest.

There are a few other points worth noting about frequency tables. First, the levels listed in the first column usually go from the highest at the top to the lowest at the bottom, and they usually do not extend beyond the highest and lowest scores in the data. For example, although scores on the Rosenberg scale can vary from a high of 30 to a low of 0, Table 14.8 only includes levels from 24 to 15 because that range includes all the scores in this particular data set. Second, when there are many different scores across a wide range of values, it is often better to create a grouped frequency table, in which the first column lists ranges of values and the second column lists the frequency of scores in each range. Table 14.9, for example, is a grouped frequency table showing a hypothetical distribution of simple reaction times for a sample of 20 participants. In a grouped frequency table, the ranges must all be of equal width, and there are usually between five and 15 of them. Finally, frequency tables can also be used for nominal or ordinal variables, in which case the levels are category labels. The order of the category labels is somewhat arbitrary, but they are often listed from the most frequent at the top to the least frequent at the bottom.

A  histogram is a graphical display of a distribution. It presents the same information as a frequency table but in a way that is grasped more quickly and easily. The histogram in Figure 14.2 presents the distribution of self-esteem scores in Table 14.8. The x- axis (the horizontal one) of the histogram represents the variable and the y- axis (the vertical one) represents frequency. Above each level of the variable on the x- axis is a vertical bar that represents the number of individuals with that score. When the variable is quantitative, as it is in this example, there is usually no gap between the bars. When the variable is nominal or ordinal, however, there is usually a small gap between them. (The gap at 17 in this histogram reflects the fact that there were no scores of 17 in this data set.)

what is univariate analysis in research

Distribution shapes

When the distribution of a quantitative variable is displayed in a histogram, it has a shape. The shape of the distribution of self-esteem scores in Figure 14.2 is typical. There is a peak somewhere near the middle of the distribution and “tails” that taper in either direction from the peak. The distribution of Figure 14.2 is unimodal , meaning it has one distinct peak, but distributions can also be bimodal , as in Figure 14.3, meaning they have two distinct peaks. Figure 14.3, for example, shows a hypothetical bimodal distribution of scores on the Beck Depression Inventory. I know we talked about the mode mostly for nominal or ordinal variables, but you can actually use histograms to look at the distribution of interval/ratio variables, too, and still have a unimodal or bimodal distribution even if you aren’t calculating a mode. Distributions can also have more than two distinct peaks, but these are relatively rare in social work research.

what is univariate analysis in research

Another characteristic of the shape of a distribution is whether it is symmetrical or skewed. The distribution in the center of Figure 14.4 is symmetrical . Its left and right halves are mirror images of each other. The distribution on the left is negatively  skewed , with its peak shifted toward the upper end of its range and a relatively long negative tail. The distribution on the right is positively skewed, with its peak toward the lower end of its range and a relatively long positive tail.

what is univariate analysis in research

Range: A simple measure of variability

The  variability of a distribution is the extent to which the scores vary around their central tendency. Consider the two distributions in Figure 14.5, both of which have the same central tendency. The mean, median, and mode of each distribution are 10. Notice, however, that the two distributions differ in terms of their variability. The top one has relatively low variability, with all the scores relatively close to the center. The bottom one has relatively high variability, with the scores are spread across a much greater range.

what is univariate analysis in research

One simple measure of variability is the range , which is simply the difference between the highest and lowest scores in the distribution. The range of the self-esteem scores in Table 12.1, for example, is the difference between the highest score (24) and the lowest score (15). That is, the range is 24 − 15 = 9. Although the range is easy to compute and understand, it can be misleading when there are outliers. Imagine, for example, an exam on which all the students scored between 90 and 100. It has a range of 10. But if there was a single student who scored 20, the range would increase to 80—giving the impression that the scores were quite variable when in fact only one student differed substantially from the rest.

  • Descriptive statistics are a way to summarize and display data, and are essential to understand and report your data.
  • A frequency table is useful for nominal and ordinal variables and is needed to produce a histogram
  • A histogram is a graphic representation of your data that shows how many cases fall into each level of your variable.
  • Variability is important to understand in analyzing your data because studying a phenomenon that does not vary for your population does not provide a lot of information.
  • Think about the dependent variable in your project. What would you do if you analyzed its variability for people of different genders, and there was very little variability?
  • What do you think it would mean if the distribution of the variable were bimodal?

Univariate data analysis is a quantitative method in which a variable is examined individually to determine its distribution.

the way the scores are distributed across the levels of that variable.

The possible values of the variable - like a participant's age, income or gender.

Referring to data analysis that doesn't examine how variables relate to each other.

An ordered outline that includes your research question, a description of the data you are going to use to answer it, and the exact analyses, step-by-step, that you plan to run to answer your research question.

The process of determining how to measure a construct that cannot be directly observed

A group of statistical techniques that examines the relationship between at least three variables

The name of your variable.

The rows in your data set. In social work, these are often your study participants (people), but can be anything from census tracts to black bears to trains.

Data someone else has collected that you have permission to use in your research.

This is the document where you list your variable names, what the variables actually measure or represent, what each of the values of the variable mean if the meaning isn't obvious.

One number that can give you an idea about the distribution of your data.

Also called the average, the mean is calculated by adding all your cases and dividing the total by the number of cases.

Extreme values in your data.

A graphical representation of data where the y-axis (the vertical one along the side) is your variable's value and the x-axis (the horizontal one along the bottom) represents the individual instance in your data.

The value in the middle when all our values are placed in numerical order. Also called the 50th percentile.

The most commonly occurring value of a variable.

A technique for summarizing and presenting data.

A table that lays out how many cases fall into each level of a varible.

a graphical display of a distribution.

A distribution with one distinct peak when represented on a histogram.

A distribution with two distinct peaks when represented on a histogram.

A distribution with a roughly equal number of cases on either side of the median.

A distribution where cases are clustered on one or the other side of the median.

The extent to which the levels of a variable vary around their central tendency (the mean, median, or mode).

The difference between the highest and lowest scores in the distribution.

Graduate research methods in social work Copyright © 2020 by Matthew DeCarlo, Cory Cummings, Kate Agnelli is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

Statology

Statistics Made Easy

How to Perform Univariate Analysis in R (With Examples)

The term univariate analysis  refers to the analysis of one variable. You can remember this because the prefix “uni” means “one.”

There are three common ways to perform univariate analysis on one variable:

1. Summary statistics – Measures the center and spread of values.

2. Frequency table – Describes how often different values occur.

3. Charts – Used to visualize the distribution of values.

This tutorial provides an example of how to perform univariate analysis for the following variable:

Summary Statistics

We can use the following syntax to calculate various summary statistics for our variable:

Frequency Table

We can use the following syntax to produce a frequency table for our variable:

This tells us that:

  • The value 1 occurs 2 times
  • The value  2 occurs 1 time
  • The value  3.5 occurs 1 time

We can produce a boxplot using the following syntax: 

what is univariate analysis in research

We can produce a histogram using the following syntax: 

what is univariate analysis in research

We can produce a density curve using the following syntax: 

what is univariate analysis in research

Each of these charts give us a unique way to visualize the distribution of values for our variable.

You can find more R tutorials on this page .

what is univariate analysis in research

Hey there. My name is Zach Bobbitt. I have a Master of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • How It Works

Univariate Analysis of Variance in SPSS

Discover Univariate Analysis of Variance in SPSS ! Learn how to perform, understand SPSS output , and report results in APA style. Check out this simple, easy-to-follow guide below for a quick read!

Struggling with the ANOVA Test in SPSS? We’re here to help . We offer comprehensive assistance to students , covering assignments , dissertations , research, and more. Request Quote Now !

what is univariate analysis in research

Introduction

Welcome to our exploration of the U nivariate Analysis of Variance Analysis , a statistical method that unlocks valuable insights when comparing means across multiple groups. Whether you’re a student engaged in a research project or a seasoned researcher investigating diverse populations, the One-Way ANOVA Test proves indispensable in discerning if there are significant differences among group means. In this blog post, we’ll traverse the fundamentals of the Univariate Analysis , from its definition to the practical application using SPSS . By the end, you’ll possess not only a solid theoretical understanding but also the practical skills to conduct and interpret this powerful statistical analysis.

What is the Univariate Analysis?

ANOVA stands for Analysis of Variance, and the “ One-Way ” denotes a scenario where there is a single independent variable with more than two levels or groups . Essentially, this test assesses whether the means of these groups are significantly different from each other. It’s a robust method for scenarios like comparing the performance of students in multiple teaching methods or examining the impact of different treatments on a medical condition. The One-Way ANOVA Test yields valuable insights into group variations, providing researchers with a statistical lens to discern patterns and make informed decisions. Now, let’s delve deeper into the assumptions, hypotheses, and the step-by-step process of conducting the One-Way ANOVA Test in SPSS .

  Assumption of the One-Way ANOVA Test

Before delving into the intricacies of the One-Way ANOVA Test, let’s outline its critical assumptions:

  • Normality : The dependent variable should be approximately normally distributed within each group.
  • Homogeneity of Variances : The variances of the groups being compared should be approximately equal. This assumption is crucial for the validity of the test.
  • Independence : Observations within each group must be independent of each other.

Adhering to these assumptions ensures the reliability of the One-Way ANOVA Test results, providing a strong foundation for accurate statistical analysis.

Hypothesis of the Univariate Analysis of Variance (ANOVA) Test

Moving on to the formulation of hypotheses in the One-Way ANOVA Test,

  • The null hypothesis ( H 0): There is no significant difference in the means of the groups.
  • The alternative hypothesis ( H 1): there is a significant difference in the means of the groups.

Clear and specific hypotheses are crucial for the subsequent statistical analysis and interpretation.

Post-Hoc Tests for ANOVA

While the One-Way ANOVA is powerful in detecting overall group differences, it doesn’t provide specific information on which pairs of groups differ significantly. Post-hoc tests become essential in this context to conduct pairwise comparisons and identify the specific groups responsible for the observed overall difference. Without post-hoc tests, researchers might miss crucial nuances in the data, leading to incomplete or inaccurate interpretations.

Here are commonly used Post-hoc Tests for One-Way ANOVA:

  • Tukey’s Honestly Significant Difference (HSD): Ideal when there are equal sample sizes and variances across groups. It controls the familywise error rate, making it suitable for multiple comparisons.
  • Bonferroni Correction : Helpful when conducting numerous comparisons. It’s more conservative, adjusting the significance level to counteract the increased risk of Type I errors.
  • Scheffe Test : Useful for unequal sample sizes and variances. It’s more robust but might be conservative in some situations.
  • Dunnett’s Test : Designed for comparing each treatment group with a control group. It’s suitable for situations where there is a control group and multiple treatment groups.
  • Games-Howell Test: Useful when sample sizes and variances are unequal across groups. It’s a robust option for situations where assumptions of homogeneity are not met.

Choosing the appropriate post-hoc test depends on the characteristics of your data and the specific research context. Consider factors such as sample sizes, homogeneity of variances, and the number of planned comparisons when deciding on the most suitable post-hoc test for your One-Way ANOVA results.

Example of Univariate Analysis of Variance Analysis

To illustrate the practical application of the One-Way ANOVA Test, let’s consider a hypothetical scenario. Imagine you’re studying the effectiveness of different fertilizers on the growth of plants. You have three groups, each treated with a different fertilizer.

  • The null hypothesis: there’s no significant difference in the mean plant growth across the three fertilizers.
  • The alternative hypothesis: there is a significant difference in the mean plant growth across the three fertilizers.

By conducting the One-Way ANOVA Test, you can statistically evaluate whether the observed differences in plant growth are likely due to the different fertilizers’ effectiveness or if they could occur by random chance alone. This example demonstrates how the One-Way ANOVA Test can be a valuable tool in diverse fields, providing insights into the impact of various factors on the dependent variable.

How to Perform Univariate Analysis of Variance in SPSS

what is univariate analysis in research

Step by Step: Running  ANOVA Test in SPSS Statistics

Let’s delve into the step-by-step process of conducting the univariate analysis using SPSS.  Here’s a step-by-step guide on how to perform Univariate Analysis of Variance in SPSS :

  • STEP: Load Data into SPSS

Commence by launching SPSS and loading your dataset, which should encompass the variables of interest – a categorical independent variable. If your data is not already in SPSS format, you can import it by navigating to File > Open > Data and selecting your data file.

  • STEP: Access the Analyze Menu

In the top menu, locate and click on “ Analyze .” Within the “Analyze” menu, navigate to “ General Linear Model ” and choose ” Univariate .” Analyze > General Linear Model> Univariate

  • STEP: Specify Variables 

In the dialogue box, move the dependent variable to the “ Dependent Variable ” field. Move the variable representing the group or factor to the “ Fixed Factor (s) ” field. This is the independent variable with different levels or groups.

  • STEP: Plots Post-Hoc Test 

Click on the “ Plot ” button, Move to Facto into the Horizontal Axis, and then click the “ Add ” button.

Go on the “ Post Hoc ” button, Check “ Tukey ” and Adjust as per your analysis requirements.

  • STEP: Options

Snap on the “ Options ” button Check “ Descriptive ”, “ Homogeneity Test ” and “ Estimates of effect size ”

  • STEP: Generate SPSS Output

Once you have specified your variables and chosen options, click the “ OK ” button to perform the analysis. SPSS will generate a comprehensive output, including the requested frequency table and chart for your dataset.

Conducting a One-Way ANOVA test in SPSS provides a robust foundation for understanding the key features of your data. Always ensure that you consult the documentation corresponding to your SPSS version, as steps might slightly differ based on the software version in use. This guide is tailored for SPSS version 25 , and for any variations, it’s recommended to refer to the software’s documentation for accurate and updated instructions.

SPSS Output for One Way ANOVA

what is univariate analysis in research

How to Interpret SPSS Output of Univariate Analysis

SPSS will generate output, including descriptive statistics, the f value, degrees of freedom, and the p-value and post-hoc  

Descriptives Table

  • Mean and Standard Deviation : Evaluate the means and standard deviations of each group. This provides an initial overview of the central tendency and variability within each group.
  • Sample Size (N): Confirm the number of observations in each group. Discrepancies in sample sizes could impact the interpretation.
  • 95% Confidence Interval (CI): Review the confidence interval for the mean difference.

Test of Homogeneity of Variances Table

  • Levene’s Test: In the Test of Homogeneity of Variances table, look at Levene’s Test statistic and associated p-value. This test assesses whether the variances across groups are roughly equal. A non-significant p-value suggests that the assumption of homogeneity of variances is met.

ANOVA Table

  • Between-Groups and Within-Groups Variability: Move on to the ANOVA table, which displays the Between-Groups and Within-Groups sums of squares, degrees of freedom, mean squares, the F-ratio, and the p-value.
  • F-Ratio : Focus on the F-ratio. A higher F-ratio indicates larger differences among group means relative to within-group variability.
  • Degrees of Freedom : Note the degrees of freedom for Between-Groups and Within-Groups. These values are essential for calculating the critical F-value.
  •   P-Value: Examine the p-value associated with the F-ratio. If the p-value is below your chosen significance level (commonly 0.05), it suggests that at least one group’s mean is significantly different.

Post Hoc Tests Table

  • Specific Group Differences: If you conducted post-hoc tests, examine the results. Look for significant differences between specific pairs of groups. Pay attention to p-values and confidence intervals to identify which groups are significantly different from each other.

Effect Size Measures

  • Eta-squared : If available, consider effect size measures in the ANOVA table. Eta-squared indicates the proportion of variance in the dependent variable explained by the group differences.

How to Report Results of One-Way ANOVA Test in APA

Reporting the results of a One-Way ANOVA Test in APA style ensures clarity and adherence to established guidelines. Begin with a concise description of the analysis conducted, including the test name, the dependent variable, and the independent variable representing the groups.

For instance, “A One-Way Analysis of Variance (ANOVA) was conducted to examine the differences in plant growth across different fertilizers.”

Present the key statistical findings from the ANOVA table, including the F-ratio, degrees of freedom, and p-value. For example, “The results revealed a significant difference in plant growth among the fertilizers, F(df_between, df_within) = [F-ratio], p = [p-value].”

If the p-value is significant, proceed with post-hoc tests (e.g., Tukey’s HSD) to pinpoint specific group differences. Additionally, report effect size measures to provide a comprehensive overview of the results.

Conclude the report by summarising the implications of the findings in relation to your research question or hypothesis. This structured approach to reporting One-Way ANOVA results in APA format ensures transparency and facilitates the understanding of your research outcomes.

what is univariate analysis in research

Get Help For Your SPSS Analysis

Embark on a seamless research journey with SPSSAnalysis.com , where our dedicated team provides expert data analysis assistance for students, academicians, and individuals. We ensure your research is elevated with precision. Explore our pages;

  • SPSS Data Analysis Help – SPSS Helper ,
  • Quantitative Analysis Help ,
  • Qualitative Analysis Help ,
  • SPSS Dissertation Analysis Help ,
  • Dissertation Statistics Help ,
  • Statistical Analysis Help ,
  • Medical Data Analysis Help .

Connect with us at SPSSAnalysis.com to empower your research endeavors and achieve impactful results. Get a Free Quote Today !

Expert SPSS data analysis assistance available.

Struggling with Statistical Analysis in SPSS? - Hire a SPSS Helper Now!

  • En español – ExME
  • Em português – EME

Multivariate analysis: an overview

Posted on 9th September 2021 by Vighnesh D

""

Data analysis is one of the most useful tools when one tries to understand the vast amount of information presented to them and synthesise evidence from it. There are usually multiple factors influencing a phenomenon.

Of these, some can be observed, documented and interpreted thoroughly while others cannot. For example, in order to estimate the burden of a disease in society there may be a lot of factors which can be readily recorded, and a whole lot of others which are unreliable and, therefore, require proper scrutiny. Factors like incidence, age distribution, sex distribution and financial loss owing to the disease can be accounted for more easily when compared to contact tracing, prevalence and institutional support for the same. Therefore, it is of paramount importance that the data which is collected and interpreted must be done thoroughly in order to avoid common pitfalls.

2 boxes side by side. Box 1 has a scatter plot with a nearly horizontal red line through it. At the bottom it states R squared = 0.06. The second box has the same scatter plot and then joined up red lines which look like a person holding a dog. The red text in this box says Rexthor, The Dog-Bearer. Below these boxes is the statement "I don't trust linear regressions when it's harder to guess the direction of the correlation from the scatter plot than to find new constellations on it".

Image from: https://imgs.xkcd.com/comics/useful_geometry_formulas.png under Creative Commons License 2.5 Randall Munroe. xkcd.com.

Why does it sound so important?

Data collection and analysis is emphasised upon in academia because the very same findings determine the policy of a governing body and, therefore, the implications that follow it are the direct product of the information that is fed into the system.

Introduction

In this blog, we will discuss types of data analysis in general and multivariate analysis in particular. It aims to introduce the concept to investigators inclined towards this discipline by attempting to reduce the complexity around the subject.

Analysis of data based on the types of variables in consideration is broadly divided into three categories:

  • Univariate analysis: The simplest of all data analysis models, univariate analysis considers only one variable in calculation. Thus, although it is quite simple in application, it has limited use in analysing big data. E.g. incidence of a disease.
  • Bivariate analysis: As the name suggests, bivariate analysis takes two variables into consideration. It has a slightly expanded area of application but is nevertheless limited when it comes to large sets of data. E.g. incidence of a disease and the season of the year.
  • Multivariate analysis: Multivariate analysis takes a whole host of variables into consideration. This makes it a complicated as well as essential tool. The greatest virtue of such a model is that it considers as many factors into consideration as possible. This results in tremendous reduction of bias and gives a result closest to reality. For example, kindly refer to the factors discussed in the “overview” section of this article.

Multivariate analysis is defined as:

The statistical study of data where multiple measurements are made on each experimental unit and where the relationships among multivariate measurements and their structure are important

Multivariate statistical methods incorporate several techniques depending on the situation and the question in focus. Some of these methods are listed below:

  • Regression analysis: Used to determine the relationship between a dependent variable and one or more independent variable.
  • Analysis of Variance (ANOVA) : Used to determine the relationship between collections of data by analyzing the difference in the means.
  • Interdependent analysis: Used to determine the relationship between a set of variables among themselves.
  • Discriminant analysis: Used to classify observations in two or more distinct set of categories.
  • Classification and cluster analysis: Used to find similarity in a group of observations.
  • Principal component analysis: Used to interpret data in its simplest form by introducing new uncorrelated variables.
  • Factor analysis: Similar to principal component analysis, this too is used to crunch big data into small, interpretable forms.
  • Canonical correlation analysis: Perhaps one of the most complex models among all of the above, canonical correlation attempts to interpret data by analysing relationships between cross-covariance matrices.

ANOVA remains one of the most widely used statistical models in academia. Of the several types of ANOVA models, there is one subtype that is frequently used because of the factors involved in the studies. Traditionally, it has found its application in behavioural research, i.e. Psychology, Psychiatry and allied disciplines. This model is called the Multivariate Analysis of Variance (MANOVA). It is widely described as the multivariate analogue of ANOVA, used in interpreting univariate data.

4 boxes side by side. 1st box has a stick man sitting at a desk with a hill shaped object which has the words 'Students T Distribution' on it. They are wiggling it on top of a bit of paper he is saying "Hmm". The 2nd box the same scene exists, but he is now saying "....Nope". In the 3rd box he has lifted off the hill shaped object and walking away from the desk with it. In the final box, he is placing a new object onto the desk which is a hill shape, but with many more peaks and troughs on it with the words 'Teachers' T Distribution' on it.

Image from: https://imgs.xkcd.com/comics/t_distribution.png under Creative Commons License 2.5 Randall Munroe. xkcd.com.

Interpretation of results

Interpretation of results is probably the most difficult part in the technique. The relevant results are generally summarized in a table with an associated text. Appropriate information must be highlighted regarding:

  • Multivariate test statistics used
  • Degrees of freedom
  • Appropriate test statistics used
  • Calculated p-value (p < x)

Reliability and validity of the test are the most important determining factors in such techniques.

Applications

Multivariate analysis is used in several disciplines. One of its most distinguishing features is that it can be used in parametric as well as non-parametric tests.

Quick question: What are parametric and non-parametric tests?

  • Parametric tests: Tests which make certain assumptions regarding the distribution of data, i.e. within a fixed parameter.
  • Non-parametric tests: Tests which do not make assumptions with respect to distribution. On the contrary, the distribution of data is assumed to be free of distribution.

2 column table. First column is "Parametric tests". Under this is the following list: Based on Interval/Ratio Scale; Outliers absent; Uniformly distributed data; equal variance; sample size is usually large. The second column is titled "Non parametric tests". The list below this is as follows: Based on Nominal/Ordinal scale; Outliers present; Non uniform data; Unequal variance; Small sample size.

Uses of Multivariate analysis: Multivariate analyses are used principally for four reasons, i.e. to see patterns of data, to make clear comparisons, to discard unwanted information and to study multiple factors at once. Applications of multivariate analysis are found in almost all the disciplines which make up the bulk of policy-making, e.g. economics, healthcare, pharmaceutical industries, applied sciences, sociology, and so on. Multivariate analysis has particularly enjoyed a traditional stronghold in the field of behavioural sciences like psychology, psychiatry and allied fields because of the complex nature of the discipline.

Multivariate analysis is one of the most useful methods to determine relationships and analyse patterns among large sets of data. It is particularly effective in minimizing bias if a structured study design is employed. However, the complexity of the technique makes it a less sought-out model for novice research enthusiasts. Therefore, although the process of designing the study and interpretation of results is a tedious one, the techniques stand out in finding the relationships in complex situations.

References (pdf)

' src=

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

No Comments on Multivariate analysis: an overview

' src=

I got good information on multivariate data analysis and using mult variat analysis advantages and patterns.

' src=

Great summary. I found this very useful for starters

' src=

Thank you so much for the dscussion on multivariate design in research. However, i want to know more about multiple regression analysis. Hope for more learnings to gain from you.

' src=

Thank you for letting the author know this was useful, and I will see if there are any students wanting to blog about multiple regression analysis next!

' src=

When you want to know what contributed to an outcome what study is done?

' src=

Dear Philip, Thank you for bringing this to our notice. Your input regarding the discussion is highly appreciated. However, since this particular blog was meant to be an overview, I consciously avoided the nuances to prevent complicated explanations at an early stage. I am planning to expand on the matter in subsequent blogs and will keep your suggestion in mind while drafting for the same. Many thanks, Vighnesh.

' src=

Sorry, I don’t want to be pedantic, but shouldn’t we differentiate between ‘multivariate’ and ‘multivariable’ regression? https://stats.stackexchange.com/questions/447455/multivariable-vs-multivariate-regression https://www.ajgponline.org/article/S1064-7481(18)30579-7/fulltext

Subscribe to our newsletter

You will receive our monthly newsletter and free access to Trip Premium.

Related Articles

data mining

Data mining or data dredging?

Advances in technology now allow huge amounts of data to be handled simultaneously. Katherine takes a look at how this can be used in healthcare and how it can be exploited.

data analysis

Nominal, ordinal, or numerical variables?

How can you tell if a variable is nominal, ordinal, or numerical? Why does it even matter? Determining the appropriate variable type used in a study is essential to determining the correct statistical method to use when obtaining your results. It is important not to take the variables out of context because more often than not, the same variable that can be ordinal can also be numerical, depending on how the data was recorded and analyzed. This post will give you a specific example that may help you better grasp this concept.

data analysis

Data analysis methods

A description of the two types of data analysis – “As Treated” and “Intention to Treat” – using a hypothetical trial as an example

  • Trending Now
  • Foundational Courses
  • Data Science
  • Practice Problem
  • Machine Learning
  • System Design
  • DevOps Tutorial
  • Data Analysis with Python

Introduction to Data Analysis

  • What is Data Analysis?
  • Data Analytics and its type
  • How to Install Numpy on Windows?
  • How to Install Pandas in Python?
  • How to Install Matplotlib on python?
  • How to Install Python Tensorflow in Windows?

Data Analysis Libraries

  • Pandas Tutorial
  • NumPy Tutorial - Python Library
  • Data Analysis with SciPy
  • Introduction to TensorFlow

Data Visulization Libraries

  • Matplotlib Tutorial
  • Python Seaborn Tutorial
  • Plotly tutorial
  • Introduction to Bokeh in Python

Exploratory Data Analysis (EDA)

  • Univariate, Bivariate and Multivariate data and its analysis
  • Measures of Central Tendency in Statistics
  • Measures of spread - Range, Variance, and Standard Deviation
  • Interquartile Range and Quartile Deviation using NumPy and SciPy
  • Anova Formula
  • Skewness of Statistical Data
  • How to Calculate Skewness and Kurtosis in Python?
  • Difference Between Skewness and Kurtosis
  • Histogram | Meaning, Example, Types and Steps to Draw
  • Interpretations of Histogram
  • Quantile Quantile plots
  • What is Univariate, Bivariate & Multivariate Analysis in Data Visualisation?
  • Using pandas crosstab to create a bar plot
  • Exploring Correlation in Python
  • Mathematics | Covariance and Correlation
  • Factor Analysis | Data Analysis
  • Data Mining - Cluster Analysis
  • MANOVA Test in R Programming
  • Python - Central Limit Theorem
  • Probability Distribution Function
  • Probability Density Estimation & Maximum Likelihood Estimation
  • Exponential Distribution in R Programming - dexp(), pexp(), qexp(), and rexp() Functions
  • Mathematics | Probability Distributions Set 4 (Binomial Distribution)
  • Poisson Distribution - Definition, Formula, Table and Examples
  • P-Value: Comprehensive Guide to Understand, Apply, and Interpret
  • Z-Score in Statistics
  • How to Calculate Point Estimates in R?
  • Confidence Interval
  • Chi-square test in Machine Learning
  • Understanding Hypothesis Testing

Data Preprocessing

  • ML | Data Preprocessing in Python
  • ML | Overview of Data Cleaning
  • ML | Handling Missing Values
  • Detect and Remove the Outliers using Python

Data Transformation

  • Data Normalization Machine Learning
  • Sampling distribution Using Python

Time Series Data Analysis

  • Data Mining - Time-Series, Symbolic and Biological Sequences Data
  • Basic DateTime Operations in Python
  • Time Series Analysis & Visualization in Python
  • How to deal with missing values in a Timeseries in Python?
  • How to calculate MOVING AVERAGE in a Pandas DataFrame?
  • What is a trend in time series?
  • How to Perform an Augmented Dickey-Fuller Test in R
  • AutoCorrelation

Case Studies and Projects

  • Top 8 Free Dataset Sources to Use for Data Science Projects
  • Step by Step Predictive Analysis - Machine Learning
  • 6 Tips for Creating Effective Data Visualizations

What is Univariate, Bivariate & Multivariate Analysis in Data Visualisation?

Data Visualisation is a graphical representation of information and data. By using different visual elements such as charts, graphs, and maps data visualization tools provide us with an accessible way to find and understand hidden trends and patterns in data.

In this article, we are going to see about the univariate, Bivariate & Multivariate Analysis in Data Visualisation using Python .

Univariate Analysis

Univariate Analysis is a type of data visualization where we visualize only a single variable at a time. Univariate Analysis helps us to analyze the distribution of the variable present in the data so that we can perform further analysis. You can find the link to the dataset here .

what is univariate analysis in research

Here we’ll be performing univariate analysis on Numerical variables using the histogram function.

what is univariate analysis in research

Univariate analysis of categorical data. We’ll be using the count plot function from the seaborn library

what is univariate analysis in research

The Bars in the chart are representing the count of each category present in the business travel column.

A piechart helps us to visualize the percentage of the data belonging to each category.

what is univariate analysis in research

Bivariate analysis

Bivariate analysis is the simultaneous analysis of two variables. It explores the concept of the relationship between two variable whether there exists an association and the strength of this association or whether there are differences between two variables and the significance of these differences.

The main three types we will see here are:

  • Categorical v/s Numerical 
  • Numerical V/s Numerical
  • Categorical V/s Categorical data

Categorical v/s Numerical

what is univariate analysis in research

Here the Black horizontal line is indicating huge differences in the length of service among different departments.

Numerical v/s Numerical

what is univariate analysis in research

It displays the age and length of service of employees in the organization as we can see that younger employees have less experience in terms of their length of service.

Categorical v/s Categorical

what is univariate analysis in research

Multivariate Analysis

It is an extension of bivariate analysis which means it involves multiple variables at the same time to find correlation between them. Multivariate Analysis is a set of statistical model that examine patterns in multidimensional data by considering at once, several data variable.

what is univariate analysis in research

Here we are using a heat map to check the correlation between all the columns in the dataset. It is a data visualisation technique that shows the magnitude of the phenomenon as colour in two dimensions. The values of correlation can vary from -1 to 1 where -1 means strong negative and +1 means strong positive correlation.

what is univariate analysis in research

Please Login to comment...

Similar reads.

author

  • AI-ML-DS With Python
  • Python Data Visualization
  • Technical Scripter 2022
  • Data Visualization
  • Technical Scripter

advertisewithusBannerImg

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

ORIGINAL RESEARCH article

Overweight as a biomarker for concomitant thyroid cancer in patients with graves’ disease.

Joonseon Park

  • Department of Surgery, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea

The incidence of concomitant thyroid cancer in Graves’ disease varies and Graves’ disease can make the diagnosis and management of thyroid nodules more challenging. Since the majority of Graves’ disease patients primarily received non-surgical treatment, identifying biomarkers for concomitant thyroid cancer in patients with Graves’ disease may facilitate planning the surgery. The aim of this study is to identify the biomarkers for concurrent thyroid cancer in Graves’ disease patients and evaluate the impact of being overweight on cancer risk. This retrospective cohort study analyzed 122 patients with Graves’ disease who underwent thyroid surgery at Seoul St. Mary’s Hospital (Seoul, Korea) from May 2010 to December 2022. Body mass index (BMI), preoperative thyroid function test, and thyroid stimulating hormone receptor antibody (TR-Ab) were measured. Overweight was defined as a BMI of 25 kg/m² or higher according to the World Health Organization (WHO). Most patients (88.5%) underwent total or near-total thyroidectomy. Multivariate analysis revealed that patients who were overweight had a higher risk of malignancy (Odds ratios, 3.108; 95% confidence intervals, 1.196–8.831; p = 0.021). Lower gland weight and lower preoperative TR-Ab were also biomarkers for malignancy in Graves’ disease. Overweight patients with Graves’ disease had a higher risk of thyroid cancer than non-overweight patients. A comprehensive assessment of overweight patients with Graves’ disease is imperative for identifying concomitant thyroid cancer.

1 Introduction

Graves’ disease (GD) is an autoimmune disease that causes hyperthyroidism by stimulating the thyroid gland to produce excessive thyroid hormone due to the presence of thyroid stimulating hormone receptor antibody (TR-Ab) ( 1 – 4 ). Surgical intervention is required for the management of GD in cases of failed medical therapy, severe or rapidly progressing disease with compressive symptoms, concomitant thyroid cancer, worsening Graves’ ophthalmopathy, or based on patient’s preference ( 1 , 5 – 7 ).

The reported incidence of concomitant thyroid cancer in patients with GD varies, ranging from 1% to 22%, and some studies reported that the incidence of thyroid cancer is higher in patients with GD than the incidence in the general population ( 8 – 11 ). Although the relationship between GD and thyroid cancer is unclear, GD can make the diagnosis and management of thyroid nodules more challenging ( 12 – 16 ). In patients with GD and concomitant thyroid cancer, most surgeries are planned after nodules are diagnosed by ultrasound or fine-needle aspiration biopsy (FNAB). However, thyroid cancer is occasionally identified incidentally in the pathologic examination after surgery ( 17 – 19 ). These cases are indications that surgery was necessary, and cancer could have been missed if surgery had not been performed for other reasons. Therefore, identifying biomarkers for concomitant thyroid cancer in patients with GD may facilitate planning the surgery and more thorough screening, even if a nodule is not discovered before surgery.

Previous studies have identified risk factors for concomitant thyroid cancer in patients with GD, including TR-Ab, preoperative nodules, previous external radiation, and younger age ( 13 , 20 – 24 ). Regardless of the existence of GD, morbid obesity affects the incidence and aggressiveness of thyroid cancer in euthyroid patients ( 25 – 29 ). However, few studies have investigated the relationship between thyroid cancer in patients with GD and obesity. In a study of 216 GD patients, those with thyroid cancer had significantly higher body mass index (BMI) compared to those without thyroid cancer ( 30 ). Since weight loss is common in patients with GD ( 31 ), investigations into the relationship between being overweight or obese and GD are needed. The aim of this study was to identify biomarkers for concurrent thyroid cancer in patients with GD and identify the effects of being overweight on cancer risk.

2 Materials and methods

2.1 patients.

We retrospectively reviewed the medical charts and pathology reports of 132 patients with GD who underwent thyroid surgery from May 2010 to December 2022 at Seoul St. Mary’s Hospital (Seoul, Korea). Five patients with newly diagnosed GD after lobectomy, one patient with distant metastasis of thyroid cancer at initial diagnosis, one patient who underwent the initial operation at a different hospital, two patients with insufficient data, and one patient who was lost to follow-up were excluded from the study. Thus, 122 patients were included in the analysis ( Figure 1 ). The mean follow-up duration was 52.8 ± 39.6 months (range, 4.8–144.0 months).

www.frontiersin.org

Figure 1 Participant flow diagram of patient selection. GD, Graves’ disease.

Overweight was defined as a BMI of 25 kg/m² or higher according to the World Health Organization (WHO) and the International Association for the Study of Obesity (IASO) ( 32 ). WHO and IASO define obesity as a BMI of 30 or above ( 33 , 34 ). However, only 7 (5.7%) patients were obese in the present study, according to these criteria (BMI ≥ 30 kg/m²). Moreover, Asian countries have lower cut-off values due to a higher prevalence of obesity-related diseases at lower BMI levels ( 35 ). As this study included Korean individuals, the patients were divided by a BMI of 25, which is the standard for overweight defined by WHO and for obesity in Asia ( 36 ).

2.2 Preoperative management and follow-up assessment

Height and weight were assessed in all patients the day prior to surgery to mitigate potential measurement and temporal biases. BMI was calculated by dividing the weight in kilograms by the square of their height in meters (kg/m2). The duration of GD was defined as the number of years between the date of initial diagnosis and the date of surgery. Disease status was assessed using the serum thyroid function test (TFT), including thyroid stimulating hormone (TSH), triiodothyronine (T3), free thyroxine (T4), and TR-Ab levels before surgery, either as outpatients or after hospital admission. Pathology reports were used to review the final results after surgery.

Patients with GD received treatment based on the 2016 American Thyroid Association (ATA) guidelines for hyperthyroidism ( 1 ). Patients with concomitant thyroid cancer were managed according to the 2015 ATA management guidelines for differentiated thyroid cancer ( 37 ). After the thyroidectomy, all patients discontinued antithyroid drugs and started taking L-T4 at a daily dosage suitable for their body weight (1.6 μg/kg). Patients with concomitant thyroid cancer were closely monitored every 3–6 months during the first year and then annually thereafter. Thyroid ultrasonography was conducted annually for patients with cancer.

2.3 Primary endpoint

The primary endpoint was the rate of overweight in GD patients with and without concomitant thyroid cancer.

2.4 Statistical analysis

Continuous variables were reported as means with standard deviations, while categorical variables were presented as numbers with percentages. Continuous variables were compared with Student’s t-tests and Mann-Whitney test, and categorical characteristics were compared using Pearson’s chi-square tests or Fisher’s exact tests. Univariate Cox regression analyses were conducted to determine the biomarkers for postoperative hypoparathyroidism and malignancy in patients with GD. Statistically significant variables were included in the multivariate Cox proportional hazard model. Odds ratios (ORs) with 95% confidence intervals (CIs) were calculated. Statistical significance was defined as p-values < 0.05. The Statistical Package for the Social Sciences (version 24.0; IBM Corp., Armonk, NY, USA) was used for all statistical analyses.

3.1 Baseline clinicopathological characteristics of the study population

Table 1 presents the clinicopathological characteristics of the 122 patients in the study. The average age was 45.7 years (range, 15–77), and the average BMI was 23.4 kg/m2 (range, 17.2–37.0). 35 patients (28.7%) were classified as overweight. The mean disease duration was 5.9 years, and the mean gland weight was 105.6 grams (range, 7.6–471.4). Most patients (110 patients, 90.2%) underwent total or near-total thyroidectomy; 11 (9.0%) patients underwent lobectomy, and one patient (0.8%) underwent total thyroidectomy with modified radical neck dissection (mRND). The 11 patients who underwent lobectomy exhibited proper regulation of thyroid function prior to surgery, and preoperative diagnosis confirmed the existence of unifocal cancer or follicular neoplasm with a size smaller than 2cm (range, 0.3-1.8). The pathology was benign in 79 (64.8%) patients, while 43 (35.2%) patients exhibited malignant pathology. The preoperative TFT showed a mean TSH level of 1.7 ± 7.8 mIU/L (range, 0.0–77.9), a mean T3 level of 1.7 ± 0.7 ng/mL (range, 0.5–5.1), a mean free T4 level of 1.4 ± 0.7 ng/dL (range, 0.3–4.1), and a mean TR-Ab level of 26.4 ± 35.7 IU/L (range, 0.3–292.8). Forty-four patients (36.1%) underwent surgery due to refractory disease or medication complications, 31 (25.4%) patients underwent surgery due to huge goiters with compressive symptoms, 10 patients (8.2%) underwent surgery due to ophthalmopathies, and 37 (30.3%) patients underwent surgery due to cancer or follicular neoplasm diagnoses before surgery. Postoperative complications were described in Supplementary Table 1 . Unilateral vocal cord palsy (VCP) occurred in 3 (2.5%) patients, and no bilateral VCP occurred. Hypoparathyroidism was transient in 48 (39.3%) patients and permanent in 3 (2.5%) patients. No cases of hematoma or thyroid storm occurred.

www.frontiersin.org

Table 1 Baseline clinicopathological characteristics of the study population.

3.2 Clinicopathological characteristics of thyroid cancer in patients with Graves’ disease

Table 2 shows the clinicopathological characteristics of the 43 patients diagnosed with thyroid cancer. 42 (97.7%) patients were diagnosed with PTC, while 1 (2.3%) patient had minimally invasive Hürthle cell carcinoma. Thirty-four (79.1%) patients were preoperatively diagnosed with papillary thyroid cancer (PTC) or Hürthle cell neoplasm, while cancers were discovered incidentally in 9 (20.9%) patients. Ten (23.3%) patients underwent lobectomy, 32 (74.4%) patients underwent total or near-total thyroidectomy, and one (2.3%) patient underwent total thyroidectomy with mRND. The most prevalent subtype of PTC was the classic type, accounting for 81.0% of PTC cases. Follicular, tall cell, and oncocytic variants comprised 7.1%, 4.8%, and 7.1% of PTC cases, respectively. The average tumor size was 0.9 cm (range, 0.1–3.4 cm). Multifocalities were observed in 19 (44.2%) patients and bilaterality was observed in 11 (25.6%) patients. Lymphatic invasion, vascular invasion, and perineural invasion were observed in 12 (27.9%), 1 (2.3%), and 2 (4.7%) patients, respectively.

www.frontiersin.org

Table 2 Clinicopathological characteristics of thyroid cancer in Graves’ disease.

As shown in Table 3 , the 34 patients who were preoperatively diagnosed with cancers were compared with the 9 patients with incidentally discovered cancers after surgery. No differences in BMI were detected between the two groups (23.3 ± 3.7 vs. 24.4 ± 4.1; p = 0.450). Gland weight was significantly lighter in patients with preoperatively diagnosed cancers compared with gland weights in the incidentally discovered group (35.3 ± 40.1 vs. 119.2 ± 62.9; p < 0.001). TR-Ab levels were significantly lower in the preoperatively diagnosed group compared with the levels in the incidentally discovered group (5.5 ± 5.3 vs. 31.4 ± 28.9; p = 0.005). Tumor size was significantly larger in the preoperatively diagnosed group compared with the size in the incidentally discovered group (1.0 ± 0.7 vs. 0.4 ± 0.2, p = 0.001). The causes of surgery were also significantly different between the two groups ( p < 0.001). In the incidentally discovered cancer group, 66.7% of the patients underwent surgery due to refractory disease or medication complications, 22.2% due to large goiters, and 11.1% due to nodules detected on preoperative ultrasound. In contrast, all surgeries were performed due to the preoperative detection of cancer in the group with preoperative diagnosis.

www.frontiersin.org

Table 3 Comparison of thyroid cancers in Graves’ disease with or without preoperative pathologic diagnosis.

3.3 Comparison of Graves’ disease subgroups with or without thyroid cancer

Patients with GD with or without thyroid cancer were compared, as shown in Table 4 . Patients with GD and thyroid cancer were significantly more overweight (BMI ≥ 25 kg/m2) than patients with GD without thyroid cancer (44.2% vs. 20.3%; p = 0.005). The duration of GD was longer in patients without cancer than the duration in patients with cancer (6.9 ± 7.1 vs. 3.9 ± 4.0 years; p = 0.003). Gland weights were significantly heavier in patients without cancer compared with patients with cancer (134.7 ± 88.9 vs. 52.9 ± 56.6 g; p < 0.001). Preoperative TR-Ab was significantly higher in patients without cancer compared with TR-Ab levels in patients with cancer (34.9 ± 40.1 vs. 10.9 ± 17.2 IU/L; p < 0.001).

www.frontiersin.org

Table 4 Comparison between sub-groups of Graves’ disease with or without thyroid cancer.

3.4 Univariate and multivariate analyses of biomarkers for malignancy in patients with Graves’ disease

Univariate analysis revealed that being overweight, the duration of GD, gland weight, and preoperative TR-Ab were significant biomarkers for malignancy in patients with GD ( Table 5 ). In the multivariate analysis, being overweight, lighter gland weight, and lower postoperative TR-Ab levels were confirmed as biomarkers for malignancy. Being overweight emerged as the most significant biomarker for malignancy (OR, 3.108; 95% CI, 1.196–8.831; p = 0.021).

www.frontiersin.org

Table 5 Univariate and multivariate analyses of biomarkers for malignancy in patients with Graves’ disease.

4 Discussion

The present study aimed to investigate the biomarkers for concomitant thyroid cancer in patients with GD and identify the effects of being overweight on cancer risk. Patients with GD and concomitant thyroid cancer were more likely to be overweight compared to patients with GD without cancer. In addition, overweight patients had a significantly increased risk of developing thyroid cancer compared to non-overweight patients.

In GD, TR-Ab stimulates the TSH receptor, leading to increased production and release of thyroid hormones. Excessive thyroid hormone affects entire body tissues, including thermogenesis and metabolic rate. GD symptoms vary by hyperthyroidism severity and duration ( 1 , 2 , 31 ).

The reported incidence of concomitant thyroid cancer in GD ranges from 1% to 22% ( 8 – 11 , 38 , 39 ). Since this study included GD patients who meet the surgical indications, the cohort demonstrated a higher prevalence of thyroid cancer compared to the general GD population. The frequency of cancer in patients with GD is consistent with the frequency in the general population. All types of thyroid cancer can occur in GD patients; PTC is the most common cancer followed by FTC ( 8 , 40 ). While surgery is not the primary treatment for GD, surgical intervention may be performed in cases that meet specific surgical indications ( 1 , 2 ). According to the 2016 ATA guidelines for hyperthyroidism, near-total or total thyroidectomy is recommended for surgical intervention of GD ( 1 ). However, 11 patients underwent lobectomy in our study; these patients maintained a euthyroid state with preoperatively detected nodules, and the decision to perform lobectomy was made based on the individual preferences of the patients and the multidisciplinary medical team. GD did not recur in any of the 11 patients who underwent lobectomies.

Numerous studies have demonstrated that thyroid cancer is more aggressive in obese and overweight patients, irrespective of the coexistence of GD ( 26 – 28 , 41 , 42 ). In a case-control study, Marcello et al. showed that being overweight (BMI ≥ 25 kg/m2) is associated with an increased risk of thyroid cancer (OR, 3.787; 95% CI, 1.110–6.814, p < 0.001) ( 27 ). GD is a hypermetabolic disease, which usually causes weight loss, and obesity is not common in patients with GD ( 31 ). Weight gain is a useful indicator for evaluating initial treatment success for hyperthyroidism, but weight loss should be considered differently in obese patients. Hoogwerf et al. reported that despite greater weight loss at the time of the initial diagnosis of GD, obese patients were still morbidly obese and had higher thyroid function values compared to non-obese patients ( 43 ). The diagnosis of hyperthyroidism may be delayed in these patients as weight loss is often perceived as a positive outcome. The results of our study agree with earlier studies and are supported by an OR of 3.108, which is similar to the OR of 3.787 reported in the Marcello study ( 27 ).

The mean tumor size in this study was 0.9 cm, which was similar to previous studies concerning thyroid cancer in patients with GD. In a study by Hales et al., the average size of thyroid cancer in patients with GD was 0.91 cm, which was significantly smaller than the average size in the euthyroid group (0.91 vs. 2.33 cm) ( 44 ). However, previous studies demonstrated a more aggressive thyroid cancer phenotype in patients with GD ( 9 , 45 ). In addition, Marongju et al. revealed a higher degree of aggressiveness in some patients with microcarcinoma and GD compared to controls, even when tumor characteristics were favorable, which conflicts with other studies ( 45 ). The presence of both thyroid cancer and GD is a surgical indication, regardless of the size of the cancer. Thus, microcarcinoma in GD should not be overlooked.

Lower preoperative TR-Ab were biomarkers for malignancy in patients with GD in this study. TR-Ab, which promotes hyperthyroidism by inducing the production and release of thyroid hormones, is a diagnostic biomarker for GD ( 13 , 20 ). Several studies have explored the link between TR-Ab and concurrent thyroid cancer in patients with GD and showed that TR-Ab can potentially trigger thyroid cancer by continuously stimulating thyroid cells ( 20 , 46 ). However, other studies did not detect an association between TR-Ab and concomitant thyroid cancer in patients with GD, which is consistent with our findings ( 16 , 40 , 47 ). Yano et al. demonstrated that elevated TR-Ab was significantly associated with smaller tumor size in patients with GD and had no significant impact on multifocality or lymph node metastasis ( 40 ). Similarly, Kim et al. concluded that the behavior of thyroid cancer is not affected by TR-Ab ( 16 ). We attributed these results to the fact that patients with GD and cancer may undergo surgery due to the detection of nodules that were relatively well-controlled with medication for a long time. On the other hand, in the GD without cancer group, surgery is often performed due to uncontrolled hyperthyroidism despite medication, and TR-Ab levels may be higher. Future research should investigate the association between TR-Ab levels and thyroid cancer risk in larger studies to clarify the contradictory findings in previous studies.

The lighter gland weight was a biomarker for the concomitant thyroid cancer; however, measuring the gland weight before surgery is not feasible in clinical practice. Nonetheless, ultrasound can estimate thyroid volume preoperatively using the ellipsoidal formula: Volume = (π/6) × Length × Width × Depth. The overall thyroid volume can be derived by adding together the volume calculations for both lobes ( 48 ). Future studies will focus on applying this method clinically and investigating the link between preoperative thyroid dimensions and the prevalence of concomitant thyroid cancer.

This study’s strengths include long follow-up duration with more than 100 patients, providing robust results. Additionally, the study included various demographic and clinical factors, providing a comprehensive evaluation of thyroid cancer biomarkers in patients with GD. Of note, this study focused on the effect of being overweight in patients with GD, rather than the general population. However, the relationship between GD, thyroid cancer, and overweight is complex and may involve a variety of factors, including genetics, hormonal imbalances, and lifestyle factors.

This study has several limitations. First, its retrospective design and relatively small sample size may have introduced selection and information bias. Second, the study was conducted in the Korean population, limiting generalizability to other populations. Lastly, BRAF and TERT assessments were conducted in a limited cohort, insufficient to represent the entire study population, and there is a paucity of data on the molecular characteristics and genetic information for thyroid cancer. Further research should investigate the effects of being overweight on thyroid cancer risk in a diverse population of patients with GD to determine whether the results are generalizable. In addition, more investigations into the long-term postoperative outcomes of patients with GD with and without concomitant thyroid cancer may provide a more comprehensive evaluation of surgical outcomes.

5 Conclusions

Overweight individuals with GD have a higher risk of developing concomitant thyroid cancer. This highlights the importance of thorough screening and comprehensive evaluations specifically tailored to overweight GD patients to detect and prevent thyroid cancer. Further research is needed to elucidate the underlying mechanisms and the effects of being overweight on thyroid cancer risk in GD patients in the general population.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement

The studies involving humans were approved by Institutional Review Board of Seoul St. Mary’s Hospital, The Catholic University of Korea (IRB No: KC23RISI0054 and date of approval: 2023.04.21). The studies were conducted in accordance with the local legislation and institutional requirements. The ethics committee/institutional review board waived the requirement of written informed consent for participation from the participants or the participants’ legal guardians/next of kin because due to the retrospective nature of this study.

Author contributions

JP: Writing – review & editing, Writing – original draft, Visualization, Validation, Investigation, Formal analysis, Data curation, Conceptualization. SA: Writing – review & editing, Software, Data curation. JB: Writing – review & editing, Supervision, Software, Resources, Methodology. JK: Writing – review & editing, Supervision, Resources, Methodology. KK: Writing – review & editing, Writing – original draft, Visualization, Validation, Software, Resources, Project administration, Methodology, Investigation, Formal analysis, Data curation, Conceptualization.

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fendo.2024.1382124/full#supplementary-material

1. RossDouglas S, BurchHenry B, CooperDavid S, Carol G, Luiza M, RivkeesScott A, et al. 2016 American Thyroid Association guidelines for diagnosis and management of hyperthyroidism and other causes of thyrotoxicosis. Thyroid . (2016). doi: 10.1089/thy.2016.0229

CrossRef Full Text | Google Scholar

2. Smith TJ, Hegedüs L. Graves’ disease. New Engl J Med . (2016) 375:1552–65. doi: 10.1056/NEJMra1510030

PubMed Abstract | CrossRef Full Text | Google Scholar

3. Pearce EN, Farwell AP, Braverman LE. Thyroiditis. New Engl J Med . (2003) 348:2646–55. doi: 10.1056/NEJMra021194

4. Davies TF, Andersen S, Latif R, Nagayama Y, Barbesino G, Brito M, et al. Graves’ disease. Nat Rev Dis primers . (2020) 6:52. doi: 10.1038/s41572-020-0184-y

5. Bahn RS. Graves' ophthalmopathy. New Engl J Med . (2010) 362:726–38. doi: 10.1056/NEJMra0905750

6. Burch HB, Cooper DS. Management of Graves disease: a review. Jama . (2015) 314:2544–54. doi: 10.1001/jama.2015.16535

7. Ginsberg J. Diagnosis and management of Graves' disease. Cmaj . (2003) 168:575–85.

PubMed Abstract | Google Scholar

8. Wahl RA, Goretzki P, Meybier H, Nitschke J, Linder M, Röher H-D. Coexistence of hyperthyroidism and thyroid cancer. World J Surgery . (1982) 6:385–9. doi: 10.1007/BF01657662

9. Belfiore A, Garofalo MR, Giuffrida D, Runello F, Filetti S, Fiumara A, et al. Increased aggressiveness of thyroid cancer in patients with Graves' disease. J Clin Endocrinol Metab . (1990) 70:830–5. doi: 10.1210/jcem-70-4-830

10. Kraimps J, Bouin-Pineau M, Mathonnet M, De Calan L, Ronceray J, Visset J, et al. Multicentre study of thyroid nodules in patients with Graves' disease. J Br Surgery . (2000) 87:1111–3. doi: 10.1046/j.1365-2168.2000.01504.x

11. Pacini F, Elisei R, Di Coscio G, Anelli S, Macchia E, Concetti R, et al. Thyroid carcinoma in thyrotoxic patients treated by surgery. J endocrinological Invest . (1988) 11:107–12. doi: 10.1007/BF03350115

12. Durante C, Grani G, Lamartina L, Filetti S, Mandel SJ, Cooper DS. The diagnosis and management of thyroid nodules: a review. Jama . (2018) 319:914–24. doi: 10.1001/jama.2018.0898

13. Belfiore A, Russo D, Vigneri R, Filetti S. Graves' disease, thyroid nodules and thyroid cancer. Clin endocrinology . (2001) 55:711–8. doi: 10.1046/j.1365-2265.2001.01415.x

14. Arslan H, Unal O, Algün E, Harman M, Sakarya ME. Power Doppler sonography in the diagnosis of Graves’ disease. Eur J Ultrasound . (2000) 11:117–22. doi: 10.1016/S0929-8266(99)00079-8

15. Vitti P, Rago T, Mazzeo S, Brogioni S, Lampis M, De Liperi A, et al. Thyroid blood flow evaluation by color-flow Doppler sonography distinguishes Graves’ disease from Hashimoto’s thyroiditis. J endocrinological Invest . (1995) 18:857–61. doi: 10.1007/BF03349833

16. Kim WB, Han SM, Kim TY, Nam-Goong IS, Gong G, Lee HK, et al. Ultrasonographic screening for detection of thyroid cancer in patients with Graves’ disease. Clin endocrinology . (2004) 60:719–25. doi: 10.1111/j.1365-2265.2004.02043.x

17. Phitayakorn R, McHenry CR. Incidental thyroid carcinoma in patients with Graves’ disease. Am J surgery . (2008) 195:292–7. doi: 10.1016/j.amjsurg.2007.12.006

18. Dănilă R, Karakas E, Osei-Agyemang T, Hassan I. Outcome of incidental thyroid carcinoma in patients undergoing surgery for Graves' disease. Rev Medico-chirurgicala Societatii Medici si Naturalisti din Iasi . (2008) 112:115–8.

Google Scholar

19. Jia Q, Li X, Liu Y, Li L, Kwong JS, Ren K, et al. Incidental thyroid carcinoma in surgery-treated hyperthyroid patients with Graves’ disease: a systematic review and meta-analysis of cohort studies. Cancer Manage Res . (2018) 10:1201–7. doi: 10.2147/CMAR

20. Filetti S, Belfiore A, Amir SM, Daniels GH, Ippolito O, Vigneri R, et al. The role of thyroid-stimulating antibodies of Graves' disease in differentiated thyroid cancer. New Engl J Med . (1988) 318:753–9. doi: 10.1056/NEJM198803243181206

21. Potter E, Horn R, Scheumann G, Dralle H, Costagliola S, Ludgate M, et al. Western blot analysis of thyrotropin receptor expression in human thyroid tumors and correlation with TSH binding. Biochem Biophys Res Commun . (1994) 205:361–7. doi: 10.1006/bbrc.1994.2673

22. Papanastasiou A, Sapalidis K, Goulis DG, Michalopoulos N, Mareti E, Mantalovas S, et al. Thyroid nodules as a risk factor for thyroid cancer in patients with Graves’ disease: A systematic review and meta-analysis of observational studies in surgically treated patients. Clin Endocrinology . (2019) 91:571–7. doi: 10.1111/cen.14069

23. Behar R, Arganini M, Wu T-C, McCormick M, Straus F 2nd, DeGroot L, et al. Graves' disease and thyroid cancer. Surgery . (1986) 100:1121–7.

24. Ren M, Wu MC, Shang CZ, Wang XY, Zhang JL, Cheng H, et al. Predictive factors of thyroid cancer in patients with Graves’ disease. World J surgery . (2014) 38:80–7. doi: 10.1007/s00268-013-2287-z

25. Franchini F, Palatucci G, Colao A, Ungaro P, Macchia PE, Nettore IC. Obesity and thyroid cancer risk: an update. Int J Environ Res Public Health . (2022) 19:1116. doi: 10.3390/ijerph19031116

26. Kaliszewski K, Diakowska D, Rzeszutko M, Rudnicki J. Obesity and overweight are associated with minimal extrathyroidal extension, multifocality and bilaterality of papillary thyroid cancer. J Clin Med . (2021) 10:970. doi: 10.3390/jcm10050970

27. Marcello MA, Sampaio AC, Geloneze B, Vasques ACJ, Assumpção LVM, Ward LS. Obesity and excess protein and carbohydrate consumption are risk factors for thyroid cancer. Nutr cancer . (2012) 64:1190–5. doi: 10.1080/01635581.2012.721154

28. Matrone A, Ferrari F, Santini F, Elisei R. Obesity as a risk factor for thyroid cancer. Curr Opin Endocrinology Diabetes Obes . (2020) 27:358–63. doi: 10.1097/MED.0000000000000556

29. Xu L, Port M, Landi S, Gemignani F, Cipollini M, Elisei R, et al. Obesity and the risk of papillary thyroid cancer: a pooled analysis of three case–control studies. Thyroid . (2014) 24:966–74. doi: 10.1089/thy.2013.0566

30. Sun H, Tong H, Shen X, Gao H, Kuang J, Chen X, et al. Outcomes of surgical treatment for graves’ Disease: A single-center experience of 216 cases. J Clin Med . (2023) 12:1308. doi: 10.3390/jcm12041308

31. Brent GA. Graves' disease. New Engl J Med . (2008) 358:2594–605. doi: 10.1056/NEJMcp0801880

32. James PT. Obesity: the worldwide epidemic. Clinics Dermatol . (2004) 22:276–80. doi: 10.1016/j.clindermatol.2004.01.010

33. Organization WH. Follow-up to the political declaration of the high-level meeting of the general assembly on the prevention and control of non-communicable diseases. Sixty-sixth World Health Assembly Agenda item . (2013) 13:43–4.

34. Deitel M. Overweight and obesity worldwide now estimated to involve 1.7 billion people. Obes surgery . (2003) 13:329–30. doi: 10.1381/096089203765887598

35. Consultation WE. Appropriate body-mass index for Asian populations and its implications for policy and intervention strategies. Lancet Lond Engl . (2004) 363:157–63. doi: 10.1016/S0140-6736(03)15268-3

36. Fan J-G, Kim S-U, Wong VW-S. New trends on obesity and NAFLD in Asia. J hepatology . (2017) 67:862–73. doi: 10.1016/j.jhep.2017.06.003

37. Haugen BR, Alexander EK, Bible KC, Doherty GM, Mandel SJ, Nikiforov YE, et al. 2015 American Thyroid Association management guidelines for adult patients with thyroid nodules and differentiated thyroid cancer: the American Thyroid Association guidelines task force on thyroid nodules and differentiated thyroid cancer. Thyroid . (2016) 26:1–133. doi: 10.1089/thy.2015.0020

38. Erbil Y, Barbaros U, Özbey N, Kapran Y, Tükenmez M, Bozbora A, et al. Graves' disease, with and without nodules, and the risk of thyroid carcinoma. J Laryngology Otology . (2008) 122:291–5. doi: 10.1017/S0022215107000448

39. Cantalamessa L, Baldini M, Orsatti A, Meroni L, Amodei V, Castagnone D. Thyroid nodules in Graves disease and the risk of thyroid carcinoma. Arch Internal Med . (1999) 159:1705–8. doi: 10.1001/archinte.159.15.1705

40. Yano Y, Shibuya H, Kitagawa W, Nagahama M, Sugino K, Ito K, et al. Recent outcome of Graves’ disease patients with papillary thyroid cancer. Eur J endocrinology . (2007) 157:325–9. doi: 10.1530/EJE-07-0136

41. Pappa T, Alevizaki M. Obesity and thyroid cancer: a clinical update. Thyroid . (2014) 24:190–9. doi: 10.1089/thy.2013.0232

42. Kitahara CM, Platz EA, Freeman LEB, Hsing AW, Linet MS, Park Y, et al. Obesity and thyroid cancer risk among US men and women: a pooled analysis of five prospective studies. Cancer epidemiology Biomarkers Prev . (2011) 20:464–72. doi: 10.1158/1055-9965.EPI-10-1220

43. Hoogwerf BJ, Nuttall FQ. Long-term weight regulation in treated hyperthyroid and hypothyroid subjects. Am J Med . (1984) 76:963–70. doi: 10.1016/0002-9343(84)90842-8

44. Hales I, McElduff A, Crummer P, Clifton-Bligh P, Delbridge L, Hoschl R, et al. Does Graves' disease or thyrotoxicosis affect the prognosis of thyroid cancer. J Clin Endocrinol Metab . (1992) 75:886–9. doi: 10.1210/jcem.75.3.1517381

45. Marongiu A, Nuvoli S, De Vito A, Rondini M, Spanu A, Madeddu G. A comparative follow-up study of patients with papillary thyroid carcinoma associated or not with graves’ Disease. Diagnostics . (2022) 12:2801. doi: 10.3390/diagnostics12112801

46. Katz S, Garcia A, Niepomniszcze H. Development of Graves' disease nine years after total thyroidectomy due to follicular carcinoma of the thyroid. Thyroid . (1997) 7:909–11. doi: 10.1089/thy.1997.7.909

47. Tanaka K, Inoue H, Miki H, Masuda E, Kitaichi M, Komaki K, et al. Relationship between prognostic score and thyrotropin receptor (TSH-R) in papillary thyroid carcinoma: immunohistochemical detection of TSH-R. Br J cancer . (1997) 76:594–9. doi: 10.1038/bjc.1997.431

48. Viduetsky A, Herrejon CL. Sonographic evaluation of thyroid size: a review of important measurement parameters. J Diagn Med Sonography . (2019) 35:206–10. doi: 10.1177/8756479318824290

Keywords: Graves’ disease, thyroid cancer, overweight, thyroid stimulating hormone receptor antibodies, BMI - body mass index

Citation: Park J, An S, Bae JS, Kim JS and Kim K (2024) Overweight as a biomarker for concomitant thyroid cancer in patients with Graves’ disease. Front. Endocrinol. 15:1382124. doi: 10.3389/fendo.2024.1382124

Received: 05 February 2024; Accepted: 03 April 2024; Published: 22 April 2024.

Reviewed by:

Copyright © 2024 Park, An, Bae, Kim and Kim. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Kwangsoon Kim, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

IMAGES

  1. What is Univariate Analysis?

    what is univariate analysis in research

  2. PPT

    what is univariate analysis in research

  3. Univariate Analysis

    what is univariate analysis in research

  4. What is Univariate Analysis? (Definition & Example)

    what is univariate analysis in research

  5. What is Univariate Analysis?

    what is univariate analysis in research

  6. What is Univariate Analysis? (Definition & Example)

    what is univariate analysis in research

VIDEO

  1. Univariate Analysis Lecture Chapter 21

  2. Sahulat

  3. Univariate Bivariate and Multivariate Analysis in (Hindi)- Part-2

  4. Univariat analyse del 2

  5. Univariate Analysis|Exam oriented|chapter3|BA ECO|Calicut University|

  6. Univariate Analysis with STATA 15

COMMENTS

  1. What is Univariate Analysis? (Definition & Example)

    The term univariate analysis refers to the analysis of one variable. You can remember this because the prefix "uni" means "one.". The purpose of univariate analysis is to understand the distribution of values for a single variable. You can contrast this type of analysis with the following:

  2. Univariate Analysis: Definition, Examples

    Univariate analysis is the simplest form of analyzing data. "Uni" means "one", so in other words your data has only one variable. It doesn't deal with causes or relationships (unlike regression ) and it's major purpose is to describe; It takes data, summarizes that data and finds patterns in the data.

  3. Univariate Analysis: basic theory and example

    Univariate analysis: this article explains univariate analysis in a practical way. The article begins with a general explanation and an explanation of the reasons for applying this method in research, followed by the definition of the term and a graphical representation of the different ways of representing univariate statistics.

  4. What is Univariate Analysis? (Definition & Example)

    The term univariate analysis refers to the analysis of one variable. You can remember this because the prefix "uni" means "one." The purpose of univariate analysis is to understand the distribution of values for a single variable. You can contrast this type of analysis with the following: Bivariate Analysis: The analysis of two variables.

  5. Univariate (statistics)

    Univariate is a term commonly used in statistics to describe a type of data which consists of observations on only a single characteristic or attribute. A simple example of univariate data would be the salaries of workers in industry. Like all the other data, univariate data can be visualized using graphs, images or other analysis tools after the data is measured, collected, reported, and ...

  6. How to describe univariate data

    A variable is any characteristic that can be observed or measured on a subject. In clinical studies a sample of subjects is collected and some variables of interest are considered. Univariate descriptive analysis of a single variable has the purpose to describe the variable distribution in one sample and it is the first important step of every ...

  7. Univariate Analysis

    Univariate analyses are used extensively in quality of life research. Univariate analysis is defined as analysis carried out on only one ("uni") variable ("variate") to summarize or describe the variable (Babbie, 2007; Trochim, 2006).However, another use of the term "univariate analysis" exists and refers to statistical analyses that involve only one dependent variable and which ...

  8. Univariate Data & Analysis

    Univariate analysis is the analysis of attributes or characteristics of one variable. The univariate analysis describes the data's range and measures of central tendencies. The multivariate ...

  9. 8.1

    ANOVA. The Analysis of Variance involves the partitioning of the total sum of squares which is defined as in the expression below: S S t o t a l = ∑ i = 1 g ∑ j = 1 n i ( Y i j − y ¯..) 2. Here we are looking at the average squared difference between each observation and the grand mean.

  10. A Deeper Dive into Univariate Analysis

    Univariate analysis analyzes only one variable. The most common methods to conduct univariate analysis is to check for central tendency numerical variables and frequency distribution for ...

  11. Univariate Analysis Definition

    Univariate analysis is a key step in any statistical data analysis. It provides a foundation for understanding the basic features of the data at hand. By summarizing and visualizing data distributions, univariate analysis helps analysts and researchers to make informed decisions about further analysis and potential actions based on the data ...

  12. Data Analysis in Research: Types & Methods

    Descriptive analysis is also called a 'univariate analysis' since it is commonly used to analyze a single variable. Inferential statistics. Inferential statistics are used to make predictions about a larger population after research and data analysis of the representing population's collected sample. For example, you can ask some odd 100 ...

  13. Univariate Analysis: Variance, Variables, Data, and Measurement

    The chapter focuses, therefore, on univariate analysis, that is to say, variables taken one at a time. The concept of variance is a foundational building block of a positivist approach to social and political inquiry, an approach that refers to investigations that rely on empirical evidence, or factual knowledge, acquired either through direct ...

  14. 14. Univariate analysis

    Univariate data analysis is a quantitative method in which a variable is examined individually to determine its distribution, or "the way the scores are distributed across the levels of that variable" (Price et. al, Chapter 12.1, para. 2).

  15. What's the difference between univariate, bivariate and multivariate

    A research hypothesis is your proposed answer to your research question. The research hypothesis usually includes an explanation ("x affects y because …"). A statistical hypothesis, on the other hand, is a mathematical statement about a population parameter. Statistical hypotheses always come in pairs: the null and alternative hypotheses.

  16. How To Use Univariate Analysis in Your Data Exploration

    Univariate analysis is the simplest form of quantitative data analysis. It's used to describe, summarize, and find patterns in the data from a single variable. Unlike multivariate or bivariate analyses, it's not about looking for causal relationships between variables-it focuses entirely on what you can learn from a single variable.

  17. How to Perform Univariate Analysis in R (With Examples)

    You can remember this because the prefix "uni" means "one.". There are three common ways to perform univariate analysis on one variable: 1. Summary statistics - Measures the center and spread of values. 2. Frequency table - Describes how often different values occur. 3. Charts - Used to visualize the distribution of values.

  18. univariate statistics

    Univariate analysis explores each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It describes the pattern of response to the variable. It describes each variable on its own. Descriptive statistics describe and summarize data. Univariate descriptive statistics describe ...

  19. Univariate Analysis in SPSS

    Introduction. Welcome to our exploration of the Univariate Analysis of Variance Analysis, a statistical method that unlocks valuable insights when comparing means across multiple groups.Whether you're a student engaged in a research project or a seasoned researcher investigating diverse populations, the One-Way ANOVA Test proves indispensable in discerning if there are significant ...

  20. Univariate Analysis

    Univariate analysis is usually the first statistical analysis to be conducted to reveal possible metabolites that distinguish between treatments or conditions being studied. For example, for a two-group data set, such as a paired (i.e., two related groups of samples) or an unpaired (i.e., two independent groups of samples), the statistician would use either fold change analysis, t-test and/or ...

  21. Multivariate analysis: an overview

    Univariate analysis: The simplest of all data analysis models, univariate analysis considers only one variable in calculation. Thus, although it is quite simple in application, it has limited use in analysing big data. E.g. incidence of a disease. ... Translational research, Surgery, and Robotics among others. He has 10 abstract publications to ...

  22. Univariate, Bivariate and Multivariate data and its analysis

    Last Updated : 11 Feb, 2024. In this article,we will be discussing univariate, bivariate, and multivariate data and their analysis. Univariate data: Univariate data refers to a type of data in which each observation or data point corresponds to a single variable. In other words, it involves the measurement or observation of a single ...

  23. What is Univariate, Bivariate & Multivariate Analysis in Data

    Univariate Analysis is a type of data visualization where we visualize only a single variable at a time. Univariate Analysis helps us to analyze the distribution of the variable present in the data so that we can perform further analysis. You can find the link to the dataset here.

  24. Frontiers

    Univariate analysis revealed that being overweight, the duration of GD, gland weight, ... Further research should investigate the effects of being overweight on thyroid cancer risk in a diverse population of patients with GD to determine whether the results are generalizable. In addition, more investigations into the long-term postoperative ...