• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

research article data analysis

Home Market Research

Data Analysis in Research: Types & Methods

data-analysis-in-research

Content Index

Why analyze data in research?

Types of data in research, finding patterns in the qualitative data, methods used for data analysis in qualitative research, preparing data for analysis, methods used for data analysis in quantitative research, considerations in research data analysis, what is data analysis in research.

Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. 

Three essential things occur during the data analysis process — the first is data organization . Summarization and categorization together contribute to becoming the second known method used for data reduction. It helps find patterns and themes in the data for easy identification and linking. The third and last way is data analysis – researchers do it in both top-down and bottom-up fashion.

LEARN ABOUT: Research Process Steps

On the other hand, Marshall and Rossman describe data analysis as a messy, ambiguous, and time-consuming but creative and fascinating process through which a mass of collected data is brought to order, structure and meaning.

We can say that “the data analysis and data interpretation is a process representing the application of deductive and inductive logic to the research and data analysis.”

Researchers rely heavily on data as they have a story to tell or research problems to solve. It starts with a question, and data is nothing but an answer to that question. But, what if there is no question to ask? Well! It is possible to explore data even without a problem – we call it ‘Data Mining’, which often reveals some interesting patterns within the data that are worth exploring.

Irrelevant to the type of data researchers explore, their mission and audiences’ vision guide them to find the patterns to shape the story they want to tell. One of the essential things expected from researchers while analyzing data is to stay open and remain unbiased toward unexpected patterns, expressions, and results. Remember, sometimes, data analysis tells the most unforeseen yet exciting stories that were not expected when initiating data analysis. Therefore, rely on the data you have at hand and enjoy the journey of exploratory research. 

Create a Free Account

Every kind of data has a rare quality of describing things after assigning a specific value to it. For analysis, you need to organize these values, processed and presented in a given context, to make it useful. Data can be in different forms; here are the primary data types.

  • Qualitative data: When the data presented has words and descriptions, then we call it qualitative data . Although you can observe this data, it is subjective and harder to analyze data in research, especially for comparison. Example: Quality data represents everything describing taste, experience, texture, or an opinion that is considered quality data. This type of data is usually collected through focus groups, personal qualitative interviews , qualitative observation or using open-ended questions in surveys.
  • Quantitative data: Any data expressed in numbers of numerical figures are called quantitative data . This type of data can be distinguished into categories, grouped, measured, calculated, or ranked. Example: questions such as age, rank, cost, length, weight, scores, etc. everything comes under this type of data. You can present such data in graphical format, charts, or apply statistical analysis methods to this data. The (Outcomes Measurement Systems) OMS questionnaires in surveys are a significant source of collecting numeric data.
  • Categorical data: It is data presented in groups. However, an item included in the categorical data cannot belong to more than one group. Example: A person responding to a survey by telling his living style, marital status, smoking habit, or drinking habit comes under the categorical data. A chi-square test is a standard method used to analyze this data.

Learn More : Examples of Qualitative Data in Education

Data analysis in qualitative research

Data analysis and qualitative data research work a little differently from the numerical data as the quality data is made up of words, descriptions, images, objects, and sometimes symbols. Getting insight from such complicated information is a complicated process. Hence it is typically used for exploratory research and data analysis .

Although there are several ways to find patterns in the textual information, a word-based method is the most relied and widely used global technique for research and data analysis. Notably, the data analysis process in qualitative research is manual. Here the researchers usually read the available data and find repetitive or commonly used words. 

For example, while studying data collected from African countries to understand the most pressing issues people face, researchers might find  “food”  and  “hunger” are the most commonly used words and will highlight them for further analysis.

LEARN ABOUT: Level of Analysis

The keyword context is another widely used word-based technique. In this method, the researcher tries to understand the concept by analyzing the context in which the participants use a particular keyword.  

For example , researchers conducting research and data analysis for studying the concept of ‘diabetes’ amongst respondents might analyze the context of when and how the respondent has used or referred to the word ‘diabetes.’

The scrutiny-based technique is also one of the highly recommended  text analysis  methods used to identify a quality data pattern. Compare and contrast is the widely used method under this technique to differentiate how a specific text is similar or different from each other. 

For example: To find out the “importance of resident doctor in a company,” the collected data is divided into people who think it is necessary to hire a resident doctor and those who think it is unnecessary. Compare and contrast is the best method that can be used to analyze the polls having single-answer questions types .

Metaphors can be used to reduce the data pile and find patterns in it so that it becomes easier to connect data with theory.

Variable Partitioning is another technique used to split variables so that researchers can find more coherent descriptions and explanations from the enormous data.

LEARN ABOUT: Qualitative Research Questions and Questionnaires

There are several techniques to analyze the data in qualitative research, but here are some commonly used methods,

  • Content Analysis:  It is widely accepted and the most frequently employed technique for data analysis in research methodology. It can be used to analyze the documented information from text, images, and sometimes from the physical items. It depends on the research questions to predict when and where to use this method.
  • Narrative Analysis: This method is used to analyze content gathered from various sources such as personal interviews, field observation, and  surveys . The majority of times, stories, or opinions shared by people are focused on finding answers to the research questions.
  • Discourse Analysis:  Similar to narrative analysis, discourse analysis is used to analyze the interactions with people. Nevertheless, this particular method considers the social context under which or within which the communication between the researcher and respondent takes place. In addition to that, discourse analysis also focuses on the lifestyle and day-to-day environment while deriving any conclusion.
  • Grounded Theory:  When you want to explain why a particular phenomenon happened, then using grounded theory for analyzing quality data is the best resort. Grounded theory is applied to study data about the host of similar cases occurring in different settings. When researchers are using this method, they might alter explanations or produce new ones until they arrive at some conclusion.

LEARN ABOUT: 12 Best Tools for Researchers

Data analysis in quantitative research

The first stage in research and data analysis is to make it for the analysis so that the nominal data can be converted into something meaningful. Data preparation consists of the below phases.

Phase I: Data Validation

Data validation is done to understand if the collected data sample is per the pre-set standards, or it is a biased data sample again divided into four different stages

  • Fraud: To ensure an actual human being records each response to the survey or the questionnaire
  • Screening: To make sure each participant or respondent is selected or chosen in compliance with the research criteria
  • Procedure: To ensure ethical standards were maintained while collecting the data sample
  • Completeness: To ensure that the respondent has answered all the questions in an online survey. Else, the interviewer had asked all the questions devised in the questionnaire.

Phase II: Data Editing

More often, an extensive research data sample comes loaded with errors. Respondents sometimes fill in some fields incorrectly or sometimes skip them accidentally. Data editing is a process wherein the researchers have to confirm that the provided data is free of such errors. They need to conduct necessary checks and outlier checks to edit the raw edit and make it ready for analysis.

Phase III: Data Coding

Out of all three, this is the most critical phase of data preparation associated with grouping and assigning values to the survey responses . If a survey is completed with a 1000 sample size, the researcher will create an age bracket to distinguish the respondents based on their age. Thus, it becomes easier to analyze small data buckets rather than deal with the massive data pile.

LEARN ABOUT: Steps in Qualitative Research

After the data is prepared for analysis, researchers are open to using different research and data analysis methods to derive meaningful insights. For sure, statistical analysis plans are the most favored to analyze numerical data. In statistical analysis, distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities. The method is again classified into two groups. First, ‘Descriptive Statistics’ used to describe data. Second, ‘Inferential statistics’ that helps in comparing the data .

Descriptive statistics

This method is used to describe the basic features of versatile types of data in research. It presents the data in such a meaningful way that pattern in the data starts making sense. Nevertheless, the descriptive analysis does not go beyond making conclusions. The conclusions are again based on the hypothesis researchers have formulated so far. Here are a few major types of descriptive analysis methods.

Measures of Frequency

  • Count, Percent, Frequency
  • It is used to denote home often a particular event occurs.
  • Researchers use it when they want to showcase how often a response is given.

Measures of Central Tendency

  • Mean, Median, Mode
  • The method is widely used to demonstrate distribution by various points.
  • Researchers use this method when they want to showcase the most commonly or averagely indicated response.

Measures of Dispersion or Variation

  • Range, Variance, Standard deviation
  • Here the field equals high/low points.
  • Variance standard deviation = difference between the observed score and mean
  • It is used to identify the spread of scores by stating intervals.
  • Researchers use this method to showcase data spread out. It helps them identify the depth until which the data is spread out that it directly affects the mean.

Measures of Position

  • Percentile ranks, Quartile ranks
  • It relies on standardized scores helping researchers to identify the relationship between different scores.
  • It is often used when researchers want to compare scores with the average count.

For quantitative research use of descriptive analysis often give absolute numbers, but the in-depth analysis is never sufficient to demonstrate the rationale behind those numbers. Nevertheless, it is necessary to think of the best method for research and data analysis suiting your survey questionnaire and what story researchers want to tell. For example, the mean is the best way to demonstrate the students’ average scores in schools. It is better to rely on the descriptive statistics when the researchers intend to keep the research or outcome limited to the provided  sample  without generalizing it. For example, when you want to compare average voting done in two different cities, differential statistics are enough.

Descriptive analysis is also called a ‘univariate analysis’ since it is commonly used to analyze a single variable.

Inferential statistics

Inferential statistics are used to make predictions about a larger population after research and data analysis of the representing population’s collected sample. For example, you can ask some odd 100 audiences at a movie theater if they like the movie they are watching. Researchers then use inferential statistics on the collected  sample  to reason that about 80-90% of people like the movie. 

Here are two significant areas of inferential statistics.

  • Estimating parameters: It takes statistics from the sample research data and demonstrates something about the population parameter.
  • Hypothesis test: I t’s about sampling research data to answer the survey research questions. For example, researchers might be interested to understand if the new shade of lipstick recently launched is good or not, or if the multivitamin capsules help children to perform better at games.

These are sophisticated analysis methods used to showcase the relationship between different variables instead of describing a single variable. It is often used when researchers want something beyond absolute numbers to understand the relationship between variables.

Here are some of the commonly used methods for data analysis in research.

  • Correlation: When researchers are not conducting experimental research or quasi-experimental research wherein the researchers are interested to understand the relationship between two or more variables, they opt for correlational research methods.
  • Cross-tabulation: Also called contingency tables,  cross-tabulation  is used to analyze the relationship between multiple variables.  Suppose provided data has age and gender categories presented in rows and columns. A two-dimensional cross-tabulation helps for seamless data analysis and research by showing the number of males and females in each age category.
  • Regression analysis: For understanding the strong relationship between two variables, researchers do not look beyond the primary and commonly used regression analysis method, which is also a type of predictive analysis used. In this method, you have an essential factor called the dependent variable. You also have multiple independent variables in regression analysis. You undertake efforts to find out the impact of independent variables on the dependent variable. The values of both independent and dependent variables are assumed as being ascertained in an error-free random manner.
  • Frequency tables: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Analysis of variance: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Researchers must have the necessary research skills to analyze and manipulation the data , Getting trained to demonstrate a high standard of research practice. Ideally, researchers must possess more than a basic understanding of the rationale of selecting one statistical method over the other to obtain better data insights.
  • Usually, research and data analytics projects differ by scientific discipline; therefore, getting statistical advice at the beginning of analysis helps design a survey questionnaire, select data collection methods , and choose samples.

LEARN ABOUT: Best Data Collection Tools

  • The primary aim of data research and analysis is to derive ultimate insights that are unbiased. Any mistake in or keeping a biased mind to collect data, selecting an analysis method, or choosing  audience  sample il to draw a biased inference.
  • Irrelevant to the sophistication used in research data and analysis is enough to rectify the poorly defined objective outcome measurements. It does not matter if the design is at fault or intentions are not clear, but lack of clarity might mislead readers, so avoid the practice.
  • The motive behind data analysis in research is to present accurate and reliable data. As far as possible, avoid statistical errors, and find a way to deal with everyday challenges like outliers, missing data, data altering, data mining , or developing graphical representation.

LEARN MORE: Descriptive Research vs Correlational Research The sheer amount of data generated daily is frightening. Especially when data analysis has taken center stage. in 2018. In last year, the total data supply amounted to 2.8 trillion gigabytes. Hence, it is clear that the enterprises willing to survive in the hypercompetitive world must possess an excellent capability to analyze complex research data, derive actionable insights, and adapt to the new market needs.

LEARN ABOUT: Average Order Value

QuestionPro is an online survey platform that empowers organizations in data analysis and research and provides them a medium to collect data by creating appealing surveys.

MORE LIKE THIS

Trend Report

Trend Report: Guide for Market Dynamics & Strategic Analysis

May 29, 2024

Cannabis Industry Business Intelligence

Cannabis Industry Business Intelligence: Impact on Research

May 28, 2024

Best Dynata Alternatives

Top 10 Dynata Alternatives & Competitors

May 27, 2024

research article data analysis

What Are My Employees Really Thinking? The Power of Open-ended Survey Analysis

May 24, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

Your Modern Business Guide To Data Analysis Methods And Techniques

Data analysis methods and techniques blog post by datapine

Table of Contents

1) What Is Data Analysis?

2) Why Is Data Analysis Important?

3) What Is The Data Analysis Process?

4) Types Of Data Analysis Methods

5) Top Data Analysis Techniques To Apply

6) Quality Criteria For Data Analysis

7) Data Analysis Limitations & Barriers

8) Data Analysis Skills

9) Data Analysis In The Big Data Environment

In our data-rich age, understanding how to analyze and extract true meaning from our business’s digital insights is one of the primary drivers of success.

Despite the colossal volume of data we create every day, a mere 0.5% is actually analyzed and used for data discovery , improvement, and intelligence. While that may not seem like much, considering the amount of digital information we have at our fingertips, half a percent still accounts for a vast amount of data.

With so much data and so little time, knowing how to collect, curate, organize, and make sense of all of this potentially business-boosting information can be a minefield – but online data analysis is the solution.

In science, data analysis uses a more complex approach with advanced techniques to explore and experiment with data. On the other hand, in a business context, data is used to make data-driven decisions that will enable the company to improve its overall performance. In this post, we will cover the analysis of data from an organizational point of view while still going through the scientific and statistical foundations that are fundamental to understanding the basics of data analysis. 

To put all of that into perspective, we will answer a host of important analytical questions, explore analytical methods and techniques, while demonstrating how to perform analysis in the real world with a 17-step blueprint for success.

What Is Data Analysis?

Data analysis is the process of collecting, modeling, and analyzing data using various statistical and logical methods and techniques. Businesses rely on analytics processes and tools to extract insights that support strategic and operational decision-making.

All these various methods are largely based on two core areas: quantitative and qualitative research.

To explain the key differences between qualitative and quantitative research, here’s a video for your viewing pleasure:

Gaining a better understanding of different techniques and methods in quantitative research as well as qualitative insights will give your analyzing efforts a more clearly defined direction, so it’s worth taking the time to allow this particular knowledge to sink in. Additionally, you will be able to create a comprehensive analytical report that will skyrocket your analysis.

Apart from qualitative and quantitative categories, there are also other types of data that you should be aware of before dividing into complex data analysis processes. These categories include: 

  • Big data: Refers to massive data sets that need to be analyzed using advanced software to reveal patterns and trends. It is considered to be one of the best analytical assets as it provides larger volumes of data at a faster rate. 
  • Metadata: Putting it simply, metadata is data that provides insights about other data. It summarizes key information about specific data that makes it easier to find and reuse for later purposes. 
  • Real time data: As its name suggests, real time data is presented as soon as it is acquired. From an organizational perspective, this is the most valuable data as it can help you make important decisions based on the latest developments. Our guide on real time analytics will tell you more about the topic. 
  • Machine data: This is more complex data that is generated solely by a machine such as phones, computers, or even websites and embedded systems, without previous human interaction.

Why Is Data Analysis Important?

Before we go into detail about the categories of analysis along with its methods and techniques, you must understand the potential that analyzing data can bring to your organization.

  • Informed decision-making : From a management perspective, you can benefit from analyzing your data as it helps you make decisions based on facts and not simple intuition. For instance, you can understand where to invest your capital, detect growth opportunities, predict your income, or tackle uncommon situations before they become problems. Through this, you can extract relevant insights from all areas in your organization, and with the help of dashboard software , present the data in a professional and interactive way to different stakeholders.
  • Reduce costs : Another great benefit is to reduce costs. With the help of advanced technologies such as predictive analytics, businesses can spot improvement opportunities, trends, and patterns in their data and plan their strategies accordingly. In time, this will help you save money and resources on implementing the wrong strategies. And not just that, by predicting different scenarios such as sales and demand you can also anticipate production and supply. 
  • Target customers better : Customers are arguably the most crucial element in any business. By using analytics to get a 360° vision of all aspects related to your customers, you can understand which channels they use to communicate with you, their demographics, interests, habits, purchasing behaviors, and more. In the long run, it will drive success to your marketing strategies, allow you to identify new potential customers, and avoid wasting resources on targeting the wrong people or sending the wrong message. You can also track customer satisfaction by analyzing your client’s reviews or your customer service department’s performance.

What Is The Data Analysis Process?

Data analysis process graphic

When we talk about analyzing data there is an order to follow in order to extract the needed conclusions. The analysis process consists of 5 key stages. We will cover each of them more in detail later in the post, but to start providing the needed context to understand what is coming next, here is a rundown of the 5 essential steps of data analysis. 

  • Identify: Before you get your hands dirty with data, you first need to identify why you need it in the first place. The identification is the stage in which you establish the questions you will need to answer. For example, what is the customer's perception of our brand? Or what type of packaging is more engaging to our potential customers? Once the questions are outlined you are ready for the next step. 
  • Collect: As its name suggests, this is the stage where you start collecting the needed data. Here, you define which sources of data you will use and how you will use them. The collection of data can come in different forms such as internal or external sources, surveys, interviews, questionnaires, and focus groups, among others.  An important note here is that the way you collect the data will be different in a quantitative and qualitative scenario. 
  • Clean: Once you have the necessary data it is time to clean it and leave it ready for analysis. Not all the data you collect will be useful, when collecting big amounts of data in different formats it is very likely that you will find yourself with duplicate or badly formatted data. To avoid this, before you start working with your data you need to make sure to erase any white spaces, duplicate records, or formatting errors. This way you avoid hurting your analysis with bad-quality data. 
  • Analyze : With the help of various techniques such as statistical analysis, regressions, neural networks, text analysis, and more, you can start analyzing and manipulating your data to extract relevant conclusions. At this stage, you find trends, correlations, variations, and patterns that can help you answer the questions you first thought of in the identify stage. Various technologies in the market assist researchers and average users with the management of their data. Some of them include business intelligence and visualization software, predictive analytics, and data mining, among others. 
  • Interpret: Last but not least you have one of the most important steps: it is time to interpret your results. This stage is where the researcher comes up with courses of action based on the findings. For example, here you would understand if your clients prefer packaging that is red or green, plastic or paper, etc. Additionally, at this stage, you can also find some limitations and work on them. 

Now that you have a basic understanding of the key data analysis steps, let’s look at the top 17 essential methods.

17 Essential Types Of Data Analysis Methods

Before diving into the 17 essential types of methods, it is important that we go over really fast through the main analysis categories. Starting with the category of descriptive up to prescriptive analysis, the complexity and effort of data evaluation increases, but also the added value for the company.

a) Descriptive analysis - What happened.

The descriptive analysis method is the starting point for any analytic reflection, and it aims to answer the question of what happened? It does this by ordering, manipulating, and interpreting raw data from various sources to turn it into valuable insights for your organization.

Performing descriptive analysis is essential, as it enables us to present our insights in a meaningful way. Although it is relevant to mention that this analysis on its own will not allow you to predict future outcomes or tell you the answer to questions like why something happened, it will leave your data organized and ready to conduct further investigations.

b) Exploratory analysis - How to explore data relationships.

As its name suggests, the main aim of the exploratory analysis is to explore. Prior to it, there is still no notion of the relationship between the data and the variables. Once the data is investigated, exploratory analysis helps you to find connections and generate hypotheses and solutions for specific problems. A typical area of ​​application for it is data mining.

c) Diagnostic analysis - Why it happened.

Diagnostic data analytics empowers analysts and executives by helping them gain a firm contextual understanding of why something happened. If you know why something happened as well as how it happened, you will be able to pinpoint the exact ways of tackling the issue or challenge.

Designed to provide direct and actionable answers to specific questions, this is one of the world’s most important methods in research, among its other key organizational functions such as retail analytics , e.g.

c) Predictive analysis - What will happen.

The predictive method allows you to look into the future to answer the question: what will happen? In order to do this, it uses the results of the previously mentioned descriptive, exploratory, and diagnostic analysis, in addition to machine learning (ML) and artificial intelligence (AI). Through this, you can uncover future trends, potential problems or inefficiencies, connections, and casualties in your data.

With predictive analysis, you can unfold and develop initiatives that will not only enhance your various operational processes but also help you gain an all-important edge over the competition. If you understand why a trend, pattern, or event happened through data, you will be able to develop an informed projection of how things may unfold in particular areas of the business.

e) Prescriptive analysis - How will it happen.

Another of the most effective types of analysis methods in research. Prescriptive data techniques cross over from predictive analysis in the way that it revolves around using patterns or trends to develop responsive, practical business strategies.

By drilling down into prescriptive analysis, you will play an active role in the data consumption process by taking well-arranged sets of visual data and using it as a powerful fix to emerging issues in a number of key areas, including marketing, sales, customer experience, HR, fulfillment, finance, logistics analytics , and others.

Top 17 data analysis methods

As mentioned at the beginning of the post, data analysis methods can be divided into two big categories: quantitative and qualitative. Each of these categories holds a powerful analytical value that changes depending on the scenario and type of data you are working with. Below, we will discuss 17 methods that are divided into qualitative and quantitative approaches. 

Without further ado, here are the 17 essential types of data analysis methods with some use cases in the business world: 

A. Quantitative Methods 

To put it simply, quantitative analysis refers to all methods that use numerical data or data that can be turned into numbers (e.g. category variables like gender, age, etc.) to extract valuable insights. It is used to extract valuable conclusions about relationships, differences, and test hypotheses. Below we discuss some of the key quantitative methods. 

1. Cluster analysis

The action of grouping a set of data elements in a way that said elements are more similar (in a particular sense) to each other than to those in other groups – hence the term ‘cluster.’ Since there is no target variable when clustering, the method is often used to find hidden patterns in the data. The approach is also used to provide additional context to a trend or dataset.

Let's look at it from an organizational perspective. In a perfect world, marketers would be able to analyze each customer separately and give them the best-personalized service, but let's face it, with a large customer base, it is timely impossible to do that. That's where clustering comes in. By grouping customers into clusters based on demographics, purchasing behaviors, monetary value, or any other factor that might be relevant for your company, you will be able to immediately optimize your efforts and give your customers the best experience based on their needs.

2. Cohort analysis

This type of data analysis approach uses historical data to examine and compare a determined segment of users' behavior, which can then be grouped with others with similar characteristics. By using this methodology, it's possible to gain a wealth of insight into consumer needs or a firm understanding of a broader target group.

Cohort analysis can be really useful for performing analysis in marketing as it will allow you to understand the impact of your campaigns on specific groups of customers. To exemplify, imagine you send an email campaign encouraging customers to sign up for your site. For this, you create two versions of the campaign with different designs, CTAs, and ad content. Later on, you can use cohort analysis to track the performance of the campaign for a longer period of time and understand which type of content is driving your customers to sign up, repurchase, or engage in other ways.  

A useful tool to start performing cohort analysis method is Google Analytics. You can learn more about the benefits and limitations of using cohorts in GA in this useful guide . In the bottom image, you see an example of how you visualize a cohort in this tool. The segments (devices traffic) are divided into date cohorts (usage of devices) and then analyzed week by week to extract insights into performance.

Cohort analysis chart example from google analytics

3. Regression analysis

Regression uses historical data to understand how a dependent variable's value is affected when one (linear regression) or more independent variables (multiple regression) change or stay the same. By understanding each variable's relationship and how it developed in the past, you can anticipate possible outcomes and make better decisions in the future.

Let's bring it down with an example. Imagine you did a regression analysis of your sales in 2019 and discovered that variables like product quality, store design, customer service, marketing campaigns, and sales channels affected the overall result. Now you want to use regression to analyze which of these variables changed or if any new ones appeared during 2020. For example, you couldn’t sell as much in your physical store due to COVID lockdowns. Therefore, your sales could’ve either dropped in general or increased in your online channels. Through this, you can understand which independent variables affected the overall performance of your dependent variable, annual sales.

If you want to go deeper into this type of analysis, check out this article and learn more about how you can benefit from regression.

4. Neural networks

The neural network forms the basis for the intelligent algorithms of machine learning. It is a form of analytics that attempts, with minimal intervention, to understand how the human brain would generate insights and predict values. Neural networks learn from each and every data transaction, meaning that they evolve and advance over time.

A typical area of application for neural networks is predictive analytics. There are BI reporting tools that have this feature implemented within them, such as the Predictive Analytics Tool from datapine. This tool enables users to quickly and easily generate all kinds of predictions. All you have to do is select the data to be processed based on your KPIs, and the software automatically calculates forecasts based on historical and current data. Thanks to its user-friendly interface, anyone in your organization can manage it; there’s no need to be an advanced scientist. 

Here is an example of how you can use the predictive analysis tool from datapine:

Example on how to use predictive analytics tool from datapine

**click to enlarge**

5. Factor analysis

The factor analysis also called “dimension reduction” is a type of data analysis used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. The aim here is to uncover independent latent variables, an ideal method for streamlining specific segments.

A good way to understand this data analysis method is a customer evaluation of a product. The initial assessment is based on different variables like color, shape, wearability, current trends, materials, comfort, the place where they bought the product, and frequency of usage. Like this, the list can be endless, depending on what you want to track. In this case, factor analysis comes into the picture by summarizing all of these variables into homogenous groups, for example, by grouping the variables color, materials, quality, and trends into a brother latent variable of design.

If you want to start analyzing data using factor analysis we recommend you take a look at this practical guide from UCLA.

6. Data mining

A method of data analysis that is the umbrella term for engineering metrics and insights for additional value, direction, and context. By using exploratory statistical evaluation, data mining aims to identify dependencies, relations, patterns, and trends to generate advanced knowledge.  When considering how to analyze data, adopting a data mining mindset is essential to success - as such, it’s an area that is worth exploring in greater detail.

An excellent use case of data mining is datapine intelligent data alerts . With the help of artificial intelligence and machine learning, they provide automated signals based on particular commands or occurrences within a dataset. For example, if you’re monitoring supply chain KPIs , you could set an intelligent alarm to trigger when invalid or low-quality data appears. By doing so, you will be able to drill down deep into the issue and fix it swiftly and effectively.

In the following picture, you can see how the intelligent alarms from datapine work. By setting up ranges on daily orders, sessions, and revenues, the alarms will notify you if the goal was not completed or if it exceeded expectations.

Example on how to use intelligent alerts from datapine

7. Time series analysis

As its name suggests, time series analysis is used to analyze a set of data points collected over a specified period of time. Although analysts use this method to monitor the data points in a specific interval of time rather than just monitoring them intermittently, the time series analysis is not uniquely used for the purpose of collecting data over time. Instead, it allows researchers to understand if variables changed during the duration of the study, how the different variables are dependent, and how did it reach the end result. 

In a business context, this method is used to understand the causes of different trends and patterns to extract valuable insights. Another way of using this method is with the help of time series forecasting. Powered by predictive technologies, businesses can analyze various data sets over a period of time and forecast different future events. 

A great use case to put time series analysis into perspective is seasonality effects on sales. By using time series forecasting to analyze sales data of a specific product over time, you can understand if sales rise over a specific period of time (e.g. swimwear during summertime, or candy during Halloween). These insights allow you to predict demand and prepare production accordingly.  

8. Decision Trees 

The decision tree analysis aims to act as a support tool to make smart and strategic decisions. By visually displaying potential outcomes, consequences, and costs in a tree-like model, researchers and company users can easily evaluate all factors involved and choose the best course of action. Decision trees are helpful to analyze quantitative data and they allow for an improved decision-making process by helping you spot improvement opportunities, reduce costs, and enhance operational efficiency and production.

But how does a decision tree actually works? This method works like a flowchart that starts with the main decision that you need to make and branches out based on the different outcomes and consequences of each decision. Each outcome will outline its own consequences, costs, and gains and, at the end of the analysis, you can compare each of them and make the smartest decision. 

Businesses can use them to understand which project is more cost-effective and will bring more earnings in the long run. For example, imagine you need to decide if you want to update your software app or build a new app entirely.  Here you would compare the total costs, the time needed to be invested, potential revenue, and any other factor that might affect your decision.  In the end, you would be able to see which of these two options is more realistic and attainable for your company or research.

9. Conjoint analysis 

Last but not least, we have the conjoint analysis. This approach is usually used in surveys to understand how individuals value different attributes of a product or service and it is one of the most effective methods to extract consumer preferences. When it comes to purchasing, some clients might be more price-focused, others more features-focused, and others might have a sustainable focus. Whatever your customer's preferences are, you can find them with conjoint analysis. Through this, companies can define pricing strategies, packaging options, subscription packages, and more. 

A great example of conjoint analysis is in marketing and sales. For instance, a cupcake brand might use conjoint analysis and find that its clients prefer gluten-free options and cupcakes with healthier toppings over super sugary ones. Thus, the cupcake brand can turn these insights into advertisements and promotions to increase sales of this particular type of product. And not just that, conjoint analysis can also help businesses segment their customers based on their interests. This allows them to send different messaging that will bring value to each of the segments. 

10. Correspondence Analysis

Also known as reciprocal averaging, correspondence analysis is a method used to analyze the relationship between categorical variables presented within a contingency table. A contingency table is a table that displays two (simple correspondence analysis) or more (multiple correspondence analysis) categorical variables across rows and columns that show the distribution of the data, which is usually answers to a survey or questionnaire on a specific topic. 

This method starts by calculating an “expected value” which is done by multiplying row and column averages and dividing it by the overall original value of the specific table cell. The “expected value” is then subtracted from the original value resulting in a “residual number” which is what allows you to extract conclusions about relationships and distribution. The results of this analysis are later displayed using a map that represents the relationship between the different values. The closest two values are in the map, the bigger the relationship. Let’s put it into perspective with an example. 

Imagine you are carrying out a market research analysis about outdoor clothing brands and how they are perceived by the public. For this analysis, you ask a group of people to match each brand with a certain attribute which can be durability, innovation, quality materials, etc. When calculating the residual numbers, you can see that brand A has a positive residual for innovation but a negative one for durability. This means that brand A is not positioned as a durable brand in the market, something that competitors could take advantage of. 

11. Multidimensional Scaling (MDS)

MDS is a method used to observe the similarities or disparities between objects which can be colors, brands, people, geographical coordinates, and more. The objects are plotted using an “MDS map” that positions similar objects together and disparate ones far apart. The (dis) similarities between objects are represented using one or more dimensions that can be observed using a numerical scale. For example, if you want to know how people feel about the COVID-19 vaccine, you can use 1 for “don’t believe in the vaccine at all”  and 10 for “firmly believe in the vaccine” and a scale of 2 to 9 for in between responses.  When analyzing an MDS map the only thing that matters is the distance between the objects, the orientation of the dimensions is arbitrary and has no meaning at all. 

Multidimensional scaling is a valuable technique for market research, especially when it comes to evaluating product or brand positioning. For instance, if a cupcake brand wants to know how they are positioned compared to competitors, it can define 2-3 dimensions such as taste, ingredients, shopping experience, or more, and do a multidimensional scaling analysis to find improvement opportunities as well as areas in which competitors are currently leading. 

Another business example is in procurement when deciding on different suppliers. Decision makers can generate an MDS map to see how the different prices, delivery times, technical services, and more of the different suppliers differ and pick the one that suits their needs the best. 

A final example proposed by a research paper on "An Improved Study of Multilevel Semantic Network Visualization for Analyzing Sentiment Word of Movie Review Data". Researchers picked a two-dimensional MDS map to display the distances and relationships between different sentiments in movie reviews. They used 36 sentiment words and distributed them based on their emotional distance as we can see in the image below where the words "outraged" and "sweet" are on opposite sides of the map, marking the distance between the two emotions very clearly.

Example of multidimensional scaling analysis

Aside from being a valuable technique to analyze dissimilarities, MDS also serves as a dimension-reduction technique for large dimensional data. 

B. Qualitative Methods

Qualitative data analysis methods are defined as the observation of non-numerical data that is gathered and produced using methods of observation such as interviews, focus groups, questionnaires, and more. As opposed to quantitative methods, qualitative data is more subjective and highly valuable in analyzing customer retention and product development.

12. Text analysis

Text analysis, also known in the industry as text mining, works by taking large sets of textual data and arranging them in a way that makes it easier to manage. By working through this cleansing process in stringent detail, you will be able to extract the data that is truly relevant to your organization and use it to develop actionable insights that will propel you forward.

Modern software accelerate the application of text analytics. Thanks to the combination of machine learning and intelligent algorithms, you can perform advanced analytical processes such as sentiment analysis. This technique allows you to understand the intentions and emotions of a text, for example, if it's positive, negative, or neutral, and then give it a score depending on certain factors and categories that are relevant to your brand. Sentiment analysis is often used to monitor brand and product reputation and to understand how successful your customer experience is. To learn more about the topic check out this insightful article .

By analyzing data from various word-based sources, including product reviews, articles, social media communications, and survey responses, you will gain invaluable insights into your audience, as well as their needs, preferences, and pain points. This will allow you to create campaigns, services, and communications that meet your prospects’ needs on a personal level, growing your audience while boosting customer retention. There are various other “sub-methods” that are an extension of text analysis. Each of them serves a more specific purpose and we will look at them in detail next. 

13. Content Analysis

This is a straightforward and very popular method that examines the presence and frequency of certain words, concepts, and subjects in different content formats such as text, image, audio, or video. For example, the number of times the name of a celebrity is mentioned on social media or online tabloids. It does this by coding text data that is later categorized and tabulated in a way that can provide valuable insights, making it the perfect mix of quantitative and qualitative analysis.

There are two types of content analysis. The first one is the conceptual analysis which focuses on explicit data, for instance, the number of times a concept or word is mentioned in a piece of content. The second one is relational analysis, which focuses on the relationship between different concepts or words and how they are connected within a specific context. 

Content analysis is often used by marketers to measure brand reputation and customer behavior. For example, by analyzing customer reviews. It can also be used to analyze customer interviews and find directions for new product development. It is also important to note, that in order to extract the maximum potential out of this analysis method, it is necessary to have a clearly defined research question. 

14. Thematic Analysis

Very similar to content analysis, thematic analysis also helps in identifying and interpreting patterns in qualitative data with the main difference being that the first one can also be applied to quantitative analysis. The thematic method analyzes large pieces of text data such as focus group transcripts or interviews and groups them into themes or categories that come up frequently within the text. It is a great method when trying to figure out peoples view’s and opinions about a certain topic. For example, if you are a brand that cares about sustainability, you can do a survey of your customers to analyze their views and opinions about sustainability and how they apply it to their lives. You can also analyze customer service calls transcripts to find common issues and improve your service. 

Thematic analysis is a very subjective technique that relies on the researcher’s judgment. Therefore,  to avoid biases, it has 6 steps that include familiarization, coding, generating themes, reviewing themes, defining and naming themes, and writing up. It is also important to note that, because it is a flexible approach, the data can be interpreted in multiple ways and it can be hard to select what data is more important to emphasize. 

15. Narrative Analysis 

A bit more complex in nature than the two previous ones, narrative analysis is used to explore the meaning behind the stories that people tell and most importantly, how they tell them. By looking into the words that people use to describe a situation you can extract valuable conclusions about their perspective on a specific topic. Common sources for narrative data include autobiographies, family stories, opinion pieces, and testimonials, among others. 

From a business perspective, narrative analysis can be useful to analyze customer behaviors and feelings towards a specific product, service, feature, or others. It provides unique and deep insights that can be extremely valuable. However, it has some drawbacks.  

The biggest weakness of this method is that the sample sizes are usually very small due to the complexity and time-consuming nature of the collection of narrative data. Plus, the way a subject tells a story will be significantly influenced by his or her specific experiences, making it very hard to replicate in a subsequent study. 

16. Discourse Analysis

Discourse analysis is used to understand the meaning behind any type of written, verbal, or symbolic discourse based on its political, social, or cultural context. It mixes the analysis of languages and situations together. This means that the way the content is constructed and the meaning behind it is significantly influenced by the culture and society it takes place in. For example, if you are analyzing political speeches you need to consider different context elements such as the politician's background, the current political context of the country, the audience to which the speech is directed, and so on. 

From a business point of view, discourse analysis is a great market research tool. It allows marketers to understand how the norms and ideas of the specific market work and how their customers relate to those ideas. It can be very useful to build a brand mission or develop a unique tone of voice. 

17. Grounded Theory Analysis

Traditionally, researchers decide on a method and hypothesis and start to collect the data to prove that hypothesis. The grounded theory is the only method that doesn’t require an initial research question or hypothesis as its value lies in the generation of new theories. With the grounded theory method, you can go into the analysis process with an open mind and explore the data to generate new theories through tests and revisions. In fact, it is not necessary to collect the data and then start to analyze it. Researchers usually start to find valuable insights as they are gathering the data. 

All of these elements make grounded theory a very valuable method as theories are fully backed by data instead of initial assumptions. It is a great technique to analyze poorly researched topics or find the causes behind specific company outcomes. For example, product managers and marketers might use the grounded theory to find the causes of high levels of customer churn and look into customer surveys and reviews to develop new theories about the causes. 

How To Analyze Data? Top 17 Data Analysis Techniques To Apply

17 top data analysis techniques by datapine

Now that we’ve answered the questions “what is data analysis’”, why is it important, and covered the different data analysis types, it’s time to dig deeper into how to perform your analysis by working through these 17 essential techniques.

1. Collaborate your needs

Before you begin analyzing or drilling down into any techniques, it’s crucial to sit down collaboratively with all key stakeholders within your organization, decide on your primary campaign or strategic goals, and gain a fundamental understanding of the types of insights that will best benefit your progress or provide you with the level of vision you need to evolve your organization.

2. Establish your questions

Once you’ve outlined your core objectives, you should consider which questions will need answering to help you achieve your mission. This is one of the most important techniques as it will shape the very foundations of your success.

To help you ask the right things and ensure your data works for you, you have to ask the right data analysis questions .

3. Data democratization

After giving your data analytics methodology some real direction, and knowing which questions need answering to extract optimum value from the information available to your organization, you should continue with democratization.

Data democratization is an action that aims to connect data from various sources efficiently and quickly so that anyone in your organization can access it at any given moment. You can extract data in text, images, videos, numbers, or any other format. And then perform cross-database analysis to achieve more advanced insights to share with the rest of the company interactively.  

Once you have decided on your most valuable sources, you need to take all of this into a structured format to start collecting your insights. For this purpose, datapine offers an easy all-in-one data connectors feature to integrate all your internal and external sources and manage them at your will. Additionally, datapine’s end-to-end solution automatically updates your data, allowing you to save time and focus on performing the right analysis to grow your company.

data connectors from datapine

4. Think of governance 

When collecting data in a business or research context you always need to think about security and privacy. With data breaches becoming a topic of concern for businesses, the need to protect your client's or subject’s sensitive information becomes critical. 

To ensure that all this is taken care of, you need to think of a data governance strategy. According to Gartner , this concept refers to “ the specification of decision rights and an accountability framework to ensure the appropriate behavior in the valuation, creation, consumption, and control of data and analytics .” In simpler words, data governance is a collection of processes, roles, and policies, that ensure the efficient use of data while still achieving the main company goals. It ensures that clear roles are in place for who can access the information and how they can access it. In time, this not only ensures that sensitive information is protected but also allows for an efficient analysis as a whole. 

5. Clean your data

After harvesting from so many sources you will be left with a vast amount of information that can be overwhelming to deal with. At the same time, you can be faced with incorrect data that can be misleading to your analysis. The smartest thing you can do to avoid dealing with this in the future is to clean the data. This is fundamental before visualizing it, as it will ensure that the insights you extract from it are correct.

There are many things that you need to look for in the cleaning process. The most important one is to eliminate any duplicate observations; this usually appears when using multiple internal and external sources of information. You can also add any missing codes, fix empty fields, and eliminate incorrectly formatted data.

Another usual form of cleaning is done with text data. As we mentioned earlier, most companies today analyze customer reviews, social media comments, questionnaires, and several other text inputs. In order for algorithms to detect patterns, text data needs to be revised to avoid invalid characters or any syntax or spelling errors. 

Most importantly, the aim of cleaning is to prevent you from arriving at false conclusions that can damage your company in the long run. By using clean data, you will also help BI solutions to interact better with your information and create better reports for your organization.

6. Set your KPIs

Once you’ve set your sources, cleaned your data, and established clear-cut questions you want your insights to answer, you need to set a host of key performance indicators (KPIs) that will help you track, measure, and shape your progress in a number of key areas.

KPIs are critical to both qualitative and quantitative analysis research. This is one of the primary methods of data analysis you certainly shouldn’t overlook.

To help you set the best possible KPIs for your initiatives and activities, here is an example of a relevant logistics KPI : transportation-related costs. If you want to see more go explore our collection of key performance indicator examples .

Transportation costs logistics KPIs

7. Omit useless data

Having bestowed your data analysis tools and techniques with true purpose and defined your mission, you should explore the raw data you’ve collected from all sources and use your KPIs as a reference for chopping out any information you deem to be useless.

Trimming the informational fat is one of the most crucial methods of analysis as it will allow you to focus your analytical efforts and squeeze every drop of value from the remaining ‘lean’ information.

Any stats, facts, figures, or metrics that don’t align with your business goals or fit with your KPI management strategies should be eliminated from the equation.

8. Build a data management roadmap

While, at this point, this particular step is optional (you will have already gained a wealth of insight and formed a fairly sound strategy by now), creating a data governance roadmap will help your data analysis methods and techniques become successful on a more sustainable basis. These roadmaps, if developed properly, are also built so they can be tweaked and scaled over time.

Invest ample time in developing a roadmap that will help you store, manage, and handle your data internally, and you will make your analysis techniques all the more fluid and functional – one of the most powerful types of data analysis methods available today.

9. Integrate technology

There are many ways to analyze data, but one of the most vital aspects of analytical success in a business context is integrating the right decision support software and technology.

Robust analysis platforms will not only allow you to pull critical data from your most valuable sources while working with dynamic KPIs that will offer you actionable insights; it will also present them in a digestible, visual, interactive format from one central, live dashboard . A data methodology you can count on.

By integrating the right technology within your data analysis methodology, you’ll avoid fragmenting your insights, saving you time and effort while allowing you to enjoy the maximum value from your business’s most valuable insights.

For a look at the power of software for the purpose of analysis and to enhance your methods of analyzing, glance over our selection of dashboard examples .

10. Answer your questions

By considering each of the above efforts, working with the right technology, and fostering a cohesive internal culture where everyone buys into the different ways to analyze data as well as the power of digital intelligence, you will swiftly start to answer your most burning business questions. Arguably, the best way to make your data concepts accessible across the organization is through data visualization.

11. Visualize your data

Online data visualization is a powerful tool as it lets you tell a story with your metrics, allowing users across the organization to extract meaningful insights that aid business evolution – and it covers all the different ways to analyze data.

The purpose of analyzing is to make your entire organization more informed and intelligent, and with the right platform or dashboard, this is simpler than you think, as demonstrated by our marketing dashboard .

An executive dashboard example showcasing high-level marketing KPIs such as cost per lead, MQL, SQL, and cost per customer.

This visual, dynamic, and interactive online dashboard is a data analysis example designed to give Chief Marketing Officers (CMO) an overview of relevant metrics to help them understand if they achieved their monthly goals.

In detail, this example generated with a modern dashboard creator displays interactive charts for monthly revenues, costs, net income, and net income per customer; all of them are compared with the previous month so that you can understand how the data fluctuated. In addition, it shows a detailed summary of the number of users, customers, SQLs, and MQLs per month to visualize the whole picture and extract relevant insights or trends for your marketing reports .

The CMO dashboard is perfect for c-level management as it can help them monitor the strategic outcome of their marketing efforts and make data-driven decisions that can benefit the company exponentially.

12. Be careful with the interpretation

We already dedicated an entire post to data interpretation as it is a fundamental part of the process of data analysis. It gives meaning to the analytical information and aims to drive a concise conclusion from the analysis results. Since most of the time companies are dealing with data from many different sources, the interpretation stage needs to be done carefully and properly in order to avoid misinterpretations. 

To help you through the process, here we list three common practices that you need to avoid at all costs when looking at your data:

  • Correlation vs. causation: The human brain is formatted to find patterns. This behavior leads to one of the most common mistakes when performing interpretation: confusing correlation with causation. Although these two aspects can exist simultaneously, it is not correct to assume that because two things happened together, one provoked the other. A piece of advice to avoid falling into this mistake is never to trust just intuition, trust the data. If there is no objective evidence of causation, then always stick to correlation. 
  • Confirmation bias: This phenomenon describes the tendency to select and interpret only the data necessary to prove one hypothesis, often ignoring the elements that might disprove it. Even if it's not done on purpose, confirmation bias can represent a real problem, as excluding relevant information can lead to false conclusions and, therefore, bad business decisions. To avoid it, always try to disprove your hypothesis instead of proving it, share your analysis with other team members, and avoid drawing any conclusions before the entire analytical project is finalized.
  • Statistical significance: To put it in short words, statistical significance helps analysts understand if a result is actually accurate or if it happened because of a sampling error or pure chance. The level of statistical significance needed might depend on the sample size and the industry being analyzed. In any case, ignoring the significance of a result when it might influence decision-making can be a huge mistake.

13. Build a narrative

Now, we’re going to look at how you can bring all of these elements together in a way that will benefit your business - starting with a little something called data storytelling.

The human brain responds incredibly well to strong stories or narratives. Once you’ve cleansed, shaped, and visualized your most invaluable data using various BI dashboard tools , you should strive to tell a story - one with a clear-cut beginning, middle, and end.

By doing so, you will make your analytical efforts more accessible, digestible, and universal, empowering more people within your organization to use your discoveries to their actionable advantage.

14. Consider autonomous technology

Autonomous technologies, such as artificial intelligence (AI) and machine learning (ML), play a significant role in the advancement of understanding how to analyze data more effectively.

Gartner predicts that by the end of this year, 80% of emerging technologies will be developed with AI foundations. This is a testament to the ever-growing power and value of autonomous technologies.

At the moment, these technologies are revolutionizing the analysis industry. Some examples that we mentioned earlier are neural networks, intelligent alarms, and sentiment analysis.

15. Share the load

If you work with the right tools and dashboards, you will be able to present your metrics in a digestible, value-driven format, allowing almost everyone in the organization to connect with and use relevant data to their advantage.

Modern dashboards consolidate data from various sources, providing access to a wealth of insights in one centralized location, no matter if you need to monitor recruitment metrics or generate reports that need to be sent across numerous departments. Moreover, these cutting-edge tools offer access to dashboards from a multitude of devices, meaning that everyone within the business can connect with practical insights remotely - and share the load.

Once everyone is able to work with a data-driven mindset, you will catalyze the success of your business in ways you never thought possible. And when it comes to knowing how to analyze data, this kind of collaborative approach is essential.

16. Data analysis tools

In order to perform high-quality analysis of data, it is fundamental to use tools and software that will ensure the best results. Here we leave you a small summary of four fundamental categories of data analysis tools for your organization.

  • Business Intelligence: BI tools allow you to process significant amounts of data from several sources in any format. Through this, you can not only analyze and monitor your data to extract relevant insights but also create interactive reports and dashboards to visualize your KPIs and use them for your company's good. datapine is an amazing online BI software that is focused on delivering powerful online analysis features that are accessible to beginner and advanced users. Like this, it offers a full-service solution that includes cutting-edge analysis of data, KPIs visualization, live dashboards, reporting, and artificial intelligence technologies to predict trends and minimize risk.
  • Statistical analysis: These tools are usually designed for scientists, statisticians, market researchers, and mathematicians, as they allow them to perform complex statistical analyses with methods like regression analysis, predictive analysis, and statistical modeling. A good tool to perform this type of analysis is R-Studio as it offers a powerful data modeling and hypothesis testing feature that can cover both academic and general data analysis. This tool is one of the favorite ones in the industry, due to its capability for data cleaning, data reduction, and performing advanced analysis with several statistical methods. Another relevant tool to mention is SPSS from IBM. The software offers advanced statistical analysis for users of all skill levels. Thanks to a vast library of machine learning algorithms, text analysis, and a hypothesis testing approach it can help your company find relevant insights to drive better decisions. SPSS also works as a cloud service that enables you to run it anywhere.
  • SQL Consoles: SQL is a programming language often used to handle structured data in relational databases. Tools like these are popular among data scientists as they are extremely effective in unlocking these databases' value. Undoubtedly, one of the most used SQL software in the market is MySQL Workbench . This tool offers several features such as a visual tool for database modeling and monitoring, complete SQL optimization, administration tools, and visual performance dashboards to keep track of KPIs.
  • Data Visualization: These tools are used to represent your data through charts, graphs, and maps that allow you to find patterns and trends in the data. datapine's already mentioned BI platform also offers a wealth of powerful online data visualization tools with several benefits. Some of them include: delivering compelling data-driven presentations to share with your entire company, the ability to see your data online with any device wherever you are, an interactive dashboard design feature that enables you to showcase your results in an interactive and understandable way, and to perform online self-service reports that can be used simultaneously with several other people to enhance team productivity.

17. Refine your process constantly 

Last is a step that might seem obvious to some people, but it can be easily ignored if you think you are done. Once you have extracted the needed results, you should always take a retrospective look at your project and think about what you can improve. As you saw throughout this long list of techniques, data analysis is a complex process that requires constant refinement. For this reason, you should always go one step further and keep improving. 

Quality Criteria For Data Analysis

So far we’ve covered a list of methods and techniques that should help you perform efficient data analysis. But how do you measure the quality and validity of your results? This is done with the help of some science quality criteria. Here we will go into a more theoretical area that is critical to understanding the fundamentals of statistical analysis in science. However, you should also be aware of these steps in a business context, as they will allow you to assess the quality of your results in the correct way. Let’s dig in. 

  • Internal validity: The results of a survey are internally valid if they measure what they are supposed to measure and thus provide credible results. In other words , internal validity measures the trustworthiness of the results and how they can be affected by factors such as the research design, operational definitions, how the variables are measured, and more. For instance, imagine you are doing an interview to ask people if they brush their teeth two times a day. While most of them will answer yes, you can still notice that their answers correspond to what is socially acceptable, which is to brush your teeth at least twice a day. In this case, you can’t be 100% sure if respondents actually brush their teeth twice a day or if they just say that they do, therefore, the internal validity of this interview is very low. 
  • External validity: Essentially, external validity refers to the extent to which the results of your research can be applied to a broader context. It basically aims to prove that the findings of a study can be applied in the real world. If the research can be applied to other settings, individuals, and times, then the external validity is high. 
  • Reliability : If your research is reliable, it means that it can be reproduced. If your measurement were repeated under the same conditions, it would produce similar results. This means that your measuring instrument consistently produces reliable results. For example, imagine a doctor building a symptoms questionnaire to detect a specific disease in a patient. Then, various other doctors use this questionnaire but end up diagnosing the same patient with a different condition. This means the questionnaire is not reliable in detecting the initial disease. Another important note here is that in order for your research to be reliable, it also needs to be objective. If the results of a study are the same, independent of who assesses them or interprets them, the study can be considered reliable. Let’s see the objectivity criteria in more detail now. 
  • Objectivity: In data science, objectivity means that the researcher needs to stay fully objective when it comes to its analysis. The results of a study need to be affected by objective criteria and not by the beliefs, personality, or values of the researcher. Objectivity needs to be ensured when you are gathering the data, for example, when interviewing individuals, the questions need to be asked in a way that doesn't influence the results. Paired with this, objectivity also needs to be thought of when interpreting the data. If different researchers reach the same conclusions, then the study is objective. For this last point, you can set predefined criteria to interpret the results to ensure all researchers follow the same steps. 

The discussed quality criteria cover mostly potential influences in a quantitative context. Analysis in qualitative research has by default additional subjective influences that must be controlled in a different way. Therefore, there are other quality criteria for this kind of research such as credibility, transferability, dependability, and confirmability. You can see each of them more in detail on this resource . 

Data Analysis Limitations & Barriers

Analyzing data is not an easy task. As you’ve seen throughout this post, there are many steps and techniques that you need to apply in order to extract useful information from your research. While a well-performed analysis can bring various benefits to your organization it doesn't come without limitations. In this section, we will discuss some of the main barriers you might encounter when conducting an analysis. Let’s see them more in detail. 

  • Lack of clear goals: No matter how good your data or analysis might be if you don’t have clear goals or a hypothesis the process might be worthless. While we mentioned some methods that don’t require a predefined hypothesis, it is always better to enter the analytical process with some clear guidelines of what you are expecting to get out of it, especially in a business context in which data is utilized to support important strategic decisions. 
  • Objectivity: Arguably one of the biggest barriers when it comes to data analysis in research is to stay objective. When trying to prove a hypothesis, researchers might find themselves, intentionally or unintentionally, directing the results toward an outcome that they want. To avoid this, always question your assumptions and avoid confusing facts with opinions. You can also show your findings to a research partner or external person to confirm that your results are objective. 
  • Data representation: A fundamental part of the analytical procedure is the way you represent your data. You can use various graphs and charts to represent your findings, but not all of them will work for all purposes. Choosing the wrong visual can not only damage your analysis but can mislead your audience, therefore, it is important to understand when to use each type of data depending on your analytical goals. Our complete guide on the types of graphs and charts lists 20 different visuals with examples of when to use them. 
  • Flawed correlation : Misleading statistics can significantly damage your research. We’ve already pointed out a few interpretation issues previously in the post, but it is an important barrier that we can't avoid addressing here as well. Flawed correlations occur when two variables appear related to each other but they are not. Confusing correlations with causation can lead to a wrong interpretation of results which can lead to building wrong strategies and loss of resources, therefore, it is very important to identify the different interpretation mistakes and avoid them. 
  • Sample size: A very common barrier to a reliable and efficient analysis process is the sample size. In order for the results to be trustworthy, the sample size should be representative of what you are analyzing. For example, imagine you have a company of 1000 employees and you ask the question “do you like working here?” to 50 employees of which 49 say yes, which means 95%. Now, imagine you ask the same question to the 1000 employees and 950 say yes, which also means 95%. Saying that 95% of employees like working in the company when the sample size was only 50 is not a representative or trustworthy conclusion. The significance of the results is way more accurate when surveying a bigger sample size.   
  • Privacy concerns: In some cases, data collection can be subjected to privacy regulations. Businesses gather all kinds of information from their customers from purchasing behaviors to addresses and phone numbers. If this falls into the wrong hands due to a breach, it can affect the security and confidentiality of your clients. To avoid this issue, you need to collect only the data that is needed for your research and, if you are using sensitive facts, make it anonymous so customers are protected. The misuse of customer data can severely damage a business's reputation, so it is important to keep an eye on privacy. 
  • Lack of communication between teams : When it comes to performing data analysis on a business level, it is very likely that each department and team will have different goals and strategies. However, they are all working for the same common goal of helping the business run smoothly and keep growing. When teams are not connected and communicating with each other, it can directly affect the way general strategies are built. To avoid these issues, tools such as data dashboards enable teams to stay connected through data in a visually appealing way. 
  • Innumeracy : Businesses are working with data more and more every day. While there are many BI tools available to perform effective analysis, data literacy is still a constant barrier. Not all employees know how to apply analysis techniques or extract insights from them. To prevent this from happening, you can implement different training opportunities that will prepare every relevant user to deal with data. 

Key Data Analysis Skills

As you've learned throughout this lengthy guide, analyzing data is a complex task that requires a lot of knowledge and skills. That said, thanks to the rise of self-service tools the process is way more accessible and agile than it once was. Regardless, there are still some key skills that are valuable to have when working with data, we list the most important ones below.

  • Critical and statistical thinking: To successfully analyze data you need to be creative and think out of the box. Yes, that might sound like a weird statement considering that data is often tight to facts. However, a great level of critical thinking is required to uncover connections, come up with a valuable hypothesis, and extract conclusions that go a step further from the surface. This, of course, needs to be complemented by statistical thinking and an understanding of numbers. 
  • Data cleaning: Anyone who has ever worked with data before will tell you that the cleaning and preparation process accounts for 80% of a data analyst's work, therefore, the skill is fundamental. But not just that, not cleaning the data adequately can also significantly damage the analysis which can lead to poor decision-making in a business scenario. While there are multiple tools that automate the cleaning process and eliminate the possibility of human error, it is still a valuable skill to dominate. 
  • Data visualization: Visuals make the information easier to understand and analyze, not only for professional users but especially for non-technical ones. Having the necessary skills to not only choose the right chart type but know when to apply it correctly is key. This also means being able to design visually compelling charts that make the data exploration process more efficient. 
  • SQL: The Structured Query Language or SQL is a programming language used to communicate with databases. It is fundamental knowledge as it enables you to update, manipulate, and organize data from relational databases which are the most common databases used by companies. It is fairly easy to learn and one of the most valuable skills when it comes to data analysis. 
  • Communication skills: This is a skill that is especially valuable in a business environment. Being able to clearly communicate analytical outcomes to colleagues is incredibly important, especially when the information you are trying to convey is complex for non-technical people. This applies to in-person communication as well as written format, for example, when generating a dashboard or report. While this might be considered a “soft” skill compared to the other ones we mentioned, it should not be ignored as you most likely will need to share analytical findings with others no matter the context. 

Data Analysis In The Big Data Environment

Big data is invaluable to today’s businesses, and by using different methods for data analysis, it’s possible to view your data in a way that can help you turn insight into positive action.

To inspire your efforts and put the importance of big data into context, here are some insights that you should know:

  • By 2026 the industry of big data is expected to be worth approximately $273.4 billion.
  • 94% of enterprises say that analyzing data is important for their growth and digital transformation. 
  • Companies that exploit the full potential of their data can increase their operating margins by 60% .
  • We already told you the benefits of Artificial Intelligence through this article. This industry's financial impact is expected to grow up to $40 billion by 2025.

Data analysis concepts may come in many forms, but fundamentally, any solid methodology will help to make your business more streamlined, cohesive, insightful, and successful than ever before.

Key Takeaways From Data Analysis 

As we reach the end of our data analysis journey, we leave a small summary of the main methods and techniques to perform excellent analysis and grow your business.

17 Essential Types of Data Analysis Methods:

  • Cluster analysis
  • Cohort analysis
  • Regression analysis
  • Factor analysis
  • Neural Networks
  • Data Mining
  • Text analysis
  • Time series analysis
  • Decision trees
  • Conjoint analysis 
  • Correspondence Analysis
  • Multidimensional Scaling 
  • Content analysis 
  • Thematic analysis
  • Narrative analysis 
  • Grounded theory analysis
  • Discourse analysis 

Top 17 Data Analysis Techniques:

  • Collaborate your needs
  • Establish your questions
  • Data democratization
  • Think of data governance 
  • Clean your data
  • Set your KPIs
  • Omit useless data
  • Build a data management roadmap
  • Integrate technology
  • Answer your questions
  • Visualize your data
  • Interpretation of data
  • Consider autonomous technology
  • Build a narrative
  • Share the load
  • Data Analysis tools
  • Refine your process constantly 

We’ve pondered the data analysis definition and drilled down into the practical applications of data-centric analytics, and one thing is clear: by taking measures to arrange your data and making your metrics work for you, it’s possible to transform raw information into action - the kind of that will push your business to the next level.

Yes, good data analytics techniques result in enhanced business intelligence (BI). To help you understand this notion in more detail, read our exploration of business intelligence reporting .

And, if you’re ready to perform your own analysis, drill down into your facts and figures while interacting with your data on astonishing visuals, you can try our software for a free, 14-day trial .

  • Privacy Policy

Research Method

Home » Data Analysis – Process, Methods and Types

Data Analysis – Process, Methods and Types

Table of Contents

Data Analysis

Data Analysis

Definition:

Data analysis refers to the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, drawing conclusions, and supporting decision-making. It involves applying various statistical and computational techniques to interpret and derive insights from large datasets. The ultimate aim of data analysis is to convert raw data into actionable insights that can inform business decisions, scientific research, and other endeavors.

Data Analysis Process

The following are step-by-step guides to the data analysis process:

Define the Problem

The first step in data analysis is to clearly define the problem or question that needs to be answered. This involves identifying the purpose of the analysis, the data required, and the intended outcome.

Collect the Data

The next step is to collect the relevant data from various sources. This may involve collecting data from surveys, databases, or other sources. It is important to ensure that the data collected is accurate, complete, and relevant to the problem being analyzed.

Clean and Organize the Data

Once the data has been collected, it needs to be cleaned and organized. This involves removing any errors or inconsistencies in the data, filling in missing values, and ensuring that the data is in a format that can be easily analyzed.

Analyze the Data

The next step is to analyze the data using various statistical and analytical techniques. This may involve identifying patterns in the data, conducting statistical tests, or using machine learning algorithms to identify trends and insights.

Interpret the Results

After analyzing the data, the next step is to interpret the results. This involves drawing conclusions based on the analysis and identifying any significant findings or trends.

Communicate the Findings

Once the results have been interpreted, they need to be communicated to stakeholders. This may involve creating reports, visualizations, or presentations to effectively communicate the findings and recommendations.

Take Action

The final step in the data analysis process is to take action based on the findings. This may involve implementing new policies or procedures, making strategic decisions, or taking other actions based on the insights gained from the analysis.

Types of Data Analysis

Types of Data Analysis are as follows:

Descriptive Analysis

This type of analysis involves summarizing and describing the main characteristics of a dataset, such as the mean, median, mode, standard deviation, and range.

Inferential Analysis

This type of analysis involves making inferences about a population based on a sample. Inferential analysis can help determine whether a certain relationship or pattern observed in a sample is likely to be present in the entire population.

Diagnostic Analysis

This type of analysis involves identifying and diagnosing problems or issues within a dataset. Diagnostic analysis can help identify outliers, errors, missing data, or other anomalies in the dataset.

Predictive Analysis

This type of analysis involves using statistical models and algorithms to predict future outcomes or trends based on historical data. Predictive analysis can help businesses and organizations make informed decisions about the future.

Prescriptive Analysis

This type of analysis involves recommending a course of action based on the results of previous analyses. Prescriptive analysis can help organizations make data-driven decisions about how to optimize their operations, products, or services.

Exploratory Analysis

This type of analysis involves exploring the relationships and patterns within a dataset to identify new insights and trends. Exploratory analysis is often used in the early stages of research or data analysis to generate hypotheses and identify areas for further investigation.

Data Analysis Methods

Data Analysis Methods are as follows:

Statistical Analysis

This method involves the use of mathematical models and statistical tools to analyze and interpret data. It includes measures of central tendency, correlation analysis, regression analysis, hypothesis testing, and more.

Machine Learning

This method involves the use of algorithms to identify patterns and relationships in data. It includes supervised and unsupervised learning, classification, clustering, and predictive modeling.

Data Mining

This method involves using statistical and machine learning techniques to extract information and insights from large and complex datasets.

Text Analysis

This method involves using natural language processing (NLP) techniques to analyze and interpret text data. It includes sentiment analysis, topic modeling, and entity recognition.

Network Analysis

This method involves analyzing the relationships and connections between entities in a network, such as social networks or computer networks. It includes social network analysis and graph theory.

Time Series Analysis

This method involves analyzing data collected over time to identify patterns and trends. It includes forecasting, decomposition, and smoothing techniques.

Spatial Analysis

This method involves analyzing geographic data to identify spatial patterns and relationships. It includes spatial statistics, spatial regression, and geospatial data visualization.

Data Visualization

This method involves using graphs, charts, and other visual representations to help communicate the findings of the analysis. It includes scatter plots, bar charts, heat maps, and interactive dashboards.

Qualitative Analysis

This method involves analyzing non-numeric data such as interviews, observations, and open-ended survey responses. It includes thematic analysis, content analysis, and grounded theory.

Multi-criteria Decision Analysis

This method involves analyzing multiple criteria and objectives to support decision-making. It includes techniques such as the analytical hierarchy process, TOPSIS, and ELECTRE.

Data Analysis Tools

There are various data analysis tools available that can help with different aspects of data analysis. Below is a list of some commonly used data analysis tools:

  • Microsoft Excel: A widely used spreadsheet program that allows for data organization, analysis, and visualization.
  • SQL : A programming language used to manage and manipulate relational databases.
  • R : An open-source programming language and software environment for statistical computing and graphics.
  • Python : A general-purpose programming language that is widely used in data analysis and machine learning.
  • Tableau : A data visualization software that allows for interactive and dynamic visualizations of data.
  • SAS : A statistical analysis software used for data management, analysis, and reporting.
  • SPSS : A statistical analysis software used for data analysis, reporting, and modeling.
  • Matlab : A numerical computing software that is widely used in scientific research and engineering.
  • RapidMiner : A data science platform that offers a wide range of data analysis and machine learning tools.

Applications of Data Analysis

Data analysis has numerous applications across various fields. Below are some examples of how data analysis is used in different fields:

  • Business : Data analysis is used to gain insights into customer behavior, market trends, and financial performance. This includes customer segmentation, sales forecasting, and market research.
  • Healthcare : Data analysis is used to identify patterns and trends in patient data, improve patient outcomes, and optimize healthcare operations. This includes clinical decision support, disease surveillance, and healthcare cost analysis.
  • Education : Data analysis is used to measure student performance, evaluate teaching effectiveness, and improve educational programs. This includes assessment analytics, learning analytics, and program evaluation.
  • Finance : Data analysis is used to monitor and evaluate financial performance, identify risks, and make investment decisions. This includes risk management, portfolio optimization, and fraud detection.
  • Government : Data analysis is used to inform policy-making, improve public services, and enhance public safety. This includes crime analysis, disaster response planning, and social welfare program evaluation.
  • Sports : Data analysis is used to gain insights into athlete performance, improve team strategy, and enhance fan engagement. This includes player evaluation, scouting analysis, and game strategy optimization.
  • Marketing : Data analysis is used to measure the effectiveness of marketing campaigns, understand customer behavior, and develop targeted marketing strategies. This includes customer segmentation, marketing attribution analysis, and social media analytics.
  • Environmental science : Data analysis is used to monitor and evaluate environmental conditions, assess the impact of human activities on the environment, and develop environmental policies. This includes climate modeling, ecological forecasting, and pollution monitoring.

When to Use Data Analysis

Data analysis is useful when you need to extract meaningful insights and information from large and complex datasets. It is a crucial step in the decision-making process, as it helps you understand the underlying patterns and relationships within the data, and identify potential areas for improvement or opportunities for growth.

Here are some specific scenarios where data analysis can be particularly helpful:

  • Problem-solving : When you encounter a problem or challenge, data analysis can help you identify the root cause and develop effective solutions.
  • Optimization : Data analysis can help you optimize processes, products, or services to increase efficiency, reduce costs, and improve overall performance.
  • Prediction: Data analysis can help you make predictions about future trends or outcomes, which can inform strategic planning and decision-making.
  • Performance evaluation : Data analysis can help you evaluate the performance of a process, product, or service to identify areas for improvement and potential opportunities for growth.
  • Risk assessment : Data analysis can help you assess and mitigate risks, whether it is financial, operational, or related to safety.
  • Market research : Data analysis can help you understand customer behavior and preferences, identify market trends, and develop effective marketing strategies.
  • Quality control: Data analysis can help you ensure product quality and customer satisfaction by identifying and addressing quality issues.

Purpose of Data Analysis

The primary purposes of data analysis can be summarized as follows:

  • To gain insights: Data analysis allows you to identify patterns and trends in data, which can provide valuable insights into the underlying factors that influence a particular phenomenon or process.
  • To inform decision-making: Data analysis can help you make informed decisions based on the information that is available. By analyzing data, you can identify potential risks, opportunities, and solutions to problems.
  • To improve performance: Data analysis can help you optimize processes, products, or services by identifying areas for improvement and potential opportunities for growth.
  • To measure progress: Data analysis can help you measure progress towards a specific goal or objective, allowing you to track performance over time and adjust your strategies accordingly.
  • To identify new opportunities: Data analysis can help you identify new opportunities for growth and innovation by identifying patterns and trends that may not have been visible before.

Examples of Data Analysis

Some Examples of Data Analysis are as follows:

  • Social Media Monitoring: Companies use data analysis to monitor social media activity in real-time to understand their brand reputation, identify potential customer issues, and track competitors. By analyzing social media data, businesses can make informed decisions on product development, marketing strategies, and customer service.
  • Financial Trading: Financial traders use data analysis to make real-time decisions about buying and selling stocks, bonds, and other financial instruments. By analyzing real-time market data, traders can identify trends and patterns that help them make informed investment decisions.
  • Traffic Monitoring : Cities use data analysis to monitor traffic patterns and make real-time decisions about traffic management. By analyzing data from traffic cameras, sensors, and other sources, cities can identify congestion hotspots and make changes to improve traffic flow.
  • Healthcare Monitoring: Healthcare providers use data analysis to monitor patient health in real-time. By analyzing data from wearable devices, electronic health records, and other sources, healthcare providers can identify potential health issues and provide timely interventions.
  • Online Advertising: Online advertisers use data analysis to make real-time decisions about advertising campaigns. By analyzing data on user behavior and ad performance, advertisers can make adjustments to their campaigns to improve their effectiveness.
  • Sports Analysis : Sports teams use data analysis to make real-time decisions about strategy and player performance. By analyzing data on player movement, ball position, and other variables, coaches can make informed decisions about substitutions, game strategy, and training regimens.
  • Energy Management : Energy companies use data analysis to monitor energy consumption in real-time. By analyzing data on energy usage patterns, companies can identify opportunities to reduce energy consumption and improve efficiency.

Characteristics of Data Analysis

Characteristics of Data Analysis are as follows:

  • Objective : Data analysis should be objective and based on empirical evidence, rather than subjective assumptions or opinions.
  • Systematic : Data analysis should follow a systematic approach, using established methods and procedures for collecting, cleaning, and analyzing data.
  • Accurate : Data analysis should produce accurate results, free from errors and bias. Data should be validated and verified to ensure its quality.
  • Relevant : Data analysis should be relevant to the research question or problem being addressed. It should focus on the data that is most useful for answering the research question or solving the problem.
  • Comprehensive : Data analysis should be comprehensive and consider all relevant factors that may affect the research question or problem.
  • Timely : Data analysis should be conducted in a timely manner, so that the results are available when they are needed.
  • Reproducible : Data analysis should be reproducible, meaning that other researchers should be able to replicate the analysis using the same data and methods.
  • Communicable : Data analysis should be communicated clearly and effectively to stakeholders and other interested parties. The results should be presented in a way that is understandable and useful for decision-making.

Advantages of Data Analysis

Advantages of Data Analysis are as follows:

  • Better decision-making: Data analysis helps in making informed decisions based on facts and evidence, rather than intuition or guesswork.
  • Improved efficiency: Data analysis can identify inefficiencies and bottlenecks in business processes, allowing organizations to optimize their operations and reduce costs.
  • Increased accuracy: Data analysis helps to reduce errors and bias, providing more accurate and reliable information.
  • Better customer service: Data analysis can help organizations understand their customers better, allowing them to provide better customer service and improve customer satisfaction.
  • Competitive advantage: Data analysis can provide organizations with insights into their competitors, allowing them to identify areas where they can gain a competitive advantage.
  • Identification of trends and patterns : Data analysis can identify trends and patterns in data that may not be immediately apparent, helping organizations to make predictions and plan for the future.
  • Improved risk management : Data analysis can help organizations identify potential risks and take proactive steps to mitigate them.
  • Innovation: Data analysis can inspire innovation and new ideas by revealing new opportunities or previously unknown correlations in data.

Limitations of Data Analysis

  • Data quality: The quality of data can impact the accuracy and reliability of analysis results. If data is incomplete, inconsistent, or outdated, the analysis may not provide meaningful insights.
  • Limited scope: Data analysis is limited by the scope of the data available. If data is incomplete or does not capture all relevant factors, the analysis may not provide a complete picture.
  • Human error : Data analysis is often conducted by humans, and errors can occur in data collection, cleaning, and analysis.
  • Cost : Data analysis can be expensive, requiring specialized tools, software, and expertise.
  • Time-consuming : Data analysis can be time-consuming, especially when working with large datasets or conducting complex analyses.
  • Overreliance on data: Data analysis should be complemented with human intuition and expertise. Overreliance on data can lead to a lack of creativity and innovation.
  • Privacy concerns: Data analysis can raise privacy concerns if personal or sensitive information is used without proper consent or security measures.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Cluster Analysis

Cluster Analysis – Types, Methods and Examples

Data collection

Data Collection – Methods Types and Examples

Delimitations

Delimitations in Research – Types, Examples and...

Discriminant Analysis

Discriminant Analysis – Methods, Types and...

Research Process

Research Process – Steps, Examples and Tips

Research Design

Research Design – Types, Methods and Examples

How to conduct a meta-analysis in eight steps: a practical guide

  • Open access
  • Published: 30 November 2021
  • Volume 72 , pages 1–19, ( 2022 )

Cite this article

You have full access to this open access article

research article data analysis

  • Christopher Hansen 1 ,
  • Holger Steinmetz 2 &
  • Jörn Block 3 , 4 , 5  

152k Accesses

46 Citations

158 Altmetric

Explore all metrics

Avoid common mistakes on your manuscript.

1 Introduction

“Scientists have known for centuries that a single study will not resolve a major issue. Indeed, a small sample study will not even resolve a minor issue. Thus, the foundation of science is the cumulation of knowledge from the results of many studies.” (Hunter et al. 1982 , p. 10)

Meta-analysis is a central method for knowledge accumulation in many scientific fields (Aguinis et al. 2011c ; Kepes et al. 2013 ). Similar to a narrative review, it serves as a synopsis of a research question or field. However, going beyond a narrative summary of key findings, a meta-analysis adds value in providing a quantitative assessment of the relationship between two target variables or the effectiveness of an intervention (Gurevitch et al. 2018 ). Also, it can be used to test competing theoretical assumptions against each other or to identify important moderators where the results of different primary studies differ from each other (Aguinis et al. 2011b ; Bergh et al. 2016 ). Rooted in the synthesis of the effectiveness of medical and psychological interventions in the 1970s (Glass 2015 ; Gurevitch et al. 2018 ), meta-analysis is nowadays also an established method in management research and related fields.

The increasing importance of meta-analysis in management research has resulted in the publication of guidelines in recent years that discuss the merits and best practices in various fields, such as general management (Bergh et al. 2016 ; Combs et al. 2019 ; Gonzalez-Mulé and Aguinis 2018 ), international business (Steel et al. 2021 ), economics and finance (Geyer-Klingeberg et al. 2020 ; Havranek et al. 2020 ), marketing (Eisend 2017 ; Grewal et al. 2018 ), and organizational studies (DeSimone et al. 2020 ; Rudolph et al. 2020 ). These articles discuss existing and trending methods and propose solutions for often experienced problems. This editorial briefly summarizes the insights of these papers; provides a workflow of the essential steps in conducting a meta-analysis; suggests state-of-the art methodological procedures; and points to other articles for in-depth investigation. Thus, this article has two goals: (1) based on the findings of previous editorials and methodological articles, it defines methodological recommendations for meta-analyses submitted to Management Review Quarterly (MRQ); and (2) it serves as a practical guide for researchers who have little experience with meta-analysis as a method but plan to conduct one in the future.

2 Eight steps in conducting a meta-analysis

2.1 step 1: defining the research question.

The first step in conducting a meta-analysis, as with any other empirical study, is the definition of the research question. Most importantly, the research question determines the realm of constructs to be considered or the type of interventions whose effects shall be analyzed. When defining the research question, two hurdles might develop. First, when defining an adequate study scope, researchers must consider that the number of publications has grown exponentially in many fields of research in recent decades (Fortunato et al. 2018 ). On the one hand, a larger number of studies increases the potentially relevant literature basis and enables researchers to conduct meta-analyses. Conversely, scanning a large amount of studies that could be potentially relevant for the meta-analysis results in a perhaps unmanageable workload. Thus, Steel et al. ( 2021 ) highlight the importance of balancing manageability and relevance when defining the research question. Second, similar to the number of primary studies also the number of meta-analyses in management research has grown strongly in recent years (Geyer-Klingeberg et al. 2020 ; Rauch 2020 ; Schwab 2015 ). Therefore, it is likely that one or several meta-analyses for many topics of high scholarly interest already exist. However, this should not deter researchers from investigating their research questions. One possibility is to consider moderators or mediators of a relationship that have previously been ignored. For example, a meta-analysis about startup performance could investigate the impact of different ways to measure the performance construct (e.g., growth vs. profitability vs. survival time) or certain characteristics of the founders as moderators. Another possibility is to replicate previous meta-analyses and test whether their findings can be confirmed with an updated sample of primary studies or newly developed methods. Frequent replications and updates of meta-analyses are important contributions to cumulative science and are increasingly called for by the research community (Anderson & Kichkha 2017 ; Steel et al. 2021 ). Consistent with its focus on replication studies (Block and Kuckertz 2018 ), MRQ therefore also invites authors to submit replication meta-analyses.

2.2 Step 2: literature search

2.2.1 search strategies.

Similar to conducting a literature review, the search process of a meta-analysis should be systematic, reproducible, and transparent, resulting in a sample that includes all relevant studies (Fisch and Block 2018 ; Gusenbauer and Haddaway 2020 ). There are several identification strategies for relevant primary studies when compiling meta-analytical datasets (Harari et al. 2020 ). First, previous meta-analyses on the same or a related topic may provide lists of included studies that offer a good starting point to identify and become familiar with the relevant literature. This practice is also applicable to topic-related literature reviews, which often summarize the central findings of the reviewed articles in systematic tables. Both article types likely include the most prominent studies of a research field. The most common and important search strategy, however, is a keyword search in electronic databases (Harari et al. 2020 ). This strategy will probably yield the largest number of relevant studies, particularly so-called ‘grey literature’, which may not be considered by literature reviews. Gusenbauer and Haddaway ( 2020 ) provide a detailed overview of 34 scientific databases, of which 18 are multidisciplinary or have a focus on management sciences, along with their suitability for literature synthesis. To prevent biased results due to the scope or journal coverage of one database, researchers should use at least two different databases (DeSimone et al. 2020 ; Martín-Martín et al. 2021 ; Mongeon & Paul-Hus 2016 ). However, a database search can easily lead to an overload of potentially relevant studies. For example, key term searches in Google Scholar for “entrepreneurial intention” and “firm diversification” resulted in more than 660,000 and 810,000 hits, respectively. Footnote 1 Therefore, a precise research question and precise search terms using Boolean operators are advisable (Gusenbauer and Haddaway 2020 ). Addressing the challenge of identifying relevant articles in the growing number of database publications, (semi)automated approaches using text mining and machine learning (Bosco et al. 2017 ; O’Mara-Eves et al. 2015 ; Ouzzani et al. 2016 ; Thomas et al. 2017 ) can also be promising and time-saving search tools in the future. Also, some electronic databases offer the possibility to track forward citations of influential studies and thereby identify further relevant articles. Finally, collecting unpublished or undetected studies through conferences, personal contact with (leading) scholars, or listservs can be strategies to increase the study sample size (Grewal et al. 2018 ; Harari et al. 2020 ; Pigott and Polanin 2020 ).

2.2.2 Study inclusion criteria and sample composition

Next, researchers must decide which studies to include in the meta-analysis. Some guidelines for literature reviews recommend limiting the sample to studies published in renowned academic journals to ensure the quality of findings (e.g., Kraus et al. 2020 ). For meta-analysis, however, Steel et al. ( 2021 ) advocate for the inclusion of all available studies, including grey literature, to prevent selection biases based on availability, cost, familiarity, and language (Rothstein et al. 2005 ), or the “Matthew effect”, which denotes the phenomenon that highly cited articles are found faster than less cited articles (Merton 1968 ). Harrison et al. ( 2017 ) find that the effects of published studies in management are inflated on average by 30% compared to unpublished studies. This so-called publication bias or “file drawer problem” (Rosenthal 1979 ) results from the preference of academia to publish more statistically significant and less statistically insignificant study results. Owen and Li ( 2020 ) showed that publication bias is particularly severe when variables of interest are used as key variables rather than control variables. To consider the true effect size of a target variable or relationship, the inclusion of all types of research outputs is therefore recommended (Polanin et al. 2016 ). Different test procedures to identify publication bias are discussed subsequently in Step 7.

In addition to the decision of whether to include certain study types (i.e., published vs. unpublished studies), there can be other reasons to exclude studies that are identified in the search process. These reasons can be manifold and are primarily related to the specific research question and methodological peculiarities. For example, studies identified by keyword search might not qualify thematically after all, may use unsuitable variable measurements, or may not report usable effect sizes. Furthermore, there might be multiple studies by the same authors using similar datasets. If they do not differ sufficiently in terms of their sample characteristics or variables used, only one of these studies should be included to prevent bias from duplicates (Wood 2008 ; see this article for a detection heuristic).

In general, the screening process should be conducted stepwise, beginning with a removal of duplicate citations from different databases, followed by abstract screening to exclude clearly unsuitable studies and a final full-text screening of the remaining articles (Pigott and Polanin 2020 ). A graphical tool to systematically document the sample selection process is the PRISMA flow diagram (Moher et al. 2009 ). Page et al. ( 2021 ) recently presented an updated version of the PRISMA statement, including an extended item checklist and flow diagram to report the study process and findings.

2.3 Step 3: choice of the effect size measure

2.3.1 types of effect sizes.

The two most common meta-analytical effect size measures in management studies are (z-transformed) correlation coefficients and standardized mean differences (Aguinis et al. 2011a ; Geyskens et al. 2009 ). However, meta-analyses in management science and related fields may not be limited to those two effect size measures but rather depend on the subfield of investigation (Borenstein 2009 ; Stanley and Doucouliagos 2012 ). In economics and finance, researchers are more interested in the examination of elasticities and marginal effects extracted from regression models than in pure bivariate correlations (Stanley and Doucouliagos 2012 ). Regression coefficients can also be converted to partial correlation coefficients based on their t-statistics to make regression results comparable across studies (Stanley and Doucouliagos 2012 ). Although some meta-analyses in management research have combined bivariate and partial correlations in their study samples, Aloe ( 2015 ) and Combs et al. ( 2019 ) advise researchers not to use this practice. Most importantly, they argue that the effect size strength of partial correlations depends on the other variables included in the regression model and is therefore incomparable to bivariate correlations (Schmidt and Hunter 2015 ), resulting in a possible bias of the meta-analytic results (Roth et al. 2018 ). We endorse this opinion. If at all, we recommend separate analyses for each measure. In addition to these measures, survival rates, risk ratios or odds ratios, which are common measures in medical research (Borenstein 2009 ), can be suitable effect sizes for specific management research questions, such as understanding the determinants of the survival of startup companies. To summarize, the choice of a suitable effect size is often taken away from the researcher because it is typically dependent on the investigated research question as well as the conventions of the specific research field (Cheung and Vijayakumar 2016 ).

2.3.2 Conversion of effect sizes to a common measure

After having defined the primary effect size measure for the meta-analysis, it might become necessary in the later coding process to convert study findings that are reported in effect sizes that are different from the chosen primary effect size. For example, a study might report only descriptive statistics for two study groups but no correlation coefficient, which is used as the primary effect size measure in the meta-analysis. Different effect size measures can be harmonized using conversion formulae, which are provided by standard method books such as Borenstein et al. ( 2009 ) or Lipsey and Wilson ( 2001 ). There also exist online effect size calculators for meta-analysis. Footnote 2

2.4 Step 4: choice of the analytical method used

Choosing which meta-analytical method to use is directly connected to the research question of the meta-analysis. Research questions in meta-analyses can address a relationship between constructs or an effect of an intervention in a general manner, or they can focus on moderating or mediating effects. There are four meta-analytical methods that are primarily used in contemporary management research (Combs et al. 2019 ; Geyer-Klingeberg et al. 2020 ), which allow the investigation of these different types of research questions: traditional univariate meta-analysis, meta-regression, meta-analytic structural equation modeling, and qualitative meta-analysis (Hoon 2013 ). While the first three are quantitative, the latter summarizes qualitative findings. Table 1 summarizes the key characteristics of the three quantitative methods.

2.4.1 Univariate meta-analysis

In its traditional form, a meta-analysis reports a weighted mean effect size for the relationship or intervention of investigation and provides information on the magnitude of variance among primary studies (Aguinis et al. 2011c ; Borenstein et al. 2009 ). Accordingly, it serves as a quantitative synthesis of a research field (Borenstein et al. 2009 ; Geyskens et al. 2009 ). Prominent traditional approaches have been developed, for example, by Hedges and Olkin ( 1985 ) or Hunter and Schmidt ( 1990 , 2004 ). However, going beyond its simple summary function, the traditional approach has limitations in explaining the observed variance among findings (Gonzalez-Mulé and Aguinis 2018 ). To identify moderators (or boundary conditions) of the relationship of interest, meta-analysts can create subgroups and investigate differences between those groups (Borenstein and Higgins 2013 ; Hunter and Schmidt 2004 ). Potential moderators can be study characteristics (e.g., whether a study is published vs. unpublished), sample characteristics (e.g., study country, industry focus, or type of survey/experiment participants), or measurement artifacts (e.g., different types of variable measurements). The univariate approach is thus suitable to identify the overall direction of a relationship and can serve as a good starting point for additional analyses. However, due to its limitations in examining boundary conditions and developing theory, the univariate approach on its own is currently oftentimes viewed as not sufficient (Rauch 2020 ; Shaw and Ertug 2017 ).

2.4.2 Meta-regression analysis

Meta-regression analysis (Hedges and Olkin 1985 ; Lipsey and Wilson 2001 ; Stanley and Jarrell 1989 ) aims to investigate the heterogeneity among observed effect sizes by testing multiple potential moderators simultaneously. In meta-regression, the coded effect size is used as the dependent variable and is regressed on a list of moderator variables. These moderator variables can be categorical variables as described previously in the traditional univariate approach or (semi)continuous variables such as country scores that are merged with the meta-analytical data. Thus, meta-regression analysis overcomes the disadvantages of the traditional approach, which only allows us to investigate moderators singularly using dichotomized subgroups (Combs et al. 2019 ; Gonzalez-Mulé and Aguinis 2018 ). These possibilities allow a more fine-grained analysis of research questions that are related to moderating effects. However, Schmidt ( 2017 ) critically notes that the number of effect sizes in the meta-analytical sample must be sufficiently large to produce reliable results when investigating multiple moderators simultaneously in a meta-regression. For further reading, Tipton et al. ( 2019 ) outline the technical, conceptual, and practical developments of meta-regression over the last decades. Gonzalez-Mulé and Aguinis ( 2018 ) provide an overview of methodological choices and develop evidence-based best practices for future meta-analyses in management using meta-regression.

2.4.3 Meta-analytic structural equation modeling (MASEM)

MASEM is a combination of meta-analysis and structural equation modeling and allows to simultaneously investigate the relationships among several constructs in a path model. Researchers can use MASEM to test several competing theoretical models against each other or to identify mediation mechanisms in a chain of relationships (Bergh et al. 2016 ). This method is typically performed in two steps (Cheung and Chan 2005 ): In Step 1, a pooled correlation matrix is derived, which includes the meta-analytical mean effect sizes for all variable combinations; Step 2 then uses this matrix to fit the path model. While MASEM was based primarily on traditional univariate meta-analysis to derive the pooled correlation matrix in its early years (Viswesvaran and Ones 1995 ), more advanced methods, such as the GLS approach (Becker 1992 , 1995 ) or the TSSEM approach (Cheung and Chan 2005 ), have been subsequently developed. Cheung ( 2015a ) and Jak ( 2015 ) provide an overview of these approaches in their books with exemplary code. For datasets with more complex data structures, Wilson et al. ( 2016 ) also developed a multilevel approach that is related to the TSSEM approach in the second step. Bergh et al. ( 2016 ) discuss nine decision points and develop best practices for MASEM studies.

2.4.4 Qualitative meta-analysis

While the approaches explained above focus on quantitative outcomes of empirical studies, qualitative meta-analysis aims to synthesize qualitative findings from case studies (Hoon 2013 ; Rauch et al. 2014 ). The distinctive feature of qualitative case studies is their potential to provide in-depth information about specific contextual factors or to shed light on reasons for certain phenomena that cannot usually be investigated by quantitative studies (Rauch 2020 ; Rauch et al. 2014 ). In a qualitative meta-analysis, the identified case studies are systematically coded in a meta-synthesis protocol, which is then used to identify influential variables or patterns and to derive a meta-causal network (Hoon 2013 ). Thus, the insights of contextualized and typically nongeneralizable single studies are aggregated to a larger, more generalizable picture (Habersang et al. 2019 ). Although still the exception, this method can thus provide important contributions for academics in terms of theory development (Combs et al., 2019 ; Hoon 2013 ) and for practitioners in terms of evidence-based management or entrepreneurship (Rauch et al. 2014 ). Levitt ( 2018 ) provides a guide and discusses conceptual issues for conducting qualitative meta-analysis in psychology, which is also useful for management researchers.

2.5 Step 5: choice of software

Software solutions to perform meta-analyses range from built-in functions or additional packages of statistical software to software purely focused on meta-analyses and from commercial to open-source solutions. However, in addition to personal preferences, the choice of the most suitable software depends on the complexity of the methods used and the dataset itself (Cheung and Vijayakumar 2016 ). Meta-analysts therefore must carefully check if their preferred software is capable of performing the intended analysis.

Among commercial software providers, Stata (from version 16 on) offers built-in functions to perform various meta-analytical analyses or to produce various plots (Palmer and Sterne 2016 ). For SPSS and SAS, there exist several macros for meta-analyses provided by scholars, such as David B. Wilson or Andy P. Field and Raphael Gillet (Field and Gillett 2010 ). Footnote 3 Footnote 4 For researchers using the open-source software R (R Core Team 2021 ), Polanin et al. ( 2017 ) provide an overview of 63 meta-analysis packages and their functionalities. For new users, they recommend the package metafor (Viechtbauer 2010 ), which includes most necessary functions and for which the author Wolfgang Viechtbauer provides tutorials on his project website. Footnote 5 Footnote 6 In addition to packages and macros for statistical software, templates for Microsoft Excel have also been developed to conduct simple meta-analyses, such as Meta-Essentials by Suurmond et al. ( 2017 ). Footnote 7 Finally, programs purely dedicated to meta-analysis also exist, such as Comprehensive Meta-Analysis (Borenstein et al. 2013 ) or RevMan by The Cochrane Collaboration ( 2020 ).

2.6 Step 6: coding of effect sizes

2.6.1 coding sheet.

The first step in the coding process is the design of the coding sheet. A universal template does not exist because the design of the coding sheet depends on the methods used, the respective software, and the complexity of the research design. For univariate meta-analysis or meta-regression, data are typically coded in wide format. In its simplest form, when investigating a correlational relationship between two variables using the univariate approach, the coding sheet would contain a column for the study name or identifier, the effect size coded from the primary study, and the study sample size. However, such simple relationships are unlikely in management research because the included studies are typically not identical but differ in several respects. With more complex data structures or moderator variables being investigated, additional columns are added to the coding sheet to reflect the data characteristics. These variables can be coded as dummy, factor, or (semi)continuous variables and later used to perform a subgroup analysis or meta regression. For MASEM, the required data input format can deviate depending on the method used (e.g., TSSEM requires a list of correlation matrices as data input). For qualitative meta-analysis, the coding scheme typically summarizes the key qualitative findings and important contextual and conceptual information (see Hoon ( 2013 ) for a coding scheme for qualitative meta-analysis). Figure  1 shows an exemplary coding scheme for a quantitative meta-analysis on the correlational relationship between top-management team diversity and profitability. In addition to effect and sample sizes, information about the study country, firm type, and variable operationalizations are coded. The list could be extended by further study and sample characteristics.

figure 1

Exemplary coding sheet for a meta-analysis on the relationship (correlation) between top-management team diversity and profitability

2.6.2 Inclusion of moderator or control variables

It is generally important to consider the intended research model and relevant nontarget variables before coding a meta-analytic dataset. For example, study characteristics can be important moderators or function as control variables in a meta-regression model. Similarly, control variables may be relevant in a MASEM approach to reduce confounding bias. Coding additional variables or constructs subsequently can be arduous if the sample of primary studies is large. However, the decision to include respective moderator or control variables, as in any empirical analysis, should always be based on strong (theoretical) rationales about how these variables can impact the investigated effect (Bernerth and Aguinis 2016 ; Bernerth et al. 2018 ; Thompson and Higgins 2002 ). While substantive moderators refer to theoretical constructs that act as buffers or enhancers of a supposed causal process, methodological moderators are features of the respective research designs that denote the methodological context of the observations and are important to control for systematic statistical particularities (Rudolph et al. 2020 ). Havranek et al. ( 2020 ) provide a list of recommended variables to code as potential moderators. While researchers may have clear expectations about the effects for some of these moderators, the concerns for other moderators may be tentative, and moderator analysis may be approached in a rather exploratory fashion. Thus, we argue that researchers should make full use of the meta-analytical design to obtain insights about potential context dependence that a primary study cannot achieve.

2.6.3 Treatment of multiple effect sizes in a study

A long-debated issue in conducting meta-analyses is whether to use only one or all available effect sizes for the same construct within a single primary study. For meta-analyses in management research, this question is fundamental because many empirical studies, particularly those relying on company databases, use multiple variables for the same construct to perform sensitivity analyses, resulting in multiple relevant effect sizes. In this case, researchers can either (randomly) select a single value, calculate a study average, or use the complete set of effect sizes (Bijmolt and Pieters 2001 ; López-López et al. 2018 ). Multiple effect sizes from the same study enrich the meta-analytic dataset and allow us to investigate the heterogeneity of the relationship of interest, such as different variable operationalizations (López-López et al. 2018 ; Moeyaert et al. 2017 ). However, including more than one effect size from the same study violates the independency assumption of observations (Cheung 2019 ; López-López et al. 2018 ), which can lead to biased results and erroneous conclusions (Gooty et al. 2021 ). We follow the recommendation of current best practice guides to take advantage of using all available effect size observations but to carefully consider interdependencies using appropriate methods such as multilevel models, panel regression models, or robust variance estimation (Cheung 2019 ; Geyer-Klingeberg et al. 2020 ; Gooty et al. 2021 ; López-López et al. 2018 ; Moeyaert et al. 2017 ).

2.7 Step 7: analysis

2.7.1 outlier analysis and tests for publication bias.

Before conducting the primary analysis, some preliminary sensitivity analyses might be necessary, which should ensure the robustness of the meta-analytical findings (Rudolph et al. 2020 ). First, influential outlier observations could potentially bias the observed results, particularly if the number of total effect sizes is small. Several statistical methods can be used to identify outliers in meta-analytical datasets (Aguinis et al. 2013 ; Viechtbauer and Cheung 2010 ). However, there is a debate about whether to keep or omit these observations. Anyhow, relevant studies should be closely inspected to infer an explanation about their deviating results. As in any other primary study, outliers can be a valid representation, albeit representing a different population, measure, construct, design or procedure. Thus, inferences about outliers can provide the basis to infer potential moderators (Aguinis et al. 2013 ; Steel et al. 2021 ). On the other hand, outliers can indicate invalid research, for instance, when unrealistically strong correlations are due to construct overlap (i.e., lack of a clear demarcation between independent and dependent variables), invalid measures, or simply typing errors when coding effect sizes. An advisable step is therefore to compare the results both with and without outliers and base the decision on whether to exclude outlier observations with careful consideration (Geyskens et al. 2009 ; Grewal et al. 2018 ; Kepes et al. 2013 ). However, instead of simply focusing on the size of the outlier, its leverage should be considered. Thus, Viechtbauer and Cheung ( 2010 ) propose considering a combination of standardized deviation and a study’s leverage.

Second, as mentioned in the context of a literature search, potential publication bias may be an issue. Publication bias can be examined in multiple ways (Rothstein et al. 2005 ). First, the funnel plot is a simple graphical tool that can provide an overview of the effect size distribution and help to detect publication bias (Stanley and Doucouliagos 2010 ). A funnel plot can also support in identifying potential outliers. As mentioned above, a graphical display of deviation (e.g., studentized residuals) and leverage (Cook’s distance) can help detect the presence of outliers and evaluate their influence (Viechtbauer and Cheung 2010 ). Moreover, several statistical procedures can be used to test for publication bias (Harrison et al. 2017 ; Kepes et al. 2012 ), including subgroup comparisons between published and unpublished studies, Begg and Mazumdar’s ( 1994 ) rank correlation test, cumulative meta-analysis (Borenstein et al. 2009 ), the trim and fill method (Duval and Tweedie 2000a , b ), Egger et al.’s ( 1997 ) regression test, failsafe N (Rosenthal 1979 ), or selection models (Hedges and Vevea 2005 ; Vevea and Woods 2005 ). In examining potential publication bias, Kepes et al. ( 2012 ) and Harrison et al. ( 2017 ) both recommend not relying only on a single test but rather using multiple conceptionally different test procedures (i.e., the so-called “triangulation approach”).

2.7.2 Model choice

After controlling and correcting for the potential presence of impactful outliers or publication bias, the next step in meta-analysis is the primary analysis, where meta-analysts must decide between two different types of models that are based on different assumptions: fixed-effects and random-effects (Borenstein et al. 2010 ). Fixed-effects models assume that all observations share a common mean effect size, which means that differences are only due to sampling error, while random-effects models assume heterogeneity and allow for a variation of the true effect sizes across studies (Borenstein et al. 2010 ; Cheung and Vijayakumar 2016 ; Hunter and Schmidt 2004 ). Both models are explained in detail in standard textbooks (e.g., Borenstein et al. 2009 ; Hunter and Schmidt 2004 ; Lipsey and Wilson 2001 ).

In general, the presence of heterogeneity is likely in management meta-analyses because most studies do not have identical empirical settings, which can yield different effect size strengths or directions for the same investigated phenomenon. For example, the identified studies have been conducted in different countries with different institutional settings, or the type of study participants varies (e.g., students vs. employees, blue-collar vs. white-collar workers, or manufacturing vs. service firms). Thus, the vast majority of meta-analyses in management research and related fields use random-effects models (Aguinis et al. 2011a ). In a meta-regression, the random-effects model turns into a so-called mixed-effects model because moderator variables are added as fixed effects to explain the impact of observed study characteristics on effect size variations (Raudenbush 2009 ).

2.8 Step 8: reporting results

2.8.1 reporting in the article.

The final step in performing a meta-analysis is reporting its results. Most importantly, all steps and methodological decisions should be comprehensible to the reader. DeSimone et al. ( 2020 ) provide an extensive checklist for journal reviewers of meta-analytical studies. This checklist can also be used by authors when performing their analyses and reporting their results to ensure that all important aspects have been addressed. Alternative checklists are provided, for example, by Appelbaum et al. ( 2018 ) or Page et al. ( 2021 ). Similarly, Levitt et al. ( 2018 ) provide a detailed guide for qualitative meta-analysis reporting standards.

For quantitative meta-analyses, tables reporting results should include all important information and test statistics, including mean effect sizes; standard errors and confidence intervals; the number of observations and study samples included; and heterogeneity measures. If the meta-analytic sample is rather small, a forest plot provides a good overview of the different findings and their accuracy. However, this figure will be less feasible for meta-analyses with several hundred effect sizes included. Also, results displayed in the tables and figures must be explained verbally in the results and discussion sections. Most importantly, authors must answer the primary research question, i.e., whether there is a positive, negative, or no relationship between the variables of interest, or whether the examined intervention has a certain effect. These results should be interpreted with regard to their magnitude (or significance), both economically and statistically. However, when discussing meta-analytical results, authors must describe the complexity of the results, including the identified heterogeneity and important moderators, future research directions, and theoretical relevance (DeSimone et al. 2019 ). In particular, the discussion of identified heterogeneity and underlying moderator effects is critical; not including this information can lead to false conclusions among readers, who interpret the reported mean effect size as universal for all included primary studies and ignore the variability of findings when citing the meta-analytic results in their research (Aytug et al. 2012 ; DeSimone et al. 2019 ).

2.8.2 Open-science practices

Another increasingly important topic is the public provision of meta-analytical datasets and statistical codes via open-source repositories. Open-science practices allow for results validation and for the use of coded data in subsequent meta-analyses ( Polanin et al. 2020 ), contributing to the development of cumulative science. Steel et al. ( 2021 ) refer to open science meta-analyses as a step towards “living systematic reviews” (Elliott et al. 2017 ) with continuous updates in real time. MRQ supports this development and encourages authors to make their datasets publicly available. Moreau and Gamble ( 2020 ), for example, provide various templates and video tutorials to conduct open science meta-analyses. There exist several open science repositories, such as the Open Science Foundation (OSF; for a tutorial, see Soderberg 2018 ), to preregister and make documents publicly available. Furthermore, several initiatives in the social sciences have been established to develop dynamic meta-analyses, such as metaBUS (Bosco et al. 2015 , 2017 ), MetaLab (Bergmann et al. 2018 ), or PsychOpen CAMA (Burgard et al. 2021 ).

3 Conclusion

This editorial provides a comprehensive overview of the essential steps in conducting and reporting a meta-analysis with references to more in-depth methodological articles. It also serves as a guide for meta-analyses submitted to MRQ and other management journals. MRQ welcomes all types of meta-analyses from all subfields and disciplines of management research.

Gusenbauer and Haddaway ( 2020 ), however, point out that Google Scholar is not appropriate as a primary search engine due to a lack of reproducibility of search results.

One effect size calculator by David B. Wilson is accessible via: https://www.campbellcollaboration.org/escalc/html/EffectSizeCalculator-Home.php .

The macros of David B. Wilson can be downloaded from: http://mason.gmu.edu/~dwilsonb/ .

The macros of Field and Gillet ( 2010 ) can be downloaded from: https://www.discoveringstatistics.com/repository/fieldgillett/how_to_do_a_meta_analysis.html .

The tutorials can be found via: https://www.metafor-project.org/doku.php .

Metafor does currently not provide functions to conduct MASEM. For MASEM, users can, for instance, use the package metaSEM (Cheung 2015b ).

The workbooks can be downloaded from: https://www.erim.eur.nl/research-support/meta-essentials/ .

Aguinis H, Dalton DR, Bosco FA, Pierce CA, Dalton CM (2011a) Meta-analytic choices and judgment calls: Implications for theory building and testing, obtained effect sizes, and scholarly impact. J Manag 37(1):5–38

Google Scholar  

Aguinis H, Gottfredson RK, Joo H (2013) Best-practice recommendations for defining, identifying, and handling outliers. Organ Res Methods 16(2):270–301

Article   Google Scholar  

Aguinis H, Gottfredson RK, Wright TA (2011b) Best-practice recommendations for estimating interaction effects using meta-analysis. J Organ Behav 32(8):1033–1043

Aguinis H, Pierce CA, Bosco FA, Dalton DR, Dalton CM (2011c) Debunking myths and urban legends about meta-analysis. Organ Res Methods 14(2):306–331

Aloe AM (2015) Inaccuracy of regression results in replacing bivariate correlations. Res Synth Methods 6(1):21–27

Anderson RG, Kichkha A (2017) Replication, meta-analysis, and research synthesis in economics. Am Econ Rev 107(5):56–59

Appelbaum M, Cooper H, Kline RB, Mayo-Wilson E, Nezu AM, Rao SM (2018) Journal article reporting standards for quantitative research in psychology: the APA publications and communications BOARD task force report. Am Psychol 73(1):3–25

Aytug ZG, Rothstein HR, Zhou W, Kern MC (2012) Revealed or concealed? Transparency of procedures, decisions, and judgment calls in meta-analyses. Organ Res Methods 15(1):103–133

Begg CB, Mazumdar M (1994) Operating characteristics of a rank correlation test for publication bias. Biometrics 50(4):1088–1101. https://doi.org/10.2307/2533446

Bergh DD, Aguinis H, Heavey C, Ketchen DJ, Boyd BK, Su P, Lau CLL, Joo H (2016) Using meta-analytic structural equation modeling to advance strategic management research: Guidelines and an empirical illustration via the strategic leadership-performance relationship. Strateg Manag J 37(3):477–497

Becker BJ (1992) Using results from replicated studies to estimate linear models. J Educ Stat 17(4):341–362

Becker BJ (1995) Corrections to “Using results from replicated studies to estimate linear models.” J Edu Behav Stat 20(1):100–102

Bergmann C, Tsuji S, Piccinini PE, Lewis ML, Braginsky M, Frank MC, Cristia A (2018) Promoting replicability in developmental research through meta-analyses: Insights from language acquisition research. Child Dev 89(6):1996–2009

Bernerth JB, Aguinis H (2016) A critical review and best-practice recommendations for control variable usage. Pers Psychol 69(1):229–283

Bernerth JB, Cole MS, Taylor EC, Walker HJ (2018) Control variables in leadership research: A qualitative and quantitative review. J Manag 44(1):131–160

Bijmolt TH, Pieters RG (2001) Meta-analysis in marketing when studies contain multiple measurements. Mark Lett 12(2):157–169

Block J, Kuckertz A (2018) Seven principles of effective replication studies: Strengthening the evidence base of management research. Manag Rev Quart 68:355–359

Borenstein M (2009) Effect sizes for continuous data. In: Cooper H, Hedges LV, Valentine JC (eds) The handbook of research synthesis and meta-analysis. Russell Sage Foundation, pp 221–235

Borenstein M, Hedges LV, Higgins JPT, Rothstein HR (2009) Introduction to meta-analysis. John Wiley, Chichester

Book   Google Scholar  

Borenstein M, Hedges LV, Higgins JPT, Rothstein HR (2010) A basic introduction to fixed-effect and random-effects models for meta-analysis. Res Synth Methods 1(2):97–111

Borenstein M, Hedges L, Higgins J, Rothstein H (2013) Comprehensive meta-analysis (version 3). Biostat, Englewood, NJ

Borenstein M, Higgins JP (2013) Meta-analysis and subgroups. Prev Sci 14(2):134–143

Bosco FA, Steel P, Oswald FL, Uggerslev K, Field JG (2015) Cloud-based meta-analysis to bridge science and practice: Welcome to metaBUS. Person Assess Decis 1(1):3–17

Bosco FA, Uggerslev KL, Steel P (2017) MetaBUS as a vehicle for facilitating meta-analysis. Hum Resour Manag Rev 27(1):237–254

Burgard T, Bošnjak M, Studtrucker R (2021) Community-augmented meta-analyses (CAMAs) in psychology: potentials and current systems. Zeitschrift Für Psychologie 229(1):15–23

Cheung MWL (2015a) Meta-analysis: A structural equation modeling approach. John Wiley & Sons, Chichester

Cheung MWL (2015b) metaSEM: An R package for meta-analysis using structural equation modeling. Front Psychol 5:1521

Cheung MWL (2019) A guide to conducting a meta-analysis with non-independent effect sizes. Neuropsychol Rev 29(4):387–396

Cheung MWL, Chan W (2005) Meta-analytic structural equation modeling: a two-stage approach. Psychol Methods 10(1):40–64

Cheung MWL, Vijayakumar R (2016) A guide to conducting a meta-analysis. Neuropsychol Rev 26(2):121–128

Combs JG, Crook TR, Rauch A (2019) Meta-analytic research in management: contemporary approaches unresolved controversies and rising standards. J Manag Stud 56(1):1–18. https://doi.org/10.1111/joms.12427

DeSimone JA, Köhler T, Schoen JL (2019) If it were only that easy: the use of meta-analytic research by organizational scholars. Organ Res Methods 22(4):867–891. https://doi.org/10.1177/1094428118756743

DeSimone JA, Brannick MT, O’Boyle EH, Ryu JW (2020) Recommendations for reviewing meta-analyses in organizational research. Organ Res Methods 56:455–463

Duval S, Tweedie R (2000a) Trim and fill: a simple funnel-plot–based method of testing and adjusting for publication bias in meta-analysis. Biometrics 56(2):455–463

Duval S, Tweedie R (2000b) A nonparametric “trim and fill” method of accounting for publication bias in meta-analysis. J Am Stat Assoc 95(449):89–98

Egger M, Smith GD, Schneider M, Minder C (1997) Bias in meta-analysis detected by a simple, graphical test. BMJ 315(7109):629–634

Eisend M (2017) Meta-Analysis in advertising research. J Advert 46(1):21–35

Elliott JH, Synnot A, Turner T, Simmons M, Akl EA, McDonald S, Salanti G, Meerpohl J, MacLehose H, Hilton J, Tovey D, Shemilt I, Thomas J (2017) Living systematic review: 1. Introduction—the why, what, when, and how. J Clin Epidemiol 91:2330. https://doi.org/10.1016/j.jclinepi.2017.08.010

Field AP, Gillett R (2010) How to do a meta-analysis. Br J Math Stat Psychol 63(3):665–694

Fisch C, Block J (2018) Six tips for your (systematic) literature review in business and management research. Manag Rev Quart 68:103–106

Fortunato S, Bergstrom CT, Börner K, Evans JA, Helbing D, Milojević S, Petersen AM, Radicchi F, Sinatra R, Uzzi B, Vespignani A (2018) Science of science. Science 359(6379). https://doi.org/10.1126/science.aao0185

Geyer-Klingeberg J, Hang M, Rathgeber A (2020) Meta-analysis in finance research: Opportunities, challenges, and contemporary applications. Int Rev Finan Anal 71:101524

Geyskens I, Krishnan R, Steenkamp JBE, Cunha PV (2009) A review and evaluation of meta-analysis practices in management research. J Manag 35(2):393–419

Glass GV (2015) Meta-analysis at middle age: a personal history. Res Synth Methods 6(3):221–231

Gonzalez-Mulé E, Aguinis H (2018) Advancing theory by assessing boundary conditions with metaregression: a critical review and best-practice recommendations. J Manag 44(6):2246–2273

Gooty J, Banks GC, Loignon AC, Tonidandel S, Williams CE (2021) Meta-analyses as a multi-level model. Organ Res Methods 24(2):389–411. https://doi.org/10.1177/1094428119857471

Grewal D, Puccinelli N, Monroe KB (2018) Meta-analysis: integrating accumulated knowledge. J Acad Mark Sci 46(1):9–30

Gurevitch J, Koricheva J, Nakagawa S, Stewart G (2018) Meta-analysis and the science of research synthesis. Nature 555(7695):175–182

Gusenbauer M, Haddaway NR (2020) Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Res Synth Methods 11(2):181–217

Habersang S, Küberling-Jost J, Reihlen M, Seckler C (2019) A process perspective on organizational failure: a qualitative meta-analysis. J Manage Stud 56(1):19–56

Harari MB, Parola HR, Hartwell CJ, Riegelman A (2020) Literature searches in systematic reviews and meta-analyses: A review, evaluation, and recommendations. J Vocat Behav 118:103377

Harrison JS, Banks GC, Pollack JM, O’Boyle EH, Short J (2017) Publication bias in strategic management research. J Manag 43(2):400–425

Havránek T, Stanley TD, Doucouliagos H, Bom P, Geyer-Klingeberg J, Iwasaki I, Reed WR, Rost K, Van Aert RCM (2020) Reporting guidelines for meta-analysis in economics. J Econ Surveys 34(3):469–475

Hedges LV, Olkin I (1985) Statistical methods for meta-analysis. Academic Press, Orlando

Hedges LV, Vevea JL (2005) Selection methods approaches. In: Rothstein HR, Sutton A, Borenstein M (eds) Publication bias in meta-analysis: prevention, assessment, and adjustments. Wiley, Chichester, pp 145–174

Hoon C (2013) Meta-synthesis of qualitative case studies: an approach to theory building. Organ Res Methods 16(4):522–556

Hunter JE, Schmidt FL (1990) Methods of meta-analysis: correcting error and bias in research findings. Sage, Newbury Park

Hunter JE, Schmidt FL (2004) Methods of meta-analysis: correcting error and bias in research findings, 2nd edn. Sage, Thousand Oaks

Hunter JE, Schmidt FL, Jackson GB (1982) Meta-analysis: cumulating research findings across studies. Sage Publications, Beverly Hills

Jak S (2015) Meta-analytic structural equation modelling. Springer, New York, NY

Kepes S, Banks GC, McDaniel M, Whetzel DL (2012) Publication bias in the organizational sciences. Organ Res Methods 15(4):624–662

Kepes S, McDaniel MA, Brannick MT, Banks GC (2013) Meta-analytic reviews in the organizational sciences: Two meta-analytic schools on the way to MARS (the Meta-Analytic Reporting Standards). J Bus Psychol 28(2):123–143

Kraus S, Breier M, Dasí-Rodríguez S (2020) The art of crafting a systematic literature review in entrepreneurship research. Int Entrepreneur Manag J 16(3):1023–1042

Levitt HM (2018) How to conduct a qualitative meta-analysis: tailoring methods to enhance methodological integrity. Psychother Res 28(3):367–378

Levitt HM, Bamberg M, Creswell JW, Frost DM, Josselson R, Suárez-Orozco C (2018) Journal article reporting standards for qualitative primary, qualitative meta-analytic, and mixed methods research in psychology: the APA publications and communications board task force report. Am Psychol 73(1):26

Lipsey MW, Wilson DB (2001) Practical meta-analysis. Sage Publications, Inc.

López-López JA, Page MJ, Lipsey MW, Higgins JP (2018) Dealing with effect size multiplicity in systematic reviews and meta-analyses. Res Synth Methods 9(3):336–351

Martín-Martín A, Thelwall M, Orduna-Malea E, López-Cózar ED (2021) Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations. Scientometrics 126(1):871–906

Merton RK (1968) The Matthew effect in science: the reward and communication systems of science are considered. Science 159(3810):56–63

Moeyaert M, Ugille M, Natasha Beretvas S, Ferron J, Bunuan R, Van den Noortgate W (2017) Methods for dealing with multiple outcomes in meta-analysis: a comparison between averaging effect sizes, robust variance estimation and multilevel meta-analysis. Int J Soc Res Methodol 20(6):559–572

Moher D, Liberati A, Tetzlaff J, Altman DG, Prisma Group (2009) Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS medicine. 6(7):e1000097

Mongeon P, Paul-Hus A (2016) The journal coverage of Web of Science and Scopus: a comparative analysis. Scientometrics 106(1):213–228

Moreau D, Gamble B (2020) Conducting a meta-analysis in the age of open science: Tools, tips, and practical recommendations. Psychol Methods. https://doi.org/10.1037/met0000351

O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S (2015) Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev 4(1):1–22

Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A (2016) Rayyan—a web and mobile app for systematic reviews. Syst Rev 5(1):1–10

Owen E, Li Q (2021) The conditional nature of publication bias: a meta-regression analysis. Polit Sci Res Methods 9(4):867–877

Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw JM, Hróbjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E,McDonald S,McGuinness LA, Stewart LA, Thomas J, Tricco AC, Welch VA, Whiting P, Moher D (2021) The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372. https://doi.org/10.1136/bmj.n71

Palmer TM, Sterne JAC (eds) (2016) Meta-analysis in stata: an updated collection from the stata journal, 2nd edn. Stata Press, College Station, TX

Pigott TD, Polanin JR (2020) Methodological guidance paper: High-quality meta-analysis in a systematic review. Rev Educ Res 90(1):24–46

Polanin JR, Tanner-Smith EE, Hennessy EA (2016) Estimating the difference between published and unpublished effect sizes: a meta-review. Rev Educ Res 86(1):207–236

Polanin JR, Hennessy EA, Tanner-Smith EE (2017) A review of meta-analysis packages in R. J Edu Behav Stat 42(2):206–242

Polanin JR, Hennessy EA, Tsuji S (2020) Transparency and reproducibility of meta-analyses in psychology: a meta-review. Perspect Psychol Sci 15(4):1026–1041. https://doi.org/10.1177/17456916209064

R Core Team (2021). R: A language and environment for statistical computing . R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/ .

Rauch A (2020) Opportunities and threats in reviewing entrepreneurship theory and practice. Entrep Theory Pract 44(5):847–860

Rauch A, van Doorn R, Hulsink W (2014) A qualitative approach to evidence–based entrepreneurship: theoretical considerations and an example involving business clusters. Entrep Theory Pract 38(2):333–368

Raudenbush SW (2009) Analyzing effect sizes: Random-effects models. In: Cooper H, Hedges LV, Valentine JC (eds) The handbook of research synthesis and meta-analysis, 2nd edn. Russell Sage Foundation, New York, NY, pp 295–315

Rosenthal R (1979) The file drawer problem and tolerance for null results. Psychol Bull 86(3):638

Rothstein HR, Sutton AJ, Borenstein M (2005) Publication bias in meta-analysis: prevention, assessment and adjustments. Wiley, Chichester

Roth PL, Le H, Oh I-S, Van Iddekinge CH, Bobko P (2018) Using beta coefficients to impute missing correlations in meta-analysis research: Reasons for caution. J Appl Psychol 103(6):644–658. https://doi.org/10.1037/apl0000293

Rudolph CW, Chang CK, Rauvola RS, Zacher H (2020) Meta-analysis in vocational behavior: a systematic review and recommendations for best practices. J Vocat Behav 118:103397

Schmidt FL (2017) Statistical and measurement pitfalls in the use of meta-regression in meta-analysis. Career Dev Int 22(5):469–476

Schmidt FL, Hunter JE (2015) Methods of meta-analysis: correcting error and bias in research findings. Sage, Thousand Oaks

Schwab A (2015) Why all researchers should report effect sizes and their confidence intervals: Paving the way for meta–analysis and evidence–based management practices. Entrepreneurship Theory Pract 39(4):719–725. https://doi.org/10.1111/etap.12158

Shaw JD, Ertug G (2017) The suitability of simulations and meta-analyses for submissions to Academy of Management Journal. Acad Manag J 60(6):2045–2049

Soderberg CK (2018) Using OSF to share data: A step-by-step guide. Adv Methods Pract Psychol Sci 1(1):115–120

Stanley TD, Doucouliagos H (2010) Picture this: a simple graph that reveals much ado about research. J Econ Surveys 24(1):170–191

Stanley TD, Doucouliagos H (2012) Meta-regression analysis in economics and business. Routledge, London

Stanley TD, Jarrell SB (1989) Meta-regression analysis: a quantitative method of literature surveys. J Econ Surveys 3:54–67

Steel P, Beugelsdijk S, Aguinis H (2021) The anatomy of an award-winning meta-analysis: Recommendations for authors, reviewers, and readers of meta-analytic reviews. J Int Bus Stud 52(1):23–44

Suurmond R, van Rhee H, Hak T (2017) Introduction, comparison, and validation of Meta-Essentials: a free and simple tool for meta-analysis. Res Synth Methods 8(4):537–553

The Cochrane Collaboration (2020). Review Manager (RevMan) [Computer program] (Version 5.4).

Thomas J, Noel-Storr A, Marshall I, Wallace B, McDonald S, Mavergames C, Glasziou P, Shemilt I, Synnot A, Turner T, Elliot J (2017) Living systematic reviews: 2. Combining human and machine effort. J Clin Epidemiol 91:31–37

Thompson SG, Higgins JP (2002) How should meta-regression analyses be undertaken and interpreted? Stat Med 21(11):1559–1573

Tipton E, Pustejovsky JE, Ahmadi H (2019) A history of meta-regression: technical, conceptual, and practical developments between 1974 and 2018. Res Synth Methods 10(2):161–179

Vevea JL, Woods CM (2005) Publication bias in research synthesis: Sensitivity analysis using a priori weight functions. Psychol Methods 10(4):428–443

Viechtbauer W (2010) Conducting meta-analyses in R with the metafor package. J Stat Softw 36(3):1–48

Viechtbauer W, Cheung MWL (2010) Outlier and influence diagnostics for meta-analysis. Res Synth Methods 1(2):112–125

Viswesvaran C, Ones DS (1995) Theory testing: combining psychometric meta-analysis and structural equations modeling. Pers Psychol 48(4):865–885

Wilson SJ, Polanin JR, Lipsey MW (2016) Fitting meta-analytic structural equation models with complex datasets. Res Synth Methods 7(2):121–139. https://doi.org/10.1002/jrsm.1199

Wood JA (2008) Methodology for dealing with duplicate study effects in a meta-analysis. Organ Res Methods 11(1):79–95

Download references

Open Access funding enabled and organized by Projekt DEAL. No funding was received to assist with the preparation of this manuscript.

Author information

Authors and affiliations.

University of Luxembourg, Luxembourg, Luxembourg

Christopher Hansen

Leibniz Institute for Psychology (ZPID), Trier, Germany

Holger Steinmetz

Trier University, Trier, Germany

Erasmus University Rotterdam, Rotterdam, The Netherlands

Wittener Institut Für Familienunternehmen, Universität Witten/Herdecke, Witten, Germany

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Jörn Block .

Ethics declarations

Conflict of interest.

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

See Table 1 .

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Hansen, C., Steinmetz, H. & Block, J. How to conduct a meta-analysis in eight steps: a practical guide. Manag Rev Q 72 , 1–19 (2022). https://doi.org/10.1007/s11301-021-00247-4

Download citation

Published : 30 November 2021

Issue Date : February 2022

DOI : https://doi.org/10.1007/s11301-021-00247-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Find a journal
  • Publish with us
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 24 May 2024

Research collaboration data platform ensuring general data protection

  • Monica Toma 1   na1 ,
  • Caroline Bönisch 2 , 3   na1 ,
  • Benjamin Löhnhardt 4 ,
  • Michael Kelm 1 ,
  • Hanibal Bohnenberger 6 ,
  • Sven Winkelmann 1 , 5 ,
  • Philipp Ströbel 6 &
  • Tibor Kesztyüs 2  

Scientific Reports volume  14 , Article number:  11887 ( 2024 ) Cite this article

124 Accesses

Metrics details

  • Medical research
  • Translational research

Translational data is of paramount importance for medical research and clinical innovation. It has the potential to benefit individuals and organizations, however, the protection of personal data must be guaranteed. Collecting diverse omics data and electronic health records (EHR), re-using the minimized data, as well as providing a reliable data transfer between different institutions are mandatory steps for the development of the promising field of big data and artificial intelligence in medical research. This is made possible within the proposed data platform in this research project. The established data platform enables the collaboration between public and commercial organizations by data transfer from various clinical systems into a cloud for supporting multi-site research while ensuring compliant data governance.

Similar content being viewed by others

research article data analysis

Refining the impact of genetic evidence on clinical success

research article data analysis

The state of the art in secondary pharmacology and its impact on the safety of new medicines

research article data analysis

An overview of clinical decision support systems: benefits, risks, and strategies for success

Introduction.

Translational data is of paramount importance for medical research and clinical innovation. The combination of different omics data (e.g., genomics, radiomics, proteomics) and clinical health data with big data analytics and artificial intelligence (AI) has the potential to transform the healthcare to a proactive P4 medicine that is predictive, preventive, personalized, and participatory 1 . Based on this potential, medical research builds on data, which should be easily findable, accessible, interoperable, and re-usable (FAIR) for (secondary) use 2 . Unfortunately, clinical health data is usually stored in so called data or information silos. These silos are prone to restrict access and reuse of the data by holding disparate data sets 3 , 4 , 5 . However, to enable AI, an extraordinarily large amount of data is needed to train or model the neural network 6 . For this purpose, data must be collected and curated appropriately besides being stored (professionally) so that it is reliable for (further) exploitation 7 . In medical areas like pathology or radiology, where diagnostics rely on medical imaging managed by the Digital Imaging and Communications in Medicine (DICOM) standard, a large amount of data can be collected during examination and treatment, while machine learning is already well established 8 , 9 , 10 .

The Medical Informatics Initiative (MII) 11 , funded by the German Ministry of Education and Research (BMBF) is a joined collaboration project, connecting various German university hospitals, research institutions, and businesses while overcoming enclosed clinical health data silos and exchange medical information. To create interoperable research frameworks, different consortia have been formed, within the MII. Every consortium established a Medical Data Integration Center (MeDIC) at German university hospitals to bridge and merge data from different clinical source systems. This pooling of data is mostly done within a (research) data platform. A data platform is a set of technologies that enables the acquisition, storage, curation, and governance of data while ensuring security for its users and applications. With the integration of multi omics and electronic health records (EHR), important information can enrich the information in a health data platform 12 .

Additionally, most of the clinical health data within university hospitals contain personal data and are therefore subject to specific privacy protection laws and regulations. Within the European Union (EU) the data protection directive from 1995 was replaced by the General Data Protection Regulation (GDPR) ( https://gdpr.eu/tag/gdpr/ .) in 2016. The GDPR applies to organizations everywhere if they deal with data related to people in the EU. In contrast to the previous legislation, the country-specific protection laws were harmonized within the GDPR. The Regulation now contains detailed requirements for commercial and public organizations when collecting, storing, and managing personal data.

The GDPR sets the scene for a lawful, fair, and transparent usage of data in the Article 5, defining the principles related to processing of personal data. The data should only be collected for specified purposes (purpose limitation) and should be minimized accordingly (data minimization). The personal data should rely on accuracy, integrity, and confidentiality, while storage limitations should be carefully considered. The GDPR also specifies the principle of accountability, according to which the controller shall be responsible for processing of personal data in compliance with the principles of the GDPR and able to demonstrate this compliance as in the Article 5.2 of the GDPR. Especially when processing is based on the data subject’s consent, the controller should be able to demonstrate this consent as defined in the Article 7.1 of the GDPR. Furthermore, the purpose-based consent should be, not only informed and specific, but unambiguous as well, so that the data subject’s wishes are reflected in the processing of personal data.

Each of the principles under GDPR Article 5 apply to all data processing, including research purposes. Scientific research is seen, as an important area of public interest, hence derogations from the general rules are provided in Article 89 of the GDPR. Within a valid legal basis and subject to the principle of proportionality and appropriate safeguards, secondary use of research data is possible: “those measures may include pseudonymization provided that those purposes can be fulfilled in that manner. Where those purposes can be fulfilled by further processing which does not permit or no longer permits the identification of data subjects, those purposes shall be fulfilled in that manner” 13 . Nevertheless, secondary use of data in research projects remain a gray area in this field 14 , 15 , 16 . Especially if considering to use the data by third parties, like commercial partners.

According to GDPR, data minimization refers to the regulation of personal data being “adequate, relevant, and limited to what is necessary in relation to the purposes for which they are processed” 17 . This means, that data beyond this scope should not be included in the data collection and analysis process. Because of the data minimization requirements, it is essential for data platforms to retain the scope for every data set, making it possible to refer to the scope at every step of the data processing. Especially the area of data minimization is essential for the use of data platforms, because of the potential threat of violation of user’s privacy by combining data independently from a dedicated scope or use case 18 .

The challenge. Based on the aforementioned GDPR regulation requirements, that need to be taken into account, processing of data from the data platform for research purposes necessitates transferring data from clinical data storage into the cloud considering data privacy issues and legal constraints in a collaborative research environment. A crucial aspect of this work is providing a reliable data transfer between different institutions, while ensuring compliance with data privacy regulations, e.g., GDPR. Despite the data being minimized locally by the university medical site, the purpose limitation is ensured.

Literature review

To contextualize the work presented, the authors executed a comprehensive literature review. Literature databases, such as PubMed and Embase were searched for publications on the topic of the article with a restriction to English and German language in a period from January 2016 to January 2024. The search strategy was created by combining database specific index terms (e.g., Emtree - Embase Subject Headings) and free terms relevant to the aim of the study with Boolean operators as shown in Table  1 . In result, 103 hits from both databases PubMed and Embase could be obtained by the literature search. Proceeding from these hits, 67 matches were considered after the title and abstract screening. Continuing from this step, all full texts of these hits were retrieved and examined. After reviewing the full texts, further 18 results could be excluded. The remaining matches form the foundation of the work presented and are addressed inter alia in the Section Related Work.

Related work

The work in this research project is closely related to (research) data platform approaches which exchange data (on) through cloud services between different stakeholders including both commercial and public organizations.

Froehlicher et al. 19 proposes an encrypted, federated learning approach, to overcome the hurdle of processing privacy-protected data within a centralized storage. In differentiation, the rationale of this present article includes centralized data for research purposes, while considering GDPR compliant strategies to make data interoperable and accessible and neglecting technical workarounds to bypass the GDPR requirements.

A technical solution similar to the results presented in this article is shown by Bahmani et al. 20 . They present a platform for minimized data which is transferred from an app e.g., wearables to a cloud service. However, it must be noted that the approach of Bahmani et al. does not include a legal framework in relation to data privacy and the inclusion of a data contract.

The commentary of Brody et al. introduces a “cloud-based Analysis Commons” 21 , a framework that combines genotype and phenotype data from whole-genome sequencing, which are provided via multiple studies, by including data-sharing mechanism. Although this commentary provides a likely similar approach to bridge the gap between data collection within multiple studies and data transfer and interoperability to an analytical platform it does not touch on the implementation of the framework in compliance with data privacy principles as required by the GDPR. The exchange of data is secured via a “consortium agreement rather than through the typical series of bilateral agreements” 21 to share data across institutions.

The research data portal for health (“Forschungsdatenportal für Gesundheit”) developed within the MII 22 was made available in September 2022.The portal is currently running within a pilot phase and allows researchers to apply centrally for health data and biological samples for scientific studies.The data to be queried is based on a core data set 23 that was developed within the MII. The proposed approach in this manuscript allows both clinical researchers associated with university hospitals as well AI researchers associated with industrial partners to work together on the same dataset at the same time. Furthermore, the data available in the described research platform is already cleared by the ethic committee of the organization uploading the data, so that a new vote is not necessary in contrary to the approach taken by the MII. Furthermore, the exchange between the university hospital and the commercial partner is provided via a data contract with specific data governance measures including rights and permissions. This data contract is registered in the cloud prior to a data transfer.

Continuing from the MII, the Network of University Medicine (NUM), established in 2020, 24 , 25 contributes through its coordinated efforts and platforms to better prepare German health research and, consequently, the healthcare system as a whole for future pandemics and other crises. NUM, started as part of crisis management against COVID-19, coordinating clinical COVID-19 research across all university hospital sites, fostering collaboration among researchers for practical, patient-centric outcomes and better management of public health crises.The sub-project Radiological Cooperative Network (RACOON) is the first of its kind to bring together all university departments of a medical discipline and establish a nationwide platform for collaborative analysis of radiological image data 26 , 27 , 28 . This platform supports clinical and clinical-epidemiological studies as well as the training of AI models. The project utilizes technology allowing structured data capture from the outset, ensuring data quality, traceability, and long-term usability. The collected data provide valuable insights for epidemiological studies, situational assessments, and early warning mechanisms. Within RACOON the Joint Imaging Platform (JIP) established by the German Cancer Consortium (DKTK) incorporates federated data analysis technology, where the imaging data remain at the site, where it originated and the analysis algorithm are shared within the platform. JIP provides a unified infrastructure across radiology and nuclear medicine departments of 10 university hospitals in Germany. A core component is “SATORI”, a browser-based application for viewing, curating and processing medical data. SATORI supports images, videos and clinical data, with particular benefits for radiological image data. While this project is very promising and brings great potential, it is designed for radiological data and images, while the project adressed in this research manuscript is laying its focus on pathology data. Furthermore, the exchange with an industrial partner differs from the network partners listed for RACOON (university radiology centers and non-university research institutes).

Another additional infrastructure for the exchange of federated data is GAIA-X 29 . GAIA-X aims to exchange data in a trustworthy environment and give users control over their data. The GAIA-X infrastructure is based on a shared model with the components - data ecosystems and infrastructure ecosystems. Data is exchanged via a trust framework with a set of rules for participation in GAIA-X. This approach differs from the approach written in the manuscript in the form of the data contract between the partners involved and the pseudonymization of the data during exchange.

The results of the literature search led to the conclusion that there are few comparable approaches of research data platforms which exchange medical data via a cloud. However, no identical approaches could be identified. In particular, the exchange of data under consideration of a data contract in relation to a legal framework regarding GDPR could not be found amongst the research results.

Clinical infrastructure and data minimization

To ensure the exchange of medical data while considering GDPR regulations between a MeDIC - the network used in this research project, is divided using network segmentation to handle data with a higher protection class accordingly. The clinical systems (e.g., pathology systems) are located in the so-called patient network segment (PatLAN) of the research facility and is kept separated from the research network segment (WissLAN). In regards of keeping the data stored, to a minimum, a data minimization step is performed in the staging layer between the patient network segment and the research network segment. Only data items required for further processing are transferred between the two networks. In regards of collecting the data it could have been useful and advised to use the broad patient consent (as established within the MII) in such a research project. But at the start of research project presented in 2022, it had not been introduced at the UMG at that time. The underlying patient consent is recorded manually on paper, but was afterwards be entered digitally by a study nurse and passed through into the study data pool within the UMG-MeDIC. From there it is then provided to the industrial partner, as part of the data shared. It includes consent to data release and further processing within the study mentioned in Section “ Methods ”. After collecting the patient consents, personal data is replaced by a pseudonymization process. Here, an independent trusted third party (TTP) takes over the task of replacing the personally identifiable data (PII) with a pseudonym (unique generated key code). This pseudonymization can only be reversed by the TTP. This TTP is established at the MeDIC. Mapping tables of personal data and assigned pseudonyms are exclusively known to the TTP and the TTP can, if medically advised and if there is a corresponding consent for re-contact, carry out a de-pseudonymization. The staff of the TTP office is released from the authority of the MeDIC executive board regarding the pseudonymization of personal data. The TTP staff is the only party that could perform the process of de-pseudonymization based on a documented medical reason.

Cloud infrastructure

The described current status under Section “ Clinical infrastructure and data minimization ” with regard to the clinical infrastructure and the approach of making the data available for analysis via a cloud infrastructure made it necessary to deal with cloud services that enable sharing big data in medical research. Efficient data management is more important than ever, helping businesses and hospitals to gain analytical insights, as well as using machine learning in the medical field (e.g., to predict molecular alterations of tumors 30 ). The first generation of big data management mainly consisted of data warehouses which provided storage for structured data. Over time as more unstructured data emerges and is stored within clinical infrastructures, a second generation of big data management platforms called data lakes were developed. They incorporate low-cost (cloud) storages (e.g., Amazon Simple Storage Service, Microsoft Azure Storage, Hadoop Distributed File System) while holding generic raw data.

figure 1

Three-layered cloud infrastructure with uniform data access. Figure based on 31 and 32 .

Although combining data from warehouses and data lakes is highly complex for data users, Zaharia et al. 32 propose an integrated architecture which combines a low-cost data lake (cf. Fig.  1 ) with direct file access and performance features of a data warehouse and database management system (DBMS) such as: atomicity, consistency, isolation, durability (ACID) transactions, data versioning, auditing, and indexing on the storage level. All these components can be combined with a three-layer clustering (cf. Fig.  1 ) which is usually used for data warehouses: staging area (or bronze layer) for incoming data, data warehouse (or silver layer) for curated data, and data access (or gold layer, or data mart) for end users or business applications 31 , 33 , 34 . The cloud infrastructure used, is structured accordingly, to enhance the benefits of this three-layer clustering.

Ethic review

Ethical approval for the study was obtained from the Ethics Review Committee of the University Medical Center Göttingen (Ref No. 24/4/20, dated 30.04.2020), and all developments and experiments were performed in accordance with relevant guidelines and regulations. Furthermore, informed consent was obtained from all subjects and/or their legal guardian(s).

Establishing a data transfer from clinical data storage of a MeDIC into a cloud requires connecting different source systems. A system overview of the approach is shown in Fig.   2 . Firstly, data retrieved from clinical systems (segment PatLAN as patient network segment, which is separated from the internet) is processed and saved in the MeDIC (segment MeDIC as part of the research network segment WissLAN). Secondly, the data is transferred from the MeDIC to the cloud with a software component (called edge-device) that ensures authentication and data encryption. The solution proposed in this research project is based on European Privacy Seal-certified cloud products ( https://euprivacyseal.com/de/eps-en-siemens-healthcare-teamplay/  [Online 2024/01/16].) to be privacy compliant with the GDPR. The approach was validated by the development, testing, and deployment of a novel AI tool to predict molecular alterations of tumors 30 based on the data transferred from one clinical institution.

Clinical infrastructure

figure 2

System overview to transfer data from clinical systems to the cloud providing access for commercial partners.

The data transfer from the PatLAN segment to the MeDIC (and vice versa) is only possible under certain conditions. The University Medical Center Göttingen (UMG) established the MeDIC as a data platform to integrate the data from the clinical source systems in patient care (e.g., pathology). Subsequently the collected data is processed and made available in different formats to different infrastructures (e.g., cloud) depending on the use case.

As described in Fig.  2 the data in the source systems contain both personal data (IDAT) and the medical data (MDAT). The MeDIC receives the data from the clinical source systems in a particular network segment within PatLAN. In this step of the operation the data is handled by an ETL (extract, transform, load) process for data minimization and transformation 35 . This means that only the minimized data is stored in the MeDIC. This step replaces the personal data (IDAT) with a pseudonym (PSN) as a prerequisite for being able to process the data in the research network. Only by means of the trusted third party (as described in Section “ Clinical infrastructure and data minimization ”) the pseudonyms can be resolved back to an actual person. Additionally, this information is not transferred to the research network segment (WissLAN). In case of a consent withdrawal, the medical data involved will be deleted from all storage locations at the MeDIC and commercial partner. To ensure the revocation, an automatic process is initiated. The process performs the deletion of the data within the MeDIC and triggers a deletion process to the commercial partner by sending the pseudonyms (PSN) to be deleted.

Data contract

After the data is processed within the clinical infrastructure of the MeDIC, the data, received by the commercial partner will be stored after a so called “data contract”, which is designed as a questionnaire, that specifies data governance measures, including rights and permissions. For the provision of the data via the edge-device to the cloud infrastructure, the minimized data from the MeDIC will be used. It is submitted and registered on the cloud prior to the data transfer. The “data contract” includes a data protection impact assessment (DPIA) by design to assess the re-identification risks that may arise from the content and context of the data aggregation. A data owner affiliated with the commercial partner will be assigned to a specific data set. The data owner must therefore ensure that the data is processed in compliance with the purpose stipulated in the legal obligations.The data contract triggers the correct distribution and storage of data in the respective regional data center. Moreover, only designated parties can process the data to the extent which is necessary for the permitted purpose. Logs to all data activities are provided. The period of storage and usage is defined including the obligations to cite the origin of the data or to disclose the results generated by the data usage. Furthermore, it is ensured that when a request to delete a specific data set is received (e.g., withdrawal of consent) it is possible to track this data and remove it completely in a timely manner.

Cloud infrastructure supporting research

In addition to the mentioned data privacy issues (see Section “ Data contract ”) data transfer from university medical centers (hospital) networks into cloud environments (e.g., OneDrive, GoogleDrive, Dropbox,) is often restricted by security rules such as blocked ports or firewall settings. Facilitating the firewall configurations, an edge-device was established, tunneling messages and data from and to the cloud through one connection which is secured by encryption and certificates (SSL/TLS with RSA encryption using keys with at least4096 Bit). The edge-device is setup on a virtual machine within the MeDIC as part of the research network segment, while being configured, operated, and monitored from the cloud (see Fig.  3 ). This enables technical IT personal to establish data channels for medical end users without on-side involvement. Focusing on the user experience for medical users, a similar approach to Microsoft OneDrive was followed, by creating local folders for each upload and download channel, which are connected to a secured cloud storage container.

figure 3

Screenshot of the cloud-based configuration for one edge-device.

For the cloud platform storage of the commercial partner, we use a similar approach to Zaharia et al. 32 by combining Microsoft Azure Data Lake with concepts from data warehouses and direct file access for cloud data analytic platform tools (e.g., cloud hosted Jupyter Notebooks ( https://jupyter.org/  [Online 2022/09/09]). While supporting ACID transactions (see Section “ Cloud infrastructure supporting research ”), data versioning, lineage, and metadata for each file, it also covers the requirements for handling personal data. Following the three-layer approach of data warehouses, files are uploaded to an ingestion zone which scans and associates them with a data contract before they are moved into a bronze data lake storage (layer 1: staging). From this layer the data is extracted, transformed/curated, and loaded/published by data engineers on a silver zone (layer 2). Due to the data contract reference saved in the meta data, the data privacy constraints are always known for each file regardless in which zone it is located. As big amounts of data are being processed, mounting data zones within data analytic platform tools avoids copying large files from one destination to another. Furthermore, cloud-hosted machine learning tools such as MLFlow ( https://mlflow.org/  [Online 2022/08/01].) were employed, with direct file access to enable the management of the complete machine learning lifecycle.

Technical evaluation

To show that the orchestration of the MeDIC’s clinical infrastructure together with the cloud infrastructure of the commercial partner is technically feasible, an evaluation of the approach was conducted.

Overall, data from 2000 cancer patients (treated between 2000 and 2020) were transferred from the UMG to two commercial partners. Differences were expected between small and therefore fast transferrable clinical files and large files, requiring longer transfer times. For many small files the number of parallel transfers are important, whereas large files benefit from few parallel data transfers with high bandwidth each. To evaluate both cases, we transferred whole pathology slide images and multi-omics data. The data transfer is based on Microsoft Azure platform and correspondent C# libraries - thus no problems in terms of scalability were encountered. Nevertheless, sometimes connection issues were observed and when comparing the MD5 hashes between source files and destination files large files were corrupted. The issue could be traced to regular Windows Updates, reboots of the virtual machine, and/or local IT scripts changing firewall settings. Future system designs will provide an automatic validation of source and destination file.

Within the project Cancer Scout, a research collaboration platform was used for an end-to-end machine learning lifecycle, working on a large dataset, which cannot be handled on standard local hardware. In the first step, data scientists curated raw data (bronze data lake zone) to a clean data set (silver zone) by eliminating duplicates and converting whole slide imaging (WSI) iSyntax format to Tagged Image File Format (TIFF) using cloud hosted Jupyter Notebooks. Furthermore, pathologists annotated the WSIs with cancer subtypes using the cloud-hosted EXACT tool 36 , working directly with data on the cloud data storage. The trained machine learning model for WSI classification is described in Teichmann et al. 30 . Currently, model serving is done with a cloud-based API, however, it needs to be integrated in a medical decision support tool.

GDPR has been introduced with the goal to protect personal data of Europeans when processed in all sectors of the economy. Notably, it fails to provide a clear instruction for processing personal data for secondary research purposes, like under which circumstances key-coded data could be considered anonymous 37 , 38 . Nevertheless, data collection and processing are of paramount importance for further innovation and development of the promising fields of big data and artificial intelligence in drug discovery, clinical trials, personalized medicine and medical research 39 . While secure data collection enables the collaboration between multiple public and commercial organizations to scientifically explore multi-omics data, it also facilitates medical research by using AI technologies to analyze and identify patterns in large and complex data sets faster and more precisely. From a technical point of view, the task of collecting and transferring medical data from hospitals to a collaborative cloud data platform, while ensuring privacy and security, is not trivial. To address this issue, a three layer-approach was validated, consisting of clinical data storages, a MeDIC, and a cloud platform. Firstly, the data from the clinical systems are minimized during the ETL process to the MeDIC. Secondly, each data is linked to a “data contract” when transferred from the MeDIC to the cloud, specifying data governance and defining the rights and permissions to use the data. Currently, only the trusted third party of the MeDIC can link PSN to IDAT, thus no data record linkage between different locations is possible. As linking different kinds of data from different institutions increases the risk to identify a patient (e.g., head CT, genome sequencing), this topic needs further research.

We successfully established a data platform which enables the collaboration between a public and commercial organization by enabling data transfer from various clinical systems via a MeDIC into a cloud for supporting multi-site research while ensuring compliant data governance. In a first step the approach was validated by the collaboration between one clinical institution and an industrial partner and is therefore specific to the UMG and MeDIC, as the TTP is located at the MeDIC. Based on a dataset containing 2085 diagnostic slides from 840 colon cancer patients a new AI algorithm for classification of WSI in digital pathology 30 was proposed. Considering the literature review this implementation is, to the authors knowledge, the first work that implements this concept. To contribute the results gained from this research project the following measures were taken into account to ensure that this research meets the requirements of the FAIR principles as well. The research was made findable (F) by submitting it to a respected journal, providing a DOI (F1, F3) and further described keywords (F2). The data is findable and can be requested from the authors (F4). The data of the submission can be retrieved by the given DOI (A1), the access protocol is open, free (A1.1) and the manuscript, ones published, is accessible from different online libraries (A2). A formal language for knowledge representation was used (I1) and improved the manuscript to include vocabulary that follow the FAIR principles (I2). Furthermore, the data was described with as much information and relevant attributes (R1).

Data availability

The data that support the findings of this study are available from the university medical center Göttingen, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of University Medical Center Göttingen. Please contact the corresponding author CB for all data requests.

Hood, L. & Flores, M. A personal view on systems medicine and the emergence of proactive p4 medicine: Predictive, preventive, personalized and participatory. New Biotechnol. 29 (6), 613–24. https://doi.org/10.1016/j.nbt.2012.03.004 (2012).

Article   CAS   Google Scholar  

Wilkinson, M. D. et al. The fair guiding principles for scientific data management and stewardship. Sci. Data https://doi.org/10.1038/sdata.2016.18 (2016).

Article   PubMed   PubMed Central   Google Scholar  

Patel, J. Bridging data silos usind big data integration. Int. J. Database Manag. Syst. https://doi.org/10.5121/ijdms.2019.11301 (2019).

Article   Google Scholar  

Cherico-Hsii, S. et al. Sharing overdose data across state agencies to inform public health strategies: A case study. Public Health Rep. 131 (2), 258–263. https://doi.org/10.1177/003335491613100209 (2016).

Rosenbaum, L. Bridging the data-sharing divide–seeing the devil in the details, not the other camp. N. Engl. J. Med. https://doi.org/10.1056/NEJMp1704482 (2017).

Article   PubMed   Google Scholar  

Shafiee, M. J., Chung, A. G., Khalvati, F., Haider, M. A. & Wong, A. Discovery radiomics via evolutionary deep radiomic sequencer discovery for pathologically proven lung cancer detection. J. Med. Imaging 4 (4), 041305. https://doi.org/10.1117/1.JMI.4.4.041305 (2017).

DeVries, M. et al. Name it! store it! protect it!: A systems approach to managing data in research core facilities. J. Biomol. Tech. 28 (4), 137–141. https://doi.org/10.7171/jbt.17-2804-003 (2017).

Hosny, A., Parmar, C., Quackenbush, J., Schwartz, L. & Aerts, H. Artificial intelligence in radiology. Nat. Rev. Cancer 18 (18), 500–510. https://doi.org/10.1038/s41568-018-0016-5 (2018).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Cui, M. & Zhang, D. Artificial intelligence and computational pathology. Lab. Invest. 101 , 412–422. https://doi.org/10.2217/fon.15.295 (2016).

Mathur, P. & Burns, M. Artificial intelligence in critical care. Int. Anesthesiol. Clin. 57 (2), 89–102. https://doi.org/10.1097/AIA.0000000000000221 (2019).

Semler, S. C., Wissing, F. & Heyder, R. German medical informatics initiative. Methods Inf. Med. https://doi.org/10.3414/ME18-03-0003 (2018).

Casey, J., Schwartz, B., Stewart, W. & Adler, N. Using electronic health records for population health research: A review of methods and applications. Annu. Rev. Public Health 37 (1), 61–81. https://doi.org/10.1146/annurev-publhealth-032315-021353 (2016).

EuropeanDataProtectionSupervisor. A preliminary opinion on data protection and scientific research (2020). https://edps.europa.eu/sites/edp/files/publication/20-01-06_opinion_research_en.pdf,p.17 .

Soini, S. Using electronic health records for population health research: A review of methods and applications. Eur. J. Hum. Genet. https://doi.org/10.1038/s41431-020-0608-x (2020).

Chico, V. The impact of the general data protection regulation on health research. Br. Med. Bull. https://doi.org/10.1093/bmb/ldy038 (2018).

Rumbold, J. M. M. & Pierscionek, B. K. A critique of the regulation of data science in healthcare research in the European union. BMC Med. Ethics https://doi.org/10.1186/s12910-017-0184-y (2017).

EuropeanParliament. General data protection regulation (2016). https://eur-lex.europa.eu/eli/reg/2016/679/oj,p.35 .

Senarath, A. & Arachchilage, N. A. G. A data minimization model for embedding privacy into software systems. Comput. Secur. 87 , 61–81. https://doi.org/10.1016/j.cose.2019.101605 (2019).

Froelicher, D. et al. Truly privacy-preserving federated analytics for precision medicine with multiparty homomorphic encryption. Nat. Commun. 12 (1), 5910. https://doi.org/10.1038/s41467-021-25972-y (2021).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Bahmani, A. et al. A scalable, secure, and interoperable platform for deep data-driven health management. Nat. Commun. 12 , 5757. https://doi.org/10.1038/s41467-021-26040-1 (2021).

Brody, J. A. et al. Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology. Nat. Commun. 49 , 1560–1563. https://doi.org/10.1038/ng.3968 (2017).

Prokosch, H.-U. et al. Towards a national portal for medical research data (fdpg): Vision, status, and lessons learned. Stud. Health Technol. Inform. 302 , 307–311. https://doi.org/10.3233/SHTI230124 (2023).

Medizininformatik-Initiative. Der kerndatensatz der medizininformatik-initiative, 3.0 (2021).

Schmidt, C. et al. Making covid-19 research data more accessible-building a nationwide information infrastructure. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz https://doi.org/10.1007/s00103-021-03386-x (2021).

Heyder, R. et al. The german network of university medicine: Technical and organizational approaches for research data platforms. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz https://doi.org/10.1007/s00103-022-03649-1 (2023).

Schmidt, M. et al. Codex meets racoon - a concept for collaborative documentation of clinical and radiological covid-19 data. Stud. Health Technol. Inform. https://doi.org/10.3233/SHTI220804 (2022).

RACOON, N. Radiologische forschung in der entwicklung. RoFo : Fortschritte auf dem Gebiete der Rontgenstrahlen und der Nuklearmedizin (2022). https://doi.org/10.1055/a-1888-9285 .

RACOON, N. Racoon: Das radiological cooperative network zur beantwortung der großen fragen in der radiologie. RoFo : Fortschritte auf dem Gebiete der Rontgenstrahlen und der Nuklearmedizin (2022). https://doi.org/10.1055/a-1544-2240 .

Pedreira, V., Barros, D. & Pinto, P. A review of attacks, vulnerabilities, and defenses in industry 4.0 with new challenges on data sovereignty ahead. Sensors 21 , 15. https://doi.org/10.3390/s21155189 (2021).

Teichmann, M., Aichert, A., Bohnenberger, H., Ströbel, P. & Heimann, T. Wang, L., Dou, Q., Fletcher, P. T., Speidel, S. & Li, S. End-to-end learning for image-based detection of molecular alterations in digital pathology. (eds Wang, L., Dou, Q., Fletcher, P. T., Speidel, S. & Li, S.) Medical Image Computing and Computer Assisted Intervention—MICCAI 2022 , (Springer Nature: Switzerland, 2022). 88–98

Inmon, W. H. Building the Data Warehouse (John Wiley & Sons, 2005).

Google Scholar  

Zaharia, M., Ghodsi, A., Xin, R. & Armbrust, M. Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics. 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, January 11-15, 2021, Online Proceedings (2021). http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf .

Kimball, R. & Ross, M. The Data Warehouse Toolkit (John Wiley & Sons, 2013).

Lee, D. & Heintz, B. Productionizing machine learning with delta lake. databricks Engineering Blog (2019). https://databricks.com/de/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html .

Parciak, M. et al. Fairness through automation: Development of an automated medical data integration infrastructure for fair health data in a maximum care university hospital. BMC Med. Inform. Decision Making https://doi.org/10.1186/s12911-023-02195-3 (2023).

Marzahl, C. et al. Exact: A collaboration toolset for algorithm-aided annotation of images with annotation version control. Sci. Rep. 11 (1), 4343. https://doi.org/10.1038/s41598-021-83827-4 (2021).

van Ooijen, I. & Vrabec, H. U. Does the gdpr enhance consumers’ control over personal data? an analysis from a behavioural perspective. J. Consum. Policy https://doi.org/10.1007/s10603-018-9399-7 (2019).

Zarsky, T. Z. Incompatible: The Gdpr in the Age of Big Data (Seton Hall Law Review, 2017).

Mallappallil, M., Sabu, J., Gruessner, A. & Salifu, M. A review of big data and medical research. SAGE Open Med. 8 , 2050312120934839. https://doi.org/10.1177/2050312120934839 (2020).

Download references

Acknowledgements

The research presented in this work was funded by the German federal ministry of education and research (BMBF) as part of the Cancer Scout project (13GW0451). We thank all members of the Cancer Scout consortium for their contributions.

Author information

These authors contributed equally: Monica Toma and Caroline Bönisch.

Authors and Affiliations

Siemens Healthineers AG, Erlangen, Germany

Monica Toma, Michael Kelm & Sven Winkelmann

Medical Data Integration Center, Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Germany

Caroline Bönisch & Tibor Kesztyüs

Faculty of Electrical Engineering and Computer Science, University of Applied Sciences Stralsund, Stralsund, Germany

Caroline Bönisch

Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Germany

Benjamin Löhnhardt

Nuremberg Institute of Technology, Nuremberg, Germany

Sven Winkelmann

Institute of Pathology, University Medical Center Göttingen, Göttingen, Germany

Hanibal Bohnenberger & Philipp Ströbel

You can also search for this author in PubMed   Google Scholar

Contributions

T.K. and M.K. coordinated and supervised this research project. T. K., M.K., and P.S. contributed to the conceptualization of the research and were involved in the revision, editing, and final approval of the manuscript. All authors read and approved the final manuscript. C. B. wrote the introduction including the problem statement, the literature review and related work sections, and performed the literature review, as well as contributed to clinical infrastructure and data minimization. M.T. wrote the abstract, the section in the results concerned with the data contract, and the discussion; contributed to the introduction and technical evaluation; proofreading. B.L. wrote the section in the results concerned with the clinical infrastructure, and the sections in methods concerned with the clinical infrastructure and data minimization. S.W. wrote the section in the results concerned with the cloud infrastructure, the section in methods concerned with the cloud infrastructure, and the technical evaluation.

Corresponding authors

Correspondence to Monica Toma or Caroline Bönisch .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Toma, M., Bönisch, C., Löhnhardt, B. et al. Research collaboration data platform ensuring general data protection. Sci Rep 14 , 11887 (2024). https://doi.org/10.1038/s41598-024-61912-8

Download citation

Received : 10 October 2023

Accepted : 10 May 2024

Published : 24 May 2024

DOI : https://doi.org/10.1038/s41598-024-61912-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

research article data analysis

  • Download PDF
  • Share X Facebook Email LinkedIn
  • Permissions

Shingles Vaccination in Medicare Part D After Inflation Reduction Act Elimination of Cost Sharing

  • 1 Program on Medicines and Public Health, University of Southern California Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles
  • 2 Leonard D. Schaeffer Center for Health Policy and Economics, University of Southern California, Los Angeles
  • 3 Sol Price School of Public Policy, University of Southern California, Los Angeles
  • 4 Department of Population Health Sciences, University of Wisconsin-Madison
  • 5 Center for Value-Based Insurance Design, University of Michigan, Ann Arbor

Although vaccinations prevent morbidity and mortality among Medicare beneficiaries, uptake of vaccines recommended by the Advisory Committee on Immunization Practices covered by Medicare Part D (ie, shingles, tetanus, diphtheria, pertussis, and hepatitis A and B) is suboptimal. 1 Unlike commercially insured individuals who have no cost sharing for recommended vaccinations, in 2021, Medicare beneficiaries receiving vaccines covered under Medicare Part D paid $234 million out of pocket (OOP), with a mean OOP cost of $76.94 for shingles vaccines.

Read More About

Qato DM , Romley JA , Myerson R , Goldman D , Fendrick AM. Shingles Vaccination in Medicare Part D After Inflation Reduction Act Elimination of Cost Sharing. JAMA. Published online May 23, 2024. doi:10.1001/jama.2024.7348

Manage citations:

© 2024

Artificial Intelligence Resource Center

Cardiology in JAMA : Read the Latest

Browse and subscribe to JAMA Network podcasts!

Others Also Liked

Select your interests.

Customize your JAMA Network experience by selecting one or more topics from the list below.

  • Academic Medicine
  • Acid Base, Electrolytes, Fluids
  • Allergy and Clinical Immunology
  • American Indian or Alaska Natives
  • Anesthesiology
  • Anticoagulation
  • Art and Images in Psychiatry
  • Artificial Intelligence
  • Assisted Reproduction
  • Bleeding and Transfusion
  • Caring for the Critically Ill Patient
  • Challenges in Clinical Electrocardiography
  • Climate and Health
  • Climate Change
  • Clinical Challenge
  • Clinical Decision Support
  • Clinical Implications of Basic Neuroscience
  • Clinical Pharmacy and Pharmacology
  • Complementary and Alternative Medicine
  • Consensus Statements
  • Coronavirus (COVID-19)
  • Critical Care Medicine
  • Cultural Competency
  • Dental Medicine
  • Dermatology
  • Diabetes and Endocrinology
  • Diagnostic Test Interpretation
  • Drug Development
  • Electronic Health Records
  • Emergency Medicine
  • End of Life, Hospice, Palliative Care
  • Environmental Health
  • Equity, Diversity, and Inclusion
  • Facial Plastic Surgery
  • Gastroenterology and Hepatology
  • Genetics and Genomics
  • Genomics and Precision Health
  • Global Health
  • Guide to Statistics and Methods
  • Hair Disorders
  • Health Care Delivery Models
  • Health Care Economics, Insurance, Payment
  • Health Care Quality
  • Health Care Reform
  • Health Care Safety
  • Health Care Workforce
  • Health Disparities
  • Health Inequities
  • Health Policy
  • Health Systems Science
  • History of Medicine
  • Hypertension
  • Images in Neurology
  • Implementation Science
  • Infectious Diseases
  • Innovations in Health Care Delivery
  • JAMA Infographic
  • Law and Medicine
  • Leading Change
  • Less is More
  • LGBTQIA Medicine
  • Lifestyle Behaviors
  • Medical Coding
  • Medical Devices and Equipment
  • Medical Education
  • Medical Education and Training
  • Medical Journals and Publishing
  • Mobile Health and Telemedicine
  • Narrative Medicine
  • Neuroscience and Psychiatry
  • Notable Notes
  • Nutrition, Obesity, Exercise
  • Obstetrics and Gynecology
  • Occupational Health
  • Ophthalmology
  • Orthopedics
  • Otolaryngology
  • Pain Medicine
  • Palliative Care
  • Pathology and Laboratory Medicine
  • Patient Care
  • Patient Information
  • Performance Improvement
  • Performance Measures
  • Perioperative Care and Consultation
  • Pharmacoeconomics
  • Pharmacoepidemiology
  • Pharmacogenetics
  • Pharmacy and Clinical Pharmacology
  • Physical Medicine and Rehabilitation
  • Physical Therapy
  • Physician Leadership
  • Population Health
  • Primary Care
  • Professional Well-being
  • Professionalism
  • Psychiatry and Behavioral Health
  • Public Health
  • Pulmonary Medicine
  • Regulatory Agencies
  • Reproductive Health
  • Research, Methods, Statistics
  • Resuscitation
  • Rheumatology
  • Risk Management
  • Scientific Discovery and the Future of Medicine
  • Shared Decision Making and Communication
  • Sleep Medicine
  • Sports Medicine
  • Stem Cell Transplantation
  • Substance Use and Addiction Medicine
  • Surgical Innovation
  • Surgical Pearls
  • Teachable Moment
  • Technology and Finance
  • The Art of JAMA
  • The Arts and Medicine
  • The Rational Clinical Examination
  • Tobacco and e-Cigarettes
  • Translational Medicine
  • Trauma and Injury
  • Treatment Adherence
  • Ultrasonography
  • Users' Guide to the Medical Literature
  • Vaccination
  • Venous Thromboembolism
  • Veterans Health
  • Women's Health
  • Workflow and Process
  • Wound Care, Infection, Healing
  • Register for email alerts with links to free full-text articles
  • Access PDFs of free articles
  • Manage your interests
  • Save searches and receive search alerts

Help | Advanced Search

Computer Science > Cryptography and Security

Title: sok: leveraging transformers for malware analysis.

Abstract: The introduction of transformers has been an important breakthrough for AI research and application as transformers are the foundation behind Generative AI. A promising application domain for transformers is cybersecurity, in particular the malware domain analysis. The reason is the flexibility of the transformer models in handling long sequential features and understanding contextual relationships. However, as the use of transformers for malware analysis is still in the infancy stage, it is critical to evaluate, systematize, and contextualize existing literature to foster future research. This Systematization of Knowledge (SoK) paper aims to provide a comprehensive analysis of transformer-based approaches designed for malware analysis. Based on our systematic analysis of existing knowledge, we structure and propose taxonomies based on: (a) how different transformers are adapted, organized, and modified across various use cases; and (b) how diverse feature types and their representation capabilities are reflected. We also provide an inventory of datasets used to explore multiple research avenues in the use of transformers for malware analysis and discuss open challenges with future research directions. We believe that this SoK paper will assist the research community in gaining detailed insights from existing work and will serve as a foundational resource for implementing novel research using transformers for malware analysis.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

America’s best decade, according to data

One simple variable, more than anything, determines when you think the nation peaked.

research article data analysis

How do you define the good old days?

Department of Data

research article data analysis

The plucky poll slingers at YouGov, who are consistently willing to use their elite-tier survey skills in service of measuring the unmeasurable, asked 2,000 adults which decade had the best and worst music, movies, economy and so forth, across 20 measures . But when we charted them, no consistent pattern emerged.

We did spot some peaks: When asked which decade had the most moral society, the happiest families or the closest-knit communities, White people and Republicans were about twice as likely as Black people and Democrats to point to the 1950s. The difference probably depends on whether you remember that particular decade for “Leave it to Beaver,” drive-in theaters and “12 Angry Men” — or the Red Scare, the murder of Emmett Till and massive resistance to school integration.

“This was a time when Repubs were pretty much running the show and had reason to be happy,” pioneering nostalgia researcher Morris Holbrook told us via email. “Apparently, you could argue that nostalgia is colored by political preferences. Surprise, surprise.”

And he’s right! But any political, racial or gender divides were dwarfed by what happened when we charted the data by generation. Age, more than anything, determines when you think America peaked.

So, we looked at the data another way, measuring the gap between each person’s birth year and their ideal decade. The consistency of the resulting pattern delighted us: It shows that Americans feel nostalgia not for a specific era, but for a specific age.

The good old days when America was “great” aren’t the 1950s. They’re whatever decade you were 11, your parents knew the correct answer to any question, and you’d never heard of war crimes tribunals, microplastics or improvised explosive devices. Or when you were 15 and athletes and musicians still played hard and hadn’t sold out.

Not every flavor of nostalgia peaks as sharply as music does. But by distilling them to the most popular age for each question, we can chart a simple life cycle of nostalgia.

The closest-knit communities were those in our childhood, ages 4 to 7. The happiest families, most moral society and most reliable news reporting came in our early formative years — ages 8 through 11. The best economy, as well as the best radio, television and movies, happened in our early teens — ages 12 through 15.

GET CAUGHT UP Summarized stories to quickly stay informed

How a paperwork glitch is hurting student loan borrowers’ credit scores

How a paperwork glitch is hurting student loan borrowers’ credit scores

Babies exposed to peanuts less likely to be allergic years later, study says

Babies exposed to peanuts less likely to be allergic years later, study says

Lawsuit says American Airlines kicked 8 Black men off plane, citing body odor

Lawsuit says American Airlines kicked 8 Black men off plane, citing body odor

Most D.C.-area sports fans dislike or hate Commanders’ name, poll finds

Most D.C.-area sports fans dislike or hate Commanders’ name, poll finds

Slightly spendier activities such as fashion, music and sporting events peaked in our late teens — ages 16 through 19 — matching research from the University of South Australia’s Ehrenberg-Bass Institute, which shows music nostalgia centers on age 17 .

YouGov didn’t just ask about the best music and the best economy. The pollsters also asked about the worst music and the worst economy. But almost without exception, if you ask an American when times were worst, the most common response will be “right now!”

This holds true even when “now” is clearly not the right answer. For example, when we ask which decade had the worst economy, the most common answer is today. The Great Depression — when, for much of a decade, unemployment exceeded the what we saw in the worst month of pandemic shutdowns — comes in a grudging second.

To be sure, other forces seem to be at work. Democrats actually thought the current economy wasn’t as bad as the Great Depression. Republicans disagreed. In fact, measure after measure, Republicans were more negative about the current decade than any other group — even low-income folks in objectively difficult situations.

So, we called the brilliant Joanne Hsu, director of the University of Michigan’s Surveys of Consumers who regularly wrestles with partisan bias in polling.

Hsu said that yes, she sees a huge partisan split in the economy, and yes, Republicans are far more negative than Democrats. But it hasn’t always been that way.

“People whose party is in the White House always have more favorable sentiment than people who don’t,” she told us. “And this has widened over time.”

In a recent analysis , Hsu — who previously worked on some of our favorite surveys at the Federal Reserve — found that while partisanship drove wider gaps in economic expectations than did income, age or education even in the George W. Bush and Barack Obama years, they more than doubled under Donald Trump as Republicans’ optimism soared and Democrats’ hopes fell.

Our attitudes reversed almost the instant President Biden took office, but the gap remains nearly as wide. That is to say, if we’d asked the same questions about the worst decades during the Trump administration, Hsu’s work suggests the partisan gap could have shriveled or even flipped eyeglasses over teakettle.

To understand the swings, Hsu and her friends spent the first part of 2024 asking 2,400 Americans where they get their information about the economy. In a new analysis , she found Republicans who listen to partisan outlets are more likely to be negative, and Democrats who listen to their own version of such news are more positive — and that Republicans are a bit more likely to follow partisan news.

But while Fox and friends drive some negativity, only a fifth of Republicans get their economic news from partisan outlets. And Democrats and independents give a thumbs down to the current decade, too, albeit at much lower rates.

There’s clearly something more fundamental at work. As YouGov’s Carl Bialik points out, when Americans were asked last year which decade they’d most want to live in, the most common answer was now. At some level then, it seems unlikely that we truly believe this decade stinks by almost every measure.

A deeper explanation didn’t land in our laps until halfway through a Zoom call with four well-caffeinated Australian marketing and consumer-behavior researchers: the Ehrenberg-Bass folks behind the music study we cited above. (Their antipodean academic institute has attracted massive sponsorships by replacing typical corporate marketing fluffery with actual evidence.)

Their analysis began when Callum Davies needed to better understand the demographics of American music tastes to interpret streaming data for his impending dissertation. Since they were already asking folks about music, Davies and his colleagues decided they might as well seize the opportunity to update landmark research from Holbrook and Robert Schindler about music nostalgia.

Building on the American scholars’ methods, they asked respondents to listen to a few seconds each of 34 songs , including Justin Timberlake’s “Sexy Back” and Johnny Preston’s “ Running Bear .” Then respondents were asked to rate each song on a zero-to-10 scale. (In the latter case, we can’t imagine the high end of the scale got much use, especially if the excerpt included that song’s faux-tribal “hooga-hooga” chant and/or its climactic teen drownings.)

Together, the songs represented top-10 selections from every even-numbered year from 1950 (Bing and Gary Crosby’s “Play a Simple Melody”) to 2016 (Rihanna’s “Work”), allowing researchers to gather our preferences for music released throughout our lives.

Like us, they found that you’ll forever prefer the music of your late teens. But their results show one big difference: There’s no sudden surge of negative ratings for the most recent music.

Marketing researcher Bill Page said that by broadly asking when music, sports or crime were worst, instead of getting ratings for specific years or items, YouGov got answers to a question they didn’t ask.

“When you ask about ‘worst,’ you’re not asking for an actual opinion,” Page said. “You’re asking, ‘Are you predisposed to think things get worse?’”

“There’s plenty of times surveys unintentionally don’t measure what they claim to,” his colleague Zac Anesbury added.

YouGov actually measured what academics call “declinism,” his bigwig colleague Carl Driesener explained. He looked a tiny bit offended when we asked if that was a real term or slang they’d coined on the spot. But in our defense, only a few minutes had passed since they had claimed “cozzie livs” was Australian for “the cost of living crisis.”

Declinists believe the world keeps getting worse. It’s often the natural result of rosy retrospection, or the idea that everything — with the possible exception of “Running Bear” — looks better in memory than it did at the time. This may happen in part because remembering the good bits of the past can help us through difficult times, Page said.

It’s a well-established phenomenon in psychology, articulated by Leigh Thompson, Terence Mitchell and their collaborators in a set of analyses . They found that when asked to rate a trip mid-vacation, we often sound disappointed. But after we get home — when the lost luggage has been found and the biting-fly welts have stopped itching — we’re as positive about the trip as we were in the early planning stage. Sometimes even more so.

So saying the 2020s are the worst decade ever is akin to sobbing about “the worst goldang trip ever” at 3 a.m . in a sketchy flophouse full of Russian-speaking truckers after you’ve run out of cash and spent three days racing around Urumqi looking for the one bank in Western China that takes international cards.

A few decades from now, our memories shaped by grainy photos of auroras and astrolabes, we’ll recall only the bread straight from streetside tandoor-style ovens and the locals who went out of their way to bail out a couple of distraught foreigners.

In other words, the 2020s will be the good old days.

Greetings! The Department of Data curates queries. What are you curious about: How many islands have been completely de-ratted? Where is America’s disc-golf heartland? Who goes to summer camp? Just ask!

If your question inspires a column, we’ll send you an official Department of Data button and ID card. This week’s buttons go to YouGov’s Taylor Orth, who correctly deduced we’d be fascinated by decade-related polls, and Stephanie Killian in Kennesaw, Ga., who also got a button for our music column , with her questions about how many people cling to the music of their youth.

research article data analysis

  • Letter to the Editor
  • Open access
  • Published: 27 May 2024

Analyzing global research trends and focal points of pyoderma gangrenosum from 1930 to 2023: visualization and bibliometric analysis

  • Sa’ed H. Zyoud   ORCID: orcid.org/0000-0002-7369-2058 1 , 2  

Journal of Translational Medicine volume  22 , Article number:  508 ( 2024 ) Cite this article

14 Accesses

Metrics details

To the Editor, I read with great interest the publication entitled “An approach to the diagnosis and management of patients with pyoderma gangrenosum from an international perspective: results from an expert forum” [ 1 ]. Pyoderma gangrenosum is an ulcerative, cutaneous condition with distinctive clinical characteristics first described in 1930 [ 2 ]. Due to the importance of the subject, this published study was searched in databases, and I did not find any bibliometric studies on this topic. In recent years, researchers have successfully applied bibliometric analysis in various domains, contributing to the development of novel theories and assessing research frontiers, including in the dermatology field. Nonetheless, comprehensive bibliometric analyses of P. gangrenosum have not been performed. This study addresses this gap by conducting a thorough bibliometric analysis in the field of P. gangrenosum at the global level. The goal is to assist researchers in swiftly grasping the knowledge structure and current focal points in the field, generating new research topic ideas, and enhancing the overall quality of research on P. gangrenosum.

This bibliometric analysis sought to delineate research endeavors concerning P. gangrenosum, pinpoint the primary contributing countries, and discern prevalent topics within this domain. Using a descriptive cross-sectional bibliometric methodology, this study extracted pertinent documents from the Scopus database covering the period from 1930 to December 31, 2023. The search strategy included keywords related to ‘pyoderma gangrenosum.’ VOSviewer software (version 1.6.20) was used to illustrate the most recurring terms or themes [ 3 ]. The scope of the retrieved documents was restricted to including only journal research articles while ignoring other forms of documents.

Overall, 4,326 papers about P. gangrenosum were published between 1930 and 2023. Among these were 3,095 (71.54%) original papers, 548 (12.67%) letters, 477 (11.03%) reviews, and 206 (4.76%) other kinds of articles, such as conference abstracts, editorials, or notes. With 3,454 publications, English was the most often used language, followed by French ( n  = 253), German ( n  = 190), and Spanish ( n  = 163), accounting for 93.85% of all related publications.

Figure  1 shows the distribution of these publications. Between 1930 and 2023, there were steadily more publications on P. gangrenosum (R 2  = 0.9257; P value < 0.001). Growth trends and productivity trends in P. gangrenosum-related publications have been influenced by developments in medical research, clinical practice and patient care [ 4 , 5 ]. All of these factors have advanced our knowledge of the condition, enhanced our methods of treatment, and helped to create standardized findings for clinical studies.

figure 1

Annual growth of published research related to P. gangrenosum (1930–2023)

The top 10 countries with the most publications on P. gangrenosum are listed in Table  1 . These are the USA ( n  = 1073; 24.80%), the UK ( n  = 345; 7.98%), Japan ( n  = 335, 7.74%), and Germany ( n  = 296; 6.84%). With 65 articles, the Mayo Clinic in the USA led the institutions; Oregon Health & Science University in the USA and Università degli Studi di Milano in Italy followed with 60 articles each.

To create a term co-occurrence map in VOSviewer 1.6.20, terms had to appear in the title and abstract at least forty times by binary counting. The network was visualized by building the map using terms with the highest relevance scores. Large bubbles for often cooccurring terms and close spacing between terms with high similarity were guaranteed by the algorithm. The larger circles in Fig.  2 A represent frequently occurring terms in titles and abstracts. Four primary topic clusters—“Treatment modalities” (green cluster), “epidemiology and clinical presentation” (blue cluster), “improved diagnostic methods” (red cluster), and “the links between P. gangrenosum and other morbidities such as inflammatory bowel disease or autoimmune conditions” (yellow cluster)—are distinguished by color.

figure 2

Mapping of terms used in research on P. gangrenosum. A : The co-occurrence network of terms extracted from the title or abstract of at least 40 articles. The colors represent groups of terms that are relatively strongly linked to each other. The size of a term signifies the number of publications related to P. gangrenosum in which the term appeared, and the distance between two terms represents an estimated indication of the relatedness of these terms. B : Mapping of terms used in research on P. gangrenosum. The terms “early” (blue) or “late” (red) years indicate when the term appeared

Interestingly, after 2012, terms related to “treatment modalities” and “epidemiology and clinical presentation” have gained more attention than in the past, which focused on “improved diagnostic methods” and “the links between P. gangrenosum and other morbidities such as inflammatory bowel disease or autoimmune conditions” (pre-2012). Figure  2 B shows this tendency.

In conclusion, there has recently been an increase in P. gangrenosum research, especially in the last decade. The current focus of research is on treatment challenges, obstacles to diagnosis, and connections to underlying diseases. Furthermore, efforts are being made to create core outcome sets and standardized diagnostic criteria for clinical trials. These patterns demonstrate continuous attempts to comprehend, identify, and treat this illness with greater effectiveness. This recent increase in research has important implications for clinical practice. Clinicians can improve patient care by remaining current in emerging trends and areas of interest. Moreover, an in-depth analysis of previous studies can identify knowledge gaps, directing future research efforts toward the most important issues. In the end, a deeper comprehension of the body of research can result in better clinical judgment based on best practices, which could enhance patient outcomes and advance the dermatological field.

Data availability

This published article contains all the information produced or examined in this research. Additional datasets utilized during this study can be obtained from the corresponding author.

Haddadin OM, Ortega-Loayza AG, Marzano AV, Davis MDP, Dini V, Dissemond J, Hampton PJ, Navarini AA, Shavit E, Tada Y, et al. An approach to diagnosis and management of patients with pyoderma gangrenosum from an international perspective: results from an expert forum. Arch Dermatol Res. 2024;316(3):89.

Article   PubMed   Google Scholar  

Brunsting LA. Pyoderma (Echthyma) Gangrenosum. Arch Derm Syphilol. 1930;22(4):655–80.

Article   Google Scholar  

van Eck NJ, Waltman L. Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics. 2010;84(2):523–38.

Kridin K, Cohen AD, Amber KT. Underlying systemic diseases in Pyoderma Gangrenosum: a systematic review and Meta-analysis. Am J Clin Dermatol. 2018;19(4):479–87.

McKenzie F, Arthur M, Ortega-Loayza AG. Pyoderma Gangrenosum: what do we know now? Curr Dermatol Rep. 2018;7(3):147–57.

Download references

Acknowledgements

The author thanks An-Najah National University for all the administrative assistance during the implementation of the project.

No support was received for conducting this study.

Author information

Authors and affiliations.

Department of Clinical and Community Pharmacy, College of Medicine and Health Sciences, An-Najah National University, Nablus, 44839, Palestine

Sa’ed H. Zyoud

Clinical Research Centre, An-Najah National University Hospital, Nablus, 44839, Palestine

You can also search for this author in PubMed   Google Scholar

Contributions

Sa’ed H. Zyoud significantly contributed to the conceptualization and design of the research project, overseeing data management and analysis, generating figures, and making substantial contributions to the literature search and interpretation. Furthermore, Sa’ed H. Zyoud authored the manuscript, which he reviewed and approved as the sole author.

Corresponding author

Correspondence to Sa’ed H. Zyoud .

Ethics declarations

Ethics approval and consent to participate.

The ethics committee’s approval was unnecessary because the study did not include human interactions.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Zyoud, S.H. Analyzing global research trends and focal points of pyoderma gangrenosum from 1930 to 2023: visualization and bibliometric analysis. J Transl Med 22 , 508 (2024). https://doi.org/10.1186/s12967-024-05306-4

Download citation

Received : 10 May 2024

Accepted : 14 May 2024

Published : 27 May 2024

DOI : https://doi.org/10.1186/s12967-024-05306-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Journal of Translational Medicine

ISSN: 1479-5876

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

research article data analysis

  • Open access
  • Published: 27 May 2024

Discovery of novel RNA viruses through analysis of fungi-associated next-generation sequencing data

  • Xiang Lu 1 , 2   na1 ,
  • Ziyuan Dai 3   na1 ,
  • Jiaxin Xue 2   na1 ,
  • Wang Li 4 ,
  • Ping Ni 4 ,
  • Juan Xu 4 ,
  • Chenglin Zhou 4 &
  • Wen Zhang 1 , 2 , 4  

BMC Genomics volume  25 , Article number:  517 ( 2024 ) Cite this article

200 Accesses

1 Altmetric

Metrics details

Like all other species, fungi are susceptible to infection by viruses. The diversity of fungal viruses has been rapidly expanding in recent years due to the availability of advanced sequencing technologies. However, compared to other virome studies, the research on fungi-associated viruses remains limited.

In this study, we downloaded and analyzed over 200 public datasets from approximately 40 different Bioprojects to explore potential fungal-associated viral dark matter. A total of 12 novel viral sequences were identified, all of which are RNA viruses, with lengths ranging from 1,769 to 9,516 nucleotides. The amino acid sequence identity of all these viruses with any known virus is below 70%. Through phylogenetic analysis, these RNA viruses were classified into different orders or families, such as Mitoviridae , Benyviridae , Botourmiaviridae , Deltaflexiviridae , Mymonaviridae , Bunyavirales , and Partitiviridae . It is possible that these sequences represent new taxa at the level of family, genus, or species. Furthermore, a co-evolution analysis indicated that the evolutionary history of these viruses within their groups is largely driven by cross-species transmission events.

Conclusions

These findings are of significant importance for understanding the diversity, evolution, and relationships between genome structure and function of fungal viruses. However, further investigation is needed to study their interactions.

Peer Review reports

Introduction

Viruses are among the most abundant and diverse biological entities on Earth; they are ubiquitous in the natural environment but difficult to culture and detect [ 1 , 2 , 3 ]. In recent decades, the significant advancements in omics have transformed the field of virology and enabled researchers to detect potential viruses in a variety of environmental samples, helping us to expand the known diversity of viruses and explore the “dark matter” of viruses that may exist in vast quantities [ 4 ]. In most cases, the hosts of these newly discovered viruses exhibit only asymptomatic infections [ 5 , 6 ], and they even play an important role in maintaining the balance, stability, and sustainable development of the biosphere [ 7 ]. But some viruses may be involved in the emergence and development of animal or plant diseases. For example, the tobacco mosaic virus (TMV) causes poor growth in tobacco plants, while norovirus is known to cause diarrhea in mammals [ 8 , 9 ]. In the field of fungal research, viral infections have significantly reduced the yield of edible fungi, thereby attracting increasing attention to fungal diseases caused by viruses [ 10 ]. However, due to their apparent relevance to health [ 11 ], fungal-associated viruses have been understudied compared to viruses affecting humans, animals, or plants.

Mycoviruses (also known as fungal viruses) are widely distributed in various fungi and fungal-like organisms [ 12 ]. The first mycoviruses were discovered in the 1960s by Hollings M in the basidiomycete Agaricus bisporus , an edible cultivated mushroom [ 13 ]. Shortly thereafter, Ellis LF et al. reported mycoviruses in the ascomycete Penicillium stoloniferum , confirming that viral dsRNA is responsible for interferon stimulation in mammals [ 13 , 14 , 15 ]. In recent years, the diversity of known mycoviruses has rapidly increased with the development and widespread application of sequencing technologies [ 16 , 17 , 18 , 19 , 20 ]. According to the classification principles of the International Committee for the Taxonomy of Viruses (ICTV), mycoviruses are currently classified into 24 taxa, consisting of 23 families and 1 genus ( Botybirnavirus ) [ 21 ]. Most mycoviruses belong to double-stranded (ds) RNA viruses, such as families Totiviridae , Partitiviridae , Reoviridae , Chrysoviridae , Megabirnaviridae , Quadriviridae , and genus Botybirnavirus , or positive-sense single-stranded (+ ss) RNA viruses, such as families Alphaflexiviridae , Gammaflexiviridae , Barnaviridae , Hypoviridae , Endornaviridae , Metaviridae and Pseudoviridae . However, negative-sense single-stranded (-ss) RNA viruses (family Mymonaviridae ) and single-stranded (ss) DNA viruses (family Genomoviridae ) have also been described [ 22 ]. The taxonomy of mycoviruses is continually refined as novel mycoviruses that cannot be classified into any established taxon are identified. While the vast majority of fungi-infecting viruses do not show infection characteristics and have no significant impact on their hosts, some mycoviruses have inhibitory effects on the phenotype of the host, leading to hypovirulence in phytopathogenic fungi [ 23 ]. The use of environmentally friendly, low-virulence-related mycoviruses such as Chryphonectria hypovirus 1 (CHV-1) for biological control has been considered a viable alternative to chemical fungicides [ 24 ]. With the deepening of research, an increasing number of mycoviruses that can cause fungal phenotypic changes have been identified [ 3 , 23 , 25 ]. Therefore, understanding the distribution of these viruses and their effects on hosts will allow us to determine whether their infections can be prevented and treated.

To explore the viral dark matter hidden within fungi, this study collected over 200 available fungal-associated libraries from approximately 40 Bioprojects in the Sequence Read Archive (SRA) database, uncovering novel RNA viruses within them. We further elucidated the genetic relationships between known viruses and these newfound ones, thereby expanding our understanding of fungal-associated viruses and providing assistance to viral taxonomy.

Materials and methods

Genome assembly.

To discover novel fungal-associated viruses, we downloaded 236 available libraries from the SRA database, corresponding to 32 fungal species (Supplementary Table 1). Pfastq-dump v0.1.6 ( https://github.com/inutano/pfastq-dump ) was used to convert SRA format files to fastq format files. Subsequently, Bowtie2 v2.4.5 [ 26 ] was employed to remove host sequences. Primer sequences of raw reads underwent trimming using Trim Galore v0.6.5 ( https://www.bioinformatics.babraham.ac.uk/projects/trim_galore ), and the resulting files underwent quality control with the options ‘–phred33 –length 20 –stringency 3 –fastqc’. Duplicated reads were marked using PRINSEQ-lite v0.20.4 (-derep 1). All SRA datasets were then assembled in-house pipeline. Paired-end reads were assembled using SPAdes v3.15.5 [ 27 ] with the option ‘-meta’, while single-end reads were assembled with MEGAHIT v1.2.9 [ 28 ], both using default parameters. The results were then imported into Geneious Prime v2022.0.1 ( https://www.geneious.com ) for sorting and manual confirmation. To reduce false negatives during sequence assembly, further semi-automatic assembly of unmapped contigs and singlets with a sequence length < 500 nt was performed. Contigs with a sequence length > 1,500 nt after reassembly were retained. Individual contigs were then used as references for mapping to the raw data using the Low Sensitivity/Fastest parameter in Geneious Prime. In addition, mixed assembly was performed using MEGAHIT in combination with BWA v0.7.17 [ 29 ] to search for unused reads that might correspond to low-abundance contigs.

Searching for novel viruses in fungal libraries

We identified novel viral sequences present in fungal libraries through a series of steps. To start, we established a local viral database, consisting of the non-redundant protein (nr) database downloaded in August 2023, along with IMG/VR v3 [ 30 ], for screening assembled contigs. The contigs labeled as “viruses” and exhibiting less than 70% amino acid (aa) sequence identity with the best match in the database were imported into Geneious Prime for manual mapping. Putative open reading frames (ORFs) were predicted by Geneious Prime using built-in parameters (Minimum size: 100) and were subsequently verified by comparison to related viruses. The annotations of these ORFs were based on comparisons to the Conserved Domain Database (CDD). The sequences after manual examination were subjected to genome clustering using MMseqs2 (-k 0 -e 0.001 –min-seq-id 0.95 -c 0.9 –cluster-mode 0) [ 31 ]. After excluding viruses with high aa sequence identity (> 70%) to known viruses, a dataset containing a total of 12 RNA viral sequences was obtained. The non-redundant fungal virus dataset was compared against the local database using the BLASTx program built in DIAMOND v2.0.15 [ 32 ], and significant sequences with a cut-off E-value of < 10 –5 were selected. The coverage of each sequence in all libraries was calculated using the pileup tool in BBMap. Taxonomic identification was conducted using TaxonKit [ 33 ] software, along with the rma2info program integrated into MEGAN6 [ 34 ]. The RNA secondary structure prediction of the novel viruses was conducted using RNA Folding Form V2.3 ( http://www.unafold.org/mfold/applications/rna-folding-form-v2.php ).

Phylogenetic analysis

To infer phylogenetic relationships, nucleotide and their encoded protein sequences of reference strains belonging to different groups of corresponding viruses were downloaded from the NCBI GenBank database, along with sequences of proposed species pending ratification. Related sequences were aligned using the alignment program within the CLC Genomics Workbench 10.0, and the resulting alignment was further optimized using MUSCLE in MEGA-X [ 35 ]. Sites containing more than 50% gaps were temporarily removed from the alignments. Maximum-likelihood (ML) trees were then constructed using IQ-TREE v1.6.12 [ 36 ]. All phylogenetic trees were created using IQ-TREE with 1,000 bootstrap replicates (-bb 1000) and the ModelFinder function (-m MFP). Interactive Tree Of Life (iTOL) was used for visualizing and editing phylogenetic trees [ 37 ]. Colorcoded distance matrix analysis between novel viruses and other known viruses were performed with Sequence Demarcation Tool v1.2 [ 38 ].

To illustrate cross-species transmission and co-divergence between viruses and their hosts across different virus groups, we reconciled the co-phylogenetic relationships between these viruses and their hosts. The evolutionary tree and topologies of the hosts involved in this study were obtained from the TimeTree [ 39 ] website by inputting their Latin names. The viruses in the phylogenetic tree for which the host cannot be recognized through published literature or information provided by the authors are disregarded. The co-phylogenetic plots (or ‘tanglegram’) generated using the R package phytools [ 40 ] visually represent the correspondence between host and virus trees, with lines connecting hosts and their respective viruses. The event-based program eMPRess [ 41 ] was employed to determine whether the pairs of virus groups and their hosts undergo coevolution. This tool reconciles pairs of phylogenetic trees according to the Duplication-Transfer-Loss (DTL) model [ 42 ], employing a maximum parsimony formulation to calculate the cost of each coevolution event. The cost of duplication, host-jumping (transfer), and extinction (loss) event types were set to 1.0, while host-virus co-divergence was set to zero, as it was considered the null event.

Data availability

The data reported in this paper have been deposited in the GenBase in National Genomics Data Center [ 43 ], Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under accession numbers C_AA066339.1-C_AA066350.1 that are publicly accessible at https://ngdc.cncb.ac.cn/genbase . Please refer to Table  1 for details.

Twelve novel RNA viruses associated with fungi

We investigated fungi-associated novel viruses by mining publicly available metagenomic and transcriptomic fungal datasets. In total, we collected 236 datasets, which were categorized into four fungal phyla: Ascomycota (159), Basidiomycota (47), Chytridiomycota (15), and Zoopagomycota (15). These phyla corresponded to 20, 8, 2, and 2 different fungal genera, respectively (Supplementary Table 1). A total of 12 sequences containing complete coding DNA sequences (CDS) for RNA-dependent RNA polymerase (RdRp) have been identified, ranging in length from 1,769 nt to 9,516 nt. All of these sequences have less than 70% aa identity with RdRp sequences from any currently known virus (ranging from 32.97% to 60.43%), potentially representing novel families, genera, or species (Table  1 ). Some of the identified sequences were shorter than the reference genomes of RNA viruses, suggesting that these viral sequences represented partial sequences of viral genomes. To exclude the possibility of transient viral infections in hosts or de novo assembly artefacts in co-infection detection, we extracted the nucleotide sequences of the coding regions of these 12 sequences and mapped them to all collected libraries to compute coverage (Supplementary Table 2). The results revealed varying degrees of read matches for these viral genomes across different libraries, spanning different fungal species. Although we only analyzed sequences longer than 1,500 nt, it is worth noting that we also discovered other viral reads in many libraries. However, we were unable to assemble them into sufficiently long contigs, possibly due to library construction strategies or sequencing depth. In any case, this preliminary finding reveals a greater diversity of fungal-associated viruses than previously considered.

Positive-sense single-stranded RNA viruses

(i) mitoviridae.

Members of the family Mitoviridae (order Cryppavirales ) are monopartite, linear, positive-sense ( +) single-stranded (ss) RNA viruses with genome size of approximately 2.5–2.9 kb [ 44 ], carrying a single long open reading frame (ORF) which encodes a putative RdRp. Mitoviruses have no true virions and no structural proteins, virus genome is transmitted horizontally through mating or vertically from mother to daughter cells [ 45 ]. They use mitochondria as their sites of replication and have typical 5' and 3' untranslated regions (UTRs) of varying sizes, which are responsible for viral translation and replicase recognition [ 46 ]. According to the taxonomic principles of ICTV, the viruses belonging to the family Mitoviridae are divided into four genera, namely Duamitovirus , Kvaramitovirus , Triamitovirus and Unuamitovirus . In this study, two novel viruses belonging to the family Mitoviridae were identified in the same library (SRR12744489; Species: Thielaviopsis ethacetica ), named Thielaviopsis ethacetica mitovirus 1 (TeMV01) and Thielaviopsis ethacetica mitovirus 2 (TeMV02), respectively (Fig.  1 A). The genome sequence of TeMV01 spans 2,689 nucleotides in length with a GC content of 32.2%. Its 5' and 3' UTRs comprise 406 nt and 36 nt, respectively. Similarly, the genome sequence of TeMV02 extends 3,087 nucleotides in length with a GC content of 32.6%. Its 5' and 3' UTRs consist of 553 and 272 nt, respectively. The 5' and 3' ends of both genomes are predicted to have typical stem-loop structures (Fig.  1 B). In order to determine the evolutionary relationship between these two mitoviruses and other known mitoviruses, phylogenetic analysis based on RdRp showed that viral strains were divided into 2 genetic lineages in the genera Duamitovirus and Unuamitovirus (Fig.  1 C). In the genus Unuamitovirus , TeMV01 was clustered with Ophiostoma mitovirus 4, exhibiting the highest aa identity of 51.47%, while in the genus Duamitovirus , TeMV02 was clustered with a strain isolated from Plasmopara viticola , showing the highest aa identity of 42.82%. According to the guidelines from the ICTV regarding the taxonomy of the family Mitoviridae , a species demarcation cutoff of < 70% aa sequence identity is established [ 47 ]. Drawing on this recommendation and phylogenetic inferences, these two viral strains could be presumed to be novel viral species [ 48 ].

figure 1

Identification of novel positive-sense single-stranded RNA viruses in fungal sequencing libraries. A Genome organization of two novel mitoviruses; the putative ORF for the viral RdRp is depicted by a green box, and the predicted conserved domain region is displayed in a gray box. B Predicted RNA secondary structures of the 5'- and 3'-terminal regions. C ML phylogenetic tree of members of the family Mitoviridae . The best-fit model (LG + F + R6) was estimated using IQ-Tree model selection. The bootstrap value is shown at each branch, with the newly identified viruses represented in red font. D The genome organization of GtBeV is depicted at the top; in the middle is the ML phylogenetic tree of members of the family Benyviridae . The best-fit model (VT + F + R5) was estimated using IQ-Tree model selection. The bootstrap value is shown at each branch, with the newly identified virus represented in red font. At the bottom is the distance matrix analysis of GeBeV identified in Gaeumannomyces tritici . Pairwise sequence comparison produced with the RdRp amino acid sequences within the ML tree. E The genome organization of CrBV is depicted at the top; in the middle is the ML phylogenetic tree of members of the family Botourmiaviridae . The best-fit model (VT + F + R5) was estimated using IQ-Tree model selection. The bootstrap value is shown at each branch, with the newly identified virus represented in red font. At the bottom is the distance matrix analysis of CrBV identified in Clonostachys rosea . Pairwise sequence comparison produced with the RdRp amino acid sequences within the ML tree

(ii) Benyviridae

The family Benyviridae is comprised of multipartite plant viruses that are rod-shaped, approximately 85–390 nm in length and 20 nm in diameter. Within this family, there is a single genus, Benyvirus [ 49 ]. It is reported that one species within this genus,Beet necrotic yellow vein virus, can cause widespread and highly destructive soil-borne ‘rhizomania’ disease of sugar beet [ 50 ]. A full-length RNA1 sequence related to Benyviridae has been detected from Gaeumannomyces tritici (ERR3486062), with a length of 6,479 nt. It possesses a poly(A) tail at the 3' end and is temporarily designated as Gaeumannomyces tritici benyvirus (GtBeV). BLASTx results indicate a 34.68% aa sequence identity with the best match found (Fig.  1 D). The non-structural polyprotein CDS of RNA1 encodes a large replication-associated protein of 1,688 amino acids with a molecular mass of 190 kDa. Four domains were predicted in this polyprotein corresponding to representative species within the family Benyviridae . The viral methyltransferase (Mtr) domain spans from nucleotide position 386 to 1411, while the RNA helicase (Hel) domain occupies positions 2113 to 2995 nt. Additionally, the protease (Pro) domain is located between positions 3142 and 3410 nt, and the RdRp domain is located at 4227 to 4796 nt. A phylogenetic analysis was conducted by integrating RdRp sequences of viruses closely related to GtBeV. The result revealed that GtBeV clustered within the family Benyviridae , exhibiting substantial evolutionary divergence from any other sequences. Consequently, this virus likely represents a novel species in the family Benyviridae .

(iii) Botourmiaviridae

The family Botourmiaviridae comprises viruses infecting plants and filamentous fungi, which may possess mono- or multi-segmented genomes [ 51 ]. Recent research has led to a rapid expansion in the number of viruses within the family Botourmiaviridae , increasing from the confirmed 4 genera in 2020 to a total of 12 genera. A contig identified from Clonostachys rosea (ERR5928658) using the BLASTx method exhibited similarity to viruses in the family Botourmiaviridae . After manual mapping, a 2,903 nt-long genome was obtained, tentatively named Clonostachys rosea botourmiavirus (CrBV), which includes a complete RdRP region (Fig.  1 E). Based on phylogenetic analysis using RdRp, CrBV clustered with members of the genus Magoulivirus , sharing 56.58% aa identity with a strain identified from Eclipta prostrata . However, puzzlingly, according to the ICTV's Genus/Species demarcation criteria, members of different genera/species within the family Botourmiaviridae share less than 70%/90% identity in their complete RdRP amino acid sequences. Furthermore, the RdRp sequences with accession numbers NC_055143 and NC_076766, both considered to be members of the genus Magoulivirus , exhibited only 39.05% aa identity to each other. Therefore, CrBV should at least be considered as a new species within the family Botourmiaviridae .

(iv) Deltaflexiviridae

An assembled sequence of 3,425 nucleotides in length Lepista sordida deltaflexivirus (LsDV), derived from Lepista sordida (DRR252167) and showing homology to Deltaflexiviridae within the order Tymovirales , was obtained. The Tymovirales comprises five recognized families: Alphaflexiviridae , Betaflexiviridae , Deltaflexiviridae , Gammaflexiviridae , and Tymoviridae [ 52 ]. The Deltaflexiviridae currently only includes one genus, the fungal-associated deltaflexivirus; they are mostly identified in fungi or plants pathogens [ 53 ]. LsDV was predicted to have a single large ORF, VP1, which starts with an AUG codon at nt 163–165 and ends with a UAG codon at nt 3,418–3,420. This ORF encodes a putative polyprotein of 1,086 aa with a calculated molecular mass of 119 kDa. Two conserved domains within the VP1 protein were identified: Hel and RdRp (Fig.  2 A). However, the Mtr was missing, indicating that the 5' end of this polyprotein is incomplete. According to the phylogenetic analysis of RdRp, LsDV was closely related to viruses of the family Deltaflexiviridae and shared 46.61% aa identity with a strain (UUW06602) isolated from Macrotermes carbonarius . Despite this, according to the species demarcation criteria proposed by ICTV, because we couldn't recover the entire replication-associated polyprotein, LsDV cannot be regarded as a novel species at present.

figure 2

Identification of novel members of family Deltaflexiviridae and Toga-like virus in fungal sequencing libraries. A On the right side of the image is the genome organization of LsDV; the putative ORF for the viral RdRp is depicted by a green box, and the predicted conserved domain region is displayed in a gray box. ML phylogenetic tree of members of the family Deltaflexiviridae . The best-fit model (VT + F + R6) was estimated using IQ-Tree model selection. The bootstrap value is shown at each branch, with the newly identified virus represented in red font. B The genome organization of GtTlV is depicted at the top; the putative ORF for the viral RdRp is depicted by a green box, and the predicted conserved domain region is displayed in a gray box. ML phylogenetic tree of members of the order Martellivirales . The best-fit model (LG + R7) was estimated using IQ-Tree model selection. The bootstrap value is shown at each branch, with the newly identified virus represented in red font

(v) Toga-like virus

Members of the family Togaviridae are primarily transmitted by arthropods and can infect a wide range of vertebrates, including mammals, birds, reptiles, amphibians, and fish [ 54 ]. Currently, this family only contains a single confirmed genus, Alphavirus . A contig was discovered in Gaeumannomyces tritici (ERR3486058), it is 7,588 nt in length with a complete ORF encoding a putative protein of 1,928 aa, which had 60.43% identity to Fusarium sacchari alphavirus-like virus 1 (QIQ28421) with 97% coverage. Phylogenetic analysis showed that it did not cluster with classical alphavirus members such as VEE, WEE, EEE, SF complex [ 54 ], but rather with several sequences annotated as Toga-like that were available (Fig.  2 B). It was provisionally named Gaeumannomyces tritici toga-like virus (GtTIV). However, we remain cautious about the accuracy of these so-called Toga-like sequences, as they show little significant correlation with members of the order Martellivirales .

Negative-sense single-stranded RNA viruses

(i) mymonaviridae.

Mymonaviridae is a family of linear, enveloped, negative-stranded RNA genomes in the order Mononegavirales , which infect fungi. They are approximately 10 kb in size and encode six proteins [ 55 ]. The famliy Mymonaviridae was established to accommodate Sclerotinia sclerotiorum negative-stranded RNA virus 1 (SsNSRV-1), a novel virus discovered in a hypovirulent strain of Sclerotinia sclerotiorum [ 56 ]. According to the ICTV, the family Mymonaviridae currently includes 9 genera, namely Auricularimonavirus , Botrytimonavirus , Hubramonavirus , Lentimonavirus , Penicillimonavirus , Phyllomonavirus , Plasmopamonavirus , Rhizomonavirus and Sclerotimonavirus . Two sequences originating from Gaeumannomyces tritici (ERR3486068) and Aspergillus puulaauensis (DRR266546), respectively, and associated with the family Mymonaviridae , have been identified and provisionally named Gaeumannomyces tritici mymonavirus (GtMV) and Aspergillus puulaauensis mymonavirus (ApMV). GtMV is 9,339 nt long with a GC content of 52.8%. It was predicted to contain 5 discontinuous ORFs, with the largest one encoding RdRp. Additionally, a nucleoprotein and three hypothetical proteins with unknown function were also predicted. A multiple alignment of nucleotide sequences among these ORFs identified a semi-conserved sequence, 5'-UAAAA-CUAGGAGC-3', located downstream of each ORF (Fig.  3 A). These regions are likely gene-junction regions in the GtMV genome, a characteristic feature shared by mononegaviruses [ 57 , 58 ]. For ApMV, a complete RdRp CDS with a length of 1,978 aa was predicted. The BLASTx searches showed that GtMV shared 45.22% identity with the RdRp of Soybean leaf-associated negative-stranded RNA virus 2 (YP_010784557), while ApMV shared 55.90% identity with the RdRp of Erysiphe necator associated negative-stranded RNA virus 23 (YP_010802816). The representative members of the family Mymonaviridae were included in the phylogenetic analysis. The results showed that GtMV and ApMV clustered closely with members of the genera Sclerotimonavirus and Plasmopamonavirus , respectively (Fig.  3 B). Members of the genus Plasmopamonavirus are about 6 kb in size and encode for a single protein. Therefore, GtMV and ApMV should be considered as representing new species within their respective genera.

figure 3

Identification of two new members in the family Mymonaviridae . A At the top is the nucleotide multiple sequence alignment result of GtMV with the reference genomes. the putative ORF for the viral RdRp is depicted by a green box, the predicted nucleoprotein is displayed in a yellow box, and three hypothetical proteins are displayed in gray boxes. The comparison of putative semi-conserved regions between ORFs in GtMV is displayed in the 5' to 3' orientation, with conserved sequences are highlighted. At the bottom is the genome organization of AmPV; the putative ORF for the viral RdRp is depicted by a green box. B ML phylogenetic tree of members of the family Mymonaviridae . The best-fit model (LG + F + R6) was estimated using IQ-Tree model selection. The bootstrap value is shown at each branch, with the newly identified viruses represented in red font

(ii) Bunyavirales

The Bunyavirales (the only order in the class Ellioviricetes ) is one of the largest groups of segmented negative-sense single-stranded RNA viruses with mainly tripartite genomes [ 59 ], which includes many pathogenic strains that infect arthropods(such as mosquitoes, ticks, sand flies), plants, protozoans, and vertebrates, and even cause severe human diseases. Order Bunyavirales consists of 14 viral families, including Arenaviridae , Cruliviridae , Discoviridae , Fimoviridae , Hantaviridae , Leishbuviridae , Mypoviridae , Nairoviridae , Peribunyaviridae , Phasmaviridae , Phenuiviridae , Tospoviridae , Tulasviridae and Wupedeviridae . In this study, three complete or near complete RNA1 sequences related to bunyaviruses were identified and named according to their respective hosts: CoBV ( Conidiobolus obscurus bunyavirus; SRR6181013; 7,277 nt), GtBV ( Gaeumannomyces tritici bunyavirus; ERR3486069; 7,364 nt), and TaBV ( Thielaviopsis aethacetica bunyavirus; SRR12744489; 9,516 nt) (Fig.  4 A). The 5' and 3' terminal RNA segments of GtBV and TaBV complement each other, allowing the formation of a panhandle structure [ 60 ], which plays an essential role as promoters of genome transcription and replication [ 61 ], except for CoBV, as the 3' terminal of CoBV has not been fully obtained (Fig.  4 B). BLASTx results indicated that these three viruses had identities ranging from 32.97% to 54.20% to the best matches in the GenBank database. Phylogenetic analysis indicated that CoBV was classified into the family Phasmaviridae , with distant relationships to any of its genera; GtBV clustered well with members of the genus Entovirus of family Phenuiviridae ; while TaBV did not cluster with any known members of families within Bunyavirales , hence provisionally placed within the Bunya-like group (Fig.  4 C). Therefore, these three sequences should be considered as potential new family, genus, or species within the order Bunyavirales .

figure 4

Identification of three new members in the order Bunyavirales . A The genome organization of CoBV, GtBV, and TaBV; the putative ORF for the viral RdRp is depicted by a green box, and the predicted conserved domain region is displayed in a gray box. B The complementary structures formed at the 5' and 3' ends of GtBV and TaBV. C ML phylogenetic tree of members of the order Bunyavirales . The best-fit model (VT + F + R8) was estimated using IQ-Tree model selection. The bootstrap value is shown at each branch, with the newly identified viruses represented in red font

Double-stranded RNA viruses

Partitiviridae.

The Partitiviridae is a family of small, non-enveloped viruses, approximately 35–40 nm in diameter, with bisegmented double-stranded (ds) RNA genomes. Each segment is about 1.4–3.0 kb in size, resulting in a total size about 4 kb [ 62 ]. The family Partitiviridae is now divided into five genera: Alphapartitivirus , Betapartiivirus , Cryspovirus , Deltapartitivirus and Gammapartitivirus . Each genus has characteristic hosts: plants or fungi for Alphapartitivirus and Betapartitivirus , fungi for Gammapartitivirus , plants for Deltapartitivirus , and protozoa for Cryspovirus [ 62 ]. A complete dsRNA1 sequence Neocallimastix californiae partitivirus (NcPV) retrieved from Neocallimastix californiae (SRR15362281) has been identified as being associated with the family Partitiviridae . The BLASTp result indicated that it shared the highest aa identity of 41.5% with members of the genus Gammapartitivirus . According to the phylogenetic tree constructed based on RdRp, NcPV was confirmed to fall within the genus Gammapartitivirus (Fig.  5 ). Typical members of the genus Gammapartitivirus have two segments in their complete genome, namely dsRNA1 and dsRNA2, encoding RdRp and coat protein, respectively [ 62 ]. The larger dsRNA1 segment of NcPV measures 1,769 nt in length, with a GC content of 35.8%. It contains a single ORF encoding a 561 aa RdRp. A CDD search revealed that the RdRp of NcPV harbors a catalytic region spanning from 119 to 427aa. Regrettably, only the complete dsRNA1 segment was obtained. According to the classification principles of ICTV, due to the lack of information regarding dsRNA2, we are unable to propose it as a new species. It is worth noting that according to the Genus demarcation criteria ( https://ictv.global/report/chapter/partitiviridae/partitiviridae ), members of the genus Gammapartitivirus should have a dsRNA1 length ranging from 1645 to 1787 nt, and the RdRp length should fall between 519 and 539 aa. However, the length of dsRNA1 in NcPV is 1,769 nt, with RdRp being 561 aa, challenging this classification criterion. In fact, multiple strains have already exceeded this criterion, such as GenBank accession numbers: WBW48344, UDL14336, QKK35392, among others.

figure 5

Identification of a new member in the family Partitiviridae . The genome organization of NcPV is depicted at the top; the putative ORF for the viral RdRp is depicted by a green box, and the predicted conserved domain region is displayed in a gray box. At the bottom is the ML phylogenetic tree of members of the family Partitiviridae . The best-fit model (VT + F + R4) was estimated using IQ-Tree model selection. The bootstrap value is shown at each branch, with the newly identified virus represented in red font

Long-term evolutionary relationships between fungal-associated viruses and hosts

Understanding the co-divergence history between viruses and hosts helps reveal patterns of virus transmission and infection and influences the biodiversity and stability of ecosystems. To explore the frequency of cross-species transmission and co-divergence among fungi-associated viruses, we constructed tanglegrams illustrating the interconnected evolutionary histories of viral families and their respective hosts through phylogenetic trees (Fig.  6 A). The results indicated that cross-species transmission (Host-jumping) consistently emerged as the most frequent evolutionary event among all groups of RNA viruses examined in this study (median, 66.79%; range, 60.00% to 79.07%) (Fig.  6 B). This finding is highly consistent with the evolutionary patterns of RNA viruses recently identified by Mifsud et al. in their extensive transcriptome survey of plants [ 63 ]. Members of the families Botourmiaviridae (79.07%) and Deltaflexiviridae (72.41%) were most frequently involved in cross-species transmission. The frequencies of co-divergence (median, 20.19%; range, 6.98% to 27.78%), duplication (median, 10.60%; range, 0% to 22.45%), and extinction (median, 2.42%; range, 0% to 5.56%) events involved in the evolution of fungi-associated viruses gradually decrease. Specifically, members of the family Benyviridae exhibited the highest frequency of co-divergence events, which also supports the findings reported by Mifsud et al.; certain studies propose that members of Benyviridae are transmitted via zoospores of plasmodiophorid protist [ 64 ]. It's speculated that the ancestor of these viruses underwent interkingdom horizontal transfer between plants and protists over evolutionary timelines [ 65 ]. Members of the family Mitoviridae showed the highest frequency of duplication events; and members of the families Benyviridae and Partitiviridae demonstrated the highest frequency of extinction events. Not surprisingly, this result is influenced by the current limited understanding of virus-host relationships. On one hand, viruses whose hosts cannot be recognized through published literature or information provided by authors have been overlooked. On the other hand, the number of viruses recorded in reference databases represents just the tip of the iceberg within the entire virosphere. The involvement of a more extensive sample size in the future should change this evolutionary landscape.

figure 6

Co-evolutionary analysis of virus and host. A Tanglegram of phylogenetic trees for virus orders/families and their hosts. Lines and branches are color-coded to indicate host clades. The cophylo function in phytools was employed to enhance congruence between the host (left) and virus (right) phylogenies. B Reconciliation analysis of virus groups. The bar chart illustrates the proportional range of possible evolutionary events, with the frequency of each event displayed at the top of its respective column

Our understanding of the interactions between fungi and their associated viruses has long been constrained by insufficient sampling of fungal species. Advances in metagenomics in recent decades have led to a rapid expansion of the known viral sequence space, but it is far from saturated. The diversity of hosts, the instability of the viral structures (especially RNA viruses), and the propensity to exchange genetic material with other host viruses all contribute to the unparalleled diversity of viral genomes [ 66 ]. Fungi are diverse and widely distributed in nature and are closely related to humans. A few fungi can parasitize immunocompromised humans, but their adverse effects are limited. As decomposers in the biological chain, fungi can decompose the remains of plants and animals and maintain the material cycle in the biological world [ 67 ]. In agricultural production, many fungi are plant pathogens, and about 80% of plant diseases are caused by fungi. However, little is currently known about the diversity of mycoviruses and how these viruses affect fungal phenotypes, fungal-host interactions, and virus evolution, and the sequencing depth of fungal libraries in most public databases only meets the needs of studying bacterial genomes. Sampling viruses from a larger diversity of fungal hosts should lead to new and improved evolutionary scenarios.

RNA viruses are widespread in deep-sea sediments [ 68 ], freshwater [ 69 ], sewage [ 70 ], and rhizosphere soils [ 71 ]. Compared to DNA viruses, RNA viruses are less conserved, prone to mutation, and can transfer between different hosts, potentially forming highly differentiated and unrecognized novel viruses. This characteristic increases the difficulty of monitoring these viruses. Previously, all discovered mycoviruses were RNA viruses. Until 2010, Yu et al. reported the discovery of a DNA virus, namely SsHADV-1, in fungi for the first time [ 72 ]. Subsequently, new fungal-related DNA viruses are continually being identified [ 73 , 74 , 75 ]. Currently, viruses have been found in all major groups of fungi, and approximately 100 types of fungi can be infected by viruses, instances exist where one virus can infect multiple fungi, or one fungus can be infected by several viruses simultaneously. The transmission of mycoviruses differs from that of animal and plant viruses and is mainly categorized into vertical and horizontal transmission [ 76 ]. Vertical transmission refers to the spread of the mycovirus to the next generation through the sexual or asexual spores of the fungus, while horizontal transmission refers to the spread of the mycovirus from one strain to another through fusion between hyphae. In the phylum Ascomycota , mycoviruses generally exhibit a low ability to transmit vertically through ascospores, but they are commonly transmitted vertically to progeny strains through asexual spores [ 77 ].

In this study, we identified two novel species belonging to different genera within the family Mitoviridae . Interestingly, they both simultaneously infect the same fungus— Thielaviopsis ethacetica , the causal agent of pineapple sett rot disease in sugarcane [ 78 ]. Previously, a report identified three different mitoviruses in Fusarium circinatum [ 79 ]. These findings suggest that there may be a certain level of adaptability or symbiotic relationship among members of the family Mitoviridae . Benyviruses are typically considered to infect plants, but recent evidence suggests that they can also infect fungi, such as Agaricus bisporus [ 80 ], further reinforced by the virus we discovered in Gaeumannomyces tritici . Moreover, members of the family Botourmiaviridae commonly exhibit a broad host range, with viruses closely related to CrBV capable of infecting members of Eukaryota , Viridiplantae , and Metazoa , in addition to fungi (Supplementary Fig. 1). The LsDV identified in this study shared the closest phylogenetic relationship with a virus identified from Macrotermes carbonarius in southern Vietnam (17_N1 + N237) [ 81 ]. M. carbonarius is an open-air foraging species that collects plant litter and wood debris to cultivate fungi in fungal gardens [ 82 ], termites may act as vectors, transmitting deltaflexivirus to other fungi. Furthermore, the viruses we identified, typically associated with fungi, also deepen their connections with species from other kingdoms on the tanglegram tree. For example, while Partitiviridae are naturally associated with fungi and plants, NcPV also shows close connections with Metazoa . In fact, based largely on phylogenetic predictions, various eukaryotic viruses have been found to undergo horizontal transfer between organisms of plants, fungi, and animals [ 83 ]. The rice dwarf virus was demonstrated to infect both plant and insect vectors [ 84 ]; moreover, plant-infecting rhabdoviruses, tospoviruses, and tenuiviruses are now known to replicate and spread in vector insects and shuttle between plants and animals [ 85 ]. Furthermore, Bian et al. demonstrated that plant virus infection in plants enables Cryphonectria hypovirus 1 to undergo horizontal transfer from fungi to plants and other heterologous fungal species [ 86 ].

Recent studies have greatly expanded the diversity of mycoviruses [ 87 , 88 ]. Gilbert et al. [ 20 ] investigated publicly available fungal transcriptomes from the subphylum Pezizomycotina, resulting in the detection of 52 novel mycoviruses; Myers et al. [ 18 ] employed both culture-based and transcriptome-mining approaches to identify 85 unique RNA viruses across 333 fungi; Ruiz-Padilla et al. identified 62 new mycoviral species from 248 Botrytis cinerea field isolates; Zhou et al. identified 20 novel viruses from 90 fungal strains (across four different macrofungi species) [ 89 ]. However, compared to these studies, our work identified fewer novel viruses, possibly due to the following reasons: 1) The libraries from the same Bioproject are usually from the same strains (or isolates). Therefore, there is a certain degree of redundancy in the datasets collected for this study. 2) Contigs shorter than 1,500 nt were discarded, potentially resulting in the oversight of short viral molecules. 3) Establishing a threshold of 70% aa sequence identity may also lead to the exclusion of certain viruses. 4) Some poly(A)-enriched RNA-seq libraries are likely to miss non-polyadenylated RNA viral genomes.

Taxonomy is a dynamic science, evolving with improvements in analytical methods and the emergence of new data. Identifying and rectifying incorrect classifications when new information becomes available is an ongoing and inevitable process in today's rapidly expanding field of virology. For instance, in 1975, members of the genera Rubivirus and Alphavirus were initially grouped under the family Togaviridae ; however, in 2019, Rubivirus was reclassified into the family Matonaviridae due to recognized differences in transmission modes and virion structures [ 90 ]. Additionally, the conflicts between certain members of the genera Magoulivirus and Gammapartitivirus mentioned here and their current demarcation criteria (e.g., amino acid identity, nucleotide length thresholds) need to be reconsidered.

Taken together, these findings reveal the potential diversity and novelty within fungal-associated viral communities and discuss the genetic similarities among different fungal-associated viruses. These findings advance our understanding of fungal-associated viruses and suggest the importance of subsequent in-depth investigations into the interactions between fungi and viruses, which will shed light on the important roles of these viruses in the global fungal kingdom.

Availability of data and materials

The data reported in this paper have been deposited in the GenBase in National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under accession numbers C_AA066339.1-C_AA066350.1 that are publicly accessible at https://ngdc.cncb.ac.cn/genbase . Please refer to Table  1 for details.

Leigh DM, Peranic K, Prospero S, Cornejo C, Curkovic-Perica M, Kupper Q, et al. Long-read sequencing reveals the evolutionary drivers of intra-host diversity across natural RNA mycovirus infections. Virus Evol. 2021;7(2):veab101. https://doi.org/10.1093/ve/veab101 . Epub 2022/03/19 PubMed PMID: 35299787; PubMed Central PMCID: PMCPMC8923234.

Article   PubMed   PubMed Central   Google Scholar  

Ghabrial SA, Suzuki N. Viruses of plant pathogenic fungi. Annu Rev Phytopathol. 2009;47:353–84. https://doi.org/10.1146/annurev-phyto-080508-081932 . Epub 2009/04/30 PubMed PMID: 19400634.

Article   CAS   PubMed   Google Scholar  

Ghabrial SA, Caston JR, Jiang D, Nibert ML, Suzuki N. 50-plus years of fungal viruses. Virology. 2015;479–480:356–68. https://doi.org/10.1016/j.virol.2015.02.034 . Epub 2015/03/17 PubMed PMID: 25771805.

Chen YM, Sadiq S, Tian JH, Chen X, Lin XD, Shen JJ, et al. RNA viromes from terrestrial sites across China expand environmental viral diversity. Nat Microbiol. 2022;7(8):1312–23. https://doi.org/10.1038/s41564-022-01180-2 . Epub 2022/07/29 PubMed PMID: 35902778.

Pearson MN, Beever RE, Boine B, Arthur K. Mycoviruses of filamentous fungi and their relevance to plant pathology. Mol Plant Pathol. 2009;10(1):115–28. https://doi.org/10.1111/j.1364-3703.2008.00503.x . Epub 2009/01/24 PubMed PMID: 19161358; PubMed Central PMCID: PMCPMC6640375.

Santiago-Rodriguez TM, Hollister EB. Unraveling the viral dark matter through viral metagenomics. Front Immunol. 2022;13:1005107. https://doi.org/10.3389/fimmu.2022.1005107 . Epub 2022/10/04 PubMed PMID: 36189246; PubMed Central PMCID: PMCPMC9523745.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Srinivasiah S, Bhavsar J, Thapar K, Liles M, Schoenfeld T, Wommack KE. Phages across the biosphere: contrasts of viruses in soil and aquatic environments. Res Microbiol. 2008;159(5):349–57. https://doi.org/10.1016/j.resmic.2008.04.010 . Epub 2008/06/21 PubMed PMID: 18565737.

Guo W, Yan H, Ren X, Tang R, Sun Y, Wang Y, et al. Berberine induces resistance against tobacco mosaic virus in tobacco. Pest Manag Sci. 2020;76(5):1804–13. https://doi.org/10.1002/ps.5709 . Epub 2019/12/10 PubMed PMID: 31814252.

Villabruna N, Izquierdo-Lara RW, Schapendonk CME, de Bruin E, Chandler F, Thao TTN, et al. Profiling of humoral immune responses to norovirus in children across Europe. Sci Rep. 2022;12(1):14275. https://doi.org/10.1038/s41598-022-18383-6 . Epub 2022/08/23 PubMed PMID: 35995986.

Zhang Y, Gao J, Li Y. Diversity of mycoviruses in edible fungi. Virus Genes. 2022;58(5):377–91. https://doi.org/10.1007/s11262-022-01908-6 . Epub 2022/06/07 PubMed PMID: 35668282.

Shkoporov AN, Clooney AG, Sutton TDS, Ryan FJ, Daly KM, Nolan JA, et al. The human gut virome is highly diverse, stable, and individual specific. Cell Host Microbe. 2019;26(4):527–41. https://doi.org/10.1016/j.chom.2019.09.009 . Epub 2019/10/11 PubMed PMID: 31600503.

Botella L, Janousek J, Maia C, Jung MH, Raco M, Jung T. Marine Oomycetes of the Genus Halophytophthora harbor viruses related to Bunyaviruses. Front Microbiol. 2020;11:1467. https://doi.org/10.3389/fmicb.2020.01467 . Epub 2020/08/08 PubMed PMID: 32760358; PubMed Central PMCID: PMCPMC7375090.

Kotta-Loizou I. Mycoviruses and their role in fungal pathogenesis. Curr Opin Microbiol. 2021;63:10–8. https://doi.org/10.1016/j.mib.2021.05.007 . Epub 2021/06/09 PubMed PMID: 34102567.

Ellis LF, Kleinschmidt WJ. Virus-like particles of a fraction of statolon, a mould product. Nature. 1967;215(5101):649–50. https://doi.org/10.1038/215649a0 . Epub 1967/08/05 PubMed PMID: 6050227.

Banks GT, Buck KW, Chain EB, Himmelweit F, Marks JE, Tyler JM, et al. Viruses in fungi and interferon stimulation. Nature. 1968;218(5141):542–5. https://doi.org/10.1038/218542a0 . Epub 1968/05/11 PubMed PMID: 4967851.

Jia J, Fu Y, Jiang D, Mu F, Cheng J, Lin Y, et al. Interannual dynamics, diversity and evolution of the virome in Sclerotinia sclerotiorum from a single crop field. Virus Evol. 2021;7(1):veab032. https://doi.org/10.1093/ve/veab032 .

Mu F, Li B, Cheng S, Jia J, Jiang D, Fu Y, et al. Nine viruses from eight lineages exhibiting new evolutionary modes that co-infect a hypovirulent phytopathogenic fungus. Plos Pathog. 2021;17(8):e1009823. https://doi.org/10.1371/journal.ppat.1009823 . Epub 2021/08/25 PubMed PMID: 34428260; PubMed Central PMCID: PMCPMC8415603.

Myers JM, Bonds AE, Clemons RA, Thapa NA, Simmons DR, Carter-House D, et al. Survey of early-diverging lineages of fungi reveals abundant and diverse Mycoviruses. mBio. 2020;11(5):e02027. https://doi.org/10.1128/mBio.02027-20 . Epub 2020/09/10 PubMed PMID: 32900807; PubMed Central PMCID: PMCPMC7482067.

Ruiz-Padilla A, Rodriguez-Romero J, Gomez-Cid I, Pacifico D, Ayllon MA. Novel Mycoviruses discovered in the Mycovirome of a Necrotrophic fungus. MBio. 2021;12(3):e03705. https://doi.org/10.1128/mBio.03705-20 . Epub 2021/05/13 PubMed PMID: 33975945; PubMed Central PMCID: PMCPMC8262958.

Gilbert KB, Holcomb EE, Allscheid RL, Carrington JC. Hiding in plain sight: new virus genomes discovered via a systematic analysis of fungal public transcriptomes. Plos One. 2019;14(7):e0219207. https://doi.org/10.1371/journal.pone.0219207 . Epub 2019/07/25 PubMed PMID: 31339899; PubMed Central PMCID: PMCPMC6655640.

Khan HA, Telengech P, Kondo H, Bhatti MF, Suzuki N. Mycovirus hunting revealed the presence of diverse viruses in a single isolate of the Phytopathogenic fungus diplodia seriata from Pakistan. Front Cell Infect Microbiol. 2022;12:913619. https://doi.org/10.3389/fcimb.2022.913619 . Epub 2022/07/19 PubMed PMID: 35846770; PubMed Central PMCID: PMCPMC9277117.

Kotta-Loizou I, Coutts RHA. Mycoviruses in Aspergilli: a comprehensive review. Front Microbiol. 2017;8:1699. https://doi.org/10.3389/fmicb.2017.01699 . Epub 2017/09/22 PubMed PMID: 28932216; PubMed Central PMCID: PMCPMC5592211.

Garcia-Pedrajas MD, Canizares MC, Sarmiento-Villamil JL, Jacquat AG, Dambolena JS. Mycoviruses in biological control: from basic research to field implementation. Phytopathology. 2019;109(11):1828–39. https://doi.org/10.1094/PHYTO-05-19-0166-RVW . Epub 2019/08/10 PubMed PMID: 31398087.

Rigling D, Prospero S. Cryphonectria parasitica, the causal agent of chestnut blight: invasion history, population biology and disease control. Mol Plant Pathol. 2018;19(1):7–20. https://doi.org/10.1111/mpp.12542 . Epub 2017/02/01 PubMed PMID: 28142223; PubMed Central PMCID: PMCPMC6638123.

Okada R, Ichinose S, Takeshita K, Urayama SI, Fukuhara T, Komatsu K, et al. Molecular characterization of a novel mycovirus in Alternaria alternata manifesting two-sided effects: down-regulation of host growth and up-regulation of host plant pathogenicity. Virology. 2018;519:23–32. https://doi.org/10.1016/j.virol.2018.03.027 . Epub 2018/04/10 PubMed PMID: 29631173.

Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9. https://doi.org/10.1038/nmeth.1923 . Epub 2012/03/06 PubMed PMID: 22388286; PubMed Central PMCID: PMCPMC3322381.

Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo assembler. Curr Protoc Bioinform. 2020;70(1):e102. https://doi.org/10.1002/cpbi.102 . Epub 2020/06/20 PubMed PMID: 32559359.

Article   CAS   Google Scholar  

Li D, Luo R, Liu CM, Leung CM, Ting HF, Sadakane K, et al. MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods. 2016;102:3–11. https://doi.org/10.1016/j.ymeth.2016.02.020 . Epub 2016/03/26 PubMed PMID: 27012178.

Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26(5):589–95. https://doi.org/10.1093/bioinformatics/btp698 . Epub 2010/01/19 PubMed PMID: 20080505; PubMed Central PMCID: PMCPMC2828108.

Roux S, Paez-Espino D, Chen IA, Palaniappan K, Ratner A, Chu K, et al. IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses. Nucleic Acids Res. 2021;49(D1):D764–75. https://doi.org/10.1093/nar/gkaa946 . Epub 2020/11/03 PubMed PMID: 33137183; PubMed Central PMCID: PMCPMC7778971.

Mirdita M, Steinegger M, Soding J. MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics. 2019;35(16):2856–8. https://doi.org/10.1093/bioinformatics/bty1057 . Epub 2019/01/08 PubMed PMID: 30615063; PubMed Central PMCID: PMCPMC6691333.

Buchfink B, Reuter K, Drost HG. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18(4):366–8. https://doi.org/10.1038/s41592-021-01101-x . Epub 2021/04/09 PubMed PMID: 33828273; PubMed Central PMCID: PMCPMC8026399.

Shen W, Ren H. TaxonKit: A practical and efficient NCBI taxonomy toolkit. J Genet Genomics. 2021;48(9):844–50. https://doi.org/10.1016/j.jgg.2021.03.006 . Epub 2021/05/19 PubMed PMID: 34001434.

Article   PubMed   Google Scholar  

Gautam A, Felderhoff H, Bagci C, Huson DH. Using AnnoTree to get more assignments, faster, in DIAMOND+MEGAN microbiome analysis. mSystems. 2022;7(1):e0140821. https://doi.org/10.1128/msystems.01408-21 . Epub 2022/02/23 PubMed PMID: 35191776; PubMed Central PMCID: PMCPMC8862659.

Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol Biol Evol. 2018;35(6):1547–9. https://doi.org/10.1093/molbev/msy096 . Epub 2018/05/04 PubMed PMID: 29722887; PubMed Central PMCID: PMCPMC5967553.

Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020;37(5):1530–4. https://doi.org/10.1093/molbev/msaa015 . Epub 2020/02/06 PubMed PMID: 32011700; PubMed Central PMCID: PMCPMC7182206.

Letunic I, Bork P. Interactive Tree of Life (iTOL) v6: recent updates to the phylogenetic tree display and annotation tool. Nucleic Acids Res (2024). https://doi.org/10.1093/nar/gkae268

Muhire BM, Varsani A, Martin DP. SDT: a virus classification tool based on pairwise sequence alignment and identity calculation. Plos One. 2014;9(9):e108277. https://doi.org/10.1371/journal.pone.0108277 . Epub 2014/09/27 PubMed PMID: 25259891; PubMed Central PMCID: PMCPMC4178126.

Kumar S, Suleski M, Craig JM, Kasprowicz AE, Sanderford M, Li M, et al. TimeTree 5: an expanded resource for species divergence times. Mol Biol Evol. 2022;39(8):msac174. https://doi.org/10.1093/molbev/msac174 . Epub 2022/08/07 PubMed PMID: 35932227; PubMed Central PMCID: PMCPMC9400175.

Revell LJ. phytools 2.0: an updated R ecosystem for phylogenetic comparative methods (and other things). PeerJ. 2024;12:e16505. https://doi.org/10.7717/peerj.16505 . Epub 2024/01/09 PubMed PMID: 38192598; PubMed Central PMCID: PMCPMC10773453.

Santichaivekin S, Yang Q, Liu J, Mawhorter R, Jiang J, Wesley T, et al. eMPRess: a systematic cophylogeny reconciliation tool. Bioinformatics. 2021;37(16):2481–2. https://doi.org/10.1093/bioinformatics/btaa978 . Epub 2020/11/21 PubMed PMID: 33216126.

Ma W, Smirnov D, Libeskind-Hadas R. DTL reconciliation repair. BMC Bioinformatics. 2017;18(Suppl 3):76. https://doi.org/10.1186/s12859-017-1463-9 . Epub 2017/04/01 PubMed PMID: 28361686; PubMed Central PMCID: PMCPMC5374596.

Members C-N, Partners. Database resources of the national genomics data center, China national center for bioinformation in 2024. Nucleic Acids Res. 2024;52(D1):D18–32. https://doi.org/10.1093/nar/gkad1078 . Epub 2023/11/29 PubMed PMID: 38018256; PubMed Central PMCID: PMCPMC10767964.

Article   Google Scholar  

Shafik K, Umer M, You H, Aboushedida H, Wang Z, Ni D, et al. Characterization of a Novel Mitovirus infecting Melanconiella theae isolated from tea plants. Front Microbiol. 2021;12: 757556. https://doi.org/10.3389/fmicb.2021.757556 . Epub 2021/12/07 PubMed PMID: 34867881; PubMed Central PMCID: PMCPMC8635788

Kamaruzzaman M, He G, Wu M, Zhang J, Yang L, Chen W, et al. A novel Partitivirus in the Hypovirulent isolate QT5–19 of the plant pathogenic fungus Botrytis cinerea. Viruses. 2019;11(1):24. https://doi.org/10.3390/v11010024 . Epub 2019/01/06 PubMed PMID: 30609795; PubMed Central PMCID: PMCPMC6356794.

Akata I, Keskin E, Sahin E. Molecular characterization of a new mitovirus hosted by the ectomycorrhizal fungus Albatrellopsis flettii. Arch Virol. 2021;166(12):3449–54. https://doi.org/10.1007/s00705-021-05250-4 . Epub 2021/09/24 PubMed PMID: 34554305.

Walker PJ, Siddell SG, Lefkowitz EJ, Mushegian AR, Adriaenssens EM, Alfenas-Zerbini P, et al. Recent changes to virus taxonomy ratified by the international committee on taxonomy of viruses (2022). Arch Virol. 2022;167(11):2429–40. https://doi.org/10.1007/s00705-022-05516-5 . Epub 2022/08/24 PubMed PMID: 35999326; PubMed Central PMCID: PMCPMC10088433.

Alvarez-Quinto R, Grinstead S, Jones R, Mollov D. Complete genome sequence of a new mitovirus associated with walking iris (Trimezia northiana). Arch Virol. 2023;168(11):273. https://doi.org/10.1007/s00705-023-05901-8 . Epub 2023/10/17 PubMed PMID: 37845386.

Gilmer D, Ratti C, Ictv RC. ICTV Virus taxonomy profile: Benyviridae. J Gen Virol. 2017;98(7):1571–2. https://doi.org/10.1099/jgv.0.000864 . Epub 2017/07/18 PubMed PMID: 28714846; PubMed Central PMCID: PMCPMC5656776.

Wetzel V, Willlems G, Darracq A, Galein Y, Liebe S, Varrelmann M. The Beta vulgaris-derived resistance gene Rz2 confers broad-spectrum resistance against soilborne sugar beet-infecting viruses from different families by recognizing triple gene block protein 1. Mol Plant Pathol. 2021;22(7):829–42. https://doi.org/10.1111/mpp.13066 . Epub 2021/05/06 PubMed PMID: 33951264; PubMed Central PMCID: PMCPMC8232027.

Ayllon MA, Turina M, Xie J, Nerva L, Marzano SL, Donaire L, et al. ICTV Virus taxonomy profile: Botourmiaviridae. J Gen Virol. 2020;101(5):454–5. https://doi.org/10.1099/jgv.0.001409 . Epub 2020/05/08 PubMed PMID: 32375992; PubMed Central PMCID: PMCPMC7414452.

Xiao J, Wang X, Zheng Z, Wu Y, Wang Z, Li H, et al. Molecular characterization of a novel deltaflexivirus infecting the edible fungus Pleurotus ostreatus. Arch Virol. 2023;168(6):162. https://doi.org/10.1007/s00705-023-05789-4 . Epub 2023/05/17 PubMed PMID: 37195309.

Canuti M, Rodrigues B, Lang AS, Dufour SC, Verhoeven JTP. Novel divergent members of the Kitrinoviricota discovered through metagenomics in the intestinal contents of red-backed voles (Clethrionomys gapperi). Int J Mol Sci. 2022;24(1):131. https://doi.org/10.3390/ijms24010131 . Epub 2023/01/09 PubMed PMID: 36613573; PubMed Central PMCID: PMCPMC9820622.

Hermanns K, Zirkel F, Kopp A, Marklewitz M, Rwego IB, Estrada A, et al. Discovery of a novel alphavirus related to Eilat virus. J Gen Virol. 2017;98(1):43–9. https://doi.org/10.1099/jgv.0.000694 . Epub 2017/02/17 PubMed PMID: 28206905.

Jiang D, Ayllon MA, Marzano SL, Ictv RC. ICTV Virus taxonomy profile: Mymonaviridae. J Gen Virol. 2019;100(10):1343–4. https://doi.org/10.1099/jgv.0.001301 . Epub 2019/09/04 PubMed PMID: 31478828.

Liu L, Xie J, Cheng J, Fu Y, Li G, Yi X, et al. Fungal negative-stranded RNA virus that is related to bornaviruses and nyaviruses. Proc Natl Acad Sci U S A. 2014;111(33):12205–10. https://doi.org/10.1073/pnas.1401786111 . Epub 2014/08/06 PubMed PMID: 25092337; PubMed Central PMCID: PMCPMC4143027.

Zhong J, Li P, Gao BD, Zhong SY, Li XG, Hu Z, et al. Novel and diverse mycoviruses co-infecting a single strain of the phytopathogenic fungus Alternaria dianthicola. Front Cell Infect Microbiol. 2022;12:980970. https://doi.org/10.3389/fcimb.2022.980970 . Epub 2022/10/15 PubMed PMID: 36237429; PubMed Central PMCID: PMCPMC9552818.

Wang W, Wang X, Tu C, Yang M, Xiang J, Wang L, et al. Novel Mycoviruses discovered from a Metatranscriptomics survey of the Phytopathogenic Alternaria Fungus. Viruses. 2022;14(11):2552. https://doi.org/10.3390/v14112552 . Epub 2022/11/25 PubMed PMID: 36423161; PubMed Central PMCID: PMCPMC9693364.

Sun Y, Li J, Gao GF, Tien P, Liu W. Bunyavirales ribonucleoproteins: the viral replication and transcription machinery. Crit Rev Microbiol. 2018;44(5):522–40. https://doi.org/10.1080/1040841X.2018.1446901 . Epub 2018/03/09 PubMed PMID: 29516765.

Li P, Bhattacharjee P, Gagkaeva T, Wang S, Guo L. A novel bipartite negative-stranded RNA mycovirus of the order Bunyavirales isolated from the phytopathogenic fungus Fusarium sibiricum. Arch Virol. 2023;169(1):13. https://doi.org/10.1007/s00705-023-05942-z . Epub 2023/12/29 PubMed PMID: 38155262.

Ferron F, Weber F, de la Torre JC, Reguera J. Transcription and replication mechanisms of Bunyaviridae and Arenaviridae L proteins. Virus Res. 2017;234:118–34. https://doi.org/10.1016/j.virusres.2017.01.018 . Epub 2017/02/01 PubMed PMID: 28137457; PubMed Central PMCID: PMCPMC7114536.

Vainio EJ, Chiba S, Ghabrial SA, Maiss E, Roossinck M, Sabanadzovic S, et al. ICTV Virus taxonomy profile: Partitiviridae. J Gen Virol. 2018;99(1):17–8. https://doi.org/10.1099/jgv.0.000985 . Epub 2017/12/08 PubMed PMID: 29214972; PubMed Central PMCID: PMCPMC5882087.

Mifsud JCO, Gallagher RV, Holmes EC, Geoghegan JL. Transcriptome mining expands knowledge of RNA viruses across the plant Kingdom. J Virol. 2022;96(24):e0026022. https://doi.org/10.1128/jvi.00260-22 . Epub 2022/06/01 PubMed PMID: 35638822; PubMed Central PMCID: PMCPMC9769393.

Tamada T, Kondo H. Biological and genetic diversity of plasmodiophorid-transmitted viruses and their vectors. J Gen Plant Pathol. 2013;79:307–20.

Dolja VV, Krupovic M, Koonin EV. Deep roots and splendid boughs of the global plant virome. Annu Rev Phytopathol. 2020;58:23–53.

Koonin EV, Dolja VV, Krupovic M, Varsani A, Wolf YI, Yutin N, et al. Global organization and proposed Megataxonomy of the virus world. Microbiol Mol Biol Rev. 2020;84(2):e00061. https://doi.org/10.1128/MMBR.00061-19 . Epub 2020/03/07 PubMed PMID: 32132243; PubMed Central PMCID: PMCPMC7062200.

Osono T. Role of phyllosphere fungi of forest trees in the development of decomposer fungal communities and decomposition processes of leaf litter. Can J Microbiol. 2006;52(8):701–16. https://doi.org/10.1139/w06-023 . Epub 2006/08/19 PubMed PMID: 16917528.

Li Z, Pan D, Wei G, Pi W, Zhang C, Wang JH, et al. Deep sea sediments associated with cold seeps are a subsurface reservoir of viral diversity. ISME J. 2021;15(8):2366–78. https://doi.org/10.1038/s41396-021-00932-y . Epub 2021/03/03 PubMed PMID: 33649554; PubMed Central PMCID: PMCPMC8319345.

Hierweger MM, Koch MC, Rupp M, Maes P, Di Paola N, Bruggmann R, et al. Novel Filoviruses, Hantavirus, and Rhabdovirus in freshwater fish, Switzerland, 2017. Emerg Infect Dis. 2021;27(12):3082–91. https://doi.org/10.3201/eid2712.210491 . Epub 2021/11/23 PubMed PMID: 34808081; PubMed Central PMCID: PMCPMC8632185.

La Rosa G, Iaconelli M, Mancini P, Bonanno Ferraro G, Veneri C, Bonadonna L, et al. First detection of SARS-CoV-2 in untreated wastewaters in Italy. Sci Total Environ. 2020;736:139652. https://doi.org/10.1016/j.scitotenv.2020.139652 . Epub 2020/05/29 PubMed PMID: 32464333; PubMed Central PMCID: PMCPMC7245320.

Sutela S, Poimala A, Vainio EJ. Viruses of fungi and oomycetes in the soil environment. FEMS Microbiol Ecol. 2019;95(9):fiz119. https://doi.org/10.1093/femsec/fiz119 . Epub 2019/08/01 PubMed PMID: 31365065.

Yu X, Li B, Fu Y, Jiang D, Ghabrial SA, Li G, et al. A geminivirus-related DNA mycovirus that confers hypovirulence to a plant pathogenic fungus. Proc Natl Acad Sci U S A. 2010;107(18):8387–92. https://doi.org/10.1073/pnas.0913535107 . Epub 2010/04/21 PubMed PMID: 20404139; PubMed Central PMCID: PMCPMC2889581.

Li P, Wang S, Zhang L, Qiu D, Zhou X, Guo L. A tripartite ssDNA mycovirus from a plant pathogenic fungus is infectious as cloned DNA and purified virions. Sci Adv. 2020;6(14):eaay9634. https://doi.org/10.1126/sciadv.aay9634 . Epub 2020/04/15 PubMed PMID: 32284975; PubMed Central PMCID: PMCPMC7138691.

Khalifa ME, MacDiarmid RM. A mechanically transmitted DNA Mycovirus is targeted by the defence machinery of its host, Botrytis cinerea. Viruses. 2021;13(7):1315. https://doi.org/10.3390/v13071315 . Epub 2021/08/11 PubMed PMID: 34372522; PubMed Central PMCID: PMCPMC8309985.

Yu X, Li B, Fu Y, Xie J, Cheng J, Ghabrial SA, et al. Extracellular transmission of a DNA mycovirus and its use as a natural fungicide. Proc Natl Acad Sci U S A. 2013;110(4):1452–7. https://doi.org/10.1073/pnas.1213755110 . Epub 2013/01/09 PubMed PMID: 23297222; PubMed Central PMCID: PMCPMC3557086.

Nuss DL. Hypovirulence: mycoviruses at the fungal-plant interface. Nat Rev Microbiol. 2005;3(8):632–42. https://doi.org/10.1038/nrmicro1206 . Epub 2005/08/03 PubMed PMID: 16064055.

Coenen A, Kevei F, Hoekstra RF. Factors affecting the spread of double-stranded RNA viruses in Aspergillus nidulans. Genet Res. 1997;69(1):1–10. https://doi.org/10.1017/s001667239600256x . Epub 1997/02/01 PubMed PMID: 9164170.

Freitas CSA, Maciel LF, Dos Correa Santos RA, Costa O, Maia FCB, Rabelo RS, et al. Bacterial volatile organic compounds induce adverse ultrastructural changes and DNA damage to the sugarcane pathogenic fungus Thielaviopsis ethacetica. Environ Microbiol. 2022;24(3):1430–53. https://doi.org/10.1111/1462-2920.15876 . Epub 2022/01/08 PubMed PMID: 34995419.

Martinez-Alvarez P, Vainio EJ, Botella L, Hantula J, Diez JJ. Three mitovirus strains infecting a single isolate of Fusarium circinatum are the first putative members of the family Narnaviridae detected in a fungus of the genus Fusarium. Arch Virol. 2014;159(8):2153–5. https://doi.org/10.1007/s00705-014-2012-8 . Epub 2014/02/13 PubMed PMID: 24519462.

Deakin G, Dobbs E, Bennett JM, Jones IM, Grogan HM, Burton KS. Multiple viral infections in Agaricus bisporus - characterisation of 18 unique RNA viruses and 8 ORFans identified by deep sequencing. Sci Rep. 2017;7(1):2469. https://doi.org/10.1038/s41598-017-01592-9 . Epub 2017/05/28 PubMed PMID: 28550284; PubMed Central PMCID: PMCPMC5446422.

Litov AG, Zueva AI, Tiunov AV, Van Thinh N, Belyaeva NV, Karganova GG. Virome of three termite species from Southern Vietnam. Viruses. 2022;14(5):860. https://doi.org/10.3390/v14050860 . Epub 2022/05/29 PubMed PMID: 35632601; PubMed Central PMCID: PMCPMC9143207.

Hu J, Neoh KB, Appel AG, Lee CY. Subterranean termite open-air foraging and tolerance to desiccation: Comparative water relation of two sympatric Macrotermes spp. (Blattodea: Termitidae). Comp Biochem Physiol A Mol Integr Physiol. 2012;161(2):201–7. https://doi.org/10.1016/j.cbpa.2011.10.028 . Epub 2011/11/17 PubMed PMID: 22085890.

Kondo H, Botella L, Suzuki N. Mycovirus diversity and evolution revealed/inferred from recent studies. Annu Rev Phytopathol. 2022;60:307–36. https://doi.org/10.1146/annurev-phyto-021621-122122 . Epub 2022/05/25 PubMed PMID: 35609970.

Fukushi T. Relationships between propagative rice viruses and their vectors. 1969.

Google Scholar  

Sun L, Kondo H, Bagus AI. Cross-kingdom virus infection. Encyclopedia of Virology: Volume 1–5. 4th Ed. Elsevier; 2020. pp. 443–9. https://doi.org/10.1016/B978-0-12-809633-8.21320-4 .

Bian R, Andika IB, Pang T, Lian Z, Wei S, Niu E, et al. Facilitative and synergistic interactions between fungal and plant viruses. Proc Natl Acad Sci U S A. 2020;117(7):3779–88. https://doi.org/10.1073/pnas.1915996117 . Epub 2020/02/06 PubMed PMID: 32015104; PubMed Central PMCID: PMCPMC7035501.

Chiapello M, Rodriguez-Romero J, Ayllon MA, Turina M. Analysis of the virome associated to grapevine downy mildew lesions reveals new mycovirus lineages. Virus Evol. 2020;6(2):veaa058. https://doi.org/10.1093/ve/veaa058 . Epub 2020/12/17 PubMed PMID: 33324489; PubMed Central PMCID: PMCPMC7724247.

Sutela S, Forgia M, Vainio EJ, Chiapello M, Daghino S, Vallino M, et al. The virome from a collection of endomycorrhizal fungi reveals new viral taxa with unprecedented genome organization. Virus Evol. 2020;6(2):veaa076. https://doi.org/10.1093/ve/veaa076 . Epub 2020/12/17 PubMed PMID: 33324490; PubMed Central PMCID: PMCPMC7724248.

Zhou K, Zhang F, Deng Y. Comparative analysis of viromes identified in multiple macrofungi. Viruses. 2024;16(4):597. https://doi.org/10.3390/v16040597 . Epub 2024/04/27 PubMed PMID: 38675938; PubMed Central PMCID: PMCPMC11054281.

Siddell SG, Smith DB, Adriaenssens E, Alfenas-Zerbini P, Dutilh BE, Garcia ML, et al. Virus taxonomy and the role of the International Committee on Taxonomy of Viruses (ICTV). J Gen Virol. 2023;104(5):001840. https://doi.org/10.1099/jgv.0.001840 . Epub 2023/05/04 PubMed PMID: 37141106; PubMed Central PMCID: PMCPMC10227694.

Download references

Acknowledgements

All authors participated in the design, interpretation of the studies and analysis of the data and review of the manuscript; WZ and CZ contributed to the conception and design; XL, ZD, JXU, WL and PN contributed to the collection and assembly of data; XL, ZD and JXE contributed to the data analysis and interpretation.

This research was supported by National Key Research and Development Programs of China [No.2023YFD1801301 and 2022YFC2603801] and the National Natural Science Foundation of China [No.82341106].

Author information

Xiang Lu, Ziyuan Dai and Jiaxin Xue are equally contributed to this works.

Authors and Affiliations

Institute of Critical Care Medicine, The Affiliated People’s Hospital, Jiangsu University, Zhenjiang, 212002, China

Xiang Lu & Wen Zhang

Department of Microbiology, School of Medicine, Jiangsu University, Zhenjiang, 212013, China

Xiang Lu, Jiaxin Xue & Wen Zhang

Department of Clinical Laboratory, Affiliated Hospital 6 of Nantong University, Yancheng Third People’s Hospital, Yancheng, Jiangsu, China

Clinical Laboratory Center, The Affiliated Taizhou People’s Hospital of Nanjing Medical University, Taizhou, 225300, China

Wang Li, Ping Ni, Juan Xu, Chenglin Zhou & Wen Zhang

You can also search for this author in PubMed   Google Scholar

Contributions

Corresponding authors.

Correspondence to Juan Xu , Chenglin Zhou or Wen Zhang .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1., supplementary material 2., supplementary material 3., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Lu, X., Dai, Z., Xue, J. et al. Discovery of novel RNA viruses through analysis of fungi-associated next-generation sequencing data. BMC Genomics 25 , 517 (2024). https://doi.org/10.1186/s12864-024-10432-w

Download citation

Received : 19 March 2024

Accepted : 20 May 2024

Published : 27 May 2024

DOI : https://doi.org/10.1186/s12864-024-10432-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

BMC Genomics

ISSN: 1471-2164

research article data analysis

COMMENTS

  1. Learning to Do Qualitative Data Analysis: A Starting Point

    For many researchers unfamiliar with qualitative research, determining how to conduct qualitative analyses is often quite challenging. Part of this challenge is due to the seemingly limitless approaches that a qualitative researcher might leverage, as well as simply learning to think like a qualitative researcher when analyzing data. From framework analysis (Ritchie & Spencer, 1994) to content ...

  2. (PDF) The art of Data Analysis

    The process of performing certain. calculations and evaluation in order to extract. relevant information from data is called data. analysis. The data analysis ma y take several steps. to reach ...

  3. Creating a Data Analysis Plan: What to Consider When Choosing

    The purpose of this article is to help you create a data analysis plan for a quantitative study. For those interested in conducting qualitative research, previous articles in this Research Primer series have provided information on the design and analysis of such studies. 2, 3 Information in the current article is divided into 3 main sections ...

  4. Data Analysis in Research: Types & Methods

    Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. Three essential things occur during the data ...

  5. Full article: Design Principles for Data Analysis

    Our primary focus in this article is to (i) introduce a set of data analytic design principles ( Section 2 ), (ii) describe an example of how the design principles can be used to measure different characteristics of a data analysis ( Section 3 ), and (iii) present data on the variation in principles within and between producers of data analyses ...

  6. Introduction to Research Statistical Analysis: An Overview of the

    Introduction. Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology.

  7. Data Analysis in Quantitative Research

    Abstract. Quantitative data analysis serves as part of an essential process of evidence-making in health and social sciences. It is adopted for any types of research question and design whether it is descriptive, explanatory, or causal. However, compared with qualitative counterpart, quantitative data analysis has less flexibility.

  8. What Is Data Analysis? (With Examples)

    Written by Coursera Staff • Updated on Apr 19, 2024. Data analysis is the practice of working with data to glean useful information, which can then be used to make informed decisions. "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts," Sherlock ...

  9. How to use and assess qualitative research methods

    How to conduct qualitative research? Given that qualitative research is characterised by flexibility, openness and responsivity to context, the steps of data collection and analysis are not as separate and consecutive as they tend to be in quantitative research [13, 14].As Fossey puts it: "sampling, data collection, analysis and interpretation are related to each other in a cyclical ...

  10. Data analysis

    data analysis, the process of systematically collecting, cleaning, transforming, describing, modeling, and interpreting data, generally employing statistical techniques. Data analysis is an important part of both scientific research and business, where demand has grown in recent years for data-driven decision making.Data analysis techniques are used to gain useful insights from datasets, which ...

  11. What is data analysis? Methods, techniques, types & how-to

    A method of data analysis that is the umbrella term for engineering metrics and insights for additional value, direction, and context. By using exploratory statistical evaluation, data mining aims to identify dependencies, relations, patterns, and trends to generate advanced knowledge.

  12. Data Analysis

    Data Analysis. Definition: Data analysis refers to the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, drawing conclusions, and supporting decision-making. It involves applying various statistical and computational techniques to interpret and derive insights from large datasets.

  13. The Beginner's Guide to Statistical Analysis

    Step 1: Write your hypotheses and plan your research design. To collect valid data for statistical analysis, you first need to specify your hypotheses and plan out your research design. Writing statistical hypotheses. The goal of research is often to investigate a relationship between variables within a population. You start with a prediction ...

  14. Research Methods

    To analyze data collected in a statistically valid manner (e.g. from experiments, surveys, and observations). Meta-analysis. Quantitative. To statistically analyze the results of a large collection of studies. Can only be applied to studies that collected data in a statistically valid manner. Thematic analysis.

  15. A Practical Guide to Writing Quantitative and Qualitative Research

    INTRODUCTION. Scientific research is usually initiated by posing evidenced-based research questions which are then explicitly restated as hypotheses.1,2 The hypotheses provide directions to guide the study, solutions, explanations, and expected results.3,4 Both research questions and hypotheses are essentially formulated based on conventional theories and real-world processes, which allow the ...

  16. How to conduct a meta-analysis in eight steps: a practical guide

    The first step in the coding process is the design of the coding sheet. A universal template does not exist because the design of the coding sheet depends on the methods used, the respective software, and the complexity of the research design. For univariate meta-analysis or meta-regression, data are typically coded in wide format.

  17. Research collaboration data platform ensuring general data ...

    Translational data is of paramount importance for medical research and clinical innovation. The combination of different omics data (e.g., genomics, radiomics, proteomics) and clinical health data ...

  18. Most States Already Collect Data That Can Help Improve Opioid Use

    Data from the behavioral health core set and health home programs are current as of federal fiscal year 2020, and data from 1115 SUD waivers are current as of the dates specified in the Excel file, which range from 2019 to 2023. A state is counted as reporting a measure if it does so through any of the CMS programs that include the measure.

  19. Shingles Vaccination in Medicare Part D After Inflation Reduction Act

    Future research should investigate the association of the IRA policy and disparities in vaccination rates. 5,6. ... Dr Qato had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Concept and design: Qato, Romley, Goldman, Fendrick.

  20. [2405.17190] SoK: Leveraging Transformers for Malware Analysis

    The introduction of transformers has been an important breakthrough for AI research and application as transformers are the foundation behind Generative AI. A promising application domain for transformers is cybersecurity, in particular the malware domain analysis. The reason is the flexibility of the transformer models in handling long sequential features and understanding contextual ...

  21. Data Science and Analytics: An Overview from Data-Driven Smart

    Data pre-processing and exploration: Exploratory data analysis is defined in data science as an approach to analyzing datasets to summarize their key characteristics, often with visual methods . This examines a broad data collection to discover initial trends, attributes, points of interest, etc. in an unstructured manner to construct ...

  22. Critical Analysis: The Often-Missing Step in Conducting Literature

    Many techniques for extracting data from a sample of individual research articles have been proposed (e.g., matrices, databases, charts) and are well articulated in a number of research textbooks (Galvan & Galvan, 2017; Miles et al., 2019; Toronto & Remington, 2020). The assignment of conducting a literature review is often given to students ...

  23. America's best decade, according to data

    The closest-knit communities were those in our childhood, ages 4 to 7. The happiest families, most moral society and most reliable news reporting came in our early formative years — ages 8 ...

  24. Analyzing global research trends and focal points of pyoderma

    Mapping of terms used in research on P. gangrenosum. A: The co-occurrence network of terms extracted from the title or abstract of at least 40 articles.The colors represent groups of terms that are relatively strongly linked to each other. The size of a term signifies the number of publications related to P. gangrenosum in which the term appeared, and the distance between two terms represents ...

  25. Basic statistical tools in research and data analysis

    The article covers a brief outline of the variables, an understanding of quantitative and qualitative variables and the measures of central tendency. An idea of the sample size estimation, power analysis and the statistical errors is given. Finally, there is a summary of parametric and non-parametric tests used for data analysis.

  26. Discovery of novel RNA viruses through analysis of fungi-associated

    Background Like all other species, fungi are susceptible to infection by viruses. The diversity of fungal viruses has been rapidly expanding in recent years due to the availability of advanced sequencing technologies. However, compared to other virome studies, the research on fungi-associated viruses remains limited. Results In this study, we downloaded and analyzed over 200 public datasets ...