• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

data analysis example in research paper

Home Market Research

Data Analysis in Research: Types & Methods


Content Index

Why analyze data in research?

Types of data in research, finding patterns in the qualitative data, methods used for data analysis in qualitative research, preparing data for analysis, methods used for data analysis in quantitative research, considerations in research data analysis, what is data analysis in research.

Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. 

Three essential things occur during the data analysis process — the first is data organization . Summarization and categorization together contribute to becoming the second known method used for data reduction. It helps find patterns and themes in the data for easy identification and linking. The third and last way is data analysis – researchers do it in both top-down and bottom-up fashion.

LEARN ABOUT: Research Process Steps

On the other hand, Marshall and Rossman describe data analysis as a messy, ambiguous, and time-consuming but creative and fascinating process through which a mass of collected data is brought to order, structure and meaning.

We can say that “the data analysis and data interpretation is a process representing the application of deductive and inductive logic to the research and data analysis.”

Researchers rely heavily on data as they have a story to tell or research problems to solve. It starts with a question, and data is nothing but an answer to that question. But, what if there is no question to ask? Well! It is possible to explore data even without a problem – we call it ‘Data Mining’, which often reveals some interesting patterns within the data that are worth exploring.

Irrelevant to the type of data researchers explore, their mission and audiences’ vision guide them to find the patterns to shape the story they want to tell. One of the essential things expected from researchers while analyzing data is to stay open and remain unbiased toward unexpected patterns, expressions, and results. Remember, sometimes, data analysis tells the most unforeseen yet exciting stories that were not expected when initiating data analysis. Therefore, rely on the data you have at hand and enjoy the journey of exploratory research. 

Create a Free Account

Every kind of data has a rare quality of describing things after assigning a specific value to it. For analysis, you need to organize these values, processed and presented in a given context, to make it useful. Data can be in different forms; here are the primary data types.

  • Qualitative data: When the data presented has words and descriptions, then we call it qualitative data . Although you can observe this data, it is subjective and harder to analyze data in research, especially for comparison. Example: Quality data represents everything describing taste, experience, texture, or an opinion that is considered quality data. This type of data is usually collected through focus groups, personal qualitative interviews , qualitative observation or using open-ended questions in surveys.
  • Quantitative data: Any data expressed in numbers of numerical figures are called quantitative data . This type of data can be distinguished into categories, grouped, measured, calculated, or ranked. Example: questions such as age, rank, cost, length, weight, scores, etc. everything comes under this type of data. You can present such data in graphical format, charts, or apply statistical analysis methods to this data. The (Outcomes Measurement Systems) OMS questionnaires in surveys are a significant source of collecting numeric data.
  • Categorical data: It is data presented in groups. However, an item included in the categorical data cannot belong to more than one group. Example: A person responding to a survey by telling his living style, marital status, smoking habit, or drinking habit comes under the categorical data. A chi-square test is a standard method used to analyze this data.

Learn More : Examples of Qualitative Data in Education

Data analysis in qualitative research

Data analysis and qualitative data research work a little differently from the numerical data as the quality data is made up of words, descriptions, images, objects, and sometimes symbols. Getting insight from such complicated information is a complicated process. Hence it is typically used for exploratory research and data analysis .

Although there are several ways to find patterns in the textual information, a word-based method is the most relied and widely used global technique for research and data analysis. Notably, the data analysis process in qualitative research is manual. Here the researchers usually read the available data and find repetitive or commonly used words. 

For example, while studying data collected from African countries to understand the most pressing issues people face, researchers might find  “food”  and  “hunger” are the most commonly used words and will highlight them for further analysis.

LEARN ABOUT: Level of Analysis

The keyword context is another widely used word-based technique. In this method, the researcher tries to understand the concept by analyzing the context in which the participants use a particular keyword.  

For example , researchers conducting research and data analysis for studying the concept of ‘diabetes’ amongst respondents might analyze the context of when and how the respondent has used or referred to the word ‘diabetes.’

The scrutiny-based technique is also one of the highly recommended  text analysis  methods used to identify a quality data pattern. Compare and contrast is the widely used method under this technique to differentiate how a specific text is similar or different from each other. 

For example: To find out the “importance of resident doctor in a company,” the collected data is divided into people who think it is necessary to hire a resident doctor and those who think it is unnecessary. Compare and contrast is the best method that can be used to analyze the polls having single-answer questions types .

Metaphors can be used to reduce the data pile and find patterns in it so that it becomes easier to connect data with theory.

Variable Partitioning is another technique used to split variables so that researchers can find more coherent descriptions and explanations from the enormous data.

LEARN ABOUT: Qualitative Research Questions and Questionnaires

There are several techniques to analyze the data in qualitative research, but here are some commonly used methods,

  • Content Analysis:  It is widely accepted and the most frequently employed technique for data analysis in research methodology. It can be used to analyze the documented information from text, images, and sometimes from the physical items. It depends on the research questions to predict when and where to use this method.
  • Narrative Analysis: This method is used to analyze content gathered from various sources such as personal interviews, field observation, and  surveys . The majority of times, stories, or opinions shared by people are focused on finding answers to the research questions.
  • Discourse Analysis:  Similar to narrative analysis, discourse analysis is used to analyze the interactions with people. Nevertheless, this particular method considers the social context under which or within which the communication between the researcher and respondent takes place. In addition to that, discourse analysis also focuses on the lifestyle and day-to-day environment while deriving any conclusion.
  • Grounded Theory:  When you want to explain why a particular phenomenon happened, then using grounded theory for analyzing quality data is the best resort. Grounded theory is applied to study data about the host of similar cases occurring in different settings. When researchers are using this method, they might alter explanations or produce new ones until they arrive at some conclusion.

LEARN ABOUT: 12 Best Tools for Researchers

Data analysis in quantitative research

The first stage in research and data analysis is to make it for the analysis so that the nominal data can be converted into something meaningful. Data preparation consists of the below phases.

Phase I: Data Validation

Data validation is done to understand if the collected data sample is per the pre-set standards, or it is a biased data sample again divided into four different stages

  • Fraud: To ensure an actual human being records each response to the survey or the questionnaire
  • Screening: To make sure each participant or respondent is selected or chosen in compliance with the research criteria
  • Procedure: To ensure ethical standards were maintained while collecting the data sample
  • Completeness: To ensure that the respondent has answered all the questions in an online survey. Else, the interviewer had asked all the questions devised in the questionnaire.

Phase II: Data Editing

More often, an extensive research data sample comes loaded with errors. Respondents sometimes fill in some fields incorrectly or sometimes skip them accidentally. Data editing is a process wherein the researchers have to confirm that the provided data is free of such errors. They need to conduct necessary checks and outlier checks to edit the raw edit and make it ready for analysis.

Phase III: Data Coding

Out of all three, this is the most critical phase of data preparation associated with grouping and assigning values to the survey responses . If a survey is completed with a 1000 sample size, the researcher will create an age bracket to distinguish the respondents based on their age. Thus, it becomes easier to analyze small data buckets rather than deal with the massive data pile.

LEARN ABOUT: Steps in Qualitative Research

After the data is prepared for analysis, researchers are open to using different research and data analysis methods to derive meaningful insights. For sure, statistical analysis plans are the most favored to analyze numerical data. In statistical analysis, distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities. The method is again classified into two groups. First, ‘Descriptive Statistics’ used to describe data. Second, ‘Inferential statistics’ that helps in comparing the data .

Descriptive statistics

This method is used to describe the basic features of versatile types of data in research. It presents the data in such a meaningful way that pattern in the data starts making sense. Nevertheless, the descriptive analysis does not go beyond making conclusions. The conclusions are again based on the hypothesis researchers have formulated so far. Here are a few major types of descriptive analysis methods.

Measures of Frequency

  • Count, Percent, Frequency
  • It is used to denote home often a particular event occurs.
  • Researchers use it when they want to showcase how often a response is given.

Measures of Central Tendency

  • Mean, Median, Mode
  • The method is widely used to demonstrate distribution by various points.
  • Researchers use this method when they want to showcase the most commonly or averagely indicated response.

Measures of Dispersion or Variation

  • Range, Variance, Standard deviation
  • Here the field equals high/low points.
  • Variance standard deviation = difference between the observed score and mean
  • It is used to identify the spread of scores by stating intervals.
  • Researchers use this method to showcase data spread out. It helps them identify the depth until which the data is spread out that it directly affects the mean.

Measures of Position

  • Percentile ranks, Quartile ranks
  • It relies on standardized scores helping researchers to identify the relationship between different scores.
  • It is often used when researchers want to compare scores with the average count.

For quantitative research use of descriptive analysis often give absolute numbers, but the in-depth analysis is never sufficient to demonstrate the rationale behind those numbers. Nevertheless, it is necessary to think of the best method for research and data analysis suiting your survey questionnaire and what story researchers want to tell. For example, the mean is the best way to demonstrate the students’ average scores in schools. It is better to rely on the descriptive statistics when the researchers intend to keep the research or outcome limited to the provided  sample  without generalizing it. For example, when you want to compare average voting done in two different cities, differential statistics are enough.

Descriptive analysis is also called a ‘univariate analysis’ since it is commonly used to analyze a single variable.

Inferential statistics

Inferential statistics are used to make predictions about a larger population after research and data analysis of the representing population’s collected sample. For example, you can ask some odd 100 audiences at a movie theater if they like the movie they are watching. Researchers then use inferential statistics on the collected  sample  to reason that about 80-90% of people like the movie. 

Here are two significant areas of inferential statistics.

  • Estimating parameters: It takes statistics from the sample research data and demonstrates something about the population parameter.
  • Hypothesis test: I t’s about sampling research data to answer the survey research questions. For example, researchers might be interested to understand if the new shade of lipstick recently launched is good or not, or if the multivitamin capsules help children to perform better at games.

These are sophisticated analysis methods used to showcase the relationship between different variables instead of describing a single variable. It is often used when researchers want something beyond absolute numbers to understand the relationship between variables.

Here are some of the commonly used methods for data analysis in research.

  • Correlation: When researchers are not conducting experimental research or quasi-experimental research wherein the researchers are interested to understand the relationship between two or more variables, they opt for correlational research methods.
  • Cross-tabulation: Also called contingency tables,  cross-tabulation  is used to analyze the relationship between multiple variables.  Suppose provided data has age and gender categories presented in rows and columns. A two-dimensional cross-tabulation helps for seamless data analysis and research by showing the number of males and females in each age category.
  • Regression analysis: For understanding the strong relationship between two variables, researchers do not look beyond the primary and commonly used regression analysis method, which is also a type of predictive analysis used. In this method, you have an essential factor called the dependent variable. You also have multiple independent variables in regression analysis. You undertake efforts to find out the impact of independent variables on the dependent variable. The values of both independent and dependent variables are assumed as being ascertained in an error-free random manner.
  • Frequency tables: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Analysis of variance: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Researchers must have the necessary research skills to analyze and manipulation the data , Getting trained to demonstrate a high standard of research practice. Ideally, researchers must possess more than a basic understanding of the rationale of selecting one statistical method over the other to obtain better data insights.
  • Usually, research and data analytics projects differ by scientific discipline; therefore, getting statistical advice at the beginning of analysis helps design a survey questionnaire, select data collection methods , and choose samples.

LEARN ABOUT: Best Data Collection Tools

  • The primary aim of data research and analysis is to derive ultimate insights that are unbiased. Any mistake in or keeping a biased mind to collect data, selecting an analysis method, or choosing  audience  sample il to draw a biased inference.
  • Irrelevant to the sophistication used in research data and analysis is enough to rectify the poorly defined objective outcome measurements. It does not matter if the design is at fault or intentions are not clear, but lack of clarity might mislead readers, so avoid the practice.
  • The motive behind data analysis in research is to present accurate and reliable data. As far as possible, avoid statistical errors, and find a way to deal with everyday challenges like outliers, missing data, data altering, data mining , or developing graphical representation.

LEARN MORE: Descriptive Research vs Correlational Research The sheer amount of data generated daily is frightening. Especially when data analysis has taken center stage. in 2018. In last year, the total data supply amounted to 2.8 trillion gigabytes. Hence, it is clear that the enterprises willing to survive in the hypercompetitive world must possess an excellent capability to analyze complex research data, derive actionable insights, and adapt to the new market needs.

LEARN ABOUT: Average Order Value

QuestionPro is an online survey platform that empowers organizations in data analysis and research and provides them a medium to collect data by creating appealing surveys.


Cannabis Industry Business Intelligence

Cannabis Industry Business Intelligence: Impact on Research

May 28, 2024

Best Dynata Alternatives

Top 10 Dynata Alternatives & Competitors

May 27, 2024

data analysis example in research paper

What Are My Employees Really Thinking? The Power of Open-ended Survey Analysis

May 24, 2024

When I think of “disconnected”, it is important that this is not just in relation to people analytics, Employee Experience or Customer Experience - it is also relevant to looking across them.

I Am Disconnected – Tuesday CX Thoughts

May 21, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

Grad Coach

Qualitative Data Analysis Methods 101:

The “big 6” methods + examples.

By: Kerryn Warren (PhD) | Reviewed By: Eunice Rautenbach (D.Tech) | May 2020 (Updated April 2023)

Qualitative data analysis methods. Wow, that’s a mouthful. 

If you’re new to the world of research, qualitative data analysis can look rather intimidating. So much bulky terminology and so many abstract, fluffy concepts. It certainly can be a minefield!

Don’t worry – in this post, we’ll unpack the most popular analysis methods , one at a time, so that you can approach your analysis with confidence and competence – whether that’s for a dissertation, thesis or really any kind of research project.

Qualitative data analysis methods

What (exactly) is qualitative data analysis?

To understand qualitative data analysis, we need to first understand qualitative data – so let’s step back and ask the question, “what exactly is qualitative data?”.

Qualitative data refers to pretty much any data that’s “not numbers” . In other words, it’s not the stuff you measure using a fixed scale or complex equipment, nor do you analyse it using complex statistics or mathematics.

So, if it’s not numbers, what is it?

Words, you guessed? Well… sometimes , yes. Qualitative data can, and often does, take the form of interview transcripts, documents and open-ended survey responses – but it can also involve the interpretation of images and videos. In other words, qualitative isn’t just limited to text-based data.

So, how’s that different from quantitative data, you ask?

Simply put, qualitative research focuses on words, descriptions, concepts or ideas – while quantitative research focuses on numbers and statistics . Qualitative research investigates the “softer side” of things to explore and describe , while quantitative research focuses on the “hard numbers”, to measure differences between variables and the relationships between them. If you’re keen to learn more about the differences between qual and quant, we’ve got a detailed post over here .

qualitative data analysis vs quantitative data analysis

So, qualitative analysis is easier than quantitative, right?

Not quite. In many ways, qualitative data can be challenging and time-consuming to analyse and interpret. At the end of your data collection phase (which itself takes a lot of time), you’ll likely have many pages of text-based data or hours upon hours of audio to work through. You might also have subtle nuances of interactions or discussions that have danced around in your mind, or that you scribbled down in messy field notes. All of this needs to work its way into your analysis.

Making sense of all of this is no small task and you shouldn’t underestimate it. Long story short – qualitative analysis can be a lot of work! Of course, quantitative analysis is no piece of cake either, but it’s important to recognise that qualitative analysis still requires a significant investment in terms of time and effort.

Need a helping hand?

data analysis example in research paper

In this post, we’ll explore qualitative data analysis by looking at some of the most common analysis methods we encounter. We’re not going to cover every possible qualitative method and we’re not going to go into heavy detail – we’re just going to give you the big picture. That said, we will of course includes links to loads of extra resources so that you can learn more about whichever analysis method interests you.

Without further delay, let’s get into it.

The “Big 6” Qualitative Analysis Methods 

There are many different types of qualitative data analysis, all of which serve different purposes and have unique strengths and weaknesses . We’ll start by outlining the analysis methods and then we’ll dive into the details for each.

The 6 most popular methods (or at least the ones we see at Grad Coach) are:

  • Content analysis
  • Narrative analysis
  • Discourse analysis
  • Thematic analysis
  • Grounded theory (GT)
  • Interpretive phenomenological analysis (IPA)

Let’s take a look at each of them…

QDA Method #1: Qualitative Content Analysis

Content analysis is possibly the most common and straightforward QDA method. At the simplest level, content analysis is used to evaluate patterns within a piece of content (for example, words, phrases or images) or across multiple pieces of content or sources of communication. For example, a collection of newspaper articles or political speeches.

With content analysis, you could, for instance, identify the frequency with which an idea is shared or spoken about – like the number of times a Kardashian is mentioned on Twitter. Or you could identify patterns of deeper underlying interpretations – for instance, by identifying phrases or words in tourist pamphlets that highlight India as an ancient country.

Because content analysis can be used in such a wide variety of ways, it’s important to go into your analysis with a very specific question and goal, or you’ll get lost in the fog. With content analysis, you’ll group large amounts of text into codes , summarise these into categories, and possibly even tabulate the data to calculate the frequency of certain concepts or variables. Because of this, content analysis provides a small splash of quantitative thinking within a qualitative method.

Naturally, while content analysis is widely useful, it’s not without its drawbacks . One of the main issues with content analysis is that it can be very time-consuming , as it requires lots of reading and re-reading of the texts. Also, because of its multidimensional focus on both qualitative and quantitative aspects, it is sometimes accused of losing important nuances in communication.

Content analysis also tends to concentrate on a very specific timeline and doesn’t take into account what happened before or after that timeline. This isn’t necessarily a bad thing though – just something to be aware of. So, keep these factors in mind if you’re considering content analysis. Every analysis method has its limitations , so don’t be put off by these – just be aware of them ! If you’re interested in learning more about content analysis, the video below provides a good starting point.

QDA Method #2: Narrative Analysis 

As the name suggests, narrative analysis is all about listening to people telling stories and analysing what that means . Since stories serve a functional purpose of helping us make sense of the world, we can gain insights into the ways that people deal with and make sense of reality by analysing their stories and the ways they’re told.

You could, for example, use narrative analysis to explore whether how something is being said is important. For instance, the narrative of a prisoner trying to justify their crime could provide insight into their view of the world and the justice system. Similarly, analysing the ways entrepreneurs talk about the struggles in their careers or cancer patients telling stories of hope could provide powerful insights into their mindsets and perspectives . Simply put, narrative analysis is about paying attention to the stories that people tell – and more importantly, the way they tell them.

Of course, the narrative approach has its weaknesses , too. Sample sizes are generally quite small due to the time-consuming process of capturing narratives. Because of this, along with the multitude of social and lifestyle factors which can influence a subject, narrative analysis can be quite difficult to reproduce in subsequent research. This means that it’s difficult to test the findings of some of this research.

Similarly, researcher bias can have a strong influence on the results here, so you need to be particularly careful about the potential biases you can bring into your analysis when using this method. Nevertheless, narrative analysis is still a very useful qualitative analysis method – just keep these limitations in mind and be careful not to draw broad conclusions . If you’re keen to learn more about narrative analysis, the video below provides a great introduction to this qualitative analysis method.

QDA Method #3: Discourse Analysis 

Discourse is simply a fancy word for written or spoken language or debate . So, discourse analysis is all about analysing language within its social context. In other words, analysing language – such as a conversation, a speech, etc – within the culture and society it takes place. For example, you could analyse how a janitor speaks to a CEO, or how politicians speak about terrorism.

To truly understand these conversations or speeches, the culture and history of those involved in the communication are important factors to consider. For example, a janitor might speak more casually with a CEO in a company that emphasises equality among workers. Similarly, a politician might speak more about terrorism if there was a recent terrorist incident in the country.

So, as you can see, by using discourse analysis, you can identify how culture , history or power dynamics (to name a few) have an effect on the way concepts are spoken about. So, if your research aims and objectives involve understanding culture or power dynamics, discourse analysis can be a powerful method.

Because there are many social influences in terms of how we speak to each other, the potential use of discourse analysis is vast . Of course, this also means it’s important to have a very specific research question (or questions) in mind when analysing your data and looking for patterns and themes, or you might land up going down a winding rabbit hole.

Discourse analysis can also be very time-consuming  as you need to sample the data to the point of saturation – in other words, until no new information and insights emerge. But this is, of course, part of what makes discourse analysis such a powerful technique. So, keep these factors in mind when considering this QDA method. Again, if you’re keen to learn more, the video below presents a good starting point.

QDA Method #4: Thematic Analysis

Thematic analysis looks at patterns of meaning in a data set – for example, a set of interviews or focus group transcripts. But what exactly does that… mean? Well, a thematic analysis takes bodies of data (which are often quite large) and groups them according to similarities – in other words, themes . These themes help us make sense of the content and derive meaning from it.

Let’s take a look at an example.

With thematic analysis, you could analyse 100 online reviews of a popular sushi restaurant to find out what patrons think about the place. By reviewing the data, you would then identify the themes that crop up repeatedly within the data – for example, “fresh ingredients” or “friendly wait staff”.

So, as you can see, thematic analysis can be pretty useful for finding out about people’s experiences , views, and opinions . Therefore, if your research aims and objectives involve understanding people’s experience or view of something, thematic analysis can be a great choice.

Since thematic analysis is a bit of an exploratory process, it’s not unusual for your research questions to develop , or even change as you progress through the analysis. While this is somewhat natural in exploratory research, it can also be seen as a disadvantage as it means that data needs to be re-reviewed each time a research question is adjusted. In other words, thematic analysis can be quite time-consuming – but for a good reason. So, keep this in mind if you choose to use thematic analysis for your project and budget extra time for unexpected adjustments.

Thematic analysis takes bodies of data and groups them according to similarities (themes), which help us make sense of the content.

QDA Method #5: Grounded theory (GT) 

Grounded theory is a powerful qualitative analysis method where the intention is to create a new theory (or theories) using the data at hand, through a series of “ tests ” and “ revisions ”. Strictly speaking, GT is more a research design type than an analysis method, but we’ve included it here as it’s often referred to as a method.

What’s most important with grounded theory is that you go into the analysis with an open mind and let the data speak for itself – rather than dragging existing hypotheses or theories into your analysis. In other words, your analysis must develop from the ground up (hence the name). 

Let’s look at an example of GT in action.

Assume you’re interested in developing a theory about what factors influence students to watch a YouTube video about qualitative analysis. Using Grounded theory , you’d start with this general overarching question about the given population (i.e., graduate students). First, you’d approach a small sample – for example, five graduate students in a department at a university. Ideally, this sample would be reasonably representative of the broader population. You’d interview these students to identify what factors lead them to watch the video.

After analysing the interview data, a general pattern could emerge. For example, you might notice that graduate students are more likely to read a post about qualitative methods if they are just starting on their dissertation journey, or if they have an upcoming test about research methods.

From here, you’ll look for another small sample – for example, five more graduate students in a different department – and see whether this pattern holds true for them. If not, you’ll look for commonalities and adapt your theory accordingly. As this process continues, the theory would develop . As we mentioned earlier, what’s important with grounded theory is that the theory develops from the data – not from some preconceived idea.

So, what are the drawbacks of grounded theory? Well, some argue that there’s a tricky circularity to grounded theory. For it to work, in principle, you should know as little as possible regarding the research question and population, so that you reduce the bias in your interpretation. However, in many circumstances, it’s also thought to be unwise to approach a research question without knowledge of the current literature . In other words, it’s a bit of a “chicken or the egg” situation.

Regardless, grounded theory remains a popular (and powerful) option. Naturally, it’s a very useful method when you’re researching a topic that is completely new or has very little existing research about it, as it allows you to start from scratch and work your way from the ground up .

Grounded theory is used to create a new theory (or theories) by using the data at hand, as opposed to existing theories and frameworks.

QDA Method #6:   Interpretive Phenomenological Analysis (IPA)

Interpretive. Phenomenological. Analysis. IPA . Try saying that three times fast…

Let’s just stick with IPA, okay?

IPA is designed to help you understand the personal experiences of a subject (for example, a person or group of people) concerning a major life event, an experience or a situation . This event or experience is the “phenomenon” that makes up the “P” in IPA. Such phenomena may range from relatively common events – such as motherhood, or being involved in a car accident – to those which are extremely rare – for example, someone’s personal experience in a refugee camp. So, IPA is a great choice if your research involves analysing people’s personal experiences of something that happened to them.

It’s important to remember that IPA is subject – centred . In other words, it’s focused on the experiencer . This means that, while you’ll likely use a coding system to identify commonalities, it’s important not to lose the depth of experience or meaning by trying to reduce everything to codes. Also, keep in mind that since your sample size will generally be very small with IPA, you often won’t be able to draw broad conclusions about the generalisability of your findings. But that’s okay as long as it aligns with your research aims and objectives.

Another thing to be aware of with IPA is personal bias . While researcher bias can creep into all forms of research, self-awareness is critically important with IPA, as it can have a major impact on the results. For example, a researcher who was a victim of a crime himself could insert his own feelings of frustration and anger into the way he interprets the experience of someone who was kidnapped. So, if you’re going to undertake IPA, you need to be very self-aware or you could muddy the analysis.

IPA can help you understand the personal experiences of a person or group concerning a major life event, an experience or a situation.

How to choose the right analysis method

In light of all of the qualitative analysis methods we’ve covered so far, you’re probably asking yourself the question, “ How do I choose the right one? ”

Much like all the other methodological decisions you’ll need to make, selecting the right qualitative analysis method largely depends on your research aims, objectives and questions . In other words, the best tool for the job depends on what you’re trying to build. For example:

  • Perhaps your research aims to analyse the use of words and what they reveal about the intention of the storyteller and the cultural context of the time.
  • Perhaps your research aims to develop an understanding of the unique personal experiences of people that have experienced a certain event, or
  • Perhaps your research aims to develop insight regarding the influence of a certain culture on its members.

As you can probably see, each of these research aims are distinctly different , and therefore different analysis methods would be suitable for each one. For example, narrative analysis would likely be a good option for the first aim, while grounded theory wouldn’t be as relevant. 

It’s also important to remember that each method has its own set of strengths, weaknesses and general limitations. No single analysis method is perfect . So, depending on the nature of your research, it may make sense to adopt more than one method (this is called triangulation ). Keep in mind though that this will of course be quite time-consuming.

As we’ve seen, all of the qualitative analysis methods we’ve discussed make use of coding and theme-generating techniques, but the intent and approach of each analysis method differ quite substantially. So, it’s very important to come into your research with a clear intention before you decide which analysis method (or methods) to use.

Start by reviewing your research aims , objectives and research questions to assess what exactly you’re trying to find out – then select a qualitative analysis method that fits. Never pick a method just because you like it or have experience using it – your analysis method (or methods) must align with your broader research aims and objectives.

No single analysis method is perfect, so it can often make sense to adopt more than one  method (this is called triangulation).

Let’s recap on QDA methods…

In this post, we looked at six popular qualitative data analysis methods:

  • First, we looked at content analysis , a straightforward method that blends a little bit of quant into a primarily qualitative analysis.
  • Then we looked at narrative analysis , which is about analysing how stories are told.
  • Next up was discourse analysis – which is about analysing conversations and interactions.
  • Then we moved on to thematic analysis – which is about identifying themes and patterns.
  • From there, we went south with grounded theory – which is about starting from scratch with a specific question and using the data alone to build a theory in response to that question.
  • And finally, we looked at IPA – which is about understanding people’s unique experiences of a phenomenon.

Of course, these aren’t the only options when it comes to qualitative data analysis, but they’re a great starting point if you’re dipping your toes into qualitative research for the first time.

If you’re still feeling a bit confused, consider our private coaching service , where we hold your hand through the research process to help you develop your best work.

data analysis example in research paper

Psst... there’s more!

This post was based on one of our popular Research Bootcamps . If you're working on a research project, you'll definitely want to check this out ...

You Might Also Like:

Sampling methods and strategies in research


Richard N

This has been very helpful. Thank you.


Thank you madam,

Mariam Jaiyeola

Thank you so much for this information


I wonder it so clear for understand and good for me. can I ask additional query?


Very insightful and useful

Susan Nakaweesi

Good work done with clear explanations. Thank you.


Thanks so much for the write-up, it’s really good.

Hemantha Gunasekara

Thanks madam . It is very important .


thank you very good

Pramod Bahulekar

This has been very well explained in simple language . It is useful even for a new researcher.

Derek Jansen

Great to hear that. Good luck with your qualitative data analysis, Pramod!

Adam Zahir

This is very useful information. And it was very a clear language structured presentation. Thanks a lot.


Thank you so much.


very informative sequential presentation


Precise explanation of method.


Hi, may we use 2 data analysis methods in our qualitative research?

Thanks for your comment. Most commonly, one would use one type of analysis method, but it depends on your research aims and objectives.

Dr. Manju Pandey

You explained it in very simple language, everyone can understand it. Thanks so much.


Thank you very much, this is very helpful. It has been explained in a very simple manner that even a layman understands


Thank nicely explained can I ask is Qualitative content analysis the same as thematic analysis?

Thanks for your comment. No, QCA and thematic are two different types of analysis. This article might help clarify – https://onlinelibrary.wiley.com/doi/10.1111/nhs.12048

Rev. Osadare K . J

This is my first time to come across a well explained data analysis. so helpful.

Tina King

I have thoroughly enjoyed your explanation of the six qualitative analysis methods. This is very helpful. Thank you!


Thank you very much, this is well explained and useful


i need a citation of your book.


Thanks a lot , remarkable indeed, enlighting to the best


Hi Derek, What other theories/methods would you recommend when the data is a whole speech?


Keep writing useful artikel.


It is important concept about QDA and also the way to express is easily understandable, so thanks for all.

Carl Benecke

Thank you, this is well explained and very useful.


Very helpful .Thanks.

Hajra Aman

Hi there! Very well explained. Simple but very useful style of writing. Please provide the citation of the text. warm regards

Hillary Mophethe

The session was very helpful and insightful. Thank you

This was very helpful and insightful. Easy to read and understand


As a professional academic writer, this has been so informative and educative. Keep up the good work Grad Coach you are unmatched with quality content for sure.

Keep up the good work Grad Coach you are unmatched with quality content for sure.


Its Great and help me the most. A Million Thanks you Dr.


It is a very nice work

Noble Naade

Very insightful. Please, which of this approach could be used for a research that one is trying to elicit students’ misconceptions in a particular concept ?


This is Amazing and well explained, thanks


great overview


What do we call a research data analysis method that one use to advise or determining the best accounting tool or techniques that should be adopted in a company.

Catherine Shimechero

Informative video, explained in a clear and simple way. Kudos

Van Hmung

Waoo! I have chosen method wrong for my data analysis. But I can revise my work according to this guide. Thank you so much for this helpful lecture.


This has been very helpful. It gave me a good view of my research objectives and how to choose the best method. Thematic analysis it is.

Livhuwani Reineth

Very helpful indeed. Thanku so much for the insight.

Storm Erlank

This was incredibly helpful.

Jack Kanas

Very helpful.


very educative

Wan Roslina

Nicely written especially for novice academic researchers like me! Thank you.


choosing a right method for a paper is always a hard job for a student, this is a useful information, but it would be more useful personally for me, if the author provide me with a little bit more information about the data analysis techniques in type of explanatory research. Can we use qualitative content analysis technique for explanatory research ? or what is the suitable data analysis method for explanatory research in social studies?


that was very helpful for me. because these details are so important to my research. thank you very much

Kumsa Desisa

I learnt a lot. Thank you

Tesfa NT

Relevant and Informative, thanks !


Well-planned and organized, thanks much! 🙂

Dr. Jacob Lubuva

I have reviewed qualitative data analysis in a simplest way possible. The content will highly be useful for developing my book on qualitative data analysis methods. Cheers!

Nyi Nyi Lwin

Clear explanation on qualitative and how about Case study

Ogobuchi Otuu

This was helpful. Thank you


This was really of great assistance, it was just the right information needed. Explanation very clear and follow.

Wow, Thanks for making my life easy

C. U

This was helpful thanks .

Dr. Alina Atif

Very helpful…. clear and written in an easily understandable manner. Thank you.


This was so helpful as it was easy to understand. I’m a new to research thank you so much.


so educative…. but Ijust want to know which method is coding of the qualitative or tallying done?


Thank you for the great content, I have learnt a lot. So helpful


precise and clear presentation with simple language and thank you for that.


very informative content, thank you.

Oscar Kuebutornye

You guys are amazing on YouTube on this platform. Your teachings are great, educative, and informative. kudos!


Brilliant Delivery. You made a complex subject seem so easy. Well done.

Ankit Kumar

Beautifully explained.

Thanks a lot

Kidada Owen-Browne

Is there a video the captures the practical process of coding using automated applications?

Thanks for the comment. We don’t recommend using automated applications for coding, as they are not sufficiently accurate in our experience.

Mathewos Damtew

content analysis can be qualitative research?



Dev get

Thank you very much for such a wonderful content

Kassahun Aman

do you have any material on Data collection

Prince .S. mpofu

What a powerful explanation of the QDA methods. Thank you.


Great explanation both written and Video. i have been using of it on a day to day working of my thesis project in accounting and finance. Thank you very much for your support.


very helpful, thank you so much

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 17, Issue 1
  • Qualitative data analysis: a practical example
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • Helen Noble 1 ,
  • Joanna Smith 2
  • 1 School of Nursing and Midwifery, Queens's University Belfast , Belfast , UK
  • 2 Department of Health Sciences , University of Huddersfield , Huddersfield , UK
  • Correspondence to : Dr Helen Noble School of Nursing and Midwifery, Queen's University Belfast, Medical Biology Centre, 97 Lisburn Road, Belfast BT9 7BL, UK; helen.noble{at}qub.ac.uk


Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

The aim of this paper is to equip readers with an understanding of the principles of qualitative data analysis and offer a practical example of how analysis might be undertaken in an interview-based study.

What is qualitative data analysis?

What are the approaches in undertaking qualitative data analysis.

Although qualitative data analysis is inductive and focuses on meaning, approaches in analysing data are diverse with different purposes and ontological (concerned with the nature of being) and epistemological (knowledge and understanding) underpinnings. 2 Identifying an appropriate approach in analysing qualitative data analysis to meet the aim of a study can be challenging. One way to understand qualitative data analysis is to consider the processes involved. 3 Approaches can be divided into four broad groups: quasistatistical approaches such as content analysis; the use of frameworks or matrices such as a framework approach and thematic analysis; interpretative approaches that include interpretative phenomenological analysis and grounded theory; and sociolinguistic approaches such as discourse analysis and conversation analysis. However, there are commonalities across approaches. Data analysis is an interactive process, where data are systematically searched and analysed in order to provide an illuminating description of phenomena; for example, the experience of carers supporting dying patients with renal disease 4 or student nurses’ experiences following assignment referral. 5 Data analysis is an iterative or recurring process, essential to the creativity of the analysis, development of ideas, clarifying meaning and the reworking of concepts as new insights ‘emerge’ or are identified in the data.

Do you need data software packages when analysing qualitative data?

Qualitative data software packages are not a prerequisite for undertaking qualitative analysis but a range of programmes are available that can assist the qualitative researcher. Software programmes vary in design and application but can be divided into text retrievers, code and retrieve packages and theory builders. 6 NVivo and NUD*IST are widely used because they have sophisticated code and retrieve functions and modelling capabilities, which speed up the process of managing large data sets and data retrieval. Repetitions within data can be quantified and memos and hyperlinks attached to data. Analytical processes can be mapped and tracked and linkages across data visualised leading to theory development. 6 Disadvantages of using qualitative data software packages include the complexity of the software and some programmes are not compatible with standard text format. Extensive coding and categorising can result in data becoming unmanageable and researchers may find visualising data on screen inhibits conceptualisation of the data.

How do you begin analysing qualitative data?

Despite the diversity of qualitative methods, the subsequent analysis is based on a common set of principles and for interview data includes: transcribing the interviews; immersing oneself within the data to gain detailed insights into the phenomena being explored; developing a data coding system; and linking codes or units of data to form overarching themes/concepts, which may lead to the development of theory. 2 Identifying recurring and significant themes, whereby data are methodically searched to identify patterns in order to provide an illuminating description of a phenomenon, is a central skill in undertaking qualitative data analysis. Table 1 contains an extract of data taken from a research study which included interviews with carers of people with end-stage renal disease managed without dialysis. The extract is taken from a carer who is trying to understand why her mother was not offered dialysis. The first stage of data analysis involves the process of initial coding, whereby each line of the data is considered to identify keywords or phrases; these are sometimes known as in vivo codes (highlighted) because they retain participants’ words.

  • View inline

Data extract containing units of data and line-by-line coding

When transcripts have been broken down into manageable sections, the researcher sorts and sifts them, searching for types, classes, sequences, processes, patterns or wholes. The next stage of data analysis involves bringing similar categories together into broader themes. Table 2 provides an example of the early development of codes and categories and how these link to form broad initial themes.

Development of initial themes from descriptive codes

Table 3 presents an example of further category development leading to final themes which link to an overarching concept.

Development of final themes and overarching concept

How do qualitative researchers ensure data analysis procedures are transparent and robust?

In congruence with quantitative researchers, ensuring qualitative studies are methodologically robust is essential. Qualitative researchers need to be explicit in describing how and why they undertook the research. However, qualitative research is criticised for lacking transparency in relation to the analytical processes employed, which hinders the ability of the reader to critically appraise study findings. 7 In the three tables presented the progress from units of data to coding to theme development is illustrated. ‘Not involved in treatment decisions’ appears in each table and informs one of the final themes. Documenting the movement from units of data to final themes allows for transparency of data analysis. Although other researchers may interpret the data differently, appreciating and understanding how the themes were developed is an essential part of demonstrating the robustness of the findings. Qualitative researchers must demonstrate rigour, associated with openness, relevance to practice and congruence of the methodological approch. 2 In summary qualitative research is complex in that it produces large amounts of data and analysis is time consuming and complex. High-quality data analysis requires a researcher with expertise, vision and veracity.

  • Cheater F ,
  • Robshaw M ,
  • McLafferty E ,
  • Maggs-Rapport F

Competing interests None.

Read the full text or download the PDF:

PW Skills | Blog

Data Analysis Techniques in Research – Methods, Tools & Examples

' src=

Varun Saharawat is a seasoned professional in the fields of SEO and content writing. With a profound knowledge of the intricate aspects of these disciplines, Varun has established himself as a valuable asset in the world of digital marketing and online content creation.

data analysis techniques in research

Data analysis techniques in research are essential because they allow researchers to derive meaningful insights from data sets to support their hypotheses or research objectives.

Data Analysis Techniques in Research : While various groups, institutions, and professionals may have diverse approaches to data analysis, a universal definition captures its essence. Data analysis involves refining, transforming, and interpreting raw data to derive actionable insights that guide informed decision-making for businesses.

Data Analytics Course

A straightforward illustration of data analysis emerges when we make everyday decisions, basing our choices on past experiences or predictions of potential outcomes.

If you want to learn more about this topic and acquire valuable skills that will set you apart in today’s data-driven world, we highly recommend enrolling in the Data Analytics Course by Physics Wallah . And as a special offer for our readers, use the coupon code “READER” to get a discount on this course.

Table of Contents

What is Data Analysis?

Data analysis is the systematic process of inspecting, cleaning, transforming, and interpreting data with the objective of discovering valuable insights and drawing meaningful conclusions. This process involves several steps:

  • Inspecting : Initial examination of data to understand its structure, quality, and completeness.
  • Cleaning : Removing errors, inconsistencies, or irrelevant information to ensure accurate analysis.
  • Transforming : Converting data into a format suitable for analysis, such as normalization or aggregation.
  • Interpreting : Analyzing the transformed data to identify patterns, trends, and relationships.

Types of Data Analysis Techniques in Research

Data analysis techniques in research are categorized into qualitative and quantitative methods, each with its specific approaches and tools. These techniques are instrumental in extracting meaningful insights, patterns, and relationships from data to support informed decision-making, validate hypotheses, and derive actionable recommendations. Below is an in-depth exploration of the various types of data analysis techniques commonly employed in research:

1) Qualitative Analysis:

Definition: Qualitative analysis focuses on understanding non-numerical data, such as opinions, concepts, or experiences, to derive insights into human behavior, attitudes, and perceptions.

  • Content Analysis: Examines textual data, such as interview transcripts, articles, or open-ended survey responses, to identify themes, patterns, or trends.
  • Narrative Analysis: Analyzes personal stories or narratives to understand individuals’ experiences, emotions, or perspectives.
  • Ethnographic Studies: Involves observing and analyzing cultural practices, behaviors, and norms within specific communities or settings.

2) Quantitative Analysis:

Quantitative analysis emphasizes numerical data and employs statistical methods to explore relationships, patterns, and trends. It encompasses several approaches:

Descriptive Analysis:

  • Frequency Distribution: Represents the number of occurrences of distinct values within a dataset.
  • Central Tendency: Measures such as mean, median, and mode provide insights into the central values of a dataset.
  • Dispersion: Techniques like variance and standard deviation indicate the spread or variability of data.

Diagnostic Analysis:

  • Regression Analysis: Assesses the relationship between dependent and independent variables, enabling prediction or understanding causality.
  • ANOVA (Analysis of Variance): Examines differences between groups to identify significant variations or effects.

Predictive Analysis:

  • Time Series Forecasting: Uses historical data points to predict future trends or outcomes.
  • Machine Learning Algorithms: Techniques like decision trees, random forests, and neural networks predict outcomes based on patterns in data.

Prescriptive Analysis:

  • Optimization Models: Utilizes linear programming, integer programming, or other optimization techniques to identify the best solutions or strategies.
  • Simulation: Mimics real-world scenarios to evaluate various strategies or decisions and determine optimal outcomes.

Specific Techniques:

  • Monte Carlo Simulation: Models probabilistic outcomes to assess risk and uncertainty.
  • Factor Analysis: Reduces the dimensionality of data by identifying underlying factors or components.
  • Cohort Analysis: Studies specific groups or cohorts over time to understand trends, behaviors, or patterns within these groups.
  • Cluster Analysis: Classifies objects or individuals into homogeneous groups or clusters based on similarities or attributes.
  • Sentiment Analysis: Uses natural language processing and machine learning techniques to determine sentiment, emotions, or opinions from textual data.

Also Read: AI and Predictive Analytics: Examples, Tools, Uses, Ai Vs Predictive Analytics

Data Analysis Techniques in Research Examples

To provide a clearer understanding of how data analysis techniques are applied in research, let’s consider a hypothetical research study focused on evaluating the impact of online learning platforms on students’ academic performance.

Research Objective:

Determine if students using online learning platforms achieve higher academic performance compared to those relying solely on traditional classroom instruction.

Data Collection:

  • Quantitative Data: Academic scores (grades) of students using online platforms and those using traditional classroom methods.
  • Qualitative Data: Feedback from students regarding their learning experiences, challenges faced, and preferences.

Data Analysis Techniques Applied:

1) Descriptive Analysis:

  • Calculate the mean, median, and mode of academic scores for both groups.
  • Create frequency distributions to represent the distribution of grades in each group.

2) Diagnostic Analysis:

  • Conduct an Analysis of Variance (ANOVA) to determine if there’s a statistically significant difference in academic scores between the two groups.
  • Perform Regression Analysis to assess the relationship between the time spent on online platforms and academic performance.

3) Predictive Analysis:

  • Utilize Time Series Forecasting to predict future academic performance trends based on historical data.
  • Implement Machine Learning algorithms to develop a predictive model that identifies factors contributing to academic success on online platforms.

4) Prescriptive Analysis:

  • Apply Optimization Models to identify the optimal combination of online learning resources (e.g., video lectures, interactive quizzes) that maximize academic performance.
  • Use Simulation Techniques to evaluate different scenarios, such as varying student engagement levels with online resources, to determine the most effective strategies for improving learning outcomes.

5) Specific Techniques:

  • Conduct Factor Analysis on qualitative feedback to identify common themes or factors influencing students’ perceptions and experiences with online learning.
  • Perform Cluster Analysis to segment students based on their engagement levels, preferences, or academic outcomes, enabling targeted interventions or personalized learning strategies.
  • Apply Sentiment Analysis on textual feedback to categorize students’ sentiments as positive, negative, or neutral regarding online learning experiences.

By applying a combination of qualitative and quantitative data analysis techniques, this research example aims to provide comprehensive insights into the effectiveness of online learning platforms.

Also Read: Learning Path to Become a Data Analyst in 2024

Data Analysis Techniques in Quantitative Research

Quantitative research involves collecting numerical data to examine relationships, test hypotheses, and make predictions. Various data analysis techniques are employed to interpret and draw conclusions from quantitative data. Here are some key data analysis techniques commonly used in quantitative research:

1) Descriptive Statistics:

  • Description: Descriptive statistics are used to summarize and describe the main aspects of a dataset, such as central tendency (mean, median, mode), variability (range, variance, standard deviation), and distribution (skewness, kurtosis).
  • Applications: Summarizing data, identifying patterns, and providing initial insights into the dataset.

2) Inferential Statistics:

  • Description: Inferential statistics involve making predictions or inferences about a population based on a sample of data. This technique includes hypothesis testing, confidence intervals, t-tests, chi-square tests, analysis of variance (ANOVA), regression analysis, and correlation analysis.
  • Applications: Testing hypotheses, making predictions, and generalizing findings from a sample to a larger population.

3) Regression Analysis:

  • Description: Regression analysis is a statistical technique used to model and examine the relationship between a dependent variable and one or more independent variables. Linear regression, multiple regression, logistic regression, and nonlinear regression are common types of regression analysis .
  • Applications: Predicting outcomes, identifying relationships between variables, and understanding the impact of independent variables on the dependent variable.

4) Correlation Analysis:

  • Description: Correlation analysis is used to measure and assess the strength and direction of the relationship between two or more variables. The Pearson correlation coefficient, Spearman rank correlation coefficient, and Kendall’s tau are commonly used measures of correlation.
  • Applications: Identifying associations between variables and assessing the degree and nature of the relationship.

5) Factor Analysis:

  • Description: Factor analysis is a multivariate statistical technique used to identify and analyze underlying relationships or factors among a set of observed variables. It helps in reducing the dimensionality of data and identifying latent variables or constructs.
  • Applications: Identifying underlying factors or constructs, simplifying data structures, and understanding the underlying relationships among variables.

6) Time Series Analysis:

  • Description: Time series analysis involves analyzing data collected or recorded over a specific period at regular intervals to identify patterns, trends, and seasonality. Techniques such as moving averages, exponential smoothing, autoregressive integrated moving average (ARIMA), and Fourier analysis are used.
  • Applications: Forecasting future trends, analyzing seasonal patterns, and understanding time-dependent relationships in data.

7) ANOVA (Analysis of Variance):

  • Description: Analysis of variance (ANOVA) is a statistical technique used to analyze and compare the means of two or more groups or treatments to determine if they are statistically different from each other. One-way ANOVA, two-way ANOVA, and MANOVA (Multivariate Analysis of Variance) are common types of ANOVA.
  • Applications: Comparing group means, testing hypotheses, and determining the effects of categorical independent variables on a continuous dependent variable.

8) Chi-Square Tests:

  • Description: Chi-square tests are non-parametric statistical tests used to assess the association between categorical variables in a contingency table. The Chi-square test of independence, goodness-of-fit test, and test of homogeneity are common chi-square tests.
  • Applications: Testing relationships between categorical variables, assessing goodness-of-fit, and evaluating independence.

These quantitative data analysis techniques provide researchers with valuable tools and methods to analyze, interpret, and derive meaningful insights from numerical data. The selection of a specific technique often depends on the research objectives, the nature of the data, and the underlying assumptions of the statistical methods being used.

Also Read: Analysis vs. Analytics: How Are They Different?

Data Analysis Methods

Data analysis methods refer to the techniques and procedures used to analyze, interpret, and draw conclusions from data. These methods are essential for transforming raw data into meaningful insights, facilitating decision-making processes, and driving strategies across various fields. Here are some common data analysis methods:

  • Description: Descriptive statistics summarize and organize data to provide a clear and concise overview of the dataset. Measures such as mean, median, mode, range, variance, and standard deviation are commonly used.
  • Description: Inferential statistics involve making predictions or inferences about a population based on a sample of data. Techniques such as hypothesis testing, confidence intervals, and regression analysis are used.

3) Exploratory Data Analysis (EDA):

  • Description: EDA techniques involve visually exploring and analyzing data to discover patterns, relationships, anomalies, and insights. Methods such as scatter plots, histograms, box plots, and correlation matrices are utilized.
  • Applications: Identifying trends, patterns, outliers, and relationships within the dataset.

4) Predictive Analytics:

  • Description: Predictive analytics use statistical algorithms and machine learning techniques to analyze historical data and make predictions about future events or outcomes. Techniques such as regression analysis, time series forecasting, and machine learning algorithms (e.g., decision trees, random forests, neural networks) are employed.
  • Applications: Forecasting future trends, predicting outcomes, and identifying potential risks or opportunities.

5) Prescriptive Analytics:

  • Description: Prescriptive analytics involve analyzing data to recommend actions or strategies that optimize specific objectives or outcomes. Optimization techniques, simulation models, and decision-making algorithms are utilized.
  • Applications: Recommending optimal strategies, decision-making support, and resource allocation.

6) Qualitative Data Analysis:

  • Description: Qualitative data analysis involves analyzing non-numerical data, such as text, images, videos, or audio, to identify themes, patterns, and insights. Methods such as content analysis, thematic analysis, and narrative analysis are used.
  • Applications: Understanding human behavior, attitudes, perceptions, and experiences.

7) Big Data Analytics:

  • Description: Big data analytics methods are designed to analyze large volumes of structured and unstructured data to extract valuable insights. Technologies such as Hadoop, Spark, and NoSQL databases are used to process and analyze big data.
  • Applications: Analyzing large datasets, identifying trends, patterns, and insights from big data sources.

8) Text Analytics:

  • Description: Text analytics methods involve analyzing textual data, such as customer reviews, social media posts, emails, and documents, to extract meaningful information and insights. Techniques such as sentiment analysis, text mining, and natural language processing (NLP) are used.
  • Applications: Analyzing customer feedback, monitoring brand reputation, and extracting insights from textual data sources.

These data analysis methods are instrumental in transforming data into actionable insights, informing decision-making processes, and driving organizational success across various sectors, including business, healthcare, finance, marketing, and research. The selection of a specific method often depends on the nature of the data, the research objectives, and the analytical requirements of the project or organization.

Also Read: Quantitative Data Analysis: Types, Analysis & Examples

Data Analysis Tools

Data analysis tools are essential instruments that facilitate the process of examining, cleaning, transforming, and modeling data to uncover useful information, make informed decisions, and drive strategies. Here are some prominent data analysis tools widely used across various industries:

1) Microsoft Excel:

  • Description: A spreadsheet software that offers basic to advanced data analysis features, including pivot tables, data visualization tools, and statistical functions.
  • Applications: Data cleaning, basic statistical analysis, visualization, and reporting.

2) R Programming Language:

  • Description: An open-source programming language specifically designed for statistical computing and data visualization.
  • Applications: Advanced statistical analysis, data manipulation, visualization, and machine learning.

3) Python (with Libraries like Pandas, NumPy, Matplotlib, and Seaborn):

  • Description: A versatile programming language with libraries that support data manipulation, analysis, and visualization.
  • Applications: Data cleaning, statistical analysis, machine learning, and data visualization.

4) SPSS (Statistical Package for the Social Sciences):

  • Description: A comprehensive statistical software suite used for data analysis, data mining, and predictive analytics.
  • Applications: Descriptive statistics, hypothesis testing, regression analysis, and advanced analytics.

5) SAS (Statistical Analysis System):

  • Description: A software suite used for advanced analytics, multivariate analysis, and predictive modeling.
  • Applications: Data management, statistical analysis, predictive modeling, and business intelligence.

6) Tableau:

  • Description: A data visualization tool that allows users to create interactive and shareable dashboards and reports.
  • Applications: Data visualization , business intelligence , and interactive dashboard creation.

7) Power BI:

  • Description: A business analytics tool developed by Microsoft that provides interactive visualizations and business intelligence capabilities.
  • Applications: Data visualization, business intelligence, reporting, and dashboard creation.

8) SQL (Structured Query Language) Databases (e.g., MySQL, PostgreSQL, Microsoft SQL Server):

  • Description: Database management systems that support data storage, retrieval, and manipulation using SQL queries.
  • Applications: Data retrieval, data cleaning, data transformation, and database management.

9) Apache Spark:

  • Description: A fast and general-purpose distributed computing system designed for big data processing and analytics.
  • Applications: Big data processing, machine learning, data streaming, and real-time analytics.

10) IBM SPSS Modeler:

  • Description: A data mining software application used for building predictive models and conducting advanced analytics.
  • Applications: Predictive modeling, data mining, statistical analysis, and decision optimization.

These tools serve various purposes and cater to different data analysis needs, from basic statistical analysis and data visualization to advanced analytics, machine learning, and big data processing. The choice of a specific tool often depends on the nature of the data, the complexity of the analysis, and the specific requirements of the project or organization.

Also Read: How to Analyze Survey Data: Methods & Examples

Importance of Data Analysis in Research

The importance of data analysis in research cannot be overstated; it serves as the backbone of any scientific investigation or study. Here are several key reasons why data analysis is crucial in the research process:

  • Data analysis helps ensure that the results obtained are valid and reliable. By systematically examining the data, researchers can identify any inconsistencies or anomalies that may affect the credibility of the findings.
  • Effective data analysis provides researchers with the necessary information to make informed decisions. By interpreting the collected data, researchers can draw conclusions, make predictions, or formulate recommendations based on evidence rather than intuition or guesswork.
  • Data analysis allows researchers to identify patterns, trends, and relationships within the data. This can lead to a deeper understanding of the research topic, enabling researchers to uncover insights that may not be immediately apparent.
  • In empirical research, data analysis plays a critical role in testing hypotheses. Researchers collect data to either support or refute their hypotheses, and data analysis provides the tools and techniques to evaluate these hypotheses rigorously.
  • Transparent and well-executed data analysis enhances the credibility of research findings. By clearly documenting the data analysis methods and procedures, researchers allow others to replicate the study, thereby contributing to the reproducibility of research findings.
  • In fields such as business or healthcare, data analysis helps organizations allocate resources more efficiently. By analyzing data on consumer behavior, market trends, or patient outcomes, organizations can make strategic decisions about resource allocation, budgeting, and planning.
  • In public policy and social sciences, data analysis is instrumental in developing and evaluating policies and interventions. By analyzing data on social, economic, or environmental factors, policymakers can assess the effectiveness of existing policies and inform the development of new ones.
  • Data analysis allows for continuous improvement in research methods and practices. By analyzing past research projects, identifying areas for improvement, and implementing changes based on data-driven insights, researchers can refine their approaches and enhance the quality of future research endeavors.

However, it is important to remember that mastering these techniques requires practice and continuous learning. That’s why we highly recommend the Data Analytics Course by Physics Wallah . Not only does it cover all the fundamentals of data analysis, but it also provides hands-on experience with various tools such as Excel, Python, and Tableau. Plus, if you use the “ READER ” coupon code at checkout, you can get a special discount on the course.

For Latest Tech Related Information, Join Our Official Free Telegram Group : PW Skills Telegram Group

Data Analysis Techniques in Research FAQs

What are the 5 techniques for data analysis.

The five techniques for data analysis include: Descriptive Analysis Diagnostic Analysis Predictive Analysis Prescriptive Analysis Qualitative Analysis

What are techniques of data analysis in research?

Techniques of data analysis in research encompass both qualitative and quantitative methods. These techniques involve processes like summarizing raw data, investigating causes of events, forecasting future outcomes, offering recommendations based on predictions, and examining non-numerical data to understand concepts or experiences.

What are the 3 methods of data analysis?

The three primary methods of data analysis are: Qualitative Analysis Quantitative Analysis Mixed-Methods Analysis

What are the four types of data analysis techniques?

The four types of data analysis techniques are: Descriptive Analysis Diagnostic Analysis Predictive Analysis Prescriptive Analysis

Top 10 Data Analytics Trends to Watch Out for in 2024

Data Analytics Trends

Unlock the future of data by getting an overview of the top 10 data analytics trends for 2024 discover how…

What Is Big Data Analytics? Definition, Benefits, and More

big data analytics

Big data analytics is the process of identifying trends, patterns, and correlations in vast amounts of raw data in order to…

How to Analysis of Survey Data: Methods & Examples

analysis of survey data

Analysis of Survey Data transforms raw data into meaningful insights. By adhering to best practices, you can leverage survey findings…

bottom banner

Data analysis write-ups

What should a data-analysis write-up look like.

Writing up the results of a data analysis is not a skill that anyone is born with. It requires practice and, at least in the beginning, a bit of guidance.


When writing your report, organization will set you free. A good outline is: 1) overview of the problem, 2) your data and modeling approach, 3) the results of your data analysis (plots, numbers, etc), and 4) your substantive conclusions.

1) Overview Describe the problem. What substantive question are you trying to address? This needn’t be long, but it should be clear.

2) Data and model What data did you use to address the question, and how did you do it? When describing your approach, be specific. For example:

  • Don’t say, “I ran a regression” when you instead can say, “I fit a linear regression model to predict price that included a house’s size and neighborhood as predictors.”
  • Justify important features of your modeling approach. For example: “Neighborhood was included as a categorical predictor in the model because Figure 2 indicated clear differences in price across the neighborhoods.”

Sometimes your Data and Model section will contain plots or tables, and sometimes it won’t. If you feel that a plot helps the reader understand the problem or data set itself—as opposed to your results—then go ahead and include it. A great example here is Tables 1 and 2 in the main paper on the PREDIMED study . These tables help the reader understand some important properties of the data and approach, but not the results of the study itself.

3) Results In your results section, include any figures and tables necessary to make your case. Label them (Figure 1, 2, etc), give them informative captions, and refer to them in the text by their numbered labels where you discuss them. Typical things to include here may include: pictures of the data; pictures and tables that show the fitted model; tables of model coefficients and summaries.

4) Conclusion What did you learn from the analysis? What is the answer, if any, to the question you set out to address?

General advice

Make the sections as short or long as they need to be. For example, a conclusions section is often pretty short, while a results section is usually a bit longer.

It’s OK to use the first person to avoid awkward or bizarre sentence constructions, but try to do so sparingly.

Do not include computer code unless explicitly called for. Note: model outputs do not count as computer code. Outputs should be used as evidence in your results section (ideally formatted in a nice way). By code, I mean the sequence of commands you used to process the data and produce the outputs.

When in doubt, use shorter words and sentences.

A very common way for reports to go wrong is when the writer simply narrates the thought process he or she followed: :First I did this, but it didn’t work. Then I did something else, and I found A, B, and C. I wasn’t really sure what to make of B, but C was interesting, so I followed up with D and E. Then having done this…” Do not do this. The desire for specificity is admirable, but the overall effect is one of amateurism. Follow the recommended outline above.

Here’s a good example of a write-up for an analysis of a few relatively simple problems. Because the problems are so straightforward, there’s not much of a need for an outline of the kind described above. Nonetheless, the spirit of these guidelines is clearly in evidence. Notice the clear exposition, the labeled figures and tables that are referred to in the text, and the careful integration of visual and numerical evidence into the overall argument. This is one worth emulating.

National Academies Press: OpenBook

Effective Experiment Design and Data Analysis in Transportation Research (2012)

Chapter: chapter 3 - examples of effective experiment design and data analysis in transportation research.

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

10 Examples of Effective Experiment Design and Data Analysis in Transportation Research About this Chapter This chapter provides a wide variety of examples of research questions. The examples demon- strate varying levels of detail with regard to experiment designs and the statistical analyses required. The number and types of examples were selected after consulting with many practitioners. The attempt was made to provide a couple of detailed examples in each of several areas of transporta- tion practice. For each type of problem or analysis, some comments also appear about research topics in other areas that might be addressed using the same approach. Questions that were briefly introduced in Chapter 2 are addressed in considerably more depth in the context of these examples. All the examples are organized and presented using the outline below. Where applicable, ref- erences to the two-volume primer produced under NCHRP Project 20-45 have been provided to encourage the reader to obtain more detail about calculation techniques and more technical discussion of issues. Basic Outline for Examples The numbered outline below is the model for the structure of all of the examples that follow. 1. Research Question/Problem Statement: A simple statement of the research question is given. For example, in the maintenance category, does crack sealant A perform better than crack sealant B? 2. Identification and Description of Variables: The dependent and independent variables are identified and described. The latter includes an indication of whether, for example, the variables are discrete or continuous. 3. Data Collection: A hypothetical scenario is presented to describe how, where, and when data should be collected. As appropriate, reference is made to conventions or requirements for some types of data (e.g., if delay times at an intersection are being calculated before and after some treatment, the data collected need to be consistent with the requirements in the Highway Capacity Manual). Typical problems are addressed, such as sample size, the need for control groups, and so forth. 4. Specification of Analysis Technique and Data Analysis: The links between successfully framing the research question, fully describing the variables that need to be considered, and the specification of the appropriate analysis technique are highlighted in each example. Refer- ences to NCHRP Project 20-45 are provided for additional detail. The appropriate types of statistical test(s) are described for the specific example. 5. Interpreting the Results: In each example, results that can be expected from the analysis are discussed in terms of what they mean from a statistical perspective (e.g., the t-test result from C h a p t e r 3

examples of effective experiment Design and Data analysis in transportation research 11 a comparison of means indicates whether the mean values of two distributions can be con- sidered to be equal with a specified degree of confidence) as well as an operational perspective (e.g., judging whether the difference is large enough to make an operational difference). In each example, the typical results and their limitations are discussed. 6. Conclusion and Discussion: This section recaps how the early steps in the process lead directly to the later ones. Comments are made regarding how changes in the early steps can affect not only the results of the analysis but also the appropriateness of the approach. 7. Applications in Other Areas of Transportation Research: Each example includes a short list of typical applications in other areas of transportation research for which the approach or analysis technique would be appropriate. Techniques Covered in the Examples The determination of what kinds of statistical techniques to include in the examples was made after consulting with a variety of professionals and examining responses to a survey of research- oriented practitioners. The examples are not exhaustive insofar as not every type of statistical analysis is covered. However, the attempt has been made to cover a representative sample of tech- niques that the practitioner is most likely to encounter in undertaking or supervising research- oriented projects. The following techniques are introduced in one or more examples: • Descriptive statistics • Fitting distributions/goodness of fit (used in one example) • Simple one- and two-sample comparison of means • Simple comparisons of multiple means using analysis of variance (ANOVA) • Factorial designs (also ANOVA) • Simple comparisons of means before and after some treatment • Complex before-and-after comparisons involving control groups • Trend analysis • Regression • Logit analysis (used in one example) • Survey design and analysis • Simulation • Non-parametric methods (used in one example) Although the attempt has been made to make the examples as readable as possible, some tech- nical terms may be unfamiliar to some readers. Detailed definitions for most applicable statistical terms are available in the glossary in NCHRP Project 20-45, Volume 2, Appendix A. Most defini- tions used here are consistent with those contained in NCHRP Project 20-45, which contains useful information for everyone from the beginning researcher to the most accomplished statistician. Some variations appear in the notations used in the examples. For example, in statistical analy- sis an alternate hypothesis may be represented by Ha or by H1, and readers will find both notations used in this report. The examples were developed by several authors with differing backgrounds, and latitude was deliberately given to the authors to use the notations with which they are most familiar. The variations have been included purposefully to acquaint readers with the fact that the same concepts (e.g., something as simple as a mean value) may be noted in various ways by different authors or analysts. Finally, the more widely used techniques, such as analysis of variance (ANOVA), are applied in more than one example. Readers interested in ANOVA are encouraged to read all the ANOVA examples as each example presents different aspects of or perspectives on the approach, and computational techniques presented in one example may not be repeated in later examples (although a citation typically is provided).

12 effective experiment Design and Data analysis in transportation research Areas Covered in the Examples Transportation research is very broad, encompassing many fields. Based on consultation with many research-oriented professionals and a survey of practitioners, key areas of research were identified. Although these areas have lots of overlap, explicit examples in the following areas are included: • Construction • Environment • Lab testing and instrumentation • Maintenance • Materials • Pavements • Public transportation • Structures/bridges • Traffic operations • Traffic safety • Transportation planning • Work zones The 21 examples provided on the following pages begin with the most straightforward ana- lytical approaches (i.e., descriptive statistics) and progress to more sophisticated approaches. Table 1 lists the examples along with the area of research and method of analysis for each example. Example 1: Structures/Bridges; Descriptive Statistics Area: Structures/bridges Method of Analysis: Descriptive statistics (exploring and presenting data to describe existing conditions and develop a basis for further analysis) 1. Research Question/Problem Statement: An engineer for a state agency wants to determine the functional and structural condition of a select number of highway bridges located across the state. Data are obtained for 100 bridges scheduled for routine inspection. The data will be used to develop bridge rehabilitation and/or replacement programs. The objective of this analysis is to provide an overview of the bridge conditions, and to present various methods to display the data in a concise and meaningful manner. Question/Issue Use collected data to describe existing conditions and prepare for future analysis. In this case, bridge inspection data from the state are to be studied and summarized. 2. Identification and Description of Variables: Bridge inspection generally entails collection of numerous variables that include location information, traffic data, structural elements’ type and condition, and functional characteristics. In this example, the variables are: bridge condition ratings of the deck, superstructure, and substructure; and overall condition of the bridge. Based on the severity of deterioration and the extent of spread through a bridge component, a condition rating is assigned on a discrete scale from 0 (failed) to 9 (excellent). These ratings (in addition to several other factors) are used in categorization of a bridge in one of three overall conditions: not deficient; structurally deficient; or functionally obsolete.

examples of effective experiment Design and Data analysis in transportation research 13 Example Area Method of Analysis 1 Structures/bridges Descriptive statistics (exploring and presenting data to describe existing conditions) 2 Public transport Descriptive statistics (organizing and presenting data to describe a system or component) 3 Environment Descriptive statistics (organizing and presenting data to explain current conditions) 4 Traffic operations Goodness of fit (chi-square test; determining if observed/collected data fit a certain distribution) 5 Construction Simple comparisons to specified values (t-test to compare the mean value of a small sample to a standard or other requirement) 6 Maintenance Simple two-sample comparison (t-test for paired comparisons; comparing the mean values of two sets of matched data) 7 Materials Simple two-sample comparisons (t-test for paired comparisons and the F-test for comparing variances) 8 Laboratory testing and/or instrumentation Simple ANOVA (comparing the mean values of more than two samples using the F-test) 9 Materials Simple ANOVA (comparing more than two mean values and the F-test for equality of means) 10 Pavements Simple ANOVA (comparing the mean values of more than two samples using the F-test) 11 Pavements Factorial design (an ANOVA approach exploring the effects of varying more than one independent variable) 12 Work zones Simple before-and-after comparisons (exploring the effect of some treatment before it is applied versus after it is applied) 13 Traffic safety Complex before-and-after comparisons using control groups (examining the effect of some treatment or application with consideration of other factors) 14 Work zones Trend analysis (examining, describing, and modeling how something changes over time) 15 Structures/bridges Trend analysis (examining a trend over time) 16 Transportation planning Multiple regression analysis (developing and testing proposed linear models with more than one independent variable) 17 Traffic operations Regression analysis (developing a model to predict the values that a dependent variable can take as a function of one or more independent variables) 18 Transportation planning Logit and related analysis (developing predictive models when the dependent variable is dichotomous) 19 Public transit Survey design and analysis (organizing survey data for statistical analysis) 20 Traffic operations Simulation (using field data to simulate or model operations or outcomes) 21 Traffic safety Non-parametric methods (methods to be used when data do not follow assumed or conventional distributions) Table 1. Examples provided in this report.

14 effective experiment Design and Data analysis in transportation research 3. Data Collection: Data are collected at 100 scheduled locations by bridge inspectors. It is important to note that the bridge condition rating scale is based on subjective categories, and there may be inherent variability among inspectors in their assignment of ratings to bridge components. A sample of data is compiled to document the bridge condition rating of the three primary structural components and the overall condition by location and ownership (Table 2). Notice that the overall condition of a bridge is not necessarily based only on the condition rating of its components (e.g., they cannot just be added). 4. Specification of Analysis Technique and Data Analysis: The two primary variables of inter- est are bridge condition rating and overall condition. The overall condition of the bridge is a categorical variable with three possible values: not deficient; structurally deficient; and functionally obsolete. The frequencies of these values in the given data set are calculated and displayed in the pie chart below. A pie chart provides a visualization of the relative proportions of bridges falling into each category that is often easier to communicate to the reader than a table showing the same information (Figure 1). Another way to look at the overall bridge condition variable is by cross-tabulation of the three condition categories with the two location categories (urban and rural), as shown in Table 3. A cross-tabulation provides the joint distribution of two (or more) variables such that each cell represents the frequency of occurrence of a specific combination of pos- sible values. For example, as seen in Table 3, there are 10 structurally deficient bridges in rural areas, which represent 11.4% of all rural area bridges inspected. The numbers in the parentheses are column percentages and add up to 100%. Table 3 also shows that 88 of the bridges inspected were located in rural areas, whereas 12 were located in urban areas. The mean values of the bridge condition rating variable for deck, superstructure, and sub- structure are shown in Table 4. These have been calculated by taking the sum of all the values and then dividing by the total number of cases (100 in this example). Generally, a condition rating Bridge No. Owner Location Bridge Condition Rating Overall Condition Deck Superstructure Substructure 1 State Rural 8 8 8 ND* 7 Local agency Rural 6 6 6 FO* 39 State Urban 6 6 2 SD* 69 State park Rural 7 5 5 SD 92 City Urban 5 6 6 ND *ND = not deficient; FO: functionally obsolete; SD: structurally deficient. Table 2. Sample bridge inspection data. Structurally Deficient (SD), 13% Functionally Obsolete (FO), 10% Neither SD/FO, 77% Figure 1. Highway bridge conditions.

examples of effective experiment Design and Data analysis in transportation research 15 of 4 or below indicates deficiency in a structural component. For the purpose of comparison, the mean bridge condition rating of the 13 structurally deficient bridges also is provided. Notice that while the rating scale for the bridge conditions is discrete with values ranging from 0 (failure) to 9 (excellent), the average bridge condition variable is continuous. Therefore, an average score of 6.47 would indicate overall condition of all bridges to be between 6 (satisfactory) and 7 (good). The combined bridge condition rating of deck, superstructure, and substructure is not defined; therefore calculating the mean of the three components’ average rating would make no sense. Also, the average bridge condition rating of functionally obsolete bridges is not calculated because other functional characteristics also accounted for this designation. The distributions of the bridge condition ratings for deck, superstructure, and substructure are shown in Figure 2. Based on the cut-off point of 4, approximately 7% of all bridge decks, 2% of all superstructures, and 5% of all substructures are deficient. 5. Interpreting the Results: The results indicate that a majority of bridges (77%) are not struc- turally or functionally deficient. The inspections were carried out on bridges primarily located in rural areas (88 out of 100). The bridge condition variable may also be cross-tabulated with the ownership variable to determine distribution by jurisdiction. The average condition ratings for the three bridge components for all bridges lies between 6 (satisfactory, some minor problems) and 7 (good, no problems noted). 6. Conclusion and Discussion: This example illustrates how to summarize and present quan- titative and qualitative data on bridge conditions. It is important to understand the mea- surement scale of variables in order to interpret the results correctly. Bridge inspection data collected over time may also be analyzed to determine trends in the condition of bridges in a given area. Trend analysis is addressed in Example 15 (structures). 7. Applications in Other Areas of Transportation Research: Descriptive statistics could be used to present data in other areas of transportation research, such as: • Transportation Planning—to assess the distribution of travel times between origin- destination pairs in an urban area. Overall averages could also be calculated. • Traffic Operations—to analyze the average delay per vehicle at a railroad crossing. Rating Category Mean Value Overall average bridge condition rating (deck) 6.20 Overall average bridge condition rating (superstructure) 6.47 Overall average bridge condition rating (substructure) 6.08 Average bridge condition rating of structurally deficient bridges (deck) 4.92 Average bridge condition rating of structurally deficient bridges (superstructure) 5.30 Average bridge condition rating of structurally deficient bridges (substructure) 4.54 Table 4. Bridge condition ratings. Rural Urban Total Structurally deficient 10 (11.4%) 3 (25.0%) 13 Functionally obsolete 6 (6.8%) 4 (33.3%) 10 Not deficient 72 (81.8%) 5 (41.7%) 77 Total 88 (100%) 12 (100%) 100 Table 3. Cross-tabulation of bridge condition by location.

16 effective experiment Design and Data analysis in transportation research • Traffic Operations/Safety—to examine the frequency of turning violations at driveways with various turning restrictions. • Work Zones, Environment—to assess the average energy consumption during various stages of construction. Example 2: Public Transport; Descriptive Statistics Area: Public transport Method of Analysis: Descriptive statistics (organizing and presenting data to describe a system or component) 1. Research Question/Problem Statement: The manager of a transit agency would like to present information to the board of commissioners on changes in revenue that resulted from a change in the fare. The transit system provides three basic types of service: local bus routes, express bus routes, and demand-responsive bus service. There are 15 local bus routes, 10 express routes, and 1 demand-responsive system. 0 5 10 15 20 25 30 35 40 45 9 8 7 6 5 4 3 2 1 0 Condition Ratings Pe rc en ta ge o f S tru ctu re s Deck Superstructure Substructure Figure 2. Bridge condition ratings. Question/Issue Use data to describe some change over time. In this instance, data from 2008 and 2009 are used to describe the change in revenue on each route/part of a transit system when the fare structure was changed from variable (per mile) to fixed fares. 2. Identification and Description of Variables: Revenue data are available for each route on the local and express bus system and the demand-responsive system as a whole for the years 2008 and 2009. 3. Data Collection: Revenue data were collected on each route for both 2008 and 2009. The annual revenue for the demand-responsive system was also collected. These data are shown in Table 5. 4. Specification of Analysis Technique and Data Analysis: The objective of this analysis is to present the impact of changing the fare system in a series of graphs. The presentation is intended to show the impact on each component of the transit system as well as the impact on overall system revenue. The impact of the fare change on the overall revenue is best shown with a bar graph (Figure 3). The variation in the impact across system components can be illustrated in a similar graph (Figure 4). A pie chart also can be used to illustrate the relative impact on each system component (Figure 5).

examples of effective experiment Design and Data analysis in transportation research 17 Bus Route 2008 Revenue 2009 Revenue Local Route 1 $350,500 $365,700 Local Route 2 $263,000 $271,500 Local Route 3 $450,800 $460,700 Local Route 4 $294,300 $306,400 Local Route 5 $173,900 $184,600 Local Route 6 $367,800 $375,100 Local Route 7 $415,800 $430,300 Local Route 8 $145,600 $149,100 Local Route 9 $248,200 $260,800 Local Route 10 $310,400 $318,300 Local Route 11 $444,300 $459,200 Local Route 12 $208,400 $205,600 Local Route 13 $407,600 $412,400 Local Route 14 $161,500 $169,300 Local Route 15 $325,100 $340,200 Express Route 1 $85,400 $83,600 Express Route 2 $110,300 $109,200 Express Route 3 $65,800 $66,200 Express Route 4 $125,300 $127,600 Express Route 5 $90,800 $90,400 Express Route 6 $125,800 $123,400 Express Route 7 $87,200 $86,900 Express Route 8 $68.300 $67,200 Express Route 9 $110,100 $112,300 Express Route 10 $73,200 $72,100 Demand-Responsive System $510,100 $521,300 Table 5. Revenue by route or type of service and year. 6.02 6.17 0 1 2 3 4 5 6 7 8 2008 2009 Total System Revenue Re ve nu e (M illi on $ ) Figure 3. Impact of fare change on overall revenue.

18 effective experiment Design and Data analysis in transportation research Express Buses, 15.7% Express Buses, 15.2% Local Buses, 76.3% Local Buses, 75.8% Demand Responsive, 8.5% Demand Responsive, 8.5% 2008 2009 Figure 5. Pie charts illustrating percent of revenue from each component of a transit system. If it is important to display the variability in the impact within the various bus routes in the local bus or express bus operations, this also can be illustrated (Figure 6). This type of diagram shows the maximum value, minimum value, and mean value of the percent increase in revenue across the 15 local bus routes and the 10 express bus routes. 5. Interpreting the results: These results indicate that changing from a variable fare based on trip length (2008) to a fixed fare (2009) on both the local bus routes and the express bus routes had little effect on revenue. On the local bus routes, there was an average increase in revenue of 3.1%. On the express bus routes, there was an average decrease in revenue of 0.4%. These changes altered the percentage of the total system revenue attributed to the local bus routes and the express bus routes. The local bus routes generated 76.3% of the revenue in 2009, compared to 75.8% in 2008. The percentage of revenue generated by the express bus routes dropped from 15.7% to 15.2%, and the demand-responsive system generated 8.5% in both 2008 and 2009. 6. Conclusion and Discussion: The total revenue increased from $6.02 million to $6.17 mil lion. The cost of operating a variable fare system is greater than that of operating a fixed fare system— hence, net income probably increased even more (more revenue, lower cost for fare collection), and the decision to modify the fare system seems reasonable. Notice that the entire discussion Figure 4. Variation in impact of fare change across system components. 0.94 0.51 0.94 0.52 4.57 4.71 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Local Buses Express Buses Demand Responsive Re ve nu e (M illi on $ ) 2008 2009

examples of effective experiment Design and Data analysis in transportation research 19 also is based on the assumption that no other factors changed between 2008 and 2009 that might have affected total revenues. One of the implicit assumptions is that the number of riders remained relatively constant from 1 year to the next. If the ridership had changed, the statistics reported would have to be changed. Using the measure revenue/rider, for example, would help control (or normalize) for the variation in ridership. 7. Applications in Other Areas in Transportation Research: Descriptive statistics are widely used and can convey a great deal of information to a reader. They also can be used to present data in many areas of transportation research, including: • Transportation Planning—to display public response frequency or percentage to various alternative designs. • Traffic Operations—to display the frequency or percentage of crashes by route type or by the type of traffic control devices present at an intersection. • Airport Engineering—to display the arrival pattern of passengers or flights by hour or other time period. • Public Transit—to display the average load factor on buses by time of day. Example 3: Environment; Descriptive Statistics Area: Environment Method of Analysis: Descriptive statistics (organizing and presenting data to explain current conditions) 1. Research Question/Problem Statement: The planning and programming director in Envi- ronmental City wants to determine the current ozone concentration in the city. These data will be compared to data collected after the projects included in the Transportation Improvement Program (TIP) have been completed to determine the effects of these projects on the environ- ment. Because the terrain, the presence of hills or tall buildings, the prevailing wind direction, and the sample station location relative to high volume roads or industrial sites all affect the ozone level, multiple samples are required to determine the ozone concentration level in a city. For this example, air samples are obtained each weekday in the month of July (21 days) at 14 air-sampling stations in the city: 7 in the central city and 7 in the outlying areas of the city. The objective of the analysis is to determine the ozone concentration in the central city, the outlying areas of the city, and the city as a whole. Figure 6. Graph showing variation in revenue increase by type of bus route. -0.4 -1.3 -2.1 3.1 6.2 2.0 -3 -2 -1 0 1 2 3 4 5 6 7 Local Bus Routes Express Bus Routes Percent Increase in Revenue

20 effective experiment Design and Data analysis in transportation research 2. Identification and Description of Variables: The variable to be analyzed is the 8-hour average ozone concentration in parts per million (ppm) at each of the 14 air-sampling stations. The 8-hour average concentration is the basis for the EPA standard, and July is selected because ozone levels are temperature sensitive and increase with a rise in the temperature. 3. Data Collection: Ozone concentrations in ppm are recorded for each hour of the day at each of the 14 air-sampling stations. The highest average concentration for any 8-hour period during the day is recorded and tabulated. This results in 294 concentration observations (14 stations for 21 days). Table 6 and Table 7 show the data for the seven central city locations and the seven outlying area locations. 4. Specification of Analysis Technique and Data Analysis: Much of the data used in analyzing transportation issues has year-to-year, month-to-month, day-to-day, and even hour-to-hour variations. For this reason, making only one observation, or even a few observations, may not accurately describe the phenomenon being observed. Thus, standard practice is to obtain several observations and report the mean value of all observations. In this example, the phenomenon being observed is the daily ozone concentration at a series of air-sampling locations. The statistic to be estimated is the mean value of this variable over Question/Issue Use collected data to describe existing conditions and prepare for future analysis. In this example, air pollution levels in the central city, the outlying areas, and the overall city are to be described. Day Station 1 2 3 4 5 6 7 ∑ 1 0.079 0.084 0.081 0.083 0.088 0.086 0.089 0.590 2 0.082 0.087 0.088 0.086 0.086 0.087 0.081 0.597 3 0.080 0.081 0.077 0.072 0.084 0.083 0.081 0.558 4 0.083 0.086 0.082 0.079 0.086 0.087 0.089 0.592 5 0.082 0.087 0.080 0.075 0.090 0.089 0.085 0.588 6 0.075 0.084 0.079 0.076 0.080 0.083 0.081 0.558 7 0.078 0.079 0.080 0.074 0.078 0.080 0.075 0.544 8 0.081 0.077 0.082 0.081 0.076 0.079 0.074 0.540 9 0.088 0.084 0.083 0.085 0.083 0.083 0.088 0.594 10 0.085 0.087 0.086 0.089 0.088 0.087 0.090 0.612 11 0.079 0.082 0.082 0.089 0.091 0.089 0.090 0.602 12 0.078 0.080 0.081 0.086 0.088 0.089 0.089 0.591 13 0.081 0.079 0.077 0.083 0.084 0.085 0.087 0.576 14 0.083 0.080 0.079 0.081 0.080 0.082 0.083 0.568 15 0.084 0.083 0.080 0.085 0.082 0.086 0.085 0.585 16 0.086 0.087 0.085 0.087 0.089 0.090 0.089 0.613 17 0.082 0.085 0.083 0.090 0.087 0.088 0.089 0.604 18 0.080 0.081 0.080 0.087 0.085 0.086 0.088 0.587 19 0.080 0.083 0.077 0.083 0.085 0.084 0.087 0.579 20 0.081 0.084 0.079 0.082 0.081 0.083 0.088 0.578 21 0.082 0.084 0.080 0.081 0.082 0.083 0.085 0.577 ∑ 1.709 1.744 1.701 1.734 1.773 1.789 1.793 12.243 Table 6. Central city 8-hour ozone concentration samples (ppm).

examples of effective experiment Design and Data analysis in transportation research 21 the test period selected. The mean value of any data set (x _ ) equals the sum of all observations in the set divided by the total number of observations in the set (n): x x n i i n = = ∑ 1 The variables of interest stated in the research question are the average ozone concentration for the central city, the outlying areas, and the total city. Thus, there are three data sets: the first table, the second table, and the sum of the two tables. The first data set has a sample size of 147; the second data set also has a sample size of 147, and the third data set contains 294 observations. Using the formula just shown, the mean value of the ozone concentration in the central city is calculated as follows: x xi i = = = = ∑ 147 12 243 147 0 083 1 147 . . ppm The mean value of the ozone concentration in the outlying areas of the city is: x xi i = = = = ∑ 147 10 553 147 0 072 1 147 . . ppm The mean value of the ozone concentration for the entire city is: x xi i = = = = ∑ 294 22 796 294 0 078 1 294 . . ppm Day Station 8 9 10 11 12 13 14 ∑ 1 0.072 0.074 0.073 0.071 0.079 0.070 0.074 0.513 2 0.074 0.075 0.077 0.075 0.081 0.075 0.077 0.534 3 0.070 0.072 0.074 0.074 0.083 0.078 0.080 0.531 4 0.067 0.070 0.071 0.077 0.080 0.077 0.081 0.523 5 0.064 0.067 0.068 0.072 0.079 0.078 0.079 0.507 6 0.069 0.068 0.066 0.070 0.075 0.079 0.082 0.509 7 0.071 0.069 0.070 0.071 0.074 0.071 0.077 0.503 8 0.073 0.072 0.074 0.072 0.076 0.073 0.078 0.518 9 0.072 0.075 0.077 0.074 0.078 0.074 0.080 0.530 10 0.074 0.077 0.079 0.077 0.080 0.076 0.079 0.542 11 0.070 0.072 0.075 0.074 0.079 0.074 0.078 0.522 12 0.068 0.067 0.068 0.070 0.074 0.070 0.075 0.492 13 0.065 0.063 0.067 0.068 0.072 0.067 0.071 0.473 14 0.063 0.062 0.067 0.069 0.073 0.068 0.073 0.475 15 0.064 0.064 0.066 0.067 0.070 0.066 0.070 0.467 16 0.061 0.059 0.062 0.062 0.067 0.064 0.069 0.434 17 0.065 0.061 0.060 0.064 0.069 0.066 0.073 0.458 18 0.067 0.063 0.065 0.068 0.073 0.069 0.076 0.499 19 0.069 0.067 0.068 0.072 0.077 0.071 0.078 0.502 20 0.071 0.069 0.070 0.074 0.080 0.074 0.077 0.515 21 0.070 0.065 0.072 0.076 0.079 0.073 0.079 0.514 ∑ 1.439 1.431 1.409 1.497 1.598 1.513 1.606 10.553 Table 7. Outlying area 8-hour ozone concentration samples (ppm).

22 effective experiment Design and Data analysis in transportation research Using the same equation, the mean value for each air-sampling location can be found by summing the value of the ozone concentration in the column representing that location and dividing by the 21 observations at that location. For example, considering Sample Station 1, the mean value of the ozone concentration is 1.709/21 = 0.081 ppm. Similarly, the mean value of the ozone concentrations for any specific day can be found by summing the ozone concentration values in the row representing that day and dividing by the number of stations. For example, for Day 1, the mean value of the ozone concentration in the central city is 0.590/7=0.084. In the outlying areas of the city, it is 0.513/7=0.073, and for the entire city it is 1.103/14=0.079. The highest and lowest values of the ozone concentration can be obtained by searching the two tables. The highest ozone concentration (0.091 ppm) is logged as having occurred at Station 5 on Day 11. The lowest ozone concentration (0.059 ppm) occurred at Station 9 on Day 16. The variation by sample location can be illustrated in the form of a frequency diagram. A graph can be used to show the variation in the average ozone concentration for the seven sample stations in the central city (Figure 7). Notice that all of these calculations (and more) can be done very easily if all the data are put in a spreadsheet and various statistical functions used. Graphs and other displays also can be made within the spreadsheet. 5. Interpreting the Results: In this example, the data are not tested to determine whether they fit a known distribution or whether one average value is significantly higher or lower than another. It can only be reported that, as recorded in July, the mean ozone concentration in the central city was greater than the concentration in the outlying areas of the city. (For testing to see whether the data fit a known distribution or comparing mean values, see Example 4 on fitting distribu- tions and goodness of fit. For comparing mean values, see examples 5 through 7.) It is known that ozone concentration varies by day and by location of the air-sampling equipment. If there is some threshold value of importance, such as the ozone concentration level considered acceptable by the EPA, these data could be used to determine the number of days that this level was exceeded, or the number of stations that recorded an ozone concentration above this threshold. This is done by comparing each day or each station with the threshold 0.081 0.083 0.081 0.083 0.084 0.085 0.085 0.070 0.072 0.074 0.076 0.078 0.080 0.082 0.084 0.086 1 2 3 4 5 6 7 Station A ve ra ge o zo ne c on ce nt ra tio n Figure 7. Average ozone concentration for seven central city sampling stations (ppm).

examples of effective experiment Design and Data analysis in transportation research 23 value. It must be noted that, as presented, this example is not a statistical comparison per se (i.e., there has been no significance testing or formal statistical comparison). 6. Conclusion and Discussion: This example illustrates how to determine and present quanti- tative information about a data set containing values of a varying parameter. If a similar set of data were captured each month, the variation in ozone concentration could be analyzed to describe the variation over the year. Similarly, if data were captured at these same locations in July of every year, the trend in ozone concentration over time could be determined. 7. Applications in Other Areas in Transportation: These descriptive statistics techniques can be used to present data in other areas of transportation research, such as: • Traffic Operations/Safety and Transportation Planning – to analyze the average speed of vehicles on streets with a speed limit of 45 miles per hour (mph) in residential, commercial, and industrial areas by sampling a number of streets in each of these area types. – to examine the average emergency vehicle response time to various areas of the city or county, by analyzing dispatch and arrival times for emergency calls to each area of interest. • Pavement Engineering—to analyze the average number of potholes per mile on pavement as a function of the age of pavement, by sampling a number of streets where the pavement age falls in discrete categories (0 to 5 years, 5 to 10 years, 10 to 15 years, and greater than 15 years). • Traffic Safety—to evaluate the average number of crashes per month at intersections with two-way STOP control versus four-way STOP control by sampling a number of intersections in each category over time. Example 4: Traffic Operations; Goodness of Fit Area: Traffic operations Method of Analysis: Goodness of fit (chi-square test; determining if observed distributions of data fit hypothesized standard distributions) 1. Research Question/Problem Statement: A research team is developing a model to estimate travel times of various types of personal travel (modes) on a path shared by bicyclists, in-line skaters, and others. One version of the model relies on the assertion that the distribution of speeds for each mode conforms to the normal distribution. (For a helpful definition of this and other statistical terms, see the glossary in NCHRP Project 20-45, Volume 2, Appendix A.) Based on a literature review, the researchers are sure that bicycle speeds are normally distributed. However, the shapes of the speed distributions for other users are unknown. Thus, the objective is to determine if skater speeds are normally distributed in this instance. Question/Issue Do collected data fit a specific type of probability distribution? In this example, do the speeds of in-line skaters on a shared-use path follow a normal distribution (are they normally distributed)? 2. Identification and Description of Variables: The only variable collected is the speed of in-line skaters passing through short sections of the shared-use path. 3. Data Collection: The team collects speeds using a video camera placed where most path users would not notice it. The speed of each free-flowing skater (i.e., each skater who is not closely following another path user) is calculated from the times that the skater passes two benchmarks on the path visible in the camera frame. Several days of data collection allow a large sample of 219 skaters to be measured. (An implicit assumption is made that there is no

24 effective experiment Design and Data analysis in transportation research variation in the data by day.) The data have a familiar bell shape; that is, when graphed, they look like they are normally distributed (Figure 8). Each bar in the figure shows the number of observations per 1.00-mph-wide speed bin. There are 10 observations between 6.00 mph and 6.99 mph. 4. Specification of Analysis Technique and Data Analysis: This analysis involves several pre- liminary steps followed by two major steps. In the preliminaries, the team calculates the mean and standard deviation from the data sample as 10.17 mph and 2.79 mph, respectively, using standard formulas described in NCHRP Project 20-45, Volume 2, Chapter 6, Section C under the heading “Frequency Distributions, Variance, Standard Deviation, Histograms, and Boxplots.” Then the team forms bins of observations of sufficient size to conduct the analysis. For this analysis, the team forms bins containing at least four observations each, which means forming a bin for speeds of 5 mph and lower and a bin for speeds of 17 mph or higher. There is some argument regarding the minimum allowable cell size. Some analysts argue that the minimum is five; others argue that the cell size can be smaller. Smaller numbers of observations in a bin may distort the results. When in doubt, the analysis can be done with different assumptions regarding the cell size. The left two columns in Table 8 show the data ready for analysis. The first major step of the analysis is to generate the theoretical normal distribution to compare to the field data. To do this, the team calculates a value of Z, the standard normal variable for each bin i, using the following equation: Z xi = − µ σ where x is the speed in miles per hour (mph) corresponding to the bin, µ is the mean speed, and s is the standard deviation of all of the observations in the speed sample in mph. For example (and with reference to the data in Table 8), for a speed of 5 mph the value of Z will be (5 - 10.17)/2.79 = -1.85 and for a speed of 6 mph, the value of Z will be (6 - 10.17)/2.79 = -1.50. The team then consults a table of standard normal values (i.e., NCHRP Project 20-45, Volume 2, Appendix C, Table C-1) to convert these Z values into A values representing the area under the standard normal distribution curve. The A value for a Z of -1.85 is 0.468, while the A value for a Z of -1.50 is 0.432. The difference between these two A values, representing the area under the standard normal probability curve corresponding to the speed of 6 mph, is 0.036 (calculated 0.468 - 0.432 = 0.036). The team multiplies 0.036 by the total sample size (219), to estimate that there should be 7.78 skaters with a speed of 6 mph if the speeds follow the standard normal distribution. The team follows Figure 8. Distribution of observed in-line skater speeds. 0 5 10 15 20 25 30 35 40 1 3 5 7 9 11 13 15 17 232119 Speed, mph Nu m be r o f o bs er va tio ns

examples of effective experiment Design and Data analysis in transportation research 25 a similar procedure for all speeds. Notice that the areas under the curve can also be calculated in a simple Excel spreadsheet using the “NORMDIST” function for a given x value and the average speed of 10.17 and standard deviation of 2.79. The values shown in Table 8 have been estimated using the Excel function. The second major step of the analysis is to use the chi-square test (as described in NCHRP Project 20-45, Volume 2, Chapter 6, Section F) to determine if the theoretical normal distribution is significantly different from the actual data distribution. The team computes a chi-square value for each bin i using the formula: χi i i i O E E 2 2 = −( ) where Oi is the number of actual observations in bin i and Ei is the expected number of obser- vations in bin i estimated by using the theoretical distribution. For the bin of 6 mph speeds, O = 10 (from the table), E = 7.78 (calculated), and the ci2 contribution for that cell is 0.637. The sum of the ci2 values for all bins is 19.519. The degrees of freedom (df) used for this application of the chi-square test are the number of bins minus 1 minus the number of variables in the distribution of interest. Given that the normal distribution has two variables (see May, Traffic Flow Fundamentals, 1990, p. 40), in this example the degrees of freedom equal 9 (calculated 12 - 1 - 2 = 9). From a standard table of chi-square values (NCHRP Project 20-45, Volume 2, Appendix C, Table C-2), the team finds that the critical value at the 95% confidence level for this case (with df = 9) is 16.9. The calculated value of the statistic is ~19.5, more than the tabular value. The results of all of these observations and calculations are shown in Table 8. 5. Interpreting the Results: The calculated chi-square value of ~19.5 is greater than the criti- cal chi-square value of 16.9. The team concludes, therefore, that the normal distribution is significantly different from the distribution of the speed sample at the 95% level (i.e., that the in-line skater speed data do not appear to be normally distributed). Larger variations between the observed and expected distributions lead to higher values of the statistic and would be interpreted as it being less likely that the data are distributed according to the Speed (mph) Number of Observations Number Predicted by Normal Distribution Chi-Square Value Under 5.99 6 6.98 0.137 6.00 to 6.99 10 7.78 0.637 7.00 to 7.99 18 13.21 1.734 8.00 to 8.99 24 19.78 0.902 9.00 to 9.99 37 26.07 4.585 10.00 to 10.99 38 30.26 1.980 11.00 to 11.99 24 30.93 1.554 12.00 to 12.99 21 27.85 1.685 13.00 to 13.99 15 22.08 2.271 14.00 to 14.99 13 15.42 0.379 15.00 to 15.99 4 9.48 3.169 16.00 to 16.99 4 5.13 0.251 17.00 and over 5 4.03 0.234 Total 219 219 19.519 Table 8. Observations, theoretical predictions, and chi-square values for each bin.

26 effective experiment Design and Data analysis in transportation research hypothesized distribution. Conversely, smaller variations between observed and expected distributions result in lower values of the statistic, which would suggest that it is more likely that the data are normally distributed because the observed values would fit better with the expected values. 6. Conclusion and Discussion: In this case, the results suggest that the normal distribution is not a good fit to free-flow speeds of in-line skaters on shared-use paths. Interestingly, if the 23 mph observation is considered to be an outlier and discarded, the results of the analysis yield a different conclusion (that the data are normally distributed). Some researchers use a simple rule that an outlier exists if the observation is more than three standard deviations from the mean value. (In this example, the 23 mph observation is, indeed, more than three standard deviations from the mean.) If there is concern with discarding the observation as an outlier, it would be easy enough in this example to repeat the data collection exercise. Looking at the data plotted above, it is reasonably apparent that the well-known normal distribution should be a good fit (at least without the value of 23). However, the results from the statistical test could not confirm the suspicion. In other cases, the type of distribution may not be so obvious, the distributions in question may be obscure, or some distribution parameters may need to be calibrated for a good fit. In these cases, the statistical test is much more valuable. The chi-square test also can be used simply to compare two observed distributions to see if they are the same, independent of any underlying probability distribution. For example, if it is desired to know if the distribution of traffic volume by vehicle type (e.g., automobiles, light trucks, and so on) is the same at two different freeway locations, the two distributions can be compared to see if they are similar. The consequences of an error in the procedure outlined here can be severe. This is because the distributions chosen as a result of the procedure often become the heart of predictive models used by many other engineers and planners. A poorly-chosen distribution will often provide erroneous predictions for many years to come. 7. Applications in Other Areas of Transportation Research: Fitting distributions to data samples is important in several areas of transportation research, such as: • Traffic Operations—to analyze shapes of vehicle headway distributions, which are of great interest, especially as a precursor to calibrating and using simulation models. • Traffic Safety—to analyze collision frequency data. Analysts often assume that the Poisson distribution is a good fit for collision frequency data and must use the method described here to validate the claim. • Pavement Engineering—to form models of pavement wear or otherwise compare results obtained using different designs, as it is often required to check the distributions of the parameters used (e.g., roughness). Example 5: Construction; Simple Comparisons to Specified Values Area: Construction Method of Analysis: Simple comparisons to specified values—using Student’s t-test to compare the mean value of a small sample to a standard or other requirement (i.e., to a population with a known mean and unknown standard deviation or variance) 1. Research Question/Problem Statement: A contractor wants to determine if a specified soil compaction can be achieved on a segment of the road under construction by using an on-site roller or if a new roller must be brought in.

examples of effective experiment Design and Data analysis in transportation research 27 The cost of obtaining samples for many construction materials and practices is quite high. As a result, decisions often must be made based on a small number of samples. The appropri- ate statistical technique for comparing the mean value of a small sample with a standard or requirement is Student’s t-test. Formally, the working, or null, hypothesis (Ho) and the alternative hypothesis (Ha) can be stated as follows: Ho: The soil compaction achieved using the on-site roller (CA) is less than a specified value (CS); that is, (CA < CS). Ha: The soil compaction achieved using the on-site roller (CA) is greater than or equal to the specified value (CS); that is, (CA ≥ CS). Question/Issue Determine whether a sample mean exceeds a specified value. Alternatively, deter- mine the probability of obtaining a sample mean (x _ ) from a sample of size n, if the universe being sampled has a true mean less than or equal to a population mean with an unknown variance. In this example, is an observed mean of soil compaction samples equal to or greater than a specified value? 2. Identification and Description of Variables: The variable to be used is the soil density results of nuclear densometer tests. These values will be used to determine whether the use of the on-site roller is adequate to meet the contract-specified soil density obtained in the laboratory (Proctor density) of 95%. 3. Data Collection: A 125-foot section of road is constructed and compacted with the on-site roller, and four samples of the soil density are obtained (25 feet, 50 feet, 75 feet, and 100 feet from the beginning of the test section). 4. Specification of Analysis Technique and Data Analysis: For small samples (n < 30) where the population mean is known but the population standard deviation is unknown, it is not appropriate to describe the distribution of the sample mean with a normal distribution. The appropriate distribution is called Student’s distribution (t-distribution or t-statistic). The equation for Student’s t-statistic is: t x x S n = − ′ where x _ is the sample mean, x _ ′ is the population mean (or specified standard), S is the sample standard deviation, and n is the sample size. The four nuclear densometer readings were 98%, 97%, 93% and 99%. Then, showing some simple sample calculations, X X S X i i i n = = + + + = = = = = ∑ 4 98 97 93 99 4 387 4 96 75 1 4 1 . % Σ i X n S −( ) − = = 2 1 20 74 3 2 63 . . %

28 effective experiment Design and Data analysis in transportation research and using the equation for t above, t = − = = 96 75 95 00 2 63 2 1 75 1 32 1 33 . . . . . . The calculated value of the t-statistic (1.33) is most typically compared to the tabularized values of the t-statistic (e.g., NCHRP Project 20-45, Volume 2, Appendix C, Table C-4) for a given significance level (typically called t critical or tcrit). For a sample size of n = 4 having 3 (n - 1) degrees of freedom (df), the values for tcrit are: 1.638 for a = 0.10 and 2.353 for a = 0.05 (two common values of a for testing, the latter being most common). Important: The specification of the significance level (a level) for testing should be done before actual testing and interpretation of results are done. In many instances, the appropriate level is defined by the agency doing the testing, a specified testing standard, or simply common practice. Generally speaking, selection of a smaller value for a (e.g., a = 0.05 versus a = 0.10) sets a more stringent standard. In this example, because the calculated value of t (1.33) is less than the critical value (2.353, given a = 0.05), the null hypothesis is accepted. That is, the engineer cannot be confident that the mean value from the densometer tests (96.75%) is greater than the required specifica- tion (95%). If a lower confidence level is chosen (e.g., a = 0.15), the value for tcrit would change to 1.250, which means the null hypothesis would be rejected. A lower confidence level can have serious implications. For example, there is an approximately 15% chance that the standard will not be met. That level of risk may or may not be acceptable to the contractor or the agency. Notice that in many standards the required significance level is stated (typically a = 0.05). It should be emphasized that the confidence level should be chosen before calculations and testing are done. It is not generally permissible to change the confidence level after calculations have been performed. Doing this would be akin to arguing that standards can be relaxed if a test gives an answer that the analyst doesn’t like. The results of small sample tests often are sensitive to the number of samples that can be obtained at a reasonable cost. (The mean value may change considerably as more data are added.) In this example, if it were possible to obtain nine independent samples (as opposed to four) and the mean value and sample standard deviation were the same as with the four samples, the calculation of the t-statistic would be: t = − = 96 75 95 00 2 63 3 1 99 . . . . Comparing the value of t (with a larger sample size) to the appropriate tcrit (for n - 1 = 8 df and a = 0.05) of 1.860 changes the outcome. That is, the calculated value of the t-statistic is now larger than the tabularized value of tcrit, and the null hypothesis is rejected. Thus, it is accepted that the mean of the densometer readings meets or exceeds the standard. It should be noted, however, that the inclusion of additional tests may yield a different mean value and standard deviation, in which case the results could be different. 5. Interpreting the Results: By themselves, the results of the statistical analysis are insufficient to answer the question as to whether a new roller should be brought to the project site. These results only provide information the contractor can use to make this decision. The ultimate decision should be based on these probabilities and knowledge of the cost of each option. What is the cost of bringing in a new roller now? What is the cost of starting the project and then determining the current roller is not adequate and then bringing in a new roller? Will this decision result in a delay in project completion—and does the contract include an incentive for early completion and/or a penalty for missing the completion date? If it is possible to conduct additional independent densometer tests, what is the cost of conducting them?

examples of effective experiment Design and Data analysis in transportation research 29 If there is a severe penalty for missing the deadline (or a significant reward for finishing early), the contractor may be willing to incur the cost of bringing in a new roller rather than accepting a 15% probability of being delayed. 6. Conclusion and Discussion: In some cases the decision about which alternative is preferable can be expressed in the form of a probability (or level of confidence) required to make a deci- sion. The decision criterion is then expressed in a hypothesis and the probability of rejecting that hypothesis. In this example, if the hypothesis to be tested is “Using the on-site roller will provide an average soil density of 95% or higher” and the level of confidence is set at 95%, given a sample of four tests the decision will be to bring in a new roller. However, if nine independent tests could be conducted, the results in this example would lead to a decision to use the on-site roller. 7. Applications in Other Areas in Transportation Research: Simple comparisons to specified values can be used in a variety of areas of transportation research. Some examples include: • Traffic Operations—to compare the average annual number of crashes at intersections with roundabouts with the average annual number of crashes at signalized intersections. • Pavement Engineering—to test the comprehensive strength of concrete slabs. • Maintenance—to test the results of a proposed new deicer compound. Example 6: Maintenance; Simple Two-Sample Comparisons Area: Maintenance Method of Analysis: Simple two-sample comparisons (t-test for paired comparisons; com- paring the mean values of two sets of matched data) 1. Research Question/Problem Statement: As a part of a quality control and quality assurance (QC/QA) program for highway maintenance and construction, an agency engineer wants to compare and identify discrepancies in the contractor’s testing procedures or equipment in making measurements on materials being used. Specifically, compacted air voids in asphalt mixtures are being measured. In this instance, the agency’s test results need to be compared, one-to-one, with the contractor’s test results. Samples are drawn or made and then literally split and tested—one by the contractor, one by the agency. Then the pairs of measurements are analyzed. A paired t-test will be used to make the comparison. (For another type of two-sample comparison, see Example 7.) Question/Issue Use collected data to test if two sets of results are similar. Specifically, do two test- ing procedures to determine air voids produce the same results? Stated in formal terms, the null and alternative hypotheses are: Ho: There is no mean difference in air voids between agency and contractor test results: H Xo d: = 0 Ha: There is a mean difference in air voids between agency and contractor test results: H Xa d: ≠ 0 (For definitions and more discussion about the formulation of formal hypotheses for test- ing, see NCHRP Project 20-45, Volume 2, Appendix A and Volume 1, Chapter 2, “Hypothesis.”) 2. Identification and Description of Variables: The testing procedure for laboratory-compacted air voids in the asphalt mixture needs to be verified. The split-sample test results for laboratory-

30 effective experiment Design and Data analysis in transportation research compacted air voids are shown in Table 9. Twenty samples are prepared using the same asphalt mixture. Half of the samples are prepared in the agency’s laboratory and the other half in the contractor’s laboratory. Given this arrangement, there are basically two variables of concern: who did the testing and the air void determination. 3. Data Collection: A sufficient quantity of asphalt mix to make 10 lots is produced in an asphalt plant located on a highway project. Each of the 10 lots is collected, split into two samples, and labeled. A sample from each lot, 4 inches in diameter and 2 inches in height, is prepared in the contractor’s laboratory to determine the air voids in the compacted samples. A matched set of samples is prepared in the agency’s laboratory and a similar volumetric procedure is used to determine the agency’s lab-compacted air voids. The lab-compacted air void contents in the asphalt mixture for both the contractor and agency are shown in Table 9. 4. Specification of Analysis Technique and Data Analysis: A paired (two-sided) t-test will be used to determine whether a difference exists between the contractor and agency results. As noted above, in a paired t-test the null hypothesis is that the mean of the differences between each pair of two tests is 0 (there is no difference between the means). The null hypothesis can be expressed as follows: H Xo d: = 0 The alternate hypothesis, that the two means are not equal, can be expressed as follows: H Xa d: ≠ 0 The t-statistic for the paired measurements (i.e., the difference between the split-sample test results) is calculated using the following equation: t X s n d d = − 0 Using the actual data, the value of the t-statistic is calculated as follows: t = − = 0 88 0 0 7 10 4 . . Sample Air Voids (%) DifferenceContractor Agency 1 4.37 4.15 0.21 2 3.76 5.39 -1.63 3 4.10 4.47 -0.37 4 4.39 4.52 -0.13 5 4.06 5.36 -1.29 6 4.14 5.01 -0.87 7 3.92 5.23 -1.30 8 3.38 4.97 -1.60 9 4.12 4.37 -0.25 10 3.68 5.29 -1.61 X 3.99 4.88 dX = -0.88 S 0.31 0.46 ds = 0.70 Table 9. Laboratory-compacted air voids in split samples.

examples of effective experiment Design and Data analysis in transportation research 31 For n - 1 (10 - 1 = 9) degrees of freedom and a = 0.05, the tcrit value can be looked up using a t-table (e.g., NCHRP Project 20-45, Volume 2, Appendix C, Table C-4): t0 025 9 2 262. , .= For a more detailed description of the t-statistic, see the glossary in NCHRP Project 20-45, Volume 2, Appendix A. 5. Interpreting the Results: Given that t = 4 > t0.025, 9 = 2.685, the engineer would reject the null hypothesis and conclude that the results of the paired tests are different. This means that the contractor and agency test results from paired measurements indicate that the test method, technicians, and/or test equipment are not providing similar results. Notice that the engineer cannot conclude anything about the material or production variation or what has caused the differences to occur. 6. Conclusion and Discussion: The results of the test indicate that a statistically significant difference exists between the test results from the two groups. When making such comparisons, it is important that random sampling be used when obtaining the samples. Also, because sources of variability influence the population parameters, the two sets of test results must have been sampled over the same time period, and the same sampling and testing procedures must have been used. It is best if one sample is drawn and then literally split in two, then another sample drawn, and so on. The identification of a difference is just that: notice that a difference exists. The reason for the difference must still be determined. A common misinterpretation is that the result of the t-test provides the probability of the null hypothesis being true. Another way to look at the t-test result in this example is to conclude that some alternative hypothesis provides a better description of the data. The result does not, however, indicate that the alternative hypothesis is true. To ensure practical significance, it is necessary to assess the magnitude of the difference being tested. This can be done by computing confidence intervals, which are used to quantify the range of effect size and are often more useful than simple hypothesis testing. Failure to reject a hypothesis also provides important information. Possible explanations include: occurrence of a type-II error (erroneous acceptance of the null hypothesis); small sample size; difference too small to detect; expected difference did not occur in data; there is no difference/effect. Proper experiment design and data collection can minimize the impact of some of these issues. (For a more comprehensive discussion of this topic, see NCHRP Project 20-45, Volume 2, Chapter 1.) 7. Applications in Other Areas of Transportation Research: The application of the t-test to compare two mean values in other areas of transportation research may include: • Traffic Operations—to evaluate average delay in bus arrivals at various bus stops. • Traffic Operations/Safety—to determine the effect of two enforcement methods on reduction in a particular traffic violation. • Pavement Engineering—to investigate average performance of two pavement sections. • Environment—to compare average vehicular emissions at two locations in a city. Example 7: Materials; Simple Two-Sample Comparisons Area: Materials Method of Analysis: Simple two-sample comparisons (using the t-test to compare the mean values of two samples and the F-test for comparing variances) 1. Research Question/Problem Statement: As a part of dispute resolution during quality control and quality assurance, a highway agency engineer wants to validate a contractor’s test results concerning asphalt content. In this example, the engineer wants to compare the results

32 effective experiment Design and Data analysis in transportation research of two sets of tests: one from the contractor and one from the agency. Formally, the (null) hypothesis to be tested, Ho, is that the contractor’s tests and the agency’s tests are from the same population. In other words, the null hypothesis is that the means of the two data sets will be equal, as will the standard deviations. Notice that in the latter instance the variances are actually being compared. Test results were also compared in Example 6. In that example, the comparison was based on split samples. The same test specimens were tested by two different analysts using different equipment to see if the same results could be obtained by both. The major difference between Example 6 and Example 7 is that, in this example, the two samples are randomly selected from the same pavement section. Question/Issue Use collected data to test if two measured mean values are the same. In this instance, are two mean values of asphalt content the same? Stated in formal terms, the null and alternative hypotheses can be expressed as follows: Ho: There is no difference in asphalt content between agency and contractor test results: H m mo c a: − =( )0 Ha: There is a difference in asphalt content between agency and contractor test results: H m ma c a: − ≠( )0 2. Identification and Description of Variables: The contractor runs 12 asphalt content tests and the agency engineer runs 6 asphalt content tests over the same period of time, using the same random sampling and testing procedures. The question is whether it is likely that the tests have come from the same population based on their variability. 3. Data Collection: If the agency’s objective is simply to identify discrepancies in the testing procedures or equipment, then verification testing should be done on split samples (as in Example 6). Using split samples, the difference in the measured variable can more easily be attributed to testing procedures. A paired t-test should be used. (For more information, see NCHRP Project 20-45, Volume 2, Chapter 4, Section A, “Analysis of Variance Methodology.”) A split sample occurs when a physical sample (of whatever is being tested) is drawn and then literally split into two testable samples. On the other hand, if the agency’s objective is to identify discrepancies in the overall material, process, sampling, and testing processes, then validation testing should be done on independent samples. Notice the use of these terms. It is important to distinguish between testing to verify only the testing process (verification) versus testing to compare the overall production, sampling, and testing processes (validation). If independent samples are used, the agency test results still can be compared with contractor test results (using a simple t-test for comparing two means). If the test results are consistent, then the agency and contractor tests can be combined for contract compliance determination. 4. Specification of Analysis Technique and Data Analysis: When comparing the two data sets, it is important to compare both the means and the variances because the assumption when using the t-test requires equal variances for each of the two groups. A different test is used in each instance. The F-test provides a method for comparing the variances (the standard devia- tion squared) of two sets of data. Differences in means are assessed by the t-test. Generally, construction processes and material properties are assumed to follow a normal distribution.

examples of effective experiment Design and Data analysis in transportation research 33 In this example, a normal distribution is assumed. (The assumption of normality also can be tested, as in Example 4.) The ratios of variances follow an F-distribution, while the means of relatively small samples follow a t-distribution. Using these distributions, hypothesis tests can be conducted using the same concepts that have been discussed in prior examples. (For more information about the F-test and the t-distribution, see NCHRP Project 20-45, Volume 2, Chapter 4, Section A, “Compute the F-ratio Test Statistic.” For more information about the t-distribution, see NCHRP Project 20-45, Volume 2, Chapter 4, Section A.) For samples from the same normal population, the statistic F (the ratio of the two-sample variances) has a sampling distribution called the F-distribution. For validation and verification testing, the F-test is based on the ratio of the sample variance of the contractor’s test results (sc 2) and the sample variance of the agency’s test results (sa 2). Similarly, the t-test can be used to test whether the sample mean of the contractor’s tests, X _ c, and the agency’s tests, X _ a, came from populations with the same mean. Consider the asphalt content test results from the contractor samples and agency samples (Table 10). In this instance, the F-test is used to determine whether the variance observed for the contractor’s tests differs from the variance observed for the agency’s tests. Using the F-test Step 1. Compute the variance (s2), for each set of tests: sc 2 = 0.064 and sa 2 = 0.092. As an example, sc 2 can be calculated as: s x X n c i c i2 2 2 2 1 6 4 6 1 11 6 2 6 1 11 = −( ) − = −( ) + −( )∑ . . . . + + −( ) + −( ) =. . . . . . . 6 6 1 11 5 7 6 1 11 0 0645 2 2 Step 2. Compute F s s calc a c = = = 2 2 0 092 0 064 1 43 . . . . Contractor Samples Agency Samples 1 6.4 1 5.4 2 6.2 2 5.8 3 6.0 3 6.2 4 6.6 4 5.4 5 6.1 5 5.6 6 6.0 6 5.8 7 6.3 8 6.1 9 5.9 10 5.8 11 6.0 12 5.7 Descriptive Statistics = 6.1cX Descriptive Statistics = 5.7aX = 0.0642cs = 0.0922as = 0.25cs = 0.30as = 12cn = 6an Table 10. Asphalt content test results from independent samples.

34 effective experiment Design and Data analysis in transportation research Step 3. Determine Fcrit from the F-distribution table, making sure to use the correct degrees of freedom (df) for the numerator (the number of observations minus 1, or na - 1 = 6 - 1 = 5) and the denominator (nc - 1 = 12 - 1 = 11). For a = 0.01, Fcrit = 5.32. The critical F-value can be found from tables (see NCHRP Project 20-45, Volume 2, Appendix C, Table C-5). Read the F-value for 1 - a = 0.99, numerator and denominator degrees of freedom 5 and 11, respectively. Interpolation can be used if exact degrees of freedom are not available in the table. Alternatively, a statistical function in Microsoft Excel™ can be used to determine the F-value. Step 4. Compare the two values to determine if Fcalc < Fcrit. If Fcalc < Fcrit is true, then the variances are equal; if not, they are unequal. In this example, Fcalc (1.43) is, in fact, less than Fcrit (5.32) and, thus, there is no evidence of unequal variances. Given this result, the t-test for the case of equal variances is used to determine whether to declare that the mean of the contractor’s tests differs from the mean of the agency’s tests. Using the t-test Step 1. Compute the sample means (X _ ) for each set of tests: X _ c = 6.1 and X _ a = 5.7. Step 2. Compute the pooled variance sp 2 from the individual sample variances: s s n s n n n p c c a a c a 2 2 21 1 2 0 064 12 1 = −( )+ −( ) + − = −( )+. 0 092 6 1 12 6 2 0 0731 . . −( ) + − = Step 3. Compute the t-statistic using the following equation for equal variance: t X X s n s n c a p c p a = − + = − + = 2 2 6 1 5 7 0 0731 12 0 0731 6 . . . . 2 9. t0 005 16 2 921. , .= (For more information, see NCHRP Project 20-45, Volume 2, Appendix C, Table C-4 for A v= − =1 2 16 α and .) 5. Interpreting the Results: Given that F < Fcrit (i.e., 1.43 < 5.32), there is no reason to believe that the two sets of data have different variances. That is, they could have come from the same population. Therefore, the t-test can be used to compare the means using equal variance. Because t < tcrit (i.e., 2.9 < 2.921), the engineer does not reject the null hypothesis and, thus, assumes that the sample means are equal. The final conclusion is that it is likely that the contractor and agency test results represent the same process. In other words, with a 99% confidence level, it can be said that the agency’s test results are not different from the contrac- tor’s and therefore validate the contractor tests. 6. Conclusion and Discussion: The simple t-test can be used to validate the contractor’s test results by conducting independent sampling from the same pavement at the same time. Before conducting a formal t-test to compare the sample means, the assumption of equal variances needs to be evaluated. This can be accomplished by comparing sample variances using the F-test. The interpretation of results will be misleading if the equal variance assumption is not validated. If the variances of two populations being compared for their means are different, the mean comparison will reflect the difference between two separate populations. Finally, based on the comparison of means, one can conclude that the construction materials have consistent properties as validated by two independent sources (contractor and agency). This sort of comparison is developed further in Example 8, which illustrates tests for the equality of more than two mean values.

examples of effective experiment Design and Data analysis in transportation research 35 7. Applications in Other Areas of Transportation Research: The simple t-test can be used to compare means of two independent samples. Applications for this method in other areas of transportation research may include: • Traffic Operations – to compare average speeds at two locations along a route. – to evaluate average delay times at two intersections in an urban area. • Pavement Engineering—to investigate the difference in average performance of two pavement sections. • Maintenance—to determine the effects of two maintenance treatments on average life extension of two pavement sections. Example 8: Laboratory Testing/Instrumentation; Simple Analysis of Variance (ANOVA) Area: Laboratory testing and/or instrumentation Method of Analysis: Simple analysis of variance (ANOVA) comparing the mean values of more than two samples and using the F-test 1. Research Question/Problem Statement: An engineer wants to test and compare the com- pressive strength of five different concrete mix designs that vary in coarse aggregate type, gradation, and water/cement ratio. An experiment is conducted in a laboratory where five different concrete mixes are produced based on given specifications, and tested for com- pressive strength using the ASTM International standard procedures. In this example, the comparison involves inference on parameters from more than two populations. The purpose of the analysis, in other words, is to test whether all mix designs are similar to each other in mean compressive strength or whether some differences actually exist. ANOVA is the statistical procedure used to test the basic hypothesis illustrated in this example. Question/Issue Compare the means of more than two samples. In this instance, compare the compres- sive strengths of five concrete mix designs with different combinations of aggregates, gradation, and water/cement ratio. More formally, test the following hypotheses: Ho: There is no difference in mean compressive strength for the various (five) concrete mix types. Ha: At least one of the concrete mix types has a different compressive strength. 2. Identification and Description of Variables: In this experiment, the factor of interest (independent variable) is the concrete mix design, which has five levels based on differ- ent coarse aggregate types, gradation, and water/cement ratios (denoted by t and labeled A through E in Table 11). Compressive strength is a continuous response (dependent) variable, measured in pounds per square inch (psi) for each specimen. Because only one factor is of interest in this experiment, the statistical method illustrated is often called a one-way ANOVA or simple ANOVA. 3. Data Collection: For each of the five mix designs, three replicates each of cylinders 4 inches in diameter and 8 inches in height are made and cured for 28 days. After 28 days, all 15 specimens are tested for compressive strength using the standard ASTM International test. The compres- sive strength data and summary statistics are provided for each mix design in Table 11. In this example, resource constraints have limited the number of replicates for each mix design to

36 effective experiment Design and Data analysis in transportation research three. (For a discussion on sample size determination based on statistical power requirements, see NCHRP Project 20-45, Volume 2, Chapter 1, “Sample Size Determination.”) 4. Specification of Analysis Technique and Data Analysis: To perform a one-way ANOVA, pre- liminary calculations are carried out to compute the overall mean (y _ P), the sample means (y _ i.), and the sample variances (si 2) given the total sample size (nT = 15) as shown in Table 11. The basic strategy for ANOVA is to compare the variance between levels or groups—specifically, the variation between sample means—to the variance within levels. This comparison is used to determine if the levels explain a significant portion of the variance. (Details for perform- ing a one-way ANOVA are given in NCHRP Project 20-45, Volume 2, Chapter 4, Section A, “Analysis of Variance Methodology.”) ANOVA is based on partitioning of the total sum of squares (TSS, a measure of overall variability) into within-level and between-levels components. The TSS is defined as the sum of the squares of the differences of each observation (yij) from the overall mean (y _ P). The TSS, between-levels sum of squares (SSB), and within-level sum of squares (SSE) are computed as follows. TSS y y SSB y y ij i j i = −( ) = = −( ) ∑ .. , . .. . 2 2 4839620 90 = = −( ) = ∑ 4331513 60 508107 30 2 . . , . , i j ij i i j SSE y y∑ The next step is to compute the between-levels mean square (MSB) and within-levels mean square (MSE) based on respective degrees of freedom (df). The total degrees of freedom (dfT), between-levels degrees of freedom (dfB), and within-levels degrees of freedom (dfE) for one- way ANOVA are computed as follows: df n df t df n t T T B E T = − = − = = − = − = = − = − = 1 15 1 14 1 5 1 4 15 5 10 where nT = the total sample size and t = the total number of levels or groups. The next step of the ANOVA procedure is to compute the F-statistic. The F-statistic is the ratio of two variances: the variance due to interaction between the levels, and the variance due to differences within the levels. Under the null hypothesis, the between-levels mean square (MSB) and within-levels mean square (MSE) provide two independent estimates of the variance. If the means for different levels of mix design are truly different from each other, the MSB will tend Replicate Mix Design A B C D E 1 y11 = 5416 y21 = 5292 y31 = 4097 y41 = 5056 y51 = 4165 2 y12 = 5125 y22 = 4779 y32 = 3695 y42 = 5216 y52 = 3849 3 y13 = 4847 y23 = 4824 y33 = 4109 y43 = 5235 y53 = 4089 Mean y– 1. = 5129 y– 2. = 4965 y– 3. = 3967 y– 4. = 5169 y– 5. = 4034 Standard deviation s1 = 284.52 s2 = 284.08 s3 = 235.64 s4 = 98.32 s5 = 164.94 Overall mean y–.. = 4653 Table 11. Concrete compressive strength (psi) after 28 days.

examples of effective experiment Design and Data analysis in transportation research 37 to be larger than the MSE, such that it will be more likely to reject the null hypothesis. For this example, the calculations for MSB, MSE, and F are as follows: MSB SSB df MSE SSE df F M B E = = = = = 1082878 40 50810 70 . . SB MSE = 21 31. If there are no effects due to level, the F-statistic will tend to be smaller. If there are effects due to level, the F-statistic will tend to be larger, as is the case in this example. ANOVA computations usually are summarized in the form of a table. Table 12 summarizes the computations for this example. The final step is to determine Fcrit from the F-distribution table (e.g., NCHRP Project 20-45, Volume 2, Appendix C, Table C-5) with t - 1 (5 - 1 = 4) degrees of freedom for the numerator and nT - t (15 - 5 = 10) degrees of freedom for the denominator. For a significance level of a = 0.01, Fcrit is found (in Table C-5) to be 5.99. Given that F > Fcrit (21.31 > 5.99), the null hypothesis that all mix designs have equal compressive strength is rejected, supporting the conclusion that at least two mix designs are different from each other in their mean effect. Table 12 also shows the p-value calculated using a computer program. The p-value is the probability that a sample would result in the given statistic value if the null hypothesis were true. The p-value of 0.0000698408 is well below the chosen significance level of 0.01. 5. Interpreting the Results: The ANOVA results in rejection of the null hypothesis at a = 0.01. That is, the mean values are judged to be statistically different. However, the ANOVA result does not indicate where the difference lies. For example, does the compressive strength of mix design A differ from that of mix design C or D? To carry out such multiple mean comparisons, the analyst must control the experiment-wise error rate (EER) by employing more conservative methods such as Tukey’s test, Bonferroni’s test, or Scheffe’s test, as appropriate. (Details for ANOVA are given in NCHRP Project 20-45, Volume 2, Chapter 4, Section A, “Analysis of Variance Methodology.”) The coefficient of determination (R2) provides a rough indication of how well the statistical model fits the data. For this example, R2 is calculated as follows: R SSB TSS 2 4331513 60 4839620 90 0 90= = = . . . For this example, R2 indicates that the one-way ANOVA classification model accounts for 90% of the total variation in the data. In the controlled laboratory experiment demonstrated in this example, R2 = 0.90 indicates a fairly acceptable fit of the statistical model to the data. 6. Conclusion and Discussion: This example illustrates a simple one-way ANOVA where infer- ence regarding parameters (mean values) from more than two populations or treatments was Source Sum of Squares (SS) Degrees of Freedom (df) Mean Square (MS) F Probability > F (Significance) Between 4331513.60 4 1082878.40 21.31 0.0000698408 Within 508107.30 10 50810.70 Total 4839620.90 14 Table 12. ANOVA results.

38 effective experiment Design and Data analysis in transportation research desired. The focus of computations was the construction of the ANOVA table. Before pro- ceeding with ANOVA, however, an analyst must verify that the assumptions of common vari- ance and data normality are satisfied within each group/level. The results do not establish the cause of difference in compressive strength between mix designs in any way. The experimental setup and analytical procedure shown in this example may be used to test other properties of mix designs such as flexure strength. If another factor (for example, water/cement ratio with levels low or high) is added to the analysis, the classification will become a two-way ANOVA. (In this report, two-way ANOVA is demonstrated in Example 11.) Notice that the equations shown in Example 8 may only be used for one-way ANOVA for balanced designs, meaning that in this experiment there are equal numbers of replicates for each level within a factor. (For a discussion of computations on unbalanced designs and multifactor designs, see NCHRP Project 20-45.) 7. Applications in Other Areas of Transportation Research: Examples of applications of one-way ANOVA in other areas of transportation research include: • Traffic Operations—to determine the effect of various traffic calming devices on average speeds in residential areas. • Traffic Operations/Safety—to study the effect of weather conditions on accidents in a given time period. • Work Zones—to compare the effect of different placements of work zone signs on reduction in highway speeds at some downstream point. • Materials—to investigate the effect of recycled aggregates on compressive and flexural strength of concrete. Example 9: Materials; Simple Analysis of Variance (ANOVA) Area: Materials Method of Analysis: Simple analysis of variance (ANOVA) comparing more than two mean values and using the F-test for equality of means 1. Research Question/Problem Statement: To illustrate how increasingly detailed analysis may be appropriate, Example 9 is an extension of the two-sample comparison presented in Exam- ple 7. As a part of dispute resolution during quality control and quality assurance, let’s say the highway agency engineer from Example 7 decides to reconfirm the contractor’s test results for asphalt content. The agency hires an independent consultant to verify both the contractor- and agency-measured asphalt contents. It now becomes necessary to compare more than two mean values. A simple one-way analysis of variance (ANOVA) can be used to analyze the asphalt contents measured by three different parties. Question/Issue Extend a comparison of two mean values to compare three (or more) mean values. Specifically, use data collected by several (>2) different parties to see if the results (mean values) are the same. Formally, test the following null (Ho) and alternative (Ha) hypotheses, which can be stated as follows: Ho: There is no difference in asphalt content among three different parties: H m m mo contractor agency: = =( )consultant Ha: At least one of the parties has a different measured asphalt content.

examples of effective experiment Design and Data analysis in transportation research 39 2. Identification and Description of Variables: The independent consultant runs 12 additional asphalt content tests by taking independent samples from the same pavement section as the agency and contractor. The question is whether it is likely that the tests came from the same population, based on their variability. 3. Data Collection: The descriptive statistics (mean, standard deviation, and sample size) for the asphalt content data collected by the three parties are shown in Table 13. Notice that 12 measurements each have been taken by the contractor and the independent consultant, while the agency has only taken six measurements. The data for the contractor and the agency are the same as presented in Example 7. For brevity, the consultant’s raw observations are not repeated here. The mean value and standard deviation for the consultant’s data are calculated using the same formulas and equations that were used in Example 7. 4. Specification of Analysis Technique and Data Analysis: The agency engineer can use one-way ANOVA to resolve this question. (Details for one-way ANOVA are available in NCHRP Project 20-45, Volume 2, Chapter 4, Section A, “Analysis of Variance Methodology.”) The objective of the ANOVA is to determine whether the variance observed in the depen- dent variable (in this case, asphalt content) is due to the differences among the samples (different from one party to another) or due to the differences within the samples. ANOVA is basically an extension of two-sample comparisons to cases when three or more samples are being compared. More formally, the technician is testing to see whether the between- sample variability is large relative to the within-sample variability, as stated in the formal hypothesis. This type of comparison also may be referred to as between-groups versus within-groups variance. Rejection of the null hypothesis (that the mean values are the same) gives the engineer some information concerning differences among the population means; however, it does not indicate which means actually differ from each other. Rejection of the null hypothesis tells the engineer that differences exist, but it does not specify that X _ 1 differs from X _ 2 or from X _ 3. To control the experiment-wise error rate (EER) for multiple mean comparisons, a con- servative test—Tukey’s procedure for unplanned comparisons—can be used for unplanned comparisons. (Information about Tukey’s procedure can be found in almost any good statistics textbook, such as those by Freund and Wilson [2003] and Kutner et al. [2005].) The F-statistic calculated for determining the effect of who (agency, contractor, or consultant) measured Party Type Asphalt Content Percent Contractor 1 1 1 X s n = 6.1 = 0.254 = 12 Agency 2 2 2 X s n = 5.7 = 0.303 = 6 Consultant 3 3 3 X s n = 5.12 = 0.186 = 12 Table 13. Asphalt content data summary.

40 effective experiment Design and Data analysis in transportation research the asphalt content is given in Table 14. (See Example 8 for a more detailed discussion of the calculations necessary to create Table 14.) Although the ANOVA results reveal whether there are overall differences, it is always good practice to visually examine the data. For example, Figure 9 shows the mean and associated 95% confidence intervals (CI) of the mean asphalt content measured by each of the three parties involved in the testing. 5. Interpreting the Results: A simple one-way ANOVA is conducted to determine whether there is a difference in mean asphalt content as measured by the three different parties. The analysis shows that the F-statistic is significant (p-value < 0.05), meaning that at least two of the means are significantly different from each other. The engineer can use Tukey’s procedure for com- parisons of multiple means, or he or she can observe the plotted 95% confidence intervals to figure out which means are actually (and significantly) different from each other (see Figure 9). Because the confidence intervals overlap, the results show that the asphalt content measured by the contractor and the agency are somewhat different. (These same conclusions were obtained in Example 7.) However, the mean asphalt content obtained by the consultant is significantly different from (and lower than) that obtained by both of the other parties. This is evident because the confidence interval for the consultant doesn’t overlap with the confidence interval of either of the other two parties. Source Sum of Squares (SS) Degrees of Freedom (df) Mean Square (MS) F Significance Between groups 5.6 2 2.8 49.1 0.000 Within groups 1.5 27 0.06 Total 7.2 29 Table 14. ANOVA results. Figure 9. Mean and confidence intervals for asphalt content data.

examples of effective experiment Design and Data analysis in transportation research 41 6. Conclusion and Discussion: This example uses a simple one-way ANOVA to compare the mean values of three sets of results using data drawn from the same test section. The error bar plots for data from the three different parties visually illustrate the statistical differences in the multiple means. However, the F-test for multiple means should be used to formally test the hypothesis of the equality of means. The interpretation of results will be misleading if the variances of populations being compared for their mean difference are not equal. Based on the comparison of the three means, it can be concluded that the construction material in this example may not have consistent properties, as indicated by the results from the independent consultant. 7. Applications in Other Areas of Transportation Research: Simple one-way ANOVA is often used when more than two means must be compared. Examples of applications in other areas of transportation research include: • Traffic Safety/Operations—to evaluate the effect of intersection type on the average number of accidents per month. Three or more types of intersections (e.g., signalized, non-signalized, and rotary) could be selected for study in an urban area having similar traffic volumes and vehicle mix. • Pavement Engineering – to investigate the effect of hot-mix asphalt (HMA) layer thickness on fatigue cracking after 20 years of service life. Three HMA layer thicknesses (5 inches, 6 inches, and 7 inches) are to be involved in this study, and other factors (i.e., traffic, climate, and subbase/base thicknesses and subgrade types) need to be similar. – to determine the effect of climatic conditions on rutting performance of flexible pavements. Three or more climatic conditions (e.g., wet-freeze, wet-no-freeze, dry-freeze, and dry-no-freeze) need to be considered while other factors (i.e., traffic, HMA, and subbase/ base thicknesses and subgrade types) need to be similar. Example 10: Pavements; Simple Analysis of Variance (ANOVA) Area: Pavements Method of Analysis: Simple analysis of variance (ANOVA) comparing the mean values of more than two samples and using the F-test 1. Research Question/Problem Statement: The aggregate coefficient of thermal expansion (CTE) in Portland cement concrete (PCC) is a critical factor affecting thermal behavior of PCC slabs in concrete pavements. In addition, the interaction between slab curling (caused by the thermal gradient) and axle loads is assumed to be a critical factor for concrete pavement performance in terms of cracking. To verify the effect of aggregate CTE on slab cracking, a pavement engineer wants to conduct a simple observational study by collecting field pave- ment performance data on three different types of pavement. For this example, three types of aggregate (limestone, dolomite, and gravel) are being used in concrete pavement construction and yield the following CTEs: • 4 in./in. per °F • 5 in./in. per °F • 6.5 in./in. per °F It is necessary to compare more than two mean values. A simple one-way ANOVA is used to analyze the observed slab cracking performance by the three different concrete mixes with different aggregate types based on geology (limestone, dolomite, and gravel). All other factors that might cause variation in cracking are assumed to be held constant.

42 effective experiment Design and Data analysis in transportation research 2. Identification and Description of Variables: The engineer identifies 1-mile sections of uni- form pavement within the state highway network with similar attributes (aggregate type, slab thickness, joint spacing, traffic, and climate). Field performance, in terms of the observed percentage of slab cracked (“% slab cracked,” i.e., how cracked is each slab) for each pavement section after about 20 years of service, is considered in the analysis. The available pavement data are grouped (stratified) based on the aggregate type (CTE value). The % slab cracked after 20 years is the dependent variable, while CTE of aggregates is the independent variable. The question is whether pavement sections having different types of aggregate (CTE values) exhibit similar performance based on their variability. 3. Data Collection: From the data stratified by CTE, the engineer randomly selects nine pave- ment sections within each CTE category (i.e., 4, 5, and 6.5 in./in. per °F). The sample size is based on the statistical power (1-b) requirements. (For a discussion on sample size determina- tion based on statistical power requirements, see NCHRP Project 20-45, Volume 2, Chapter 1, “Sample Size Determination.”) The descriptive statistics for the data, organized by three CTE categories, are shown in Table 15. The engineer considers pavement performance data for 9 pavement sections in each CTE category. 4. Specification of Analysis Technique and Data Analysis: Because the engineer is concerned with the comparison of more than two mean values, the easiest way to make the statistical comparison is to perform a one-way ANOVA (see NCHRP Project 20-45, Volume 2, Chapter 4). The comparison will help to determine whether the between-section variability is large relative to the within-section variability. More formally, the following hypotheses are tested: HO: All mean values are equal (i.e., m1 = m2 = m3). HA: At least one of the means is different from the rest. Although rejection of the null hypothesis gives the engineer some information concerning difference among the population means, it doesn’t tell the engineer anything about how the means differ from each other. For example, does m1 differ from m2 or m3? To control the experiment-wise error rate (EER) for multiple mean comparisons, a conservative test— Tukey’s procedure for unplanned comparisons—can be used. (Information about Tukey’s procedure can be found in almost any good statistics textbook, such as those by Freund and Wilson [2003] and Kutner et al. [2005].)The F-statistic calculated for determining the effect of CTE on % slab cracked after 20 years is shown in Table 16. Question/Issue Compare the means of more than two samples. Specifically, is the cracking perfor- mance of concrete pavements designed using more than two different types of aggregates the same? Stated a bit differently, is the performance of three different types of concrete pavement statistically different (are the mean performance measures different)? CTE (in./in. per oF) % Slab Cracked After 20 Years 4 1 1 137, 4.8, 9X s n= = = 5 2 2 253.7, 6.1, 9X s n= = = 6.5 3 3 372.5, 6.3, 9X s n= = = Table 15. Pavement performance data.

examples of effective experiment Design and Data analysis in transportation research 43 The data in Table 16 have been produced by considering the original data and following the procedures presented in earlier examples. The emphasis in this example is on understanding what the table of results provides the researcher. Also in this example, the test for homogeneity of variances (Levene test) shows no significant difference among the standard deviations of % slab cracked for different CTE values. Figure 10 presents the mean and associated 95% confi- dence intervals of the average % slab cracked (also called the mean and error bars) measured for the three CTE categories considered. 5. Interpreting the Results: A simple one-way ANOVA is conducted to determine if there is a difference among the mean values for % slab cracked for different CTE values. The analysis shows that the F-statistic is significant (p-value < 0.05), meaning that at least two of the means are statistically significantly different from each other. To gain more insight, the engineer can use Tukey’s procedure to specifically compare the mean values, or the engineer may simply observe the plotted 95% confidence intervals to ascertain which means are significantly different from each other (see Figure 10). The plotted results show that the mean % slab cracked varies significantly for different CTE values—there is no overlap between the different mean/error bars. Figure 10 also shows that the mean % slab cracked is significantly higher for pavement sections having a higher CTE value. (For more information about Tukey’s procedure, see NCHRP Project 20-45, Volume 2, Chapter 4.) 6. Conclusion and Discussion: In this example, simple one-way ANOVA is used to assess the effect of CTE on cracking performance of rigid pavements. The F-test for multiple means is used to formally test the (null) hypothesis of mean equality. The confidence interval plots for data from pavements having three different CTE values visually illustrate the statistical differ- ences in the three means. The interpretation of results will be misleading if the variances of Source Sum of Squares (SS) Degrees of Freedom (df) Mean Square (MS) F Significance Between groups 5652.7 2 0.0002826.3 84.1 Within groups 806.9 24 33.6 Total 6459.6 26 Table 16. ANOVA results. Figure 10. Error bars for % slab cracked with different CTE.

44 effective experiment Design and Data analysis in transportation research populations being compared for their mean difference are not equal or if a proper multiple mean comparisons procedure is not adopted. Based on the comparison of the three means in this example, the engineer can conclude that the pavement slabs having aggregates with a higher CTE value will exhibit more cracking than those with lower CTE values, given that all other variables (e.g., climate effects) remain constant. 7. Applications in Other Areas of Transportation Research: Simple one-way ANOVA is widely used and can be employed whenever multiple means within a factor are to be compared with one another. Potential applications in other areas of transportation research include: • Traffic Operations—to evaluate the effect of commuting time on level of service (LOS) of an urban highway. Mean travel times for three periods (e.g., morning, afternoon, and evening) could be selected for specified highway sections to collect the traffic volume and headway data in all lanes. • Traffic Safety—to determine the effect of shoulder width on accident rates on rural highways. More than two shoulder widths (e.g., 0 feet, 6 feet, 9 feet, and 12 feet) should be selected in this study. • Pavement Engineering—to investigate the impact of air void content on flexible pavement fatigue performance. Pavement sections having three or more air void contents (e.g., 3%, 5%, and 7%) in the surface HMA layer could be selected to compare their average fatigue cracking performance after the same period of service (e.g., 15 years). • Materials—to study the effect of aggregate gradation on the rutting performance of flexible pavements. Three types of aggregate gradations (fine, intermediate, and coarse) could be adopted in the laboratory to make different HMA mix samples. Performance testing could be conducted in the laboratory to measure rut depths for a given number of load cycles. Example 11: Pavements; Factorial Design (ANOVA Approach) Area: Pavements Method of Analysis: Factorial design (an ANOVA approach used to explore the effects of varying more than one independent variable) 1. Research Question/Problem Statement: Extending the information from Example 10 (a simple ANOVA example for pavements), the pavement engineer has verified that the coefficient of thermal expansion (CTE) in Portland cement concrete (PCC) is a critical factor affecting thermal behavior of PCC slabs in concrete pavements and significantly affects concrete pave- ment performance in terms of cracking. The engineer now wants to investigate the effects of another factor, joint spacing (JS), in addition to CTE. To study the combined effects of PCC CTE and JS on slab cracking, the engineer needs to conduct a factorial design study by collect- ing field pavement performance data. As before, three CTEs will be considered: • 4 in./in. per °F, • 5 in./in. per °F, and • 6.5 in./in. per °F. Now, three different joint spacings (12 ft, 16 ft, and 20 ft) also will be considered. For this example, it is necessary to compare multiple means within each factor (main effects) and the interaction between the two factors (interactive effects). The statistical technique involved is called a multifactorial two-way ANOVA. 2. Identification and Description of Variables: The engineer identifies uniform 1-mile pavement sections within the state highway network with similar attributes (e.g., slab thickness, traffic, and climate). The field performance, in terms of observed percentage of each slab cracked (% slab cracked) after about 20 years of service for each pavement section, is considered the

examples of effective experiment Design and Data analysis in transportation research 45 dependent (or response) variable in the analysis. The available pavement data are stratified based on CTE and JS. CTE and JS are considered the independent variables. The question is whether pavement sections having different CTE and JS exhibit similar performance based on their variability. Question/Issue Use collected data to determine the effects of varying more than one independent variable on some measured outcome. In this example, compare the cracking perfor- mance of concrete pavements considering two independent variables: (1) coefficients of thermal expansion (CTE) as measured using more than two types of aggregate and (2) differing joint spacing (JS). More formally, the hypotheses can be stated as follows: Ho : ai = 0, No difference in % slabs cracked for different CTE values. Ho : gj = 0, No difference in % slabs cracked for different JS values. Ho : (ag)ij = 0, for all i and j, No difference in % slabs cracked for different CTE and JS combinations. 3. Data Collection: The descriptive statistics for % slab cracked data by three CTE and three JS categories are shown in Table 17. From the data stratified by CTE and JS, the engineer has randomly selected three pavement sections within each of nine combinations of CTE values. (In other words, for each of the nine pavement sections from Example 10, the engineer has selected three JS.) 4. Specification of Analysis Technique and Data Analysis: The engineer can use two-way ANOVA test statistics to determine whether the between-section variability is large relative to the within-section variability for each factor to test the following null hypotheses: • Ho : ai = 0 • Ho : gj = 0 • Ho : (ag)ij = 0 As mentioned before, although rejection of the null hypothesis does give the engineer some information concerning differences among the population means (i.e., there are differences among them), it does not clarify which means differ from each other. For example, does µ1 differ from µ2 or µ3? To control the experiment-wise error rate (EER) for the comparison of multiple means, a conservative test—Tukey’s procedure for an unplanned comparison—can be used. (Information about two-way ANOVA is available in NCHRP Project 20-45, Volume 2, CTE (in/in per oF) Marginal µ & σ 4 5 6.5 Joint spacing (ft) 12 1,1 = 32.4 s1,1 = 0.1 1,2 = 46.8 s1,2 = 1.8 1,3 = 65.3 s 1,3 = 3.2 1,. = 48.2 s1,. = 14.4 16 2,1 = 36.0 s2,1 = 2.4 2,2 = 54 s2,2 = 2.9 2,3 = 73 s2,3 = 1.1 2,. = 54.3 s2,. = 16.1 20 3,1 = 42.7 s3,1 = 2.4 3,2 = 60.3 s3,2 = 0.5 3,3 = 79.1 s3,3 = 2.0 3,. = 60.7 s3,. = 15.9 Marginal µ & σ .,1 = 37.0 x– x– x– x– x– x– x– x– x– x– x– x– x– x– x– x– s.,1 = 4.8 .,2 = 53.7 s.,2 = 6.1 .,3 = 72.5 s.,3 = 6.3 .,. = 54.4 s.,. = 15.8 Note: n = 3 in each cell; values are cell means and standard deviations. Table 17. Summary of cracking data.

46 effective experiment Design and Data analysis in transportation research Chapter 4. Information about Tukey’s procedure can be found in almost any good statistics textbook, such as those by Freund and Wilson [2003] and Kutner et al. [2005].) The results of the two-way ANOVA are shown in Table 18. From the first line it can be seen that both of the main effects, CTE and JS, are significant in explaining cracking behavior (i.e., both p-values < 0.05). However, the interaction (CTE × JS) is not significant (i.e., the p-value is 0.999, much greater than 0.05). Also, the test for homogeneity of variances (Levene statistic) shows that there is no significant difference among the standard deviations of % slab cracked for different CTE and JS values. Figure 11 illustrates the main and interactive effects of CTE and JS on % slabs cracked. 5. Interpreting the Results: A two-way (multifactorial) ANOVA is conducted to determine if difference exists among the mean values for “% slab cracked” for different CTE and JS values. The analysis shows that the main effects of both CTE and JS are significant, while the inter- action effect is insignificant (p-value > 0.05). These results show that when CTE and JS are considered jointly, they significantly impact the slab cracking separately. Given these results, the conclusions from the results will be based on the main effects alone without considering interaction effects. In fact, if the interaction effect had been significant, the conclusions would be based on them. To gain more insight, the engineer can use Tukey’s procedure to compare specific multiple means within each factor, or the engineer can simply observe the plotted means in Figure 11 to ascertain which means are significantly different from each other. The plotted results show that the mean % slab cracked varies significantly for different CTE and JS values; that is, the CTE seems to be more influential than JS. All lines are almost parallel to Source Sum of Squares (SS) Degrees of Freedom (df) Mean Square (MS) F Significance CTE 5677.74 2 2838.87 657.16 0.000 JS 703.26 2 351.63 81.40 0.000 CTE × JS 0.12 4 0.03 0.007 0.999 Residual/error 77.76 18 4.32 Total 6458.88 26 Table 18. ANOVA results. M ea n % s la bs c ra ck ed 75 70 65 60 55 50 45 40 35 201612 CTE JS Main Effects Plot (data means) for Cracking Joint Spacing (ft) M ea n % s la bs c ra ck ed 201612 80 70 60 50 40 30 CTE 6.5 4.0 5.0 Interaction Plot (data means) for Cracking Figure 11. Main and interaction effects of CTE and JS on slab cracking.

examples of effective experiment Design and Data analysis in transportation research 47 each other when plotted for both factors together, showing no interactive effects between the levels of two factors. 6. Conclusion and Discussion: The two-way ANOVA can be used to verify the combined effects of CTE and JS on cracking performance of rigid pavements. The marginal mean plot for cracking having three different CTE and JS levels visually illustrates the differences in the multiple means. The plot of cell means for cracking within the levels of each factor can indicate the presence of interactive effect between two factors (in this example, CTE and JS). However, the F-test for multiple means should be used to formally test the hypothesis of mean equality. Finally, based on the comparison of three means within each factor (CTE and JS), the engineer can conclude that the pavement slabs having aggregates with higher CTE and JS values will exhibit more cracking than those with lower CTE and JS values. In this example, the effect of CTE on concrete pavement cracking seems to be more critical than that of JS. 7. Applications in Other Areas of Transportation Research: Multifactorial designs can be used when more than one factor is considered in a study. Possible applications of these methods can extend to all transportation-related areas, including: • Pavement Engineering – to determine the effects of base type and base thickness on pavement performance of flexible pavements. Two or more levels can be considered within each factor; for exam- ple, two base types (aggregate and asphalt-treated bases) and three base thicknesses (8 inches, 12 inches, and 18 inches). – to investigate the impact of pavement surface conditions and vehicle type on fuel con- sumption. The researcher can select pavement sections with three levels of ride quality (smooth, rough, and very rough) and three types of vehicles (cars, vans, and trucks). The fuel consumptions can be measured for each vehicle type on all surface conditions to determine their impact. • Materials – to study the effects of aggregate gradation and surface on tensile strength of hot-mix asphalt (HMA). The engineer can evaluate two levels of gradation (fine and coarse) and two types of aggregate surfaces (smooth and rough). The samples can be prepared for all the combinations of aggregate gradations and surfaces for determination of tensile strength in the laboratory. – to compare the impact of curing and cement types on the compressive strength of concrete mixture. The engineer can design concrete mixes in laboratory utilizing two cement types (Type I & Type III). The concrete samples can be cured in three different ways for 24 hours and 7 days (normal curing, water bath, and room temperature). Example 12: Work Zones; Simple Before-and-After Comparisons Area: Work zones Method of Analysis: Simple before-and-after comparisons (exploring the effect of some treat- ment before it is applied versus after it is applied) 1. Research Question/Problem Statement: The crash rate in work zones has been found to be higher than the crash rate on the same roads when a work zone is not present. For this reason, the speed limit in construction zones often is set lower than the prevailing non-work-zone speed limit. The state DOT decides to implement photo-radar speed enforcement in a work zone to determine if this speed-enforcement technique reduces the average speed of free- flowing vehicles in the traffic stream. They measure the speeds of a sample of free-flowing vehicles prior to installing the photo-radar speed-enforcement equipment in a work zone and

48 effective experiment Design and Data analysis in transportation research then measure the speeds of free-flowing vehicles at the same location after implementing the photo-radar system. Question/Issue Use collected data to determine whether a difference exists between results before and after some treatment is applied. For this example, does a photo-radar speed- enforcement system reduce the speed of free-flowing vehicles in a work zone, and, if so, is the reduction statistically significant? 2. Identification and Description of Variables: The variable to be analyzed is the mean speed of vehicles before and after the implementation of a photo-radar speed-enforcement system in a work zone. 3. Data Collection: The speeds of individual free-flowing vehicles are recorded for 30 minutes on a Tuesday between 10:00 a.m. and 10:30 a.m. before installing the photo-radar system. After the system is installed, the speeds of individual free-flowing vehicles are recorded for 30 minutes on a Tuesday between 10:00 a.m. and 10:30 a.m. The before sample contains 120 observations and the after sample contains 100 observations. 4. Specification of Analysis Technique and Data Analysis: A test of the significance of the difference between two means requires a statement of the hypothesis to be tested (Ho) and a statement of the alternate hypothesis (H1). In this example, these hypotheses can be stated as follows: Ho: There is no difference in the mean speed of free-flowing vehicles before and after the photo-radar speed-enforcement system is displayed. H1: There is a difference in the mean speed of free-flowing vehicles before and after the photo-radar speed-enforcement system is displayed. Because these two samples are independent, a simple t-test is appropriate to test the stated hypotheses. This test requires the following procedure: Step 1. Compute the mean speed (x _ ) for the before sample (x _ b) and the after sample (x _ a) using the following equation: x x n n ni i i n i b a= = = = ∑ 1 120 100; and Results: x _ b = 53.1 mph and x _ a = 50.5 mph. Step 2. Compute the variance (S2) for each sample using the following equation: S x x n i i i n 2 2 1 1 = −( ) − − ∑ where na = 100; x _ a= 50.5 mph; nb = 120; and x _ b = 53.1 mph Results: S x x n b b b b 2 2 1 12 06= −( ) − =∑ . and S x x n a a a a 2 2 1 12 97= −( ) − =∑ . . Step 3. Compute the pooled variance of the two samples using the following equation: S x x x x n n p a a b b b a 2 2 2 2 = −( ) + −( ) + − ∑∑ Results: S2p = 12.472 and Sp = 3.532.

examples of effective experiment Design and Data analysis in transportation research 49 Step 4. Compute the t-statistic using the following equation: t x x S n n n n b a p a b a b = − + Result: t = − ( )( ) + = 53 1 50 5 3 532 100 120 100 120 5 43 . . . . . 5. Interpreting the Results: The results of the sample t-test are obtained by comparing the value of the calculated t-statistic (5.43 in this example) with the value of the t-statistic for the level of confidence desired. For a level of confidence of 95%, the t-statistic must be greater than 1.96 to reject the null hypotheses (Ho) that the use of a photo-radar speed-enforcement sys- tem does not change the speed of free-flowing vehicles. (For more information, see NCHRP Project 20-45, Volume 2, Appendix C, Table C-4.) 6. Conclusion and Discussion: The sample problem illustrates the use of a statistical test to determine whether the difference in the value of the variable of interest between the before conditions and the after conditions is statistically significant. The before condition is without photo-radar speed enforcement; and the after condition is with photo-radar speed enforcement. In this sample problem, the computed t-statistic (5.43) is greater than the critical t-statistic (1.96), so the null hypothesis is rejected. This means the change in the speed of free-flowing vehicles when the photo-radar speed-enforcement system is used is statistically significant. The assumption is made that all other factors that would affect the speed of free-flowing vehicles (e.g., traffic mix, weather, or construction activity) are the same in the before-and-after conditions. This test is robust if the normality assumption does not hold completely; however, it should be checked using box plots. For significant departures from normality and variance equality assumptions, non-parametric tests must be conducted. (For more information, see NCHRP Project 20-45, Volume 2, Chapter 6, Section C and also Example 21). The reliability of the results in this example could be improved by using a control group. As the example has been constructed, there is an assumption that the only thing that changed at this site was the use of photo-radar speed enforcement; that is, it is assumed that all observed differences are attributable to the use of the photo-radar. If other factors—even something as simple as a general decrease in vehicle speeds in the area—might have impacted speed changes, the effect of the photo-radar speed enforcement would have to be adjusted for those other factors. Measurements taken at a control site (ideally identical to the experiment site) during the same time periods could be used to detect background changes and then to adjust the photo-radar effects. Such a situation is explored in Example 13. 7. Applications in Other Areas in Transportation: The before-and-after comparison can be used whenever two independent samples of data are (or can be assumed to be) normally distributed with equal variance. Applications of before-and-after comparison in other areas of transportation research may include: • Traffic Operations – to compare the average delay to vehicles approaching a signalized intersection when a fixed time signal is changed to an actuated signal or a traffic-adaptive signal. – to compare the average number of vehicles entering and leaving a driveway when access is changed from full access to right-in, right-out only. • Traffic Safety – to compare the average number of crashes on a section of road before and after the road is resurfaced. – to compare the average number of speeding citations issued per day when a stationary operation is changed to a mobile operation. • Maintenance—to compare the average number of citizen complaints per day when a change is made in the snow plowing policy.

50 effective experiment Design and Data analysis in transportation research Example 13: Traffic Safety; Complex Before-and-After Comparisons and Controls Area: Traffic safety Method of Analysis: Complex before-and-after comparisons using control groups (examining the effect of some treatment or application with consideration of other factors that may also have an effect) 1. Research Question/Problem Statement: A state safety engineer wants to estimate the effec- tiveness of fluorescent orange warning signs as compared to standard orange signs in work zones on freeways and other multilane highways. Drivers can see fluorescent signs from a longer distance than standard signs, especially in low-visibility conditions, and the extra cost of the fluorescent material is not too high. Work-zone safety is a perennial concern, especially on freeways and multilane highways where speeds and traffic volumes are high. Question/Issue How can background effects be separated from the effects of a treatment or application? Compared to standard orange signs, do fluorescent orange warning signs increase safety in work zones on freeways and multilane highways? 2. Identification and Description of Variables: The engineer quickly concludes that there is a need to collect and analyze safety surrogate measures (e.g., traffic conflicts and late lane changes) rather than collision data. It would take a long time and require experimentation at many work zones before a large sample of collision data could be ready for analysis on this question. Surrogate measures relate to collisions, but they are much more numerous and it is easier to collect a large sample of them in a short time. For a study of traffic safety, surrogate measures might include near-collisions (traffic conflicts), vehicle speeds, or locations of lane changes. In this example, the engineer chooses to use the location of the lane-change maneuver made by drivers in a lane to be closed entering a work zone. This particular surrogate safety measure is a measure of effectiveness (MOE). The hypothesis is that the farther downstream at which a driver makes a lane change out of a lane to be closed—when the highway is still below capacity—the safer the work zone. 3. Data Collection: The engineer establishes site selection criteria and begins examining all active work zones on freeways and multilane highways in the state for possible inclusion in the study. The site selection criteria include items such as an active work zone, a cooperative contractor, no interchanges within the approach area, and the desired lane geometry. Seven work zones meet the criteria and are included in the study. The engineer decides to use a before-and-after (sometimes designated B/A or b/a) experiment design with randomly selected control sites. The latter are sites in the same population as the treatment sites; that is, they meet the same selection criteria but are untreated (i.e., standard warning signs are employed, not the fluorescent orange signs). This is a strong experiment design because it minimizes three common types of bias in experiments: history, maturation, and regression to the mean. History bias exists when changes (e.g., new laws or large weather events) happen at about the same time as the treatment in an experiment, so that the engineer or analyst cannot separate the effect of the treatment from the effects of the other events. Maturation bias exists when gradual changes occur throughout an extended experiment period and cannot be separated from the effects of the treatment. Examples of maturation bias might involve changes like the aging of driver populations or new vehicles with more air bags. History and maturation biases are referred to as specification errors and are described in more detail in NCHRP Project 20-45, Volume 2,

examples of effective experiment Design and Data analysis in transportation research 51 Chapter 1, in the section “Quasi-Experiments.” Regression-to-the-mean bias exists when sites with the highest MOE levels in the before time period are treated. If the MOE level falls in the after period, the analyst can never be sure how much of the fall was due to the treatment and how much was due to natural fluctuations in the values of the MOE back toward its usual mean value. A before-and-after study with randomly selected control sites minimizes these biases because their effects are expected to apply just as much to the treatment sites as to the control sites. In this example, the engineer randomly selects four of the seven work zones to receive fluorescent orange signs. The other three randomly selected work zones received standard orange signs and are the control sites. After the signs have been in place for a few weeks (a common tactic in before-and-after studies to allow regular drivers to get used to the change), the engineer collects data at all seven sites. The location of each vehicle’s lane-change maneuver out of the lane to be closed is measured from video tape recorded for several hours at each site. Table 19 shows the lane-change data at the midpoint between the first warning sign and beginning of the taper. Notice that the same number of vehicles is observed in the before-and- after periods for each type of site. 4. Specification of Analysis Technique and Data Analysis: Depending on their format, data from a before-and-after experiment with control sites may be analyzed several ways. The data in the table lend themselves to analysis with a chi-square test to see whether the distributions between the before-and-after conditions are the same at both the treatment and control sites. (For more information about chi-square testing, see NCHRP Project 20-45, Volume 2, Chapter 6, Section E, “Chi-Square Test for Independence.”) To perform the chi-square test on the data for Example 13, the engineer first computes the expected value in each cell. For the cell corresponding to the before time period for control sites, this value is computed as the row total (3361) times the column total (2738) divided by the grand total (6714): 3361 2738 6714 1371 = vehicles The engineer next computes the chi-square value for each cell using the following equation: χi i i i O E E 2 2 = −( ) where Oi is the number of actual observations in cell i and Ei is the expected number of observations in cell i. For example, the chi-square value in the cell corresponding to the before time period for control sites is (1262 - 1371)2 / 1371 = 8.6. The engineer then sums the chi-square values from all four cells to get 29.1. That sum is then compared to the critical chi-square value for the significance level of 0.025 with 1 degree of freedom (degrees of freedom = number of rows - 1 * number of columns - 1), which is shown on a standard chi-square distribution table to be 5.02 (see NCHRP Project 20-45, Volume 2, Appendix C, Table C-2.) A significance level of 0.025 is not uncommon in such experiments (although 0.05 is a general default value), but it is a standard that is difficult but not impossible to meet. Time Period Number of Vehicles Observed in Lane to be Closed at Midpoint Control Treatment Total Before 1262 2099 3361 After 1476 1877 3353 Total 2738 3976 6714 Table 19. Lane-change data for before-and-after comparison using controls.

52 effective experiment Design and Data analysis in transportation research 5. Interpreting the Results: Because the calculated chi-square value is greater than the critical chi-square value, the engineer concludes that there is a statistically significant difference in the number of vehicles in the lane to be closed at the midpoint between the before-and-after time periods for the treatment sites relative to what would be expected based on the control sites. In other words, there is a difference that is due to the treatment. 6. Conclusion and Discussion: The experiment results show that fluorescent orange signs in work zone approaches like those tested would likely have a safety benefit. Although the engi- neer cannot reasonably estimate the number of collisions that would be avoided by using this treatment, the before-and-after study with control using a safety surrogate measure makes it clear that some collisions will be avoided. The strength of the experiment design with randomly selected control sites means that agencies can have confidence in the results. The consequences of an error in an analysis like this that results in the wrong conclusion can be devastating. If the error leads an agency to use a safety measure more than it should, precious safety funds will be wasted that could be put to better use. If the error leads an agency to use the safety measure less often than it should, money will be spent on measures that do not prevent as many collisions. With safety funds in such short supply, solid analyses that lead to effective decisions on countermeasure deployment are of great importance. A before-and-after experiment with control is difficult to arrange in practice. Such an experiment is practically impossible using collision data, because that would mean leaving some higher collision sites untreated during the experiment. Such experiments are more plausible using surrogate measures like the one described in this example. 7. Applications in Other Areas of Transportation Research: Before-and-after experiments with randomly selected control sites are difficult to arrange in transportation safety and other areas of transportation research. The instinct to apply treatments to the worst sites, rather than randomly—as this method requires—is difficult to overcome. Despite the difficulties, such experiments are sometimes performed in: • Traffic Operations—to test traffic control strategies at a number of different intersections. • Pavement Engineering—to compare new pavement designs and maintenance processes to current designs and practice. • Materials—to compare new materials, mixes, or processes to standard mixtures or processes. Example 14: Work Zones; Trend Analysis Area: Work zones Method of Analysis: Trend analysis (examining, describing, and modeling how something changes over time) 1. Research Question/Problem Statement: Measurements conducted over time often reveal patterns of change called trends. A model may be used to predict some future measurement, or the relative success of a different treatment or policy may be assessed. For example, work/ construction zone safety has been a concern for highway officials, engineers, and planners for many years. Is there a pattern of change? Question/Issue Can a linear model represent change over time? In this particular example, is there a trend over time for motor vehicle crashes in work zones? The problem is to predict values of crash frequency at specific points in time. Although the question is simple, the statistical modeling becomes sophisticated very quickly.

examples of effective experiment Design and Data analysis in transportation research 53 2. Identification and Description of Variables: Highway safety, rather the lack of it, is revealed by the total number of fatalities due to motor vehicle crashes. The percentage of those deaths occurring in work zones reveals a pattern over time (Figure 12). The data points for the graph are calculated using the following equation: WZP a b YEAR u= + + where WZP = work zone percentage of total fatalities, YEAR = calendar year, and u = an error term, as used here. 3. Data Collection: The base data are obtained from the Fatality Analysis Reporting System maintained by the National Highway Traffic Safety Administration (NHTSA), as reported at www.workzonesafety.org. The data are state specific as well as for the country as a whole, and cover a period of 26 years from 1982 through 2007. The numbers of fatalities from motor vehicle crashes in and not in construction/maintenance zones (work zones) are used to compute the percentage of fatalities in work zones for each of the 26 years. 4. Specification of Analysis Techniques and Data Analysis: Ordinary least squares (OLS) regression is used to develop the general model specified above. The discussion in this example focuses on the resulting model and the related statistics. (See also examples 15, 16, and 17 for details on calculations. For more information about OLS regression, see NCHRP Project 20-45, Volume 2, Chapter 4, Section B, “Linear Regression.”) Looking at the data in Figure 12 another way, WZP = -91.523 (-8.34) (0.000) + 0.047(YEAR) (8.51) (0.000) R = 0.867 t-values p-values R2 = 0.751 The trend is significant: the line (trend) shows an increase of 0.047% each year. Generally, this trend shows that work-zone fatalities are increasing as a percentage of total fatalities. 5. Interpreting the Results: This experiment is a good fit and generally shows that work-zone fatalities were an increasing problem over the period 1982 through 2007. This is a trend that highway officials, engineers, and planners would like to change. The analyst is therefore interested in anticipating the trajectory of the trend. Here the trend suggests that things are getting worse. Figure 12. Percentage of all motor vehicle fatalities occurring in work zones.

54 effective experiment Design and Data analysis in transportation research How far might authorities let things go—5%? 10%? 25%? Caution must be exercised when interpreting a trend beyond the limits of the available data. Technically the slope, or b-coefficient, is the trend of the relationship. The a-term from the regression, also called the intercept, is the value of WZP when the independent variable equals zero. The intercept for the trend in this example would technically indicate that the percentage of motor vehicle fatalities in work zones in the year zero would be -91.5%. This is absurd on many levels. There could be no motor vehicles in year zero, and what is a negative percentage of the total? The absurdity of the intercept in this example reveals that trends are limited concepts, limited to a relevant time frame. Figure 12 also suggests that the trend, while valid for the 26 years in aggregate, doesn’t work very well for the last 5 years, during which the percentages are consistently falling, not rising. Something seems to have changed around 2002; perhaps the highway officials, engineers, and planners took action to change the trend, in which case, the trend reversal would be considered a policy success. Finally, some underlying assumptions must be considered. For example, there is an implicit assumption that the types of roads with construction zones are similar from year to year. If this assumption is not correct (e.g., if a greater number of high speed roads, where fatalities may be more likely, are worked on in some years than in others), then interpreting the trend may not make much sense. 6. Conclusion and Discussion: The computation of this dependent variable (the percent of motor-vehicle fatalities occurring in work zones, or MZP) is influenced by changes in the number of work-zone fatalities and the number of non-work-zone fatalities. To some extent, both of these are random variables. Accordingly, it is difficult to distinguish a trend or trend reversal from a short series of possibly random movements in the same direction. Statistically, more observations permit greater confidence in non-randomness. It is also possible that a data series might be recorded that contains regular, non-random movements that are unrelated to a trend. Consider the dependent variable above (MZP), but measured using monthly data instead of annual data. Further, imagine looking at such data for a state in the upper Midwest instead of for the nation as a whole. In this new situation, the WZP might fall off or halt altogether each winter (when construction and maintenance work are minimized), only to rise again in the spring (reflecting renewed work-zone activity). This change is not a trend per se, nor is it random. Rather, it is cyclical. 7. Applications in Other Areas of Transportation Research: Applications of trend analysis models in other areas of transportation research include: • Transportation Safety—to identify trends in traffic crashes (e.g., motor vehicle/deer) over time on some part of the roadway system (e.g., freeways). • Public Transportation—to determine the trend in rail passenger trips over time (e.g., in response to increasing gas prices). • Pavement Engineering—to monitor the number of miles of pavement that is below some service-life threshold over time. • Environment—to monitor the hours of truck idling time in rest areas over time. Example 15: Structures/Bridges; Trend Analysis Area: Structures/bridges Method of Analysis: Trend analysis (examining a trend over time) 1. Research Question/Problem Statement: A state agency wants to monitor trends in the condition of bridge superstructures in order to perform long-term needs assessment for bridge rehabilitation or replacement. Bridge condition rating data will be analyzed for bridge

examples of effective experiment Design and Data analysis in transportation research 55 2. Identification and Description of Variables: Bridge inspection generally entails collection of numerous variables including location information, traffic data, structural elements (type and condition), and functional characteristics. Based on the severity of deterioration and the extent of spread through a bridge component, a condition rating is assigned on a dis- crete scale from 0 (failed) to 9 (excellent). Generally a condition rating of 4 or below indicates deficiency in a structural component. The state agency inspects approximately 300 bridges every year (denominator). The number of superstructures that receive a rating of 4 or below each year (number of events, numerator) also is recorded. The agency is concerned with the change in overall rate (calculated per 100) of structurally deficient bridge superstructures. This rate, which is simply the ratio of the numerator to the denominator, is the indicator (dependent variable) to be examined for trend over a time period of 15 years. Notice that the unit of analysis is the time period and not the individual bridge superstructures. 3. Data Collection: Data are collected for bridges scheduled for inspection each year. It is important to note that the bridge condition rating scale is based on subjective categories, and therefore there may be inherent variability among inspectors in their assignments of rates to bridge superstructures. Also, it is assumed that during the time period for which the trend analysis is conducted, no major changes are introduced in the bridge inspection methods. Sample data provided in Table 20 show the rate (per 100), number of bridges per year that received a score of four or below, and total number of bridges inspected per year. 4. Specification of Analysis Technique and Data Analysis: The data set consists of 15 observa- tions, one for each year. Figure 13 shows a scatter plot of the rate (dependent variable) versus time in years. The scatter plot does not indicate the presence of any outliers. The scatter plot shows a seemingly increasing linear trend in the rate of deficient superstructures over time. No need for data transformation or smoothing is apparent from the examination of the scatter plot in Figure 13. To determine whether the apparent linear trend is statistically significant in this data, ordinary least squares (OLS) regression can be employed. Question/Issue Use collected data to determine if the values that some variables have taken show an increasing trend or a decreasing trend over time. In this example, determine if levels of structural deficiency in bridge superstructures have been increasing or decreasing over time, and determine how rapidly the increase or decrease has occurred. No. Year Rate (per 100) Number of Events (Numerator) Number of Bridges Inspected (Denominator) 1 1990 8.33 25 300 2 1991 8.70 26 299 5 1994 10.54 31 294 11 2000 13.55 42 310 15 2004 14.61 45 308 Table 20. Sample bridge inspection data. superstructures that have been inspected over a period of 15 years. The objective of this study is to examine the overall pattern of change in the indicator variable over time.

56 effective experiment Design and Data analysis in transportation research The linear regression model takes the following form: y x ei o i i= + +β β1 where i = 1, 2, . . . , n (n = 15 in this example), y = dependent variable (rate of structurally deficient bridge superstructures), x = independent variable (time), bo = y-intercept (only provides reference point), b1 = slope (change in unit y for a change in unit x), and ei = residual error. The first step is to estimate the bo and b1 in the regression function. The residual errors (e) are assumed to be independently and identically distributed (i.e., they are mutually independent and have the same probability distribution). b1 and bo can be computed using the following equations: ˆ . ˆ β β 1 1 2 1 0 454= −( ) −( ) −( ) = = = = ∑ ∑ x x y y x x i i i n i i n o y x− =β1 8 396. where y _ is the overall mean of the dependent variable and x _ is the overall mean of the independent variable. The prediction equation for rate of structurally deficient bridge superstructures over time can be written using the following equation: ˆ ˆ ˆ . .y x xo= + = +β β1 8 396 0 454 That is, as time increases by a year, the rate of structurally deficient bridge superstructures increases by 0.454 per 100 bridges. The plot of the regression line is shown in Figure 14. Figure 14 indicates some small variability about the regression line. To conduct hypothesis testing for the regression relationship (Ho: b1 = 0), assessment of this variability and the assumption of normality would be required. (For a discussion on assumptions for residual errors, see NCHRP Project 20-45, Volume 2, Chapter 4.) Like analysis of variance (ANOVA, described in examples 8, 9, and 10), statistical inference is initiated by partitioning the total sum of squares (TSS) into the error sum of squares (SSE) Figure 13. Scatter plot of time versus rate. 7.00 9.00 11.00 13.00 15.00 Time in years Ra te p er 1 00 1 3 5 7 9 11 13 15

examples of effective experiment Design and Data analysis in transportation research 57 and the model sum of squares (SSR). That is, TSS = SSE + SSR. The TSS is defined as the sum of the squares of the difference of each observation from the overall mean. In other words, deviation of observation from overall mean (TSS) = deviation of observation from prediction (SSE) + deviation of prediction from overall mean (SSR). For our example, TSS y y SSR x x i i n i = −( ) = = −( ) = = ∑ 2 1 1 2 2 60 892 57 7 . ˆ .β 90 3 102 1i n SSE TSS SSR = ∑ = − = . Regression analysis computations are usually summarized in a table (see Table 21). The mean squared errors (MSR, MSE) are computed by dividing the sums of squares by corresponding model and error degrees of freedom. For the null hypothesis (Ho: b1 = 0) to be true, the expected value of MSR is equal to the expected value of MSE such that F = MSR/MSE should be a random draw from an F-distribution with 1, n - 2 degrees of freedom. From the regression shown in Table 21, F is computed to be 242.143, and the probability of getting a value larger than the F computed is extremely small. Therefore, the null hypothesis is rejected; that is, the slope is significantly different from zero, and the linearly increasing trend is found to be statistically significant. Notice that a slope of zero implies that knowing a value of the independent variable provides no insight on the value of the dependent variable. 5. Interpreting the Results: The linear regression model does not imply any cause-and-effect relationship between the independent and dependent variables. The y-intercept only provides a reference point, and the relationship need not be linear outside the data range. The 95% confidence interval for b1 is computed as [0.391, 0.517]; that is, the analyst is 95% confident that the true mean increase in the rate of structurally deficient bridge superstructures is between Plot of regression line y = 8.396 + 0.454x R2 = 0.949 7.00 9.00 11.00 13.00 15.00 1 3 5 7 9 11 13 15 Time in years Ra te p er 1 00 Figure 14. Plot of regression line. Source Sum of Squares (SS) Degrees of Freedom (df) Mean Square F Significance Regression 57.790 1 57.790 (MSR) 242.143 8.769e-10 Error 3.102 13 0.239 (MSE) Total 60.892 14 Table 21. Analysis of regression table.

58 effective experiment Design and Data analysis in transportation research 0.391% and 0.517% per year. (For a discussion on computing confidence intervals, see NCHRP Project 20-45, Volume 2, Chapter 4.) The coefficient of determination (R2) provides an indication of the model fit. For this example, R2 is calculated using the following equation: R SSE TSS 2 0 949= = . The R2 indicates that the regression model accounts for 94.9% of the total variation in the (hypothetical) data. It should be noted that such a high value of R2 is almost impossible to attain from analysis of real observational data collected over a long time. Also, distributional assumptions must be checked before proceeding with linear regression, as serious violations may indicate the need for data transformation, use of non-linear regression or non-parametric methods, and so on. 6. Conclusion and Discussion: In this example, simple linear regression has been used to deter- mine the trend in the rate of structurally deficient bridge superstructures in a geographic area. In addition to assessing the overall patterns of change, trend analysis may be performed to: • study the levels of indicators of change (or dependent variables) in different time periods to evaluate the impact of technical advances or policy changes; • compare different geographic areas or different populations with perhaps varying degrees of exposure in absolute and relative terms; and • make projections to monitor progress toward an objective. However, given the dynamic nature of trend data, many of these applications require more sophisticated techniques than simple linear regression. An important aspect of examining trends over time is the accuracy of numerator and denominator data. For example, bridge structures may be examined more than once during the analysis time period, and retrofit measures may be taken at some deficient bridges. Also, the age of structures is not accounted for in this analysis. For the purpose of this example, it is assumed that these (and other similar) effects are negligible and do not confound the data. In real-life application, however, if the analysis time period is very long, it becomes extremely important to account for changes in factors that may have affected the dependent variable(s) and their measurement. An example of the latter could be changes in the volume of heavy trucks using the bridge, changes in maintenance policies, or changes in plowing and salting regimes. 7. Applications in Other Areas of Transportation Research: Trend analysis is carried out in many areas of transportation research, such as: • Transportation Planning/Traffic Operations—to determine the need for capital improve- ments by examining traffic growth over time. • Traffic Safety—to study the trends in overall, fatal, and/or injury crash rates over time in a geographic area. • Pavement Engineering—to assess the long-term performance of pavements under varying loads. • Environment—to monitor the emission levels from commercial traffic over time with growth of industrial areas. Example 16: Transportation Planning; Multiple Regression Analysis Area: Transportation planning Method of Analysis: Multiple regression analysis (testing proposed linear models with more than one independent variable when all variables are continuous)

examples of effective experiment Design and Data analysis in transportation research 59 1. Research Question/Problem Statement: Transportation planners and engineers often work on variations of the classic four-step transportation planning process for estimat- ing travel demand. The first step, trip generation, generally involves developing a model that can be used to predict the number of trips originating or ending in a zone, which is a geographical subdivision of a corridor, city, or region (also referred to as a traffic analysis zone or TAZ). The objective is to develop a statistical relationship (a model) that can be used to explain the variation in a dependent variable based on the variation of one or more independent variables. In this example, ordinary least squares (OLS) regres- sion is used to develop a model between trips generated (the dependent variable) and demographic, socio-economic, and employment variables (independent variables) at the household level. Question/Issue Can a linear relationship (model) be developed between a dependent variable and one or more independent variables? In this application, the dependent variable is the number of trips produced by households. Independent variables include persons, workers, and vehicles in a household, household income, and average age of persons in the household. The basic question is whether the relationship between the dependent (Y) and independent (X) variables can be represented by a linear model using two coefficients (a and b), expressed as follows: Y X= +a b i where a = the intercept and b = the slope of the line. If the relationship being examined involves more than one independent variable, the equa- tion will simply have more terms. In addition, in a more formal presentation, the equation will also include an error term, e, added at the end. 2. Identification and Description of Variables: Data for four-step modeling of travel demand or for calibration of any specific model (e.g., trip generation or trip origins) come from a variety of sources, ranging from the U.S. Census to mail or telephone surveys. The data that are collected will depend, in part, on the specific purpose of the modeling effort. Data appropriate for a trip-generation model typically are collected from some sort of household survey. For the dependent variable in a trip-generation model, data must be collected on trip-making characteristics. These characteristics could include something as simple as the total trips made by a household in a day or involve more complicated break- downs by trip purpose (e.g., work-related trips versus shopping trips) and time of day (e.g., trips made during peak and non-peak hours). The basic issue that must be addressed is to determine the purpose of the proposed model: What is to be estimated or predicted? Weekdays and work trips normally are associated with peak congestion and are often the focus of these models. For the independent variable(s), the analyst must first give some thought to what would be the likely causes for household trips to vary. For example, it makes sense intuitively that household size might be pertinent (i.e., it seems reasonable that more persons in the household would lead to a higher number of household trips). Household members could be divided into workers and non-workers, two variables instead of one. Likewise, other socio-economic characteristics, such as income-related variables, might also make sense as candidate variables for the model. Data are collected on a range of candidate variables, and

60 effective experiment Design and Data analysis in transportation research the analysis process is used to sort through these variables to determine which combination leads to the best model. To be used in ordinary regression modeling, variables need to be continuous; that is, measured ratio or interval scale variables. Nominal data may be incorporated through the use of indicator (dummy) variables. (For more information on continuous variables, see NCHRP Project 20-45, Volume 2, Chapter 1; for more information on dummy variables, see NCHRP Project 20-45, Volume 2, Chapter 4). 3. Data Collection: As noted, data for modeling travel demand often come from surveys designed especially for the modeling effort. Data also may be available from centralized sources such as a state DOT or local metropolitan planning organization (MPO). 4. Specification of Analysis Techniques and Data Analysis: In this example, data for 178 house- holds in a small city in the Midwest have been provided by the state DOT. The data are obtained from surveys of about 15,000 households all across the state. This example uses only a tiny portion of the data set (see Table 22). Based on the data, a fairly obvious relationship is initially hypothesized: more persons in a household (PERS) should produce more person- trips (TRIPS). In its simplest form, the regression model has one dependent variable and one independent variable. The underlying assumption is that variation in the independent variable causes the variation in the dependent variable. For example, the dependent variable might be TRIPSi (the count of total trips made on a typical weekday), and the independent variable might be PERS (the total number of persons, or occupants, in the household). Expressing the relation- ship between TRIPS and PERS for the ith household in a sample of households results in the following hypothesized model: TRIPS PERSi i i= + +a b i ε where a and b are coefficients to be determined by ordinary least squares (OLS) regression analysis and ei is the error term. The difference between the value of TRIPS for any household predicted using the devel- oped equation and the actual observed value of TRIPS for that same household is called the residual. The resulting model is an equation for the best fit straight line (for the given data) where a is the intercept and b is the slope of the line. (For more information about fitted regression and measures of fit see NCHRP Project 20-45, Volume 2, Chapter 4). In Table 22, R is the multiple R, the correlation coefficient in the case of the simplest linear regression involving one variable (also called univariate regression). The R2 (coefficient of determination) may be interpreted as the proportion of the variance of the dependent variable explained by the fitted regression model. The adjusted R2 corrects for the number of independent variables in the equation. A “perfect” R2 of 1.0 could be obtained if one included enough independent variables (e.g., one for each observation), but doing so would hardly be useful. Coefficients t-values (statistics) p-values Measures of Fit a = 3.347 4.626 0.000 R = 0.510 b = 2.001 7.515 0.000 R2 = 0.260 Adjusted R2 = 0.255 Table 22. Regression model statistics.

examples of effective experiment Design and Data analysis in transportation research 61 Restating the now-calibrated model, TRIPS PERS= +4 626 7 515. . i The statistical significance of each coefficient estimate is evaluated with the p-values of calculated t-statistics, provided the errors are normally distributed. The p-values (also known as probability values) generally indicate whether the coefficients are significantly different from zero (which they need to be in order for the model to be useful). More formally stated, a p-value is the probability of a Type I error. In this example, the t- and p-values shown in Table 22 indicate that both a and b are sig- nificantly different from zero at a level of significance greater than the 99.9% confidence level. P-values are generally offered as two-tail (two-sided hypothesis testing) test values in results from most computer packages; one-tail (one-sided) values may sometimes be obtained by dividing the printed p-values by two. (For more information about one-sided versus two- sided hypothesis testing, see NCHRP Project 20-45, Volume 2, Chapter 4.) The R2 may be tested with an F-statistic; in this example, the F was calculated as 56.469 (degrees of freedom = 2, 176) (See NCHRP Project 20-45, Volume 2, Chapter 4). This means that the model explains a significant amount of the variation in the dependent variable. A plot of the estimated model (line) and the actual data are shown in Figure 15. A strict interpretation of this model suggests that a household with zero occupants (PERS = 0) will produce 3.347 trips per day. Clearly, this is not feasible because there can’t be a household of zero persons, which illustrates the kind of problem encountered when a model is extrapolated beyond the range of the data used for the calibration. In other words, a formal test of the intercept (the a) is not always meaningful or appropriate. Extension of the Model to Multivariate Regression: When the list of potential inde- pendent variables is considered, the researcher or analyst might determine that more than one cause for variation in the dependent variable may exist. In the current example, the question of whether there is more than one cause for variation in the number of trips can be considered. 0 1 2 3 4 5 6 7 8 9 10 PERS 0 10 20 30 40 TR IP S Figure 15. Plot of the line for the estimated model.

62 effective experiment Design and Data analysis in transportation research The model just discussed for evaluating the effect of one independent variable is called a uni- variate model. Should the final model for this example be multivariate? Before determining the final model, the analyst may want to consider whether a variable or variables exist that further clarify what has already been modeled (e.g., more persons cause more trips). The variable PERS is a crude measure, made up of workers and non-workers. Most households have one or two workers. It can be shown that a measure of the non-workers in the household is more effective in explaining trips than is total persons; so a new variable, persons minus workers (DEP), is calculated. Next, variables may exist that address entirely different causal relationships. It might be hypothesized that as the number of registered motor vehicles available in the household (VEH) increases, the number of trips will increase. It may also be argued that as household income (INC, measured in thousands of dollars) increases, the number of trips will increase. Finally, it may be argued that as the average age of household occupants (AVEAGE) increases, the number of trips will decrease because retired people generally make fewer trips. Each of these statements is based upon a logical argument (hypothesis). Given these arguments, the hypothesized multivariate model takes the following form: TRIPS DEP VEH INC AVEAGE= + + + + +a b c d ei i i i ε The results from fitting the multivariate model are given in Table 23. Results of the analysis of variance (ANOVA) for the overall model are shown in Table 24. 5. Interpreting the Results: It is common for regression packages to provide some values in scientific notation as shown for the p-values in Table 23. The coefficient d, showing the relationship of TRIPS with INC, is read 1.907 E-05, which in turn is read as 1.907  10-5 or 0.000001907. All coefficients are of the expected sign and significantly different from 0 (at the 0.05 level) except for d. However, testing the intercept makes little sense. (The intercept value would be the number of trips for a household with 0 vehicles, 0 income, 0 average age, and 0 depen- dents, a most unlikely household.) The overall model is significant as shown by the F-ratio and its p-value, meaning that the model explains a significant amount of the variation in Coefficients t-values (statistics) p-values Measures of Fit a = 8.564 6.274 3.57E-09* R = 0.589 b = 0.899 2.832 0.005 R2 = 0.347 c = 1.067 3.360 0.001 adjusted R2 = 0.330 d = 1.907E-05* 1.927 0.056 e = -0.098 -4.808 3.68E-06 *See note about scientific notation in Section 5, Interpreting the Results. Table 23. Results from fitting the multivariate model. ANOVA Sum of Squares (SS) Degrees of Freedom (df) F-ratio p-value Regression 1487.5 4 19.952 3.4E-13 Residual 2795.7 150 Table 24. ANOVA results for the overall model.

examples of effective experiment Design and Data analysis in transportation research 63 the dependent variable. This model should reliably explain 33% of the variance of house- hold trip generation. Caution should be exercised when interpreting the significance of the R2 and the overall model because it is not uncommon to have a significant F-statistic when some of the coefficients in the equation are not significant. The analyst may want to consider recalibrating the model without the income variable because the coefficient d was insignificant. 6. Conclusion and Discussion: Regression, particularly OLS regression, relies on several assumptions about the data, the nature of the relationships, and the results. Data are assumed to be interval or ratio scale. Independent variables generally are assumed to be measured without error, so all error is attributed to the model fit. Furthermore, indepen- dent variables should be independent of one another. This is a serious concern because the presence in the model of related independent variables, called multicollinearity, compro- mises the t-tests and confuses the interpretation of coefficients. Tests of this problem are available in most statistical software packages that include regression. Look for Variance- Inflation Factor (VIF) and/or Tolerance tests; most packages will have one or the other, and some will have both. In the example above where PERS is divided into DEP and workers, knowing any two variables allows the calculation of the third. Including all three variables in the model would be a case of extreme multicollinearity and, logically, would make no sense. In this instance, because one variable is a linear combination of the other two, the calculations required (within the analysis program) to calibrate the model would actually fail. If the independent variables are simply highly correlated, the regression coefficients (at a minimum) may not have intuitive meaning. In general, equations or models with highly correlated independent variables are to be avoided; alternative models that examine one variable or the other, but not both, should be analyzed. It is also important to analyze the error distributions. Several assumptions relate to the errors and their distributions (normality, constant variance, uncorrelated, etc.) In transportation plan- ning, spatial variables and associations might become important; they require more elaborate constructs and often different estimation processes (e.g., Bayesian, Maximum Likelihood). (For more information about errors and error distributions, see NCHRP Project 20-45, Volume 2, Chapter 4.) Other logical considerations also exist. For example, for the measurement units of the different variables, does the magnitude of the result of multiplying the coefficient and the measured variable make sense and/or have a reasonable effect on the predicted magnitude of the dependent variable? Perhaps more importantly, do the independent variables make sense? In this example, does it make sense that changes in the number of vehicles in the household would cause an increase or decrease in the number of trips? These are measures of operational significance that go beyond consideration of statistical significance, but are no less important. 7. Applications in Other Areas of Transportation Research: Regression is a very important technique across many areas of transportation research, including: • Transportation Planning – to include the other half of trip generation, e.g., predicting trip destinations as a function of employment levels by various types (factory, commercial), square footage of shopping center space, and so forth. – to investigate the trip distribution stage of the 4-step model (log transformation of the gravity model). • Public Transportation—to predict loss/liability on subsidized freight rail lines (function of segment ton-miles, maintenance budgets and/or standards, operating speeds, etc.) for self-insurance computations. • Pavement Engineering—to model pavement deterioration (or performance) as a function of easily monitored predictor variables.

64 effective experiment Design and Data analysis in transportation research Example 17: Traffic Operations; Regression Analysis Area: Traffic operations Method of Analysis: Regression analysis (developing a model to predict the values that some variable can take as a function of one or more other variables, when not all variables are assumed to be continuous) 1. Research Question/Problem Statement: An engineer is concerned about false capacity at inter- sections being designed in a specified district. False capacity occurs where a lane is dropped just beyond a signalized intersection. Drivers approaching the intersection and knowing that the lane is going to be dropped shortly afterward avoid the lane. However, engineers estimating the capacity and level of service of the intersection during design have no reliable way to estimate the percentage of traffic that will avoid the lane (the lane distribution). Question/Issue Develop a model that can be used to predict the values that a dependent vari- able can take as a function of changes in the values of the independent variables. In this particular instance, how can engineers make a good estimate of the lane distribution of traffic volume in the case of a lane drop just beyond an intersec- tion? Can a linear model be developed that can be used to predict this distribu- tion based on other variables? The basic question is whether a linear relationship exists between the dependent variable (Y; in this case, the lane distribution percentage) and some independent variable(s) (X). The relationship can be expressed using the following equation: Y X= +a b i where a is the intercept and b is the slope of the line (see NCHRP Project 20-45, Volume 2, Chapter 4, Section B). 2. Identification and Description of Variables: The dependent variable of interest in this example is the volume of traffic in each lane on the approach to a signalized intersection with a lane drop just beyond. The traffic volumes by lane are converted into lane utilization factors (fLU), to be consistent with standard highway capacity techniques. The Highway Capacity Manual defines fLU using the following equation: f v v N LU g g = ( )1 where Vg is the flow rate in a lane group in vehicles per hour, Vg1 is the flow rate in the lane with the highest flow rate of any in the group in vehicles per hour, and N is the number of lanes in the lane group. The engineer thinks that lane utilization might be explained by one or more of 15 different factors, including the type of lane drop, the distance from the intersection to the lane drop, the taper length, and the heavy vehicle percentage. All of the variables are continuous except the type of lane drop. The type of lane drop is used to categorize the sites. 3. Data Collection: The engineer locates 46 lane-drop sites in the area and collects data at these sites by means of video recording. The engineer tapes for up to 3 hours at each site. The data are summarized in 15-minute periods, again to be consistent with standard highway capacity practice. For one type of lane-drop geometry, with two through lanes and an exclusive right- turn lane on the approach to the signalized intersection, the engineer ends up with 88 valid

examples of effective experiment Design and Data analysis in transportation research 65 data points (some sites have provided more than one data point), covering 15 minutes each, to use in equation (model) development. 4. Specification of Analysis Technique and Data Analysis: Multiple (or multivariate) regression is a standard statistical technique to develop predictive equations. (More information on this topic is given in NCHRP Project 20-45, Volume 2, Chapter 4, Section B). The engineer performs five steps to develop the predictive equation. Step 1. The engineer examines plots of each of the 15 candidate variables versus fLU to see if there is a relationship and to see what forms the relationships might take. Step 2. The engineer screens all 15 candidate variables for multicollinearity. (Multicollinearity occurs when two variables are related to each other and essentially contribute the same informa- tion to the prediction.) Multicollinearity can lead to models with poor predicting power and other problems. The engineer examines the variables for multicollinearity by • looking at plots of each of the 15 candidate variables against every other candidate variable; • calculating the correlation coefficient for each of the 15 candidate independent variables against every other candidate variable; and • using more sophisticated tests (such as the variance influence factor) that are available in statistical software. Step 3. The engineer reduces the set of candidate variables to eight. Next, the engineer uses statistical software to select variables and estimate the coefficients for each selected variable, assuming that the regression equation has a linear form. To select variables, the engineer employs forward selection (adding variables one at a time until the equation fit ceases to improve significantly) and backward elimination (starting with all candidate variables in the equation and removing them one by one until the equation fit starts to deteriorate). The equation fit is measured by R2 (for more information, see NCHRP Project 20-45, Volume 2, Chapter 4, Section B, under the heading, “Descriptive Measures of Association Between X and Y”), which shows how well the equation fits the data on a scale from 0 to 1, and other factors provided by statistical software. In this case, forward selection and backward elimination result in an equation with five variables: • Drop: Lane drop type, a 0 or 1 depending on the type; • Left: Left turn status, a 0 or 1 depending on the types of left turns allowed; • Length: The distance from the intersection to the lane drop, in feet ÷ 1000; • Volume: The average lane volume, in vehicles per hour per lane ÷ 1000; and • Sign: The number of signs warning of the lane drop. Notice that the first two variables are discrete variables and had to assume a zero-or-one format to work within the regression model. Each of the five variables has a coefficient that is significantly different from zero at the 95% confidence level, as measured by a t-test. (For more information, see NCHRP Project 20-45, Volume 2, Chapter 4, Section B, “How Are t-statistics Interpreted?”) Step 4. Once an initial model has been developed, the engineer plots the residuals for the tentative equation to see whether the assumed linear form is correct. A residual is the differ- ence, for each observation, between the prediction the equation makes for fLU and the actual value of fLU. In this example, a plot of the predicted value versus the residual for each of the 88 data points shows a fan-like shape, which indicates that the linear form is not appropriate. (NCHRP Project 20-45, Volume 2, Chapter 4, Section B, Figure 6 provides examples of residual plots that are and are not desirable.) The engineer experiments with several other model forms, including non-linear equations that involve transformations of variables, before settling on a lognormal form that provides a good R2 value of 0.73 and a desirable shape for the residual plot.

66 effective experiment Design and Data analysis in transportation research Step 5. Finally, the engineer examines the candidate equation for logic and practicality, asking whether the variables make sense, whether the signs of the variables make sense, and whether the variables can be collected easily by design engineers. Satisfied that the answers to these questions are “yes,” the final equation (model) can be expressed as follows: f Drop Left LLU = − − + +exp . . . .0 539 0 218 0 148 0 178i i i ength Volume Sign+ −( )0 627 0 105. .i i 5. Interpreting the Results: The process described in this example results in a useful equation for estimating the lane utilization in a lane to be dropped, thereby avoiding the estimation of false capacity. The equation has five terms and is non-linear, which will make its use a bit challenging. However, the database is large, the equation fits the data well, and the equation is logical, which should boost the confidence of potential users. If potential users apply the equation within the ranges of the data used for the calibration, the equation should provide good predictions. Applying any model outside the range of the data on which it was calibrated increases the likelihood of an inaccurate prediction. 6. Conclusion and Discussion: Regression is a powerful statistical technique that provides models engineers can use to make predictions in the absence of direct observation. Engineers tempted to use regression techniques should notice from this and other examples that the effort is substantial. Engineers using regression techniques should not skip any of the steps described above, as doing so may result in equations that provide poor predictions to users. Analysts considering developing a regression model to help make needed predictions should not be intimidated by the process. Although there are many pitfalls in developing a regression model, analysts considering making the effort should also consider the alternative: how the prediction will be made in the absence of a model. In the absence of a model, predic- tions of important factors like lane utilization would be made using tradition, opinion, or simple heuristics. With guidance from NCHRP Project 20-45 and other texts, and with good software available to make the calculations, credible regression models often can be developed that perform better than the traditional prediction methods. Because regression models developed by transportation engineers are often reused in later studies by others, the stakes are high. The consequences of a model that makes poor pre- dictions can be severe in terms of suboptimal decisions. Lane utilization models often are employed in traffic studies conducted to analyze new development proposals. A model that under-predicts utilization in a lane to be dropped may mean that the development is turned down due to the anticipated traffic impacts or that the developer has to pay for additional and unnecessary traffic mitigation measures. On the other hand, a model that over-predicts utilization in a lane to be dropped may mean that the development is approved with insufficient traffic mitigation measures in place, resulting in traffic delays, collisions, and the need for later intervention by a public agency. 7. Applications in Other Areas of Transportation Research: Regression is used in almost all areas of transportation research, including: • Transportation Planning—to create equations to predict trip generation and mode split. • Traffic Safety—to create equations to predict the number of collisions expected on a particular section of road. • Pavement Engineering/Materials—to predict long-term wear and condition of pavements. Example 18: Transportation Planning; Logit and Related Analysis Area: Transportation planning Method of Analysis: Logit and related analysis (developing predictive models when the dependent variable is dichotomous—e.g., 0 or 1)

examples of effective experiment Design and Data analysis in transportation research 67 2. Identification and Description of Variables: Considering a typical, traditional urban area in the United States, it is reasonable to argue that the likelihood of taking public transit to work (Y) will be a function of income (X). Generally, more income means less likelihood of taking public transit. This can be modeled using the following equation: Y X ui i i= + +β β1 2 where Xi = family income, Y = 0 if the family uses public transit, and Y = 1 if the family doesn’t use public transit. 3. Data Collection: These data normally are obtained from travel surveys conducted at the local level (e.g., by a metropolitan area or specific city), although the agency that collects the data often is a state DOT. 4. Specification of Analysis Techniques and Data Analysis: In this example the dependent variable is dichotomous and is a linear function of an explanatory variable. Consider the equation E(YiXi) = b1 + b2Xi. Notice that if Pi = probability that Y = 1 (household utilizes transit), then (1 - Pi) = probability that Y = 0 (doesn’t utilize transit). This has been called a linear probability model. Note that within this expression, “i” refers to a household. Thus, Y has the distribution shown in Table 25. Any attempt to estimate this relationship with standard (OLS) regression is saddled with many problems (e.g., non-normality of errors, heteroscedasticity, and the possibility that the predicted Y will be outside the range 0 to 1, to say nothing of pretty terrible R2 values). Question/Issue Can a linear model be developed that can be used to predict the probability that one of two choices will be made? In this example, the question is whether a household will use public transit (or not). Rather than being continuous (as in linear regression), the dependent variable is reduced to two categories, a dichotomous variable (e.g., yes or no, 0 or 1). Although the question is simple, the statistical modeling becomes sophisticated very quickly. 1. Research Question/Problem Statement: Transportation planners often utilize variations of the classic four-step transportation planning process for predicting travel demand. Trip generation, trip distribution, mode split, and trip assignment are used to predict traffic flows under a variety of forecasted changes in networks, population, land use, and controls. Mode split, deciding which mode of transportation a traveler will take, requires predicting mutually exclusive outcomes. For example, will a traveler utilize public transit or drive his or her own car? Table 25. Distribution of Y. Values that Y Takes Probability Meaning/Interpretation 1 Pi Household uses transit 0 1 – Pi Household does not use transit 1.0 Total

68 effective experiment Design and Data analysis in transportation research An alternative formulation for estimating Pi, the cumulative logistic distribution, is expressed by the following equation: Pi Xi = + − +( ) 1 1 1 2ε β β This function can be plotted as a lazy Z-curve where on the left, with low values of X (low household income), the probability starts near 1 and ends at 0 (Figure 16). Notice that, even at 0 income, not all households use transit. The curve is said to be asymptotic to 1 and 0. The value of Pi varies between 1 and 0 in relation to income, X. Manipulating the definition of the cumulative logistic distribution from above, 1 11 2+( ) =− +( )ε β β Xi iP P Pi i Xi+( ) =− +( )ε β β1 2 1 P Pi Xi iε β β− +( ) = −1 2 1 ε β β− +( ) = −1 2 1Xi i i P P and ε β β1 2 1 +( ) = − Xi i i P P The final expression is the ratio of the probability of utilizing public transit divided by the probability of not utilizing public transit. It is called the odds ratio. Next, taking the natural log of both sides (and reversing) results in the following equation: L P P Xi i i i= −   = +ln 1 1 2β β L is called the logit, and this is called a logit model. The left side is the natural log of the odds ratio. Unfortunately, this odds ratio is meaningless for individual households where the prob- ability is either 0 or 1 (utilize or not utilize). If the analyst uses standard OLS regression on this Figure 16. Plot of cumulative logistic distribution showing a lazy Z-curve.

examples of effective experiment Design and Data analysis in transportation research 69 equation, with data for individual households, there is a problem because when Pi happens to equal either 0 or 1 (which is all the time!), the odds ratio will, as a result, equal either 0 or infinity (and the logarithm will be undefined) for all observations. However, by using groups of households the problem can be mitigated. Table 26 presents data based on a survey of 701 households, more than half of which use transit (380). The income data are recorded for intervals; here, interval mid-points (Xj) are shown. The number of households in each income category is tallied (Nj), as is the number of households in each income category that utilizes public transit (nj). It is important to note that while there are more than 700 households (i), the number of observations (categories, j) is only 13. Using these data, for each income bracket, the probability of taking transit can be estimated as follows: P n N j j j  = This equation is an expression of relative frequency (i.e., it expresses the proportion in income bracket “j” using transit). An examination of Table 26 shows clearly that there is progression of these relative frequen- cies, with higher income brackets showing lower relative frequencies, just as was hypothesized. We can calculate the odds ratio for each income bracket listed in Table 26 and estimate the following logit function with OLS regression: L n N n N Xj j j j j j= −       = +ln 1 1 2β β The results of this regression are shown in Table 27. The results also can be expressed as an equation: LogOddsRatio X= −1 037 0 00003863. .  5. Interpreting the Results: This model provides a very good fit. The estimates of the coefficients can be inserted in the original cumulative logistic function to directly estimate the probability of using transit for any given X (income level). Indeed, the logistic graph in Figure 16 is produced with the estimated function. Xj ($) Nj (Households) nj (Utilizing Transit) Pj (Defined Above) $6,000 40 30 0.750 $8,000 55 39 0.709 $10,000 65 43 0.662 $13,000 88 58 0.659 $15,000 118 69 0.585 $20,000 81 44 0.543 $25,000 70 33 0.471 $30,000 62 25 0.403 $35,000 40 16 0.400 $40,000 30 11 0.367 $50,000 22 6 0.273 $60,000 18 4 0.222 $75,000 12 2 0.167 Total: 701 380 Table 26. Data examined by groups of households.

70 effective experiment Design and Data analysis in transportation research 6. Conclusion and Discussion: This approach to estimation is not without further problems. For example, the N within each income bracket needs to be sufficiently large that the relative fre- quency (and therefore the resulting odds ratio) is accurately estimated. Many statisticians would say that a minimum of 25 is reasonable. This approach also is limited by the fact that only one independent variable is used (income). Common sense suggests that the right-hand side of the function could logically be expanded to include more than one predictor variable (more Xs). For example, it could be argued that educational level might act, along with income, to account for the probability of using transit. However, combining predictor variables severely impinges on the categories (the j) used in this OLS regression formulation. To illustrate, assume that five educational categories are used in addition to the 13 income brackets (e.g., Grade 8 or less, high school graduate to Grade 9, some college, BA or BS degree, and graduate degree). For such an OLS regression analysis to work, data would be needed for 5 × 13, or 65 categories. Ideally, other travel modes should also be considered. In the example developed here, only transit and not-transit are considered. In some locations it is entirely reasonable to examine private auto versus bus versus bicycle versus subway versus light rail (involving five modes, not just two). This notion of a polychotomous logistic regression is possible. However, five modes cannot be estimated with the OLS regression technique employed above. The logit above is a variant of the binomial distribution and the polychotomous logistic model is a variant of the multi- nomial distribution (see NCHRP Project 20-45, Volume 2, Chapter 5). Estimation of these more advanced models requires maximum likelihood methods (as described in NCHRP Project 20-45, Volume 2, Chapter 5). Other model variants are based upon other cumulative probability distributions. For exam- ple, there is the probit model, in which the normal cumulative density function is used. The probit model is very similar to the logit model, but it is more difficult to estimate. 7. Applications in Other Areas of Transportation Research: Applications of logit and related models abound within transportation studies. In any situation in which human behavior is relegated to discrete choices, the category of models may be applied. Examples in other areas of transportation research include: • Transportation Planning—to model any “choice” issue, such as shopping destination choices. • Traffic Safety—to model dichotomous responses (e.g., did a motorist slow down or not) in response to traffic control devices. • Highway Design—to model public reactions to proposed design solutions (e.g., support or not support proposed road diets, installation of roundabouts, or use of traffic calming techniques). Example 19: Public Transit; Survey Design and Analysis Area: Public transit Method of Analysis: Survey design and analysis (organizing survey data for statistical analysis) Coefficients t-values (statistics) p-values Measures of “Fit” 1 = 1.037 12.156 0.000 R = 0.980 2 = -0.00003863 β β -16.407 0.000 R2 = 0.961 adjusted R2 = 0.957 Table 27. Results of OLS regression.

examples of effective experiment Design and Data analysis in transportation research 71 2. Identification and Description of Variables: Two types of variables are needed for this analysis. The first is data on the characteristics of the riders, such as gender, age, and access to an automobile. These data are discrete variables. The second is data on the riders’ stated responses to proposed changes in the fare or service characteristics. These data also are treated as discrete variables. Although some, like the fare, could theoretically be continuous, they are normally expressed in discrete increments (e.g., $1.00, $1.25, $1.50). 3. Data Collection: These data are normally collected by agencies conducting a survey of the transit users. The initial step in the experiment design is to choose the variables to be collected for each of these two data sets. The second step is to determine how to categorize the data. Both steps are generally based on past experience and common sense. Some of the variables used to describe the characteristics of the transit user are dichotomous, such as gender (male or female) and access to an automobile (yes or no). Other variables, such as age, are grouped into discrete categories within which the transit riding characteristics are similar. For example, one would not expect there to be a difference between the transit trip needs of a 14-year-old student and a 15-year-old student. Thus, the survey responses of these two age groups would be assigned to the same age category. However, experience (and common sense) leads one to differentiate a 19-year-old transit user from a 65-year-old transit user, because their purposes for taking trips and their perspectives on the relative value of the fare and the service components are both likely to be different. Obtaining user responses to changes in the fare or service is generally done in one of two ways. The first is to make a statement and ask the responder to mark one of several choices: strongly agree, agree, neither agree nor disagree, disagree, and strongly disagree. The number of statements used in the survey depends on how many parameter changes are being contemplated. Typical statements include: 1. I would increase the number of trips I make each month if the fare were reduced by $0.xx. 2. I would increase the number of trips I make each month if I could purchase a monthly pass. 3. I would increase the number of trips I make each month if the waiting time at the stop were reduced by 10 minutes. 4. I would increase the number of trips I make each month if express services were available from my origin to my destination. The second format is to propose a change and provide multiple choices for the responder. Typical questions for this format are: 1. If the fare were increased by $0.xx per trip I would: a) not change the number of trips per month b) reduce the non-commute trips c) reduce both the commute and non-commute trips d) switch modes 2. If express service were offered for an additional $0.xx per trip I would: a) not change the number of trips per month on this local service b) make additional trips each month c) shift from the local service to the express service Question/Issue Use and analysis of data collected in a survey. Results from a survey of transit users are used to estimate the change in ridership that would result from a change in the service or fare. 1. Research Question/Problem Statement: The transit director is considering changes to the fare structure and the service characteristics of the transit system. To assist in determining which changes would be most effective or efficient, a survey of the current transit riders is developed.

72 effective experiment Design and Data analysis in transportation research These surveys generally are administered by handing a survey form to people as they enter the transit vehicle and collecting them as people depart the transit vehicle. The surveys also can be administered by mail, telephone, or in a face-to-face interview. In constructing the questions, care should be taken to use terms with which the respondents will be familiar. For example, if the system does not currently offer “express” service, this term will need to be defined in the survey. Other technical terms should be avoided. Similarly, the word “mode” is often used by transportation professionals but is not commonly used by the public at large. The length of a survey is almost always an issue as well. To avoid asking too many questions, each question needs to be reviewed to see if it is really necessary and will produce useful data (as opposed to just being something that would be nice to know). 4. Specification of Analysis Technique and Data Analysis: The results of these surveys often are displayed in tables or in frequency distribution diagrams (see also Example 1 and Example 2). Table 28 lists responses to a sample question posed in the form of a statement. Figure 17 shows the frequency diagram for these data. Similar presentations can be made for any of the groupings included in the first type of variables discussed above. For example, if gender is included as a Type 1 question, the results might appear as shown in Table 29 and Figure 18. Figure 18 shows the frequency diagram for these data. Presentations of the data can be made for any combination of the discrete variable groups included in the survey. For example, to display responses of female users over 65 years old, Strongly Agree Agree Neither Agree nor Disagree Disagree Strongly Disagree Total responses 450 600 300 400 100 Table 28. Table of responses to sample statement, “I would increase the number of trips I make each month if the fare were reduced by $0.xx.” 450 600 300 400 100 0 50 100 150 200 250 300 350 400 450 500 550 600 Strongly agree agree neither agree nor disagree disagree strongly disagree Figure 17. Frequency diagram for total responses to sample statement.

examples of effective experiment Design and Data analysis in transportation research 73 all of the survey forms on which these two characteristics (female and over 65 years old) are checked could be extracted and recorded in a table and shown in a frequency diagram. 5. Interpreting the Results: Survey data can be used to compare the responses to fare or service changes of different groups of transit users. This flexibility can be important in determining which changes would impact various segments of transit users. The information can be used to evaluate various fare and service options being considered and allows the transit agency to design promotions to obtain the greatest increase in ridership. For example, by creating fre- quency diagrams to display the responses to statements 2, 3, and 4 listed in Section 3, the engi- neer can compare the impact of changing the fare versus changing the headway or providing express services in the corridor. Organizing response data according to different characteristics of the user produces con- tingency tables like the one illustrated for males and females. This table format can be used to conduct chi-square analysis to determine if there is any statistically significant difference among the various groups. (Chi-square analysis is described in more detail in Example 4.) 6. Conclusions and Discussion: This example illustrates how to obtain and present quan- titative information using surveys. Although survey results provide reasonably good esti- mates of the relative importance users place on different transit attributes (fare, waiting time, hours of service, etc.), when determining how often they would use the system, the magnitude of users’ responses often is overstated. Experience shows that what users say they would do (their stated preference) generally is different than what they actually do (their revealed preference). Strongly Agree Agree Neither Agree nor Disagree Disagree Strongly Disagree Male 200 275 200 200 70 Female 250 325 100 200 30 Total responses 450 600 300 400 100 Table 29. Contingency table showing responses by gender to sample statement, “I would increase the number of trips I make each month if the fare were reduced by $0.xx.” 200 275 200 200 70 250 325 100 200 30 0 50 100 150 200 250 300 350 Strongly agree agree neither agree nor disagree disagree strongly disagree Male Female Figure 18. Frequency diagram showing responses by gender to sample statement.

74 effective experiment Design and Data analysis in transportation research In this example, 1,050 of the 1,850 respondents (57%) have responded that they would use the bus service more frequently if the fare were decreased by $0.xx. Five hundred respondents (27%) have indicated that they would not use the bus service more frequently, and 300 respondents (16%) have indicated that they are not sure if they would change their bus use frequency. These percentages show the stated preferences of the users. The engineer does not yet know the revealed preferences of the users, but experience suggests that it is unlikely that 57% of the riders would actually increase the number of trips they make. 7. Applications in Other Area in Transportation: Survey design and analysis techniques can be used to collect and present data in many areas of transportation research, including: • Transportation Planning—to assess public response to a proposal to enact a local motor fuel tax to improve road maintenance in a city or county. • Traffic Operations—to assess public response to implementing road diets (e.g., 4-lane to 3-lane conversions) on different corridors in a city. • Highway Design—to assess public response to proposed alternative cross-section designs, such as a boulevard design versus an undivided multilane design in a corridor. Example 20: Traffic Operations; Simulation Area: Traffic operations Method of Analysis: Simulation (using field data to simulate, or model, operations or outcomes) 1. Research Question/Problem Statement: A team of engineers wants to determine whether one or more unconventional intersection designs will produce lower travel times than a conventional design at typical intersections for a given number of lanes. There is no way to collect field data to compare alternative intersection designs at a particular site. Macroscopic traffic operations models like those in the Highway Capacity Manual do a good job of estimating delay at specific points but are unable to provide travel time estimates for unconventional designs that consist of several smaller intersections and road segments. Microscopic simulation models measure the behaviors of individual vehicles as they traverse the highway network. Such simulation models are therefore very flexible in the types of networks and measures that can be examined. The team in this example turns to a simulation model to determine how other intersection designs might work. Question/Issue Developing and using a computer simulation model to examine operations in a computer environment. In this example, a traffic operations simulation model is used to show whether one or more unconventional intersection designs will produce lower travel times than a conventional design at typical intersections for a given number of lanes. 2. Identification and Description of Variables: The engineering team simulates seven different intersections to provide the needed scope for their findings. At each intersection, the team examines three different sets of traffic volumes: volumes from the evening (p.m.) peak hour, a typical midday off-peak hour, and a volume that is 15% greater than the p.m. peak hour to represent future conditions. At each intersection, the team models the current conventional intersection geometry and seven unconventional designs: the quadrant roadway, median U-turn, superstreet, bowtie, jughandle, split intersection, and continuous flow intersection. Traffic simulation models break the roadway network into nodes (intersections) and links (segments between intersections). Therefore, the engineering team has to design each of the

examples of effective experiment Design and Data analysis in transportation research 75 alternatives at each test site in terms of numbers of lanes, lane lengths, and such, and then faithfully translate that geometry into links and nodes that the simulation model can use. For each combination of traffic volume and intersection design, the team uses software to find the optimum signal timing and uses that during the simulation. To avoid bias, the team keeps all other factors (e.g., network size, numbers of lanes, turn lane lengths, truck percentages, average vehicle speeds) constant in all simulation runs. 3. Data Collection: The field data collection necessary in this effort consists of noting the current intersection geometries at the seven test intersections and counting the turning movements in the time periods described above. In many simulation efforts, it is also necessary to collect field data to calibrate and validate the simulation model. Calibration is the process by which simulation output is compared to actual measurements for some key measure(s) such as travel time. If a difference is found between the simulation output and the actual measurement, the simulation inputs are changed until the difference disappears. Validation is a test of the calibrated simulation model, comparing simulation output to a previously unused sample of actual field measurements. In this example, however, the team determines that it is unnecessary to collect calibration and validation data because a recent project has successfully calibrated and validated very similar models of most of these same unconventional designs. The engineer team uses the CORSIM traffic operations simulation model. Well known and widely used, CORSIM models the movement of each vehicle through a specified network in small time increments. CORSIM is a good choice for this example because it was originally designed for problems of this type, has produced appropriate results, has excellent animation and other debugging features, runs quickly in these kinds of cases, and is well-supported by the software developers. The team makes two CORSIM runs with different random number seeds for each combina- tion of volume and design at each intersection, or 48 runs for each intersection altogether. It is necessary to make more than one run (or replication) of each simulation combination with different random number seeds because of the randomness built into simulation models. The experiment design in this case allows the team to reduce the number of replications to two; typical practice in simulations when one is making simple comparisons between two variables is to make at least 5 to 10 replications. Each run lasts 30 simulated minutes. Table 30 shows the simulation data for one of the seven intersections. The lowest travel time produced in each case is bolded. Notice that Table 30 does not show data for the bowtie design. That design became congested (gridlocked) and produced essentially infinite travel times for this intersection. Handling overly congested networks is a difficult problem in many efforts and with several different simulation software packages. The best current advice is for analysts to not push their networks too hard and to scan often for gridlock. 4. Specification of Analysis Technique and Data Analysis: The experiment assembled in this example uses a factorial design. (Factorial design also is discussed in Example 11.) The team analyzes the data from this factorial experiment using analysis of variance (ANOVA). Because Time of Day Total Travel Time, Vehicle-hours, Average of Two Simulation Runs Conventional Quadrant Median U Superstreet Jughandle Split Continuous Midday 67 64 61 74 63 59* 75 P.M. peak 121 95 119 179 139 114 106 Peak + 15% 170 *Lowest total travel time. 135 145 245 164 180 142 Table 30. Simulation results for different designs and time of day.

76 effective experiment Design and Data analysis in transportation research the experimenter has complete control in a simulation, it is common to use efficient designs like factorials and efficient analysis methods like ANOVA to squeeze all possible information out of the effort. Statistical tests comparing the individual mean values of key results by factor are common ways to follow up on ANOVA results. Although ANOVA will reveal which factors make a significant contribution to the overall variance in the dependent variable, means tests will show which levels of a significant factor differ from the other levels. In this example, the team uses Tukey’s means test, which is available as part of the battery of standard tests accom- panying ANOVA in statistical software. (For more information about ANOVA, see NCHRP Project 20-45, Volume 2, Chapter 4, Section A.) 5. Interpreting the Results: For the data shown in Table 30, the ANOVA reveals that the volume and design factors are statistically significant at the 99.99% confidence level. Furthermore, the interaction between the volume and design factors also is statistically significant at the 99.99% level. The means tests on the design factors show that the quadrant roadway is significantly different from (has a lower overall travel time than) the other designs at the 95% level. The next- best designs overall are the median U-turn and the continuous flow intersection; these are not statistically different from each other at the 95% level. The third tier of designs consists of the conventional and the split, which are statistically different from all others at the 95% level but not from each other. Finally, the jughandle and the superstreet designs are statistically different from each other and from all other designs at the 95% level according to the means test. Through the simulation, the team learns that several designs appear to be more efficient than the conventional design, especially at higher volume levels. From the results at all seven intersections, the team sees that the quadrant roadway and median U-turn designs generally lead to the lowest travel times, especially with the higher volume levels. 6. Conclusion and Discussion: Simulation is an effective tool to analyze traffic operations, as at the seven intersections of interest in this example. No other tool would allow such a robust comparison of many different designs and provide the results for travel times in a larger net- work rather than delays at a single spot. The simulation conducted in this example also allows the team to conduct an efficient factorial design, which maximizes the information provided from the effort. Simulation is a useful tool in research for traffic operations because it • affords the ability to conduct randomized experiments, • allows the examination of details that other methods cannot provide, and • allows the analysis of large and complex networks. In practice, simulation also is popular because of the vivid and realistic animation output provided by common software packages. The superb animations allow analysts to spot and treat flaws in the design or model and provide agencies an effective tool by which to share designs with politicians and the public. Although simulation results can sometimes be surprising, more often they confirm what the analysts already suspect based on simpler analyses. In the example described here, the analysts suspected that the quadrant roadway and median U-turn designs would perform well because these designs had performed well in prior Highway Capacity Manual calculations. In many studies, simulations provide rich detail and vivid animation but no big surprises. 7. Applications in Other Areas of Transportation Research: Simulations are critical analysis methods in several areas of transportation research. Besides traffic operations, simulations are used in research related to: • Maintenance—to model the lifetime performance of traffic signs. • Traffic Safety – to examine vehicle performance and driver behaviors or performance. – to predict the number of collisions from a new roadway design (potentially, given the recent development of the FHWA SSAM program).

examples of effective experiment Design and Data analysis in transportation research 77 Example 21: Traffic Safety; Non-parametric Methods Area: Traffic safety Method of Analysis: Non-parametric methods (methods used when data do not follow assumed or conventional distributions, such as when comparing median values) 1. Research Question/Problem Statement: A city traffic engineer has been receiving many citizen complaints about the perceived lack of safety at unsignalized midblock crosswalks. Apparently, some motorists seem surprised by pedestrians in the crosswalks and do not yield to the pedestrians. The engineer believes that larger and brighter warning signs may be an inexpensive way to enhance safety at these locations. Question/Issue Determine whether some treatment has an effect when data to be tested do not follow known distributions. In this example, a nonparametric method is used to determine whether larger and brighter warning signs improve pedestrian safety at unsignalized midblock crosswalks. The null hypothesis and alternative hypothesis are stated as follows: Ho: There is no difference in the median values of the number of conflicts before and after a treatment. Ha: There is a difference in the median values. 2. Identification and Description of Variables: The engineer would like to collect collision data at crosswalks with improved signs, but it would take a long time at a large sample of crosswalks to collect a reasonable sample size of collisions to answer the question. Instead, the engineer collects data for conflicts, which are near-collisions when one or both of the involved entities brakes or swerves within 2 seconds of a collision to avoid the collision. Research literature has shown that conflicts are related to collisions, and because conflicts are much more numerous than collisions, it is much quicker to collect a good sample size. Conflict data are not nearly as widely used as collision data, however, and the underlying distribution of conflict data is not clear. Thus, the use of non-parametric methods seems appropriate. 3. Data Collection: The engineer identifies seven test crosswalks in the city based on large pedes- trian volumes and the presence of convenient vantage points for observing conflicts. The engi- neering staff collects data on traffic conflicts for 2 full days at each of the seven crosswalks with standard warning signs. The engineer then has larger and brighter warning signs installed at the seven sites. After waiting at least 1 month at each site after sign installation, the staff again collects traffic conflicts for 2 full days, making sure that weather, light, and as many other conditions as possible are similar between the before-and-after data collection periods at each site. 4. Specification of Analysis Technique and Data Analysis: A nonparametric statistical test is an efficient way to analyze data when the underlying distribution is unclear (as in this example using conflict data) and when the sample size is small (as in this example with its small number of sites). Several such tests, such as the sign test and the Wilcoxon signed-rank (Wilcoxon rank-sum) test are plausible in this example. (For more information about nonparametric tests, see NCHRP Project 20-45, Volume 2, Chapter 6, Section D, “Hypothesis About Population Medians for Independent Samples.” ) The decision is made to use the Wilcoxon signed-rank test because it is a more powerful test for paired numerical measurements than other tests, and this example uses paired (before-and-after) measurements. The sign test is a popular nonparametric test for paired data but loses information contained in numerical measurements by reducing the data to a series of positive or negative signs.

78 effective experiment Design and Data analysis in transportation research Having decided on the Wilcoxon signed-rank test, the engineer arranges the data (see Table 31). The third row of the table is the difference between the frequencies of the two conflict measurements at each site. The last row shows the rank order of the sites from lowest to highest based on the absolute value of the difference. Site 3 has the least difference (35 - 33 = 2) while Site 7 has the greatest difference (54 - 61 = -16). The Wilcoxon signed-rank test ranks the differences from low to high in terms of absolute values. In this case, that would be 2, 3, 7, 7, 12, 15, and 16. The test statistic, x, is the sum of the ranks that have positive differences. In this example, x = 1 + 2 + 3.5 + 3.5 + 6 = 16. Notice that all but the sixth and seventh ranked sites had positive differences. Notice also that the tied differences were assigned ranks equal to the average of the ranks they would have received if they were just slightly different from each other. The engineer then consults a table for the Wilcoxon signed-rank test to get a critical value against which to compare. (Such a table appears in NCHRP Project 20-45, Volume 2, Appendix C, Table C-8.) The standard table for a sample size of seven shows that the critical value for a one-tailed test (testing whether there is an improvement) with a confidence level of 95% is x = 24. 5. Interpreting the Results: Because the calculated value (x = 16) is less than the critical value (x = 24), the engineer concludes that there is not a statistically significant difference between the number of conflicts recorded with standard signs and the number of conflicts recorded with larger and brighter signs. 6. Conclusion and Discussion: Nonparametric tests do not require the engineer to make restric- tive assumptions about an underlying distribution and are therefore good choices in cases like this, in which the sample size is small and the data collected do not have a familiar underlying distribution. Many nonparametric tests are available, so analysts should do some reading and searching before settling on the best one for any particular case. Once a nonparametric test is determined, it is usually easy to apply. This example also illustrates one of the potential pitfalls of statistical testing. The engineer’s conclusion is that there is not a statistically significant difference between the number of conflicts recorded with standard signs and the number of conflicts recorded with larger and brighter signs. That conclusion does not necessarily mean that larger and brighter signs are a bad idea at sites similar to those tested. Notice that in this experiment, larger and brighter signs produced lower conflict frequencies at five of the seven sites, and the average number of conflicts per site was lower with the larger and brighter signs. Given that signs are relatively inexpensive, they may be a good idea at sites like those tested. A statistical test can provide useful information, especially about the quality of the experiment, but analysts must be careful not to interpret the results of a statistical test too strictly. In this example, the greatest danger to the validity of the test result lies not in the statistical test but in the underlying before-and-after test setup. For the results to be valid, it is necessary that the only important change that affects conflicts at the test sites during data collection be Site 1 Site 2 Site 3 Site 4 Site 5 Site 6 Site 7 Standard signs 170 39 35 32 32 19 45 Larger and brighter signs 155 26 33 29 25 31 61 Difference 15 7 2 3 7 -12 -16 Rank of absolute difference 6 73.5 1 2 3.5 5 Table 31. Number of conflicts recorded during each (equal) time period at each site.

examples of effective experiment Design and Data analysis in transportation research 79 the new signs. The engineer has kept the duration short between the before-and-after data collection periods, which helps minimize the chances of other important changes. However, if there is any reason to suspect other important changes, these test results should be viewed skeptically and a more sophisticated test strategy should be employed. 7. Applications in Other Areas of Transportation Research: Nonparametric tests are helpful when researchers are working with small sample sizes or sample data wherein the underlying distribution is unknown. Examples of other areas of transportation research in which non- parametric tests may be applied include: • Transportation Planning, Public Transportation—to analyze data from surveys and questionnaires when the scale of the response calls into question the underlying distribution. Such data are often analyzed in transportation planning and public transportation. • Traffic Operations—to analyze small samples of speed or volume data. • Structures, Pavements—to analyze quality ratings of pavements, bridges, and other trans- portation assets. Such ratings also use scales. Resources The examples used in this report have included references to the following resources. Researchers are encouraged to consult these resources for more information about statistical procedures. Freund, R. J. and W. J. Wilson (2003). Statistical Methods. 2d ed. Burlington, MA: Academic Press. See page 256 for a discussion of Tukey’s procedure. Kutner, M. et al. (2005). Applied Linear Statistical Models. 5th ed. Boston: McGraw-Hill. See page 746 for a discussion of Tukey’s procedure. NCHRP CD-22: Scientific Approaches to Transportation Research, Vol. 1 and 2. 2002. Transpor- tation Research Board of the National Academies, Washington, D.C. This two-volume electronic manual developed under NCHRP Project 20-45 provides a comprehensive source of information on the conduct of research. The manual includes state-of-the-art techniques for problem state- ment development; literature searching; development of the research work plan; execution of the experiment; data collection, management, quality control, and reporting of results; and evaluation of the effectiveness of the research, as well as the requirements for the systematic, pro- fessional, and ethical conduct of transportation research. For readers’ convenience, the references to NCHRP Project 20-45 from the various examples contained in this report are summarized here by topic and location in NCHRP CD-22. More information about NCHRP CD-22 is available at http://www.trb.org/Main/Blurbs/152122.aspx. • Analysis of Variance (one-way ANOVA and two-way ANOVA): See Volume 2, Chapter 4, Section A, Analysis of Variance Methodology (pp. 113, 119–31). • Assumptions for residual errors: See Volume 2, Chapter 4. • Box plots; Q-Q plots: See Volume 2, Chapter 6, Section C. • Chi-square test: See Volume 2, Chapter 6, Sections E (Chi-Square Test for Independence) and F. • Chi-square values: See Volume 2, Appendix C, Table C-2. • Computations on unbalanced designs and multi-factorial designs: See Volume 2, Chapter 4, Section A, Analysis of Variance Methodology (pp. 119–31). • Confidence intervals: See Volume 2, Chapter 4. • Correlation coefficient: See Volume 2, Appendix A, Glossary, Correlation Coefficient. • Critical F-value: See Volume 2, Appendix C, Table C-5. • Desirable and undesirable residual plots (scatter plots): See Volume 2, Chapter 4, Section B, Figure 6.

80 effective experiment Design and Data analysis in transportation research • Equation fit: See Volume 2, Chapter 4, Glossary, Descriptive Measures of Association Between X and Y. • Error distributions (normality, constant variance, uncorrelated, etc.): See Volume 2, Chapter 4 (pp. 146–55). • Experiment design and data collection: See Volume 2, Chapter 1. • Fcrit and F-distribution table: See Volume 2, Appendix C, Table C-5. • F-test (or F-test): See Volume 2, Chapter 4, Section A, Compute the F-ratio Test Statistic (p. 124). • Formulation of formal hypotheses for testing: See Volume 1, Chapter 2, Hypothesis; Volume 2, Appendix A, Glossary. • History and maturation biases (specification errors): See Volume 2, Chapter 1, Quasi- Experiments. • Indicator (dummy) variables: See Volume 2, Chapter 4 (pp. 142–45). • Intercept and slope: See Volume 2, Chapter 4 (pp. 140–42). • Maximum likelihood methods: See Volume 2, Chapter 5 (pp. 208–11). • Mean and standard deviation formulas: See Volume 2, Chapter 6, Table C, Frequency Distribu- tions, Variance, Standard Deviation, Histograms, and Boxplots. • Measured ratio or interval scale: See Volume 2, Chapter 1 (p. 83). • Multinomial distribution and polychotomous logistical model: See Volume 2, Chapter 5 (pp. 211–18). • Multiple (multivariate) regression: See Volume 2, Chapter 4, Section B. • Non-parametric tests: See Volume 2, Chapter 6, Section D. • Normal distribution: See Volume 2, Appendix A, Glossary, Normal Distribution. • One- and two-sided hypothesis testing (one- and two-tail test values): See Volume 2, Chapter 4 (pp. 161 and 164–5). • Ordinary least squares (OLS) regression: See Volume 2, Chapter 4, Section B, Linear Regression. • Sample size and confidence: See Volume 2, Chapter 1, Sample Size Determination. • Sample size determination based on statistical power requirements: See Volume 2, Chapter 1, Sample Size Determination (p. 94). • Sign test and the Wilcoxon signed-rank (Wilcoxon rank-sum) test: See Volume 2, Chapter 6, Section D, and Appendix C, Table C-8, Hypothesis About Population Medians for Independent Samples. • Split samples: See Volume 2, Chapter 4, Section A, Analysis of Variance Methodology (pp. 119–31). • Standard chi-square distribution table: See Volume 2, Appendix C, Table C-2. • Standard normal values: See Volume 2, Appendix C, Table C-1. • tcrit values: See Volume 2, Appendix C, Table C-4. • t-statistic: See Volume 2, Appendix A, Glossary. • t-statistic using equation for equal variance: See Volume 2, Appendix C, Table C-4. • t-test: See Volume 2, Chapter 4, Section B, How are t-statistics Interpreted? • Tabularized values of t-statistic: See Volume 2, Appendix C, Table C-4. • Tukey’s test, Bonferroni’s test, Scheffe’s test: See Volume 2, Chapter 4, Section A, Analysis of Variance Methodology (pp. 119–31). • Types of data and implications for selection of analysis techniques: See Volume 2, Chapter 1, Identification of Empirical Setting.

Abbreviations and acronyms used without definitions in TRB publications: AAAE American Association of Airport Executives AASHO American Association of State Highway Officials AASHTO American Association of State Highway and Transportation Officials ACI–NA Airports Council International–North America ACRP Airport Cooperative Research Program ADA Americans with Disabilities Act APTA American Public Transportation Association ASCE American Society of Civil Engineers ASME American Society of Mechanical Engineers ASTM American Society for Testing and Materials ATA American Trucking Associations CTAA Community Transportation Association of America CTBSSP Commercial Truck and Bus Safety Synthesis Program DHS Department of Homeland Security DOE Department of Energy EPA Environmental Protection Agency FAA Federal Aviation Administration FHWA Federal Highway Administration FMCSA Federal Motor Carrier Safety Administration FRA Federal Railroad Administration FTA Federal Transit Administration HMCRP Hazardous Materials Cooperative Research Program IEEE Institute of Electrical and Electronics Engineers ISTEA Intermodal Surface Transportation Efficiency Act of 1991 ITE Institute of Transportation Engineers NASA National Aeronautics and Space Administration NASAO National Association of State Aviation Officials NCFRP National Cooperative Freight Research Program NCHRP National Cooperative Highway Research Program NHTSA National Highway Traffic Safety Administration NTSB National Transportation Safety Board PHMSA Pipeline and Hazardous Materials Safety Administration RITA Research and Innovative Technology Administration SAE Society of Automotive Engineers SAFETEA-LU Safe, Accountable, Flexible, Efficient Transportation Equity Act: A Legacy for Users (2005) TCRP Transit Cooperative Research Program TEA-21 Transportation Equity Act for the 21st Century (1998) TRB Transportation Research Board TSA Transportation Security Administration U.S.DOT United States Department of Transportation

TRB’s National Cooperative Highway Research Program (NCHRP) Report 727: Effective Experiment Design and Data Analysis in Transportation Research describes the factors that may be considered in designing experiments and presents 21 typical transportation examples illustrating the experiment design process, including selection of appropriate statistical tests.

The report is a companion to NCHRP CD-22, Scientific Approaches to Transportation Research, Volumes 1 and 2 , which present detailed information on statistical methods.


Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

Show this book's table of contents , where you can jump to any chapter by name.

...or use these buttons to go back to the previous chapter or skip to the next one.

Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

To search the entire text of this book, type in your search term here and press Enter .

Share a link to this book page on your preferred social network or via email.

View our suggested citation for this chapter.

Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

Get Email Updates

Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.

Qualitative case study data analysis: an example from practice


  • 1 School of Nursing and Midwifery, National University of Ireland, Galway, Republic of Ireland.
  • PMID: 25976531
  • DOI: 10.7748/nr.22.5.8.e1307

Aim: To illustrate an approach to data analysis in qualitative case study methodology.

Background: There is often little detail in case study research about how data were analysed. However, it is important that comprehensive analysis procedures are used because there are often large sets of data from multiple sources of evidence. Furthermore, the ability to describe in detail how the analysis was conducted ensures rigour in reporting qualitative research.

Data sources: The research example used is a multiple case study that explored the role of the clinical skills laboratory in preparing students for the real world of practice. Data analysis was conducted using a framework guided by the four stages of analysis outlined by Morse ( 1994 ): comprehending, synthesising, theorising and recontextualising. The specific strategies for analysis in these stages centred on the work of Miles and Huberman ( 1994 ), which has been successfully used in case study research. The data were managed using NVivo software.

Review methods: Literature examining qualitative data analysis was reviewed and strategies illustrated by the case study example provided. Discussion Each stage of the analysis framework is described with illustration from the research example for the purpose of highlighting the benefits of a systematic approach to handling large data sets from multiple sources.

Conclusion: By providing an example of how each stage of the analysis was conducted, it is hoped that researchers will be able to consider the benefits of such an approach to their own case study analysis.

Implications for research/practice: This paper illustrates specific strategies that can be employed when conducting data analysis in case study research and other qualitative research designs.

Keywords: Case study data analysis; case study research methodology; clinical skills research; qualitative case study methodology; qualitative data analysis; qualitative research.

  • Case-Control Studies*
  • Data Interpretation, Statistical*
  • Nursing Research / methods*
  • Qualitative Research*
  • Research Design

Business growth

Business tips

What is data analysis? Examples and how to get started

A hero image with an icon of a line graph / chart

Even with years of professional experience working with data, the term "data analysis" still sets off a panic button in my soul. And yes, when it comes to serious data analysis for your business, you'll eventually want data scientists on your side. But if you're just getting started, no panic attacks are required.

Table of contents:

Quick review: What is data analysis?

Data analysis is the process of examining, filtering, adapting, and modeling data to help solve problems. Data analysis helps determine what is and isn't working, so you can make the changes needed to achieve your business goals. 

Keep in mind that data analysis includes analyzing both quantitative data (e.g., profits and sales) and qualitative data (e.g., surveys and case studies) to paint the whole picture. Here are two simple examples (of a nuanced topic) to show you what I mean.

An example of quantitative data analysis is an online jewelry store owner using inventory data to forecast and improve reordering accuracy. The owner looks at their sales from the past six months and sees that, on average, they sold 210 gold pieces and 105 silver pieces per month, but they only had 100 gold pieces and 100 silver pieces in stock. By collecting and analyzing inventory data on these SKUs, they're forecasting to improve reordering accuracy. The next time they order inventory, they order twice as many gold pieces as silver to meet customer demand.

An example of qualitative data analysis is a fitness studio owner collecting customer feedback to improve class offerings. The studio owner sends out an open-ended survey asking customers what types of exercises they enjoy the most. The owner then performs qualitative content analysis to identify the most frequently suggested exercises and incorporates these into future workout classes.

Why is data analysis important?

Here's why it's worth implementing data analysis for your business:

Understand your target audience: You might think you know how to best target your audience, but are your assumptions backed by data? Data analysis can help answer questions like, "What demographics define my target audience?" or "What is my audience motivated by?"

Inform decisions: You don't need to toss and turn over a decision when the data points clearly to the answer. For instance, a restaurant could analyze which dishes on the menu are selling the most, helping them decide which ones to keep and which ones to change.

Adjust budgets: Similarly, data analysis can highlight areas in your business that are performing well and are worth investing more in, as well as areas that aren't generating enough revenue and should be cut. For example, a B2B software company might discover their product for enterprises is thriving while their small business solution lags behind. This discovery could prompt them to allocate more budget toward the enterprise product, resulting in better resource utilization.

Identify and solve problems: Let's say a cell phone manufacturer notices data showing a lot of customers returning a certain model. When they investigate, they find that model also happens to have the highest number of crashes. Once they identify and solve the technical issue, they can reduce the number of returns.

Types of data analysis (with examples)

There are five main types of data analysis—with increasingly scary-sounding names. Each one serves a different purpose, so take a look to see which makes the most sense for your situation. It's ok if you can't pronounce the one you choose. 

Types of data analysis including text analysis, statistical analysis, diagnostic analysis, predictive analysis, and prescriptive analysis.

Text analysis: What is happening?

Here are a few methods used to perform text analysis, to give you a sense of how it's different from a human reading through the text: 

Word frequency identifies the most frequently used words. For example, a restaurant monitors social media mentions and measures the frequency of positive and negative keywords like "delicious" or "expensive" to determine how customers feel about their experience. 

Language detection indicates the language of text. For example, a global software company may use language detection on support tickets to connect customers with the appropriate agent. 

Keyword extraction automatically identifies the most used terms. For example, instead of sifting through thousands of reviews, a popular brand uses a keyword extractor to summarize the words or phrases that are most relevant. 

Statistical analysis: What happened?

Statistical analysis pulls past data to identify meaningful trends. Two primary categories of statistical analysis exist: descriptive and inferential.

Descriptive analysis

Here are a few methods used to perform descriptive analysis: 

Measures of frequency identify how frequently an event occurs. For example, a popular coffee chain sends out a survey asking customers what their favorite holiday drink is and uses measures of frequency to determine how often a particular drink is selected. 

Measures of central tendency use mean, median, and mode to identify results. For example, a dating app company might use measures of central tendency to determine the average age of its users.

Measures of dispersion measure how data is distributed across a range. For example, HR may use measures of dispersion to determine what salary to offer in a given field. 

Inferential analysis

Inferential analysis uses a sample of data to draw conclusions about a much larger population. This type of analysis is used when the population you're interested in analyzing is very large. 

Here are a few methods used when performing inferential analysis: 

Hypothesis testing identifies which variables impact a particular topic. For example, a business uses hypothesis testing to determine if increased sales were the result of a specific marketing campaign. 

Regression analysis shows the effect of independent variables on a dependent variable. For example, a rental car company may use regression analysis to determine the relationship between wait times and number of bad reviews. 

Diagnostic analysis: Why did it happen?

Diagnostic analysis, also referred to as root cause analysis, uncovers the causes of certain events or results. 

Here are a few methods used to perform diagnostic analysis: 

Time-series analysis analyzes data collected over a period of time. A retail store may use time-series analysis to determine that sales increase between October and December every year. 

Correlation analysis determines the strength of the relationship between variables. For example, a local ice cream shop may determine that as the temperature in the area rises, so do ice cream sales. 

Predictive analysis: What is likely to happen?

Predictive analysis aims to anticipate future developments and events. By analyzing past data, companies can predict future scenarios and make strategic decisions.  

Here are a few methods used to perform predictive analysis: 

Decision trees map out possible courses of action and outcomes. For example, a business may use a decision tree when deciding whether to downsize or expand. 

Prescriptive analysis: What action should we take?

The highest level of analysis, prescriptive analysis, aims to find the best action plan. Typically, AI tools model different outcomes to predict the best approach. While these tools serve to provide insight, they don't replace human consideration, so always use your human brain before going with the conclusion of your prescriptive analysis. Otherwise, your GPS might drive you into a lake.

Here are a few methods used to perform prescriptive analysis: 

Algorithms are used in technology to perform specific tasks. For example, banks use prescriptive algorithms to monitor customers' spending and recommend that they deactivate their credit card if fraud is suspected. 

Data analysis process: How to get started

The actual analysis is just one step in a much bigger process of using data to move your business forward. Here's a quick look at all the steps you need to take to make sure you're making informed decisions. 

Circle chart with data decision, data collection, data cleaning, data analysis, data interpretation, and data visualization.

Data decision

As with almost any project, the first step is to determine what problem you're trying to solve through data analysis. 

Make sure you get specific here. For example, a food delivery service may want to understand why customers are canceling their subscriptions. But to enable the most effective data analysis, they should pose a more targeted question, such as "How can we reduce customer churn without raising costs?" 

Data collection

Next, collect the required data from both internal and external sources. 

Internal data comes from within your business (think CRM software, internal reports, and archives), and helps you understand your business and processes.

External data originates from outside of the company (surveys, questionnaires, public data) and helps you understand your industry and your customers. 

Data cleaning

Data can be seriously misleading if it's not clean. So before you analyze, make sure you review the data you collected.  Depending on the type of data you have, cleanup will look different, but it might include: 

Removing unnecessary information 

Addressing structural errors like misspellings

Deleting duplicates

Trimming whitespace

Human checking for accuracy 

Data analysis

Now that you've compiled and cleaned the data, use one or more of the above types of data analysis to find relationships, patterns, and trends. 

Data analysis tools can speed up the data analysis process and remove the risk of inevitable human error. Here are some examples.

Spreadsheets sort, filter, analyze, and visualize data. 

Structured query language (SQL) tools manage and extract data in relational databases. 

Data interpretation

After you analyze the data, you'll need to go back to the original question you posed and draw conclusions from your findings. Here are some common pitfalls to avoid:

Correlation vs. causation: Just because two variables are associated doesn't mean they're necessarily related or dependent on one another. 

Confirmation bias: This occurs when you interpret data in a way that confirms your own preconceived notions. To avoid this, have multiple people interpret the data. 

Small sample size: If your sample size is too small or doesn't represent the demographics of your customers, you may get misleading results. If you run into this, consider widening your sample size to give you a more accurate representation. 

Data visualization

Automate your data collection, frequently asked questions.

Need a quick summary or still have a few nagging data analysis questions? I'm here for you.

What are the five types of data analysis?

The five types of data analysis are text analysis, statistical analysis, diagnostic analysis, predictive analysis, and prescriptive analysis. Each type offers a unique lens for understanding data: text analysis provides insights into text-based content, statistical analysis focuses on numerical trends, diagnostic analysis looks into problem causes, predictive analysis deals with what may happen in the future, and prescriptive analysis gives actionable recommendations.

What is the data analysis process?

The data analysis process involves data decision, collection, cleaning, analysis, interpretation, and visualization. Every stage comes together to transform raw data into meaningful insights. Decision determines what data to collect, collection gathers the relevant information, cleaning ensures accuracy, analysis uncovers patterns, interpretation assigns meaning, and visualization presents the insights.

What is the main purpose of data analysis?

In business, the main purpose of data analysis is to uncover patterns, trends, and anomalies, and then use that information to make decisions, solve problems, and reach your business goals.

Related reading: 

This article was originally published in October 2022 and has since been updated with contributions from Cecilia Gillen. The most recent update was in September 2023.

Get productivity tips delivered straight to your inbox

We’ll email you 1-3 times per week—and never share your information.

Shea Stevens picture

Shea Stevens

Shea is a content writer currently living in Charlotte, North Carolina. After graduating with a degree in Marketing from East Carolina University, she joined the digital marketing industry focusing on content and social media. In her free time, you can find Shea visiting her local farmers market, attending a country music concert, or planning her next adventure.

  • Data & analytics
  • Small business

What is data extraction? And how to automate the process

Data extraction is the process of taking actionable information from larger, less structured sources to be further refined or analyzed. Here's how to do it.

Related articles

Hero image of a woman doing a makeup tutorial to a camera

How to start a successful side hustle

Two orange people icons on a light orange background with a dotted line behind it.

11 management styles, plus tips for applying each type

11 management styles, plus tips for applying...

data analysis example in research paper

Keep your company adaptable with automation

Icons of three people representing leads and contacts grouped together against a yellow background.

How to enrich lead data for personalized outreach

How to enrich lead data for personalized...

Improve your productivity automatically. Use Zapier to get your apps working together.

A Zap with the trigger 'When I get a new lead from Facebook,' and the action 'Notify my team in Slack'

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

6 How to Analyze Data in a Primary Research Study

Melody Denny and Lindsay Clark

This chapter introduces students to the idea of working with primary research data grounded in qualitative inquiry, closed-and open-ended methods, and research ethics (Driscoll; Mackey and Gass; Morse; Scott and Garner). [1] We know this can seem intimidating to students, so we will walk them through the process of analyzing primary research, using information from public datasets including the Pew Research Center. Using sample data on teen social media use, we share our processes for analyzing sample data to demonstrate different approaches for analyzing primary research data (Charmaz; Creswell; Merriam and Tisdale; Saldaña). We also include links to additional public data sets, chapter discussion prompts, and sample activities for students to apply these strategies.

At this point in your education, you are familiar with what is known as secondary research or what many students think of as library research. Secondary research makes use of sources most often found in the library or, these days, online (books, journal articles, magazines, and many others). There’s another kind of research that you may or may not be familiar with: primary research. The Purdue OWL defines primary research as “any type of research you collect yourself” and lists examples as interviews, observations, and surveys (“What is Primary Research”).

Primary research is typically divided into two main types—quantitative and qualitative research. These two methods (or a mix of these) are used by many fields of study, so providing a singular definition for these is a bit tricky. Sheard explains that “quantitative research…deals with data that are numerical or that can be converted into numbers. The basic methods used to investigate numerical data are called ‘statistics’” (429). Guest, et al. explain that qualitative research is “information that is difficult to obtain through more quantitatively-oriented methods of data collection” and is used more “to answer the whys and hows of human behavior, opinion, and experience” (1).

This chapter focuses on qualitative methods that explore peoples’ behaviors, interpretations, and opinions. Rather than being only a reader and reporter of research, primary research allows you to be creators of research. Primary research provides opportunities to collect information based on your specific research questions and generate new knowledge from those questions to share with others. Generally, primary research tends to follow these steps:

  • Develop a research question. Secondary research often uses this as a starting point as well. With primary research, however, rather than using library research to answer your research question, you’ll also collect data yourself to answer the question you developed. Data, in this case, is the information you collect yourself through methods such as interviews, surveys, and observations.
  • Decide on a research method. According to Scott and Garner, “A research method is a recognized way of collecting or producing [primary data], such as a survey, interview, or content analysis of documents” (8). In other words, the method is how you obtain the data.
  • Collect data. Merriam and Tisdale clarify what it means to collect data: “data collection is about asking, watching, and reviewing” (105-106). Primary research might include asking questions via surveys or interviews, watching or observing interactions or events, and examining documents or other texts.
  • Analyze data. Once data is collected, it must then be analyzed. “Data analysis is the process of making sense out of the data… Basically, data analysis is the process used to answer your research question(s)” (Merriam and Tisdale 202). It’s worth noting that many researchers collect data and analyze at the same time, so while these may seem like different steps in the process, they actually overlap.
  • Report findings. Once the researcher has spent time understanding and interpreting the data, they are then ready to write about their research, often called “findings.” You may also see this referred to as “results.”

While the entire research process is discussed, this chapter focuses on the analysis stage of the process (step 4). Depending on where you are in the research process, you may need to spend more time on step 1, 2, or 3 and review Driscoll’s “Introduction to Primary Research” (Volume 2 of Writing Spaces ).

Primary research can seem daunting, and some students might think that they can’t do primary research, that this type of research is for professionals and scholars, but that’s simply not true. It’s true that primary research data can be difficult to collect and even more difficult to analyze, but the findings are typically very revealing. This chapter and the examples included break down this research process and demonstrate how general curiosity can lead to exciting chances to learn and share information that is relevant and interesting. The goal of this chapter is to provide you with some information about data analysis and walk you through some activities to prepare you for your own data analysis. The next section discusses analyzing data from closed-ended methods and open-ended methods.

Data from Primary Research

As stated above, this chapter doesn’t focus on methods, but before moving on to analysis, it’s important to clarify a few things related to methods as they are directly connected to analyzing data. As a quick reminder, a research method is how researchers collect their data such as surveys, interviews, or textual analysis. No matter which method used, researchers need to think about the types of questions to ask for answering their overall research question. Generally, there are two types of questions to consider: closed-ended and open-ended. The next section provides examples of the data you might receive from asking closed-ended and open-ended questions and options for analyzing and presenting that data.

Data from Closed-Ended Methods

The data that is generated by closed-ended questions on methods such as surveys and polls is often easier to organize. Because the way respondents could answer those questions is limited to specific answers (Yes/No, numbered scales, multiple choice), the data can be analyzed by each question or by looking at the responses individually or as a whole. Though there are several approaches to analyzing the data that comes from closed-ended questions, this section will introduce you to a few different ways to make sense of this kind of data.

Closed-ended questions are those that have limited answers, like multiple choice or check-all-that-apply questions. These questions mean that respondents can provide only the answers given or they may select an “other” option. An example of a closed-ended question could be “Do you use YouTube? Yes, No, Sometimes.” Closed-ended questions have their perks because they (mostly) keep participants from misinterpreting the question or providing unhelpful responses. They also make data analysis a bit easier.

If you were to ask the “Yes, No, Sometimes” question about YouTube to 20 of your closest friends, you may get responses like Yes = 18, No = 1, and Sometimes = 1. But, if you were to ask a more detailed question like “Which of the following social media platforms do you use?” and provide respondents with a check-all-that-apply option, like “Facebook, YouTube, Twitter, Instagram, Snapchat, Reddit, and Tumblr,” you would get a very different set of data. This data might look like Facebook = 17, YouTube = 18, Twitter = 12, Instagram = 20, Snapchat = 15, Reddit = 8, and Tumblr = 3. The big takeaway here is that how you ask the question determines the type of data you collect.

Analyzing Closed-Ended Data

Now that you have data, it’s time to think about analyzing and presenting that data. Luckily, the Pew Research Center conducted a similar study that can be used as an example. The Pew Research Center is a “nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research” (“About Pew Research Center”). The information provided below comes from their public dataset “Teens, Social Media, and Technology, 2018” (Anderson and Jiang). This example is used to show how you might analyze this type of data once collected and what that data might look like. “Teens, Social Media, and Technology 2018” reported responses to questions related to which online platforms teens use and which they use most often. In figure 1 below, Pew researchers show the final product of their analysis of the data:

Social Media Usage Statistics

Pew analyzed their data and organized the findings by percentages to show what they discovered. They had 743 teens who responded to these questions, so presenting their findings in percentages helps readers better “see” the data overall (rather than saying YouTube = 631 and Instagram = 535). However, results can be represented in different ways. When the Pew researchers were deciding how to present their data, they could have reported the frequency, or the number of people who said they used YouTube, Instagram, and Snapchat.

In the scenario of polling 20 of your closest friends, you, too, would need to decide how to present your data: Facebook = 17, YouTube = 18, Twitter = 12, Instagram = 20, Snapchat = 15, Reddit = 8, and Tumblr = 3. In your case, you might want to present the frequency (number) of responses rather than the percentages of responses like Pew did. You could choose a bar graph like Pew or maybe a simple table to show your data.

Looking again at the Pew data, researchers could use this data to generate further insights or questions about user preferences. For example, one could highlight the fact that 85% of respondents reported using YouTube the most, while only 7% reported using Reddit. Why is that? What conclusions might you be able to make based on these data? Does the data make you wonder if any additional questions might be explored? If you want to learn more about your respondents’ opinions or preference, you might need to ask open-ended questions.

Data from Open-Ended Methods

Whereas closed-ended questions limit how respondents might answer, open-ended questions do not limit respondents’ answers and allow them to answer more freely. An example of an open-ended question, to build off the question above, could be “Why do you use social media? Explain.” This type of question gives respondents more space to fully explain their responses. Open-ended questions can make the data varied because each respondent may answer differently. These questions, which can provide fruitful responses, can also mean unexpected responses or responses that don’t help to answer the overall research question, which can sometimes make data analysis challenging.

In that same Pew Research Center data, respondents were likely limited in how they were able to answer by selecting social media platforms from a list. Pew also shares selected data (Appendix A), and based on these data, it can be assumed they also asked open-ended questions, something about the positive or negative effects of social media platforms. Because their research method included both closed-ended questions about which platforms teens use as well as open-ended questions that invited their thoughts about social media, Pew researchers were able to learn more about these participants’ thoughts and perceptions. To give us, the readers, a clearer idea of how they justified their presentation of the data, Pew offers 15 sample excerpts from those open-ended questions. They explain that these excerpts are what the researchers believe are representative of the larger data set. We explain below how we might analyze those excerpts.

Analyzing Open-Ended Data

As Driscoll reminds us, ethical considerations impact all stages of the research process, and researchers should act ethically throughout the entire research process. You already know a little something about research ethics. For example, you know that ethical writers cite sources used in research papers by giving credit to the person who created that information. When creating primary sources, you have a few different ethical considerations for analyzing data, which will be discussed below.

To demonstrate how to analyze data from open-ended methods, we explain how we (Melody and Lindsay) analyzed the 15 excerpts from the Pew data using open coding. Open coding means analyzing the data without any predetermined categories or themes; researchers are just seeing what emerges or seems significant (Charmaz). Creswell suggests four specific steps when coding qualitative data, though he also stresses that these steps are iterative, meaning that researchers may need to revisit a step anywhere throughout the process. We use these four steps to explain our analysis process, including how we ethically coded the data, interpreted what the coding process revealed, and worked together to identify and explain categories we saw in the data.

Step 1: Organizing and Preparing the Data

The first part of the analysis stage is organizing the data before examining it. When organizing data, researchers must be careful to work with primary data ethically because that data often represents actual peoples’ information and opinions. Therefore, researchers need to carefully organize the data in such a way as to not identify their participants or reveal who they are. This is a key component to The Belmont Report , guidelines published in 1979 meant to guide researchers and help protect participants. Using pseudonyms or assigning numbers or codes (in place of names) to the data is a recommended ethical step to maintain participants’ confidentiality in a study. Anonymizing data, or removing names, has the additional effect of eliminating researcher bias, which can occur when researchers are so familiar with their own data and participants that the researchers may begin to think they already know the answers or see connections prior to analysis (Driscoll). By assigning pseudonyms, researchers can also ensure that they take an objective look at each participant’s answers without being persuaded by participant identity.

The first part of coding is to make notations while reading through the data (Merriam and Tisdale). At this point, researchers are open to many possibilities regarding their data. This is also where researchers begin to construct categories. Offering a simple example to illustrate this decision-making process, Merriam and Tisdale ask us to imagine sorting and categorizing two hundred grocery store items (204). Some items could be sorted into more than one category; for example, ice cream could be categorized as “frozen” or as “dessert.” How you decide to sort that item depends on your research question and what you want to learn.

For this step, we, Melody and Lindsay, each created a separate document that included the 15 excerpts. Melody created a table for the quotes, leaving a column for her coding notes, and Lindsay added spaces between the excerpts for her notes. For our practice analysis, we analyzed the data independently, and then shared what we did to compare, verify, and refine our analysis. This brings a second, objective view to the analysis, reduces the effect of researcher bias, and ensures that your analysis can be verified and supported by the data. To support your analysis, you need to demonstrate how you developed the opinions and conclusions you have about your data. After all, when researchers share their analyses, readers often won’t see all of the raw data, so they need to be able to trust the analysis process.

Step 2: Reading through All the Data

Creswell suggests getting a general sense of the data to understand its overall meaning. As you start reading through your data, you might begin to recognize trends, patterns, or recurring features that give you ideas about how to both analyze and later present the data. When we read through the interview excerpts of these 15 participants’ opinions of social media, we both realized that there were two major types of comments: positive and negative. This might be similar to categorizing the items in the grocery store (mentioned above) into fresh/frozen foods and non-perishable items.

To better organize the data for further analysis, Melody marked each positive comment with a plus sign and each negative comment with a minus sign. Lindsay color-coded the comments (red for negative, indicated by boldface type below; green for positive, indicated by grey type below) and then organized them on the page by type. This approach is in line with Merriam and Tisdale’s explanation of coding: “assigning some sort of shorthand designation to various aspects of your data so that you can easily retrieve specific pieces of the data. The designations can be single words, letters, numbers, phrases, colors, or combinations of these” (199). While we took different approaches, as shown the two sections below, both allowed us to visually recognize the major sections of the data:

Lindsay’s Coding Round 1, which shows her color coding indicated by boldface type

“[Social media] allows us to communicate freely and see what everyone else is doing. [It] gives us a voice that can reach many people.” (Boy, age 15) “It makes it harder for people to socialize in real life, because they become accustomed to not interacting with people in person.” (Girl, age 15) “[Teens] would rather go scrolling on their phones instead of doing their homework, and it’s so easy to do so. It’s just a huge distraction.” (Boy, age 17) “It enables people to connect with friends easily and be able to make new friends as well.” (Boy, age 15) “I think social media have a positive effect because it lets you talk to family members far away.” (Girl, age 14) “Because teens are killing people all because of the things they see on social media or because of the things that happened on social media.” (Girl, age 14) “We can connect easier with people from different places and we are more likely to ask for help through social media which can save people.” (Girl, age 15)

Melody’s Coding Round 1, showing her use of plus and minus signs to classify the comments as positive or negative, respectively

+ “[Social media] allows us to communicate freely and see what everyone else is doing. [It] gives us a voice that can reach many people.” (Boy, age 15) – “It makes it harder for people to socialize in real life, because they become accustomed to not interacting with people in person.” (Girl, age 15) – “[Teens] would rather go scrolling on their phones instead of doing their homework, and it’s so easy to do so. It’s just a huge distraction.” (Boy, age 17) + “It enables people to connect with friends easily and be able to make new friends as well.” (Boy, age 15) + “I think social media have a positive effect because it lets you talk to family members far away.” (Girl, age 14) – “Because teens are killing people all because of the things they see on social media or because of the things that happened on social media.” (Girl, age 14) + “We can connect easier with people from different places and we are more likely to ask for help through social media which can save people.” (Girl, age 15)

Step 3: Doing Detailed Coding Analysis of the Data

It’s important to mention that Creswell dedicates pages of description on coding data because there are various ways of approaching detailed analysis. To code our data, we added a descriptive word or phrase that “symbolically assigns a summative, salient, essence-capturing, and/or evocative attribute” to a portion of data (Saldaña 3). From the grocery store example above, that could mean looking at the category of frozen foods and dividing them into entrees, side dishes, desserts, appetizers, etc. We both coded for topics or what the teens were generally talking about in their responses. For example, one excerpt reads “Social media allows us to communicate freely and see what everyone else is doing. It gives us a voice that can reach many people.” To code that piece of data, researchers might assign words like communication, voice, or connection to explain what the data is describing.

In this way, we created the codes from what the data said, describing what we read in those excerpts. Notice in the section below that, even though we coded independently, we described these pieces of data in similar ways using bolded keywords:

Melody’s Coding Round 2, with key words added to summarize the meanings of the different quotes

– “Gives people a bigger audience to speak and teach hate and belittle each other.” (Boy, age 13) bullying – “It provides a fake image of someone’s life. It sometimes makes me feel that their life is perfect when it is not.” (Girl, age 15) fake + “Because a lot of things created or made can spread joy.” (Boy, age 17) reaching people + “I feel that social media can make people my age feel less lonely or alone. It creates a space where you can interact with people.” (Girl, age 15) connection + “[Social media] allows us to communicate freely and see what everyone else is doing. [It] gives us a voice that can reach many people.” (Boy, age 15) reaching people

Lindsay’s Coding Round 2, with key words added in capital letters to summarize the meanings of the quotations

“Gives people a bigger audience to speak and teach hate and belittle each other.” (Boy, age 13) OPPORTUNITIES TO COMMUNICATE NEGATIVELY/MORE EASILY “It provides a fake image of someone’s life. It sometimes makes me feel that their life is perfect when it is not.” (Girl, age 15) FAKE, NOT REALITY “Because a lot of things created or made can spread joy.” (Boy, age 17) SPREAD JOY “I feel that social media can make people my age feel less lonely or alone. It creates a space where you can interact with people.” (Girl, age 15) INTERACTION, LESS LONELY “[Social media] allows us to communicate freely and see what everyone else is doing. [It] gives us a voice that can reach many people.” (Boy, age 15) COMMUNICATE, VOICE

Though there are methods that allow for researchers to use predetermined codes (like from previous studies), “the traditional approach…is to allow the codes to emerge during the data analysis” (Creswell 187).

Step 4: Using the Codes to Create a Description Using Categories, Themes, Settings, or People

Our individual coding happened in phases, as we developed keywords and descriptions that could then be defined and relabeled into concise coding categories (Saldaña 11). We shared our work from Steps 1-3 to further define categories and determine which themes were most prominent in the data. A few times, we interpreted something differently and had to discuss and come to an agreement about which category was best.

In our process, one excerpt comment was interpreted as negative by one of us and positive by the other. Together we discussed and confirmed which comments were positive or negative and identified themes that seemed to appear more than once, such as positive feelings towards the interactional element of social media use and the negative impact of social media use on social skills. When two coders compare their results, this allows for qualitative validity, which means “the researcher checks for the accuracy of the findings” (Creswell 190). This could also be referred to as intercoder reliability (Lavrakas). For intercoder reliability, researchers sometimes calculate how often they agree in a percentage. Like many other aspects of primary research, there is no consensus on how best to establish or calculate intercoder reliability, but generally speaking, it’s a good idea to have someone else check your work and ensure you are ethically analyzing and reporting your data.

Interpreting Coded Data

Once we agreed on the common categories and themes in this dataset, we worked together on the final analysis phase of interpreting the data, asking “what does it mean?” Data interpretation includes “trying to give sense to the data by creatively producing insights about it” (Gibson and Brown 6). Though we acknowledge that this sample of only 15 excerpts is small, and it might be difficult to make claims about teens and social media from just this data, we can share a few insights we had as part of this practice activity.

Overall, we could report the frequency counts and percentages that came from our analysis. For example, we counted 8 positive comments and 7 negative comments about social media. Presented differently, those 8 positive comments represent 53% of the responses, so slightly over half. If we focus on just the positive comments, we are able to identify two common themes among those 8 responses: Interaction and Expression. People who felt positively about social media use identified the ability to connect with people and voice their feelings and opinions as the main reasons. When analyzing only the 7 negative responses, we identified themes of Bullying and Social Skills as recurring reasons people are critical of social media use among teens. Identifying these topics and themes in the data allows us to begin thinking about what we can learn and share with others about this data.

How we represent what we have learned from our data can demonstrate our ethical approach to data analysis. In short, we only want to make claims we can support, and we want to make those claims ethically, being careful to not exaggerate or be misleading.

To better understand a few common ethical dilemmas regarding the presentation of data, think about this example: A few years ago, Lindsay taught a class that had only four students. On her course evaluations, those four students rated the class experience as “Excellent.” If she reports that 100% of her students answered “Excellent,” is she being truthful? Yes. Do you see any potential ethical considerations here? If she said that 4/4 gave that rating, does that change how her data might be perceived by others? While Lindsay could show the raw data to support her claims, important contextual information could be missing if she just says 100%. Perhaps others would assume this was a regular class of 20-30 students, which would make that claim seem more meaningful and impressive than it might be.

Another word for this is cherry picking. Cherry picking refers to making conclusions based on thin (or not enough) data or focusing on data that’s not necessarily representative of the larger dataset (Morse). For example, if Lindsay reported the comment that one of her students made about this being the “best class ever,” she would be telling the truth but really only focusing on the reported opinion of 25% of the class (1 out of 4). Ideally, researchers want to make claims about the data based on ideas that are prominent, trending, or repeated. Less prominent pieces of data, like the opinion of that one student, are known as outliers, or data that seem to “be atypical of the rest of the dataset” (Mackey and Gass 257). Focusing on those less-representative portions might misrepresent or overshadow the aspects of the data that are prominent or meaningful, which could create ethical problems for your study. With these ethical considerations in mind, the last step of conducting primary research would be to write about the analysis and interpretation to share your process with others.

This chapter has introduced you to ethically analyzing data within the primary research tradition by focusing on close-ended and open-ended data. We’ve provided you with examples of how data might be analyzed, interpreted, and presented to help you understand the process of making sense of your data. This is just one way to approach data analysis, but no matter your research method, having a systematic approach is recommended. Data analysis is a key component in the overall primary research process, and we hope that you are now excited and curious to participate in a primary research project.

Works Cited

“About Pew Research Center.” Pew Research Center, 2020. www.pewresearch.org/about/ . Accessed 28 Dec 2020. Anderson, Monica, and Jingjing Jiang.

“Teens, Social Media & Technology 2018.” Pew Research Center, May 2018, www.pewresearch.org/internet/2018/05/31/teens-social-media-technology-2018/ .

The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research, Office for Human Research Protections, www.hhs.gov/ohrp/regulations-and-policy/belmont-report/read-the-belmont-report/index.html . 18 Apr. 1979.

Charmaz, Kathy. “Grounded Theory.” Approaches to Qualitative Research: A Reader on Theory and Practice , edited by Sharlene Nagy Hesse-Biber and Patricia Leavy, Oxford UP, 2004, pp. 496-521.

Corpus of Contemporary American English (COCA) . (n.d.). Retrieved April 11, 2021, from https://www.english-corpora.org/coca/

Creswell, John W. Research Design: Qualitative, Quantitative, and Mixed Methods Approaches , 3rd edition, Sage, 2009.

Data.gov . (2020). Retrieved April 11, 2021, from https://www.data.gov/

Driscoll, Dana Lynn. “Introduction to Primary Research: Observations, Surveys, and Interviews.” Writing Spaces: Readings on Writing , Volume 2, Parlor Press, 2011, pp. 153-174.

Explore Census Data . (n.d.). United States Census Bureau. Retrieved April 11, 2021, from https://data.census.gov/cedsci/

Gibson, William J., and Andrew Brown. Working with Qualitative Data . London, Sage, 2009.

Google Trends. (n.d.). Retrieved April 11, 2021, from https://trends.google.com/trends/explore

Guest, Greg, et al. Collecting Qualitative Data: A Field Manual for Applied Research . Sage, 2013.

HealthData.gov . (n.d.). Retrieved April 11, 2021, from https://healthdata.gov/

Lavrakas, Paul J. Encyclopedia of Survey Research Methods . Sage, 2008.

Mackey, Allison, and Sue M. Gass. Second Language Research: Methodology and Design . Lawrence Erlbaum Associates, 2005.

Merriam, Sharan B., and Elizabeth J. Tisdell. Qualitative Research: A Guide to Design and Implementation , John Wiley & Sons, Incorporated, 2015. ProQuest Ebook Central, https://ebookcentral.proquest.com/lib/unco/detail.action?docID=2089475 .

Michigan Corpus of Academic Spoken English. (n.d.). Retrieved April 11, 2021, from https://quod.lib.umich.edu/cgi/c/corpus/corpus?c=micase;page=simple

Morse, Janice. M. “‘Cherry Picking’: Writing from Thin Data.” Qualitative Health Research , vol. 20, no. 1, 2009, p. 3.

Pew Research Center . (2021). Retrieved April 11, 2021, from https://www.pewresearch.org/

Saldaña, Johnny. The Coding Manual for Qualitative Researchers , 2nd edition, Sage, 2013.

Scott, Greg, and Roberta Garner. Doing Qualitative Research: Designs, Methods, and Techniques , 1st edition, Pearson, 2012.

Sheard, Judithe. “Quantitative Data Analysis.” Research Methods Information, Systems, and Contexts , edited by Kirsty Williamson and Graeme Johanson, Elsevier, 2018, pp. 429-452.

Teens and Social Media , Google Trends, trends.google.com/trends/explore?-date=all&q=teens%20and%20social%20media . Accessed 15 Jul. 2020.

“What is Primary Research and How Do I Get Started?” The Writing Lab and OWL at Purdue and Purdue U , 2020. owl.purdue.edu/owl . Accessed 21 Dec. 2020.

Zhao, Alice. “How Text Messages Change from Dating to Marriage.” Huffington Post , 21 Oct. 2014, www.huffpost.com .

“My mom had to get a ride to the library to get what I have in my hand all the time. She reminds me of that a lot.” (Girl, age 14)

“Gives people a bigger audience to speak and teach hate and belittle each other.” (Boy, age 13)

“It provides a fake image of someone’s life. It sometimes makes me feel that their life is perfect when it is not.” (Girl, age 15)

“Because a lot of things created or made can spread joy.” (Boy, age 17)

“I feel that social media can make people my age feel less lonely or alone. It creates a space where you can interact with people.” (Girl, age 15)

“[Social media] allows us to communicate freely and see what everyone else is doing. [It] gives us a voice that can reach many people.” (Boy, age 15)

“It makes it harder for people to socialize in real life, because they become accustomed to not interacting with people in person.” (Girl, age 15)

“[Teens] would rather go scrolling on their phones instead of doing their homework, and it’s so easy to do so. It’s just a huge distraction.” (Boy, age 17)

“It enables people to connect with friends easily and be able to make new friends as well.” (Boy, age 15)

“I think social media have a positive effect because it lets you talk to family members far away.” (Girl, age 14)

“Because teens are killing people all because of the things they see on social media or because of the things that happened on social media.” (Girl, age 14)

“We can connect easier with people from different places and we are more likely to ask for help through social media which can save people.” (Girl, age 15)

“It has given many kids my age an outlet to express their opinions and emotions, and connect with people who feel the same way.” (Girl, age 15)

“People can say whatever they want with anonymity and I think that has a negative impact.” (Boy, age 15)

“It has a negative impact on social (in-person) interactions.” (Boy, age 17)

Teacher Resources for How to Analyze Data in a Primary Research Study

Overview and teaching strategies.

This chapter is intended as an overview of analyzing qualitative research data and was written as a follow-up piece to Dana Lynn Driscoll’s “Introduction to Primary Research: Observations, Surveys, and Interviews” in Volume 2 of this collection. This chapter could work well for leading students through their own data analysis of a primary research project or for introducing students to the idea of primary research by using outside data sources, those in the chapter and provided in the activities below, or data you have access to.

From our experiences, students usually have limited experience with primary research methods outside of conducting a small survey for other courses, like sociology. We have found that few of our students have been formally introduced to primary research and analysis. Therefore, this chapter strives to briefly introduce students to primary research while focusing on analysis. We’ve presented analysis by categorizing data as open-ended and closed-ended without getting into too many details about qualitative versus quantitative. Our students tend to produce data collection tools with a mix of these types of questions, so we feel it’s important to cover the analysis of both.

In this chapter, we bring students real examples of primary data and lead them through analysis by showing examples. Any of these exercises and the activities below may be easily supplemented with additional outside data. One way that teachers can bring in outside data is through the use of public datasets.

Public Data Sets

There are many public data sets that teachers can use to acquaint their students with analyzing data. Be aware that some of these datasets are for experienced researchers and provide the data in CSV files or include metadata, all of which is probably too advanced for most of our students. But if you are comfortable converting this data, it could be valuable for a data analysis activity.

  • In the chapter, we pulled from Pew Research, and their website contains many free and downloadable data sets (Pew Research Center).
  • The site Data.gov provides searchable datasets, but you can also explore their data by clicking on “data” and seeing what kinds of reports they offer.
  • The U.S. Census Bureau offers some datasets as well (Explore Census Data): Much of this data is presented in reports, but teachers could pull information from reports and have students analyze the data and compare their results to those in the report, much like we did with the Pew Research data in the chapter.
  • Similarly, HealthData.gov offers research-based reports packed with data for students to analyze.
  • In one of the activities below, we used Google Trends to look at searches over a period of time. There are some interesting data and visuals provided on the homepage to help students get started.
  • If you’re looking for something a bit more academic, the Michigan Corpus of Academic Spoken English is a great database of transcripts from academic interactions and situations.
  • Similarly, the Corpus of Contemporary American English allows users to search for words or word strings to see their frequency and in which genre and when these occur.

Before moving on to student activities, we’d like to offer one additional suggestion for teachers to consider.

Class Google Form

One thing that Melody does at the beginning of almost all of her research-based writing courses is ask students to complete a Google Form at the beginning of the semester. Sometimes, these forms are about their experiences with research. Other times, they revolve around a class topic (recently, she’s been interested in Generation Z or iGeneration and has asked students questions related to that). Then, when it’s time to start thinking about primary research, she uses that Google Form to help students understand more about the primary research process. Here are some ways that teachers can employ the data gathered from Google Form given to students.

  • Ask students to look at the questions asked on the survey and deduce the overall research question.
  • • Ask students to look at the types of questions asked (open- and closed-ended) and consider why they were constructed that way.
  • Ask students to evaluate the wording of the questions asked.
  • Ask students to examine the results of a few (or more) or the questions on the survey. This can be done in groups with each group looking at 1-3 questions, depending on the size of your Google Form.
  • Ask students to think about how they might present that data in visual form. Yes, Google provides some visuals, but you can give them the raw data and see what they come up with.
  • Ask students to come up with 1-3 major takeaways based on all the data.

This exercise allows students to work with real data and data that’s directly related to them and their classmates. It’s also completely within ethical boundaries because it’s data collected in the classroom, for educational purposes, and it stays within the classroom.

Below we offer some guiding questions to help move students through the chapter and the activities as well as some additional activities.

Discussion Questions

  • In the opening of this chapter, we introduced you to primary research , or “any type of research you collect yourself” (“What is Primary Research”). Have you completed primary research before? How did you decide on your research method, based on your research question? If you have not worked on primary research before, brainstorm a potential research question for a topic you want to know more about. Discuss what research method you might use, including closed- or open-ended methods and why.
  • Looking at the chart from the Pew Research dataset, “Teens, Social Media, and Technology 2018,” would you agree that the distributions among online platforms remain similar, or have trends changed?
  • What do you make of the “none of the above” category on the Pew table? Do you think teens are using online platforms that aren’t listed, or do you think those respondents don’t use any online platforms?

google trends for "social media"

  • When analyzing data from open-ended questions, which step seems most challenging to you? Explain.

Activity #1: TurnItIn and Infographics

Infographics can be a great way to help you see and understand data, while also giving you a way to think about presenting your own data. Multiple infographics are available on TurnItIn, downloadable for free, that provide information about plagiarism.

Figure 3, titled “The Plagiarism Spectrum,” provides you with the “severity” and “frequency” based on survey findings of nearly 900 high school and college instructors from around the world. TurnItIn encourages educators to print this infographic and hang in their classroom:

plagiarism spectrum

This infographic provides some great data analysis examples: specific categories with definitions (and visual representation of their categories), frequency counts with bar graphs, and color gradient bars to show higher vs. lower numbers.

  • Write a summary of how this infographic presents data.
  • How do you think they analyzed the data based on this visual?

Activity #2: How Text Messages Change from Dating to Marriage

In Alice Zhao’s Huffington Post piece, she analyzes text messages that she collected during her relationship with her boyfriend, turned fiancé, turned husband to answer the question of how text messages (or communication) change over the course of a relationship. While Zhao offers some insight into her data, she also provides readers with some really cool graphics that you can use to practice your analysis skills.

These first graphics are word clouds. In figure 4, Zhao put her textual data into a program that creates these images based on the most frequently occurring words. Word clouds are another option for analyzing your data. If you have a lot of textual data and want to know what participants said the most, placing your data into a word cloud program is an easy way to “see” the data in a new way. This is usually one of the first steps of analysis, and additional analysis is almost always needed.

Zhao’s Word Cloud Sampling

  • What do you notice about the texts from 2008 to 2014?
  • What do you notice between her texts (me) and his texts (him)?

Zhao also provided this graphic (figure 5), a comparative look at what she saw as the most frequently occurring words from the word clouds. This could be another step in your data analysis procedure: zooming in on a few key aspects and digging a bit deeper.

Zhao’s Bar Graph

  • What do you make of this data? Why might the word “hey” occur more frequently in the dating time frame and the word “ok” occur more frequently in the married time frame?

As part of her research, Zhao also looked at the time of day text messages were sent, shown below in figure 6:

Zhao’s Plot Graph of Time of Day

Here, Zhao looked at messages sent a month after their first date, a month after their engagement, and a month after their wedding.

  • She offers her own interpretation in her piece in figure 6, but what do you think of this?
  • Also make note of this graphic. It’s a great way to look at the data another way. If your data may be time sensitive, this type of graphic may help you better analyze and understand your data.
  • This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0) and is subject to the Writing Spaces Terms of Use. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ , email [email protected] , or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. To view the Writing Spaces Terms of Use, visit http://writingspaces.org/terms-of-use . ↵

How to Analyze Data in a Primary Research Study Copyright © 2021 by Melody Denny and Lindsay Clark is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License , except where otherwise noted.

Share This Book

  • How it works

researchprospect post subheader

Chapter 4 – Data Analysis and Discussion (example)

Disclaimer: This is not a sample of our professional work. The paper has been produced by a student. You can view samples of our work here . Opinions, suggestions, recommendations and results in this piece are those of the author and should not be taken as our company views.

Type of Academic Paper – Dissertation Chapter

Academic Subject – Marketing

Word Count – 2964 words

Reliability Analysis

Before conducting any analysis on the data, all the data’s reliability was analyzed based on Cronbach’s Alpha value. The reliability analysis was performed on the complete data of the questionnaire. The reliability of the data was found to be (0.922), as shown in the results of the reliability analysis provided below in table 4.1. However, the complete results output of the reliability analysis is given in the appendix.

Reliability Analysis (N=200)

The Cronbach’s Alpha value between (0.7-1.0) is considered to have excellent reliability. The Cronbach’s Alpha value of the data was found to be (0.922); therefore, this indicated that the questionnaire data had excellent reliability. All of the 29 items of the questionnaire had excellent reliability, and if they are taken for further analysis, they can generate results with 92.2% reliability.

Frequency Distribution Analysis

First of all, the frequency distribution analysis was performed on the demographic variables using SPSS to identify the respondents’ demographic composition. Section 1 of the questionnaire had 5 demographic questions to identify; gender, age group, annual income, marital status, and education level of the research sample. The frequency distribution results shown in table 4.2 below indicated that there were 200 respondents in total, out of which 50% were male, and 50% were female. This shows that the research sample was free from gender-based biases as males and females had equal representation in the sample.

Moreover, the frequency distribution analysis suggested three age groups; ‘20-35’, ‘36-60’ and ‘Above 60’. 39% of the respondents belonged to the ‘20-35’ age group, while 56.5% of the respondents belonged to the ‘36-60’ age group and the remaining 4.5% belonged to the age group of ‘Above 60’.

Furthermore, the annual income level was divided into four categories. The income values were in GBP. It was found that 13% of the respondents had income ‘up to 30000’, 27% had income between ‘31000 to 50000’, 52.5% had income between ‘51000 to 100000’, and 7.5% had income ‘Above 100000’. This suggests that most of the respondents had an annual income between ‘31000 to 50000’ GBP.

The frequency distribution analysis indicated that 61% of respondents were single, while 39% were married, as indicated in table 4.2. This means that most of the respondents were single. Based on frequency distribution, it was also found that the education level of the respondents was analyzed using four categories of education level, namely; diploma, graduate, master, and doctorate. The results depicted that 37% of the respondents were diploma holders, 46% were graduates, 16% had master-level education, while only 2% had a doctorate. This suggests that most of the respondents were either graduate or diploma holders.

Frequency Distribution of the Demographic Characteristics of the respondents (N=200)

Multiple Regression Analysis

The hypotheses were tested using linear multiple regression analysis to determine which of the dependent variables had a significant positive effect on the customer loyalty of the five-star hotel brands. The results of the regression analysis are summarized in the following table 4.3. However, the complete SPSS output of the regression analysis is given in the appendix. Table 4.3

Multiple regression analysis showing the predictive values of dependent variables (Brand image, corporate identity, public relation, perceived quality, and trustworthiness) on customer loyalty (N=200)

Predictors: (Constant), Trustworthiness, Public Relation, Brand Image, Corporate Identity, Perceived Quality Dependent Variable: Customer Loyalty

The significance value (p-value) of ANOVA was found to be (0.000) as shown in the above

table, which was less than 0.05. This suggested that the model equation was significantly fitted

on the data. Moreover, the adjusted R-Square value was (0.897), which indicated that the model’s predictors explained 89.7% variation in customer loyalty.

Furthermore, the presence of the significant effect of the 5 predicting variables on customer loyalty was identified based on their sig. Values. The effect of a predicting variable is significant if its sig. Value is less than 0.05 or if its t-Statistics value is greater than 2. It was found that the variable ‘brand image’ had sig. Value (0.046), the variable ‘corporate identity had sig. Value (0.482), the variable ‘public relation’ had sig. Value (0.400), while the variable ‘perceived quality’ had sig. value (0.000), and the variable ‘trustworthiness’ had sig. value (0.652).

Hire an Expert Dissertation Chapter Writer

Orders completed by our expert writers are

  • Formally drafted in an academic style
  • Free Amendments and 100% Plagiarism Free – or your money back!
  • 100% Confidential and Timely Delivery!
  • Free anti-plagiarism report
  • Appreciated by thousands of clients. Check client reviews

Hire an Expert Dissertation Chapter Writer

Hypotheses Assessment

Based on the regression analysis, it was found that brand image and perceived quality have a significant positive effect on customer loyalty. In contrast, corporate identity, public relations, and trustworthiness have an insignificant effect on customer loyalty. Therefore the two hypotheses; H1 and H4 were accepted, however the three hypotheses; H2, H3, and H5 were rejected as indicated in table 4.4.

Hypothesis Assessment Summary Table (N=200)

The insignificant variables (corporate identity, public relation and trustworthiness) were excluded from equation 1. After excluding the insignificant variables from the model equation 1, the final equation becomes as follows;

Customer loyalty                 = α + 0.074 (Brand image) + 0.991 (Perceived quality) + €

The above equation suggests that a 1 unit increase in brand image is likely to result in 0.074 units increase customer loyalty. In comparison, 1 unit increase in perceived quality can result in 0.991 units increase in customer loyalty.

Cross Tabulation Analysis

To further explore the results, the demographic variables’ data were cross-tabulated against the respondents’ responses regarding customer loyalty using SPSS. In this regards the five demographic variables; gender, age group, annual income, marital status and education level were cross-tabulated against the five questions regarding customer loyalty to know the difference between the customer loyalty of five-star hotels of UK based on demographic differences. The results of the cross-tabulation analysis are given in the appendix. The results are graphically presented in bar charts too, which are also given in the appendix.

Cross Tabulation of Gender against Customer Loyalty

The gender was cross-tabulated against question 1 to 5 of the questionnaire to identify the gender differences between male and female respondents’ responses regarding customer loyalty of five-star hotels of the UK. The results indicated that out of 100 males, 57% were extremely agreed that they stay at one hotel, while out of 100 females, 80% were extremely agreed they stay at one hotel. This shows that in comparison with a male, females were more agreed that they stayed at one hotel and were found to be more loyal towards their respective hotel brands.

The cross-tabulation results further indicated that out of 100 males, 53% agreed that they always say positive things about their respective hotel brand to other people. In contrast, out of 100 females, 77% were extremely agreed. Based on the results, the females were found to be in more agreement than males that they always say positive things about their respective hotel brand to other people.

It was further found that out of 100 males, 53% were extremely agreed that they recommend their hotel brand to others, however, out of 100 females, 74% were extremely agreed to this statement. This result also suggested that females were more in agreement than males to recommend their hotel brand to others.

Moreover, it was found that out of 100 males, 54% were extremely agreed that they don’t seek alternative hotel brands, while out of 100 females, 79% were extremely agreed to this statement. This result also suggested that females were more agreed than males that they don’t seek alternative hotel brands, and so were found to be more loyal than males.

Furthermore, it was identified that out of 100 male respondents 56% were extremely agreed that they would continue to go to the same hotel irrespective of the prices, however out of 100 females 79% were extremely agreed. Based on this result, it was clear that females were more agreed than males that they would continue to go to the same hotel irrespective of the prices, so females were found to be more loyal than males.

After cross tabulating ‘gender’ against the response of the 5 questions regarding customer loyalty the females were found to be more loyal customers of the five-star hotel brands than males as they were found to be more in agreement than the man that they stay at one hotel, always say positive things about their hotel brand to other people, recommend their hotel brand to others, don’t seek alternative hotel brands and would continue to go to the same hotel irrespective of the prices.

Cross Tabulation of Age Group against Customer Loyalty

Afterward, the second demographic variable, ‘age groups’ was cross-tabulated against questions 1 to 5 of the questionnaire to identify the difference between the customer loyalty of customers of different age groups. The results indicated that out of 78 respondents between 20 to 35 years of age, 61.5% were extremely agreed that they stayed at one hotel. While out of 113 respondents who were between 36 to 60 years of age, 72.6% were extremely agreed that they always stay at one hotel. However, out of 9 respondents who were above 60 years of age, 77.8% agreed that they always stay at one hotel. This indicated that customers of 36-60 and above 60 age groups were more loyal to their hotel brands as they were keener to stay at a respective hotel brand.

Content removed…

Cross Tabulation of Annual Income against Customer Loyalty

The third demographic variable, ‘annual income’ was cross-tabulated against questions 1 to 5 of the questionnaire to identify which of the customers were most loyal based on their respective annual income levels. The results indicated that out of 26 respondents who had annual income up to 30000 GBP, 84.6% were extremely agreed that they always stay at one hotel. However, out of 54 respondents who had annual income from 31000 to 50000 GBP, 98.1% agreed that they always stay at one hotel. Although out of 105 respondents had annual income from 50000 to 100000 GBP, 49.5% were extremely agreed that they always stay at one hotel. While out of 10 respondents who had annual income from 50000 to 1000000 GBP, 66.7% agreed that they always stay at one hotel. This indicated that customers of annual income levels from 31000 to 50000 GBP were more loyal to their hotel brands than the customers having other annual income levels.

Cross Tabulation of Marital Status against Customer Loyalty

Furthermore, the fourth demographic variable the ‘marital status’ was cross-tabulated against questions 1 to 5 of the questionnaire to understand the difference between married and unmarried respondents regarding customer loyalty of five-star hotels of the UK. The cross-tabulation analysis results indicated that out of 122 single respondents, 59.8% were extremely agreed that they stay at one hotel. However, out of 78 married respondents, around 82% of respondents agreed that they stay at one hotel. Thus, the married customers were more loyal to their hotel brands than unmarried customers because, in comparison, married customers prefer to stay at one hotel brand.

To proceed with the cross-tabulation results, out of 122 single respondents, 55.7% were extremely agreed upon always saying positive things about their hotel brands to other people. On the other hand, out of 78 married respondents, 79.5% were extremely agreed. Hence, upon evaluating the results, it can be said that married customers have more customer loyalty as they are in more agreement than singles. They always give positive feedback regarding their respective hotel brand to other people.

Cross Tabulation of Education Level against Customer Loyalty

Subsequently, the fifth demographic variable, ‘education level’ was cross-tabulated against questions 1 to 5 of the questionnaire to identify which of the customers were most loyal based on their respective education levels. The results indicated that out of 50 respondents who were diploma holders, 67.6% were extremely agreed that they always stay at one hotel. While out of 64 respondents who were graduates, 69.6% were extremely agreed that they always stay at one hotel. Although out of 22 respondents who were masters, 68.8% were extremely agreed that they always stay at one hotel. However, out of 2 respondents with doctorates, 50% were extremely agreed to always stay at one hotel. This indicated that customers who were graduates were more loyal than the customers with diplomas, masters, or doctorates.

Moreover, 66.2% of the diploma holders were extremely agreed that they always say positive things about their hotel brand to other people. In comparison, 64.1% of the respondents who were graduates were extremely agreed. However, 65.5% of the respondents who had masters were extremely agreed, and 50% of the respondents who had doctorates agreed with the statement. Based on this result customers having masters were the most loyal customers of their respective five-star hotel brands.

Need a Dissertation Chapter On a Similar Topic?

In this subsection, the findings of this study are compared and contrasted with the literature to identify which of the past research supports the present research findings. This present study based on regression analysis suggested that brand image can have a significant positive effect on the customer loyalty of five-star hotels in the UK. This finding was supported by the research of Heung et al. (1996), who also suggested that the hotel’s brand image can play a vital role in preserving a high ratio of customer loyalty.

Moreover, this present study also suggested that perceived quality was the second factor that was found to have a significant positive effect on customer loyalty. The perceived quality was evaluated based on; service quality, comfort, staff courtesy, customer satisfaction, and service quality expectations. In this regard, Tat and Raymond (2000) research supports the findings of this study. The staff service quality was found to affect customer loyalty and the level of satisfaction. Teas (1994) had also found service quality to affect customer loyalty. However, Teas also found that staff empathy (staff courtesy) towards customers can also affect customer loyalty. The research of Rowley and Dawes (1999) also supports the finding of this present study. The users’ expectations about the quality and nature of the services affect customer loyalty. A study by Oberoi and Hales (1990) was found to agree with the present study’s findings, as they had found the quality of staff service to affect customer loyalty.

Summary of the Findings

  • The brand image was found to have a significant positive effect on customer loyalty. Therefore customer loyalty is likely to increase with the increase in brand image.
  • The corporate identity was found to have an insignificant effect on customer loyalty. Therefore customer loyalty is not likely to increase with the increase in corporate identity.
  • Public relations was found to have an insignificant effect on customer loyalty. Therefore customer loyalty is not likely to increase with the increase in public relations.
  • Perceived quality was found to have a significant positive effect on customer loyalty. Therefore customer loyalty is likely to increase with the increase in perceived quality.
  • Trustworthiness was found to have an insignificant effect on customer loyalty. Therefore customer loyalty is not likely to increase with the increase in trustworthiness.
  • The female customers were found to be more loyal customers of the five-star hotel brands than male customers.
  • The customers of age from 36 to 60 years were more loyal to their hotel brands than the customers of age from 20 to 35 and above 60.
  • The customers who had annual income from 31000 to 50000 were more loyal customers of their respective hotel brands than those who had an annual income level of less than 31000 or more than 50000.
  • The married respondents had more customer loyalty than unmarried customers, towards five-star hotel brands of the UK.

The customers who had bachelor degrees and the customers who had master degrees were more loyal to the customers who had a diploma or doctorate.

Bryman, A., Bell, E., 2015. Business Research Methods. Oxford University Press.

Daum, P., 2013. International Synergy Management: A Strategic Approach for Raising Efficiencies in the Cross-border Interaction Process. Anchor Academic Publishing (aap_verlag).

Dümke, R., 2002. Corporate Reputation and its Importance for Business Success: A European

Perspective and its Implication for Public Relations Consultancies. diplom.de.

Guetterman, T.C., 2015. Descriptions of Sampling Practices Within Five Approaches to Qualitative Research in Education and the Health Sciences. Forum Qualitative Sozialforschung /

Forum: Qualitative Social Research 16.

Haq, M., 2014. A Comparative Analysis of Qualitative and Quantitative Research Methods and a Justification for Adopting Mixed Methods in Social Research (PDF Download Available).

ResearchGate 1–22. doi:http://dx.doi.org/10.13140/RG.2.1.1945.8640

Kelley, ., Clark, B., Brown, V., Sitzia, J., 2003. Good practice in the conduct and reporting of survey research. Int J Qual Health Care 15, 261–266. doi:10.1093/intqhc/mzg031

Lewis, S., 2015. Qualitative Inquiry and Research Design: Choosing Among Five Approaches.

Health Promotion Practice 16, 473–475. doi:10.1177/1524839915580941

Saunders, M., 2003. Research Methods for Business Students. Pearson Education India.

Saunders, M.N.K., Tosey, P., 2015. Handbook of Research Methods on Human Resource

Development. Edward Elgar Publishing.

DMCA / Removal Request

If you are the original writer of this Dissertation Chapter and no longer wish to have it published on the www.ResearchProspect.com then please:

Request The Removal Of This Dissertation Chapter

Frequently Asked Questions

How to write the results chapter of a dissertation.

To write the Results chapter of a dissertation:

  • Present findings objectively.
  • Use tables, graphs, or charts for clarity.
  • Refer to research questions/hypotheses.
  • Provide sufficient details.
  • Avoid interpretation; save that for the Discussion chapter.






  • How It Works

Generalized fused Lasso for grouped data in generalized linear models

  • Original Paper
  • Open access
  • Published: 25 May 2024
  • Volume 34 , article number  124 , ( 2024 )

Cite this article

You have full access to this open access article

data analysis example in research paper

  • Mineaki Ohishi 1  

1 Altmetric

Generalized fused Lasso (GFL) is a powerful method based on adjacent relationships or the network structure of data. It is used in a number of research areas, including clustering, discrete smoothing, and spatio-temporal analysis. When applying GFL, the specific optimization method used is an important issue. In generalized linear models, efficient algorithms based on the coordinate descent method have been developed for trend filtering under the binomial and Poisson distributions. However, to apply GFL to other distributions, such as the negative binomial distribution, which is used to deal with overdispersion in the Poisson distribution, or the gamma and inverse Gaussian distributions, which are used for positive continuous data, an algorithm for each individual distribution must be developed. To unify GFL for distributions in the exponential family, this paper proposes a coordinate descent algorithm for generalized linear models. To illustrate the method, a real data example of spatio-temporal analysis is provided.

Similar content being viewed by others

data analysis example in research paper

Coordinate descent algorithm of generalized fused Lasso logistic regression for multivariate trend filtering

data analysis example in research paper

Spatio-temporal clustering analysis using generalized lasso with an application to reveal the spread of Covid-19 cases in Japan

data analysis example in research paper

Assessing Spatial Stationarity and Segmenting Spatial Processes into Stationary Components

Avoid common mistakes on your manuscript.

1 Introduction

Assume we have grouped data such that \(y_{j 1}, \ldots , y_{j n_j}\) are observations of the j th group ( \(j \in \{ 1, \ldots , m \}\) ) for m groups. Further, assume the following generalized linear models (GLMs; Nelder and Wedderburn 1972 ) with canonical parameter \(\theta _{ji}\ (i \in \{ 1, \ldots , n_j \})\) and dispersion parameter \(\phi > 0\) :

where \(y_{j i}\) is independent with respect to j and i , \(a_{ji}\) is a constant defined by

\(a (\cdot ) > 0\) , \(b (\cdot )\) , and \(c (\cdot )\) are known functions, and \(b (\cdot )\) is differentiable.

The \(\theta _{ji}\) has the following structure:

where \(h (\cdot )\) is a known differentiable function, \(\beta _j\) is an unknown parameter, and \(q_{ji}\) is a known term called the offset, which is zero in many cases. Although \(\theta _{ji}\) depends not only on the group but also on the individual, the j th group is characterized by a common parameter \(\beta _j\) . We are thus interested in describing the relationship among the m groups. Here, the expectation of \(y_{ji}\) is given by

where \(\mu (\cdot )\) is a known function and \(\dot{b} (\cdot )\) is a derivative of \(b (\cdot )\) , i.e., \(\dot{b} (\theta ) = d b (\theta ) / d \theta \) . Furthermore, \(\mu ^{-1} (\cdot )\) is a link function, and \(h (\cdot )\) is an identify function, i.e., \(h (\eta ) = \eta \) , when \(\mu ^{-1} (\cdot )\) is a canonical link. Tables  1 , 2 , and 3 summarize the relationships between model ( 1 ) and each individual distribution. In this paper, we consider clustering for m groups or discrete smoothing via generalized fused Lasso (GFL; e.g., Höfling et al. 2010 ; Ohishi et al. 2021 ).

GFL is an extension of fused Lasso (Tibshirani et al. 2005 ) which can incorporate relationships among multiple variables, such as adjacent relationships and network structure, into parameter estimation. For example, Xin et al. ( 2014 ) applied GFL to the diagnosis of Alzheimer’s disease by expressing the structure of structural magnetic resonance images of human brains as a 3D grid graph; Ohishi et al. ( 2021 ) applied GFL to model spatial data based on geographical adjacency. Although the GFL in these particular instances is based on one factor (brain structure or a geographical relationship), it can deal with relationships based on multiple factors. For example, we can define an adjacent relationship for spatio-temporal cases based on two factors by combining geographical adjacency and the order of time. Yamamura et al. ( 2021 ), Ohishi et al. ( 2022 ), and Yamamura et al. ( 2023 ) dealt with multivariate trend filtering (e.g., Tibshirani 2014 ) based on multiple factors via GFL and applied it to the estimation of spatio-temporal trends. Yamamura et al. ( 2021 ) and Ohishi et al. ( 2022 ) used a logistic regression model, which coincides with model ( 1 ) when \(n_j = 1, q_{j i} = 0\ (\forall j \in \{ 1, \ldots , m \}; \forall i \in \{ 1, \ldots , n_j \})\) under a binomial distribution. Since this relationship holds by the reproductive property of the binomial distribution, their methods can also be applied to grouped data. Yamamura et al. ( 2023 ) used a Poisson regression model, which coincides with model ( 1 ) when \(n_j = 1\ (\forall j \in \{ 1, \ldots , m \})\) under a Poisson distribution. As is the case for Yamamura et al. ( 2021 ) and Ohishi et al. ( 2022 ), the method of Yamamura et al. ( 2023 ) can also be applied to grouped data from the reproductive property of the Poisson distribution. Yamamura et al. ( 2021 ), Ohishi et al. ( 2022 ) and Yamamura et al. ( 2023 ) proposed coordinate descent algorithms to obtain the GFL estimator. Although optimization problems for GLMs, such as logistic and Poisson regression models, are generally solved by linear approximation, Ohishi et al. ( 2022 ) and Yamamura et al. ( 2023 ) directly minimize coordinate-wise objective functions and derive update equations of a solution in closed form. Although Yamamura et al. ( 2021 ) minimized the coordinate-wise objective functions using linear approximation, Ohishi et al. ( 2022 ) showed numerically that direct minimization can provide the solution faster and more accurately than minimization using a linear approximation. Ohishi et al. ( 2021 ) also derived an explicit update equation for the coordinate descent algorithm, which corresponds to model ( 1 ) under the Gaussian distribution. As described, coordinate descent algorithms have been developed to produce GFL estimators for three specific distributions; however, none have been proposed for other distributions. For example, we have an option of using the negative binomial distribution to deal with overdispersion in the Poisson distribution (e.g., Gardner et al. 1995 ; Ver Hoef and Boveng 2007 ), or the gamma or inverse Gaussian distribution for positive continuous data. To apply GFL to these distributions, it is necessary to derive update equations for each distribution individually.

In this paper, we propose a coordinate descent algorithm to obtain GFL estimators for model ( 1 ) in order to unify the GFL approach for distributions in the exponential family. The negative log-likelihood function for model ( 1 ) is given by

We estimate parameter vector \({\varvec{\beta }}= (\beta _1, \ldots , \beta _m)'\) by minimizing the following function defined by removing terms that do not depend on \({\varvec{\beta }}\) from the above equation and by adding a GFL penalty:

where \(\lambda \) is a non-negative tuning parameter, \(D_j \subseteq \{ 1, \ldots , m \} \backslash \{ j \}\) is an index set expressing adjacent relationship among groups and satisfying \(\ell \in D_j \Leftrightarrow j \in D_\ell \) , and \(w_{j \ell }\) is a positive weight satisfying \(w_{j \ell } = w_{\ell j}\) . The GFL penalty shrinks the difference between two adjacent groups \(|\beta _j - \beta _\ell |\) and often gives a solution satisfying \(|\beta _j - \beta _\ell | = 0\ (\Leftrightarrow \beta _j = \beta _\ell )\) . That is, GFL can estimate some parameters to be exactly equal, thus enabling the clustering of m groups or the accomplishment of discrete smoothing. To obtain the GFL estimator for \({\varvec{\beta }}\) , we minimize the objective function ( 3 ) via a coordinate descent algorithm. As Ohishi et al. ( 2022 ) and Yamamura et al. ( 2023 ), we directly minimize coordinate-wise objective functions without the use of approximations. For ordinary situations, where a canonical link ( \(h (\eta ) = \eta \) ) is used and there is no offset ( \(q_{j i} = 0\) ), and for several other situations, the update equation of a solution can be derived in closed form.

Table  4 summarizes relationships between an individual distribution and an update equation. Here, \(\bigcirc \) indicates that the update equation can be obtained in closed form, and \(\times \) indicates that it cannot. Even when the update equation cannot be obtained in closed form, the proposed method can specify an interval that includes the solution, which means we can easily obtain the solution by a simple numerical search. Note that the proposed method is provided via R package GFLglm (Ohishi   2024 ). The dataset used in a real data example is available via GFLglm .

As a related work, Tang and Song ( 2016 ) proposed a regression coefficients clustering via fused Lasso approach, namely FLARCC. In this study, regression coefficients are estimated by minimizing a negative log-likelihood function with fused Lasso type penalty. However, our GFL approach in ( 3 ) shrinks and estimates parameters based on adjacent relationship, while FLARCC restricts pairs of two parameters used in the penalty based on the order of initial parameter values. Hence, FLARCC cannot be applied to minimize ( 3 ) and differs from our purpose. However, when using the complete graph structure as an adjacent relationship in ( 3 ), although the two objective functions of FLARCC and our method are different, their purposes are equivalent in terms of clustering without any constraint. Devriendt et al. ( 2021 ) proposed an algorithm based on a proximal gradient method for multi-type penalized sparse regression, namely SMuRF algorithm, which can be applied to minimize ( 3 ). As demonstrated in Ohishi et al. ( 2022 ), since the proximal gradient method involves an approximation of an objective function, its minimization procedure may be inefficient. That is, we can expect that our algorithm, which minimizes the objective function directly, can provide the solution faster and more accurately than SMuRF algorithm. Furthermore, Choi and Lee ( 2019 ) showed a serious phenomenon in fused Lasso approach for binomial distribution: a fusion phenomenon among parameters does not occur. It is possible for such a phenomenon to occur for modeling of a discrete response, such as logistic regression and Poisson regression. However, although our framework includes the situation that the phenomenon occurs as a special case, there is no practical concern.

The remainder of the paper is organized as follows: In Sect.  2 , we give an overview of coordinate descent algorithm and derive the objective functions for each step. In Sect.  3 , we discuss coordinate-wise minimization of the coordinate descent algorithm and derive update equations in closed form in many cases. In Sect.  4 , we evaluate the performance of the proposed method via numerical simulation. In Sect.  5 , we provide a real data example. Section  6 concludes the paper. Technical details are given in the Appendix.

2 Preliminaries

As in Ohishi et al. ( 2022 ) and Yamamura et al. ( 2023 ), we minimize the objective function ( 3 ) using a coordinate descent algorithm. Algorithm 1 gives an overview of the algorithm.

figure a

Overview of the coordinate descent algorithm

The descent cycle updates the parameters separately, and several parameters are often updated to be exactly equal. If several parameters are exactly equal, their updates can become stuck. To avoid this, the fusion cycle simultaneously updates equal parameters (see Friedman et al. 2007 ). In each cycle of the coordinate descent, the following function is essentially minimized:

where \(a_i\) and \(w_\ell \) are positive constants and \(z_\ell \ (\ell = 1, \ldots , r)\) are constants satisfying \(z_1< \cdots < z_{r}\) . The minimization of f ( x ) is described in Sect.  3 , and the following subsections show that an objective function in each cycle is essentially equal to f ( x ).

2.1 Descent cycle

The descent cycle repeats coordinate-wise minimizations of the objective function \(L ({\varvec{\beta }})\) in ( 3 ). To obtain a coordinate-wise objective function, we extract terms that depend on \(\beta _j\ (j \in \{ 1, \ldots , m \})\) from \(L ({\varvec{\beta }})\) . As described in Ohishi et al. ( 2021 ), the penalty term can be decomposed as

Then, only the first term depends on \(\beta _j\) . By regarding terms that do not depend on \(\beta _j\) as constants and removing them from \(L ({\varvec{\beta }})\) , the coordinate-wise objective function is obtained as

where \(\hat{\beta }_\ell \) indicates \(\beta _\ell \) is given. By sorting elements of \(D_j\) in increasing order of \(\hat{\beta }_\ell \ (\forall \ell \in D_j)\) , we can see that \(L_j (\beta )\) essentially equals f ( x ) in ( 4 ). If there exist \(\ell _1, \ell _2 \in D_j\ (\ell _1 \ne \ell _2)\) such that \(\hat{\beta }_{\ell _1} = \hat{\beta }_{\ell _2}\) , we can temporarily redefine \(D_j\) and \(w_{j \ell }\) as

Since GFL estimates several parameters as being equal, this redefinition is required in most updates.

2.2 Fusion cycle

In the fusion cycle, equal parameters are replaced by a common parameter and \(L ({\varvec{\beta }})\) is minimized with respect to the common parameter. Let \(\hat{\beta }_1, \ldots , \hat{\beta }_m\) be current solutions for \(\beta _1, \ldots , \beta _m\) , and \(\hat{\xi }_1, \ldots , \hat{\xi }_t\ (t < m)\) be their distinct values. The relationship among the current solutions and their distinct values is specified as

That is, the following statements are true:

Then, the \(\beta _j\ (\forall j \in E_k)\) are replaced by a common parameter \(\xi _k\) and \(L ({\varvec{\beta }})\) is minimized with respect to \(\xi _k\) . Hence, to obtain a coordinate-wise objective function, we extract terms that depend on \(\xi _k\ (k = 1, \ldots , t)\) from \(L ({\varvec{\beta }})\) .

We can decompose the first term of \(L ({\varvec{\beta }})\) as

Furthermore, as Ohishi et al. ( 2021 ), the penalty term of \(L ({\varvec{\beta }})\) can be decomposed as

By regarding terms that do not depend on \(\xi _k\) as constants and removing them from \(L ({\varvec{\beta }})\) , the coordinate-wise objective function is obtained as

As in the descent cycle, we can see that \(L_k^*(\xi )\) essentially equals f ( x ) in ( 4 ).

3 Main results

In this section, to obtain update equations for the descent and fusion cycles of the coordinate descent algorithm, we describe the minimization of f ( x ) in ( 4 ). Following Ohishi et al. ( 2022 ) and Yamamura et al. ( 2023 ), we directly minimize f ( x ). One of the difficulties of the minimization of f ( x ) is that f ( x ) has multiple non-differentiable points \(z_1, \ldots , z_r\) . We cope with this difficulty by using a subdifferential. The subdifferential of f ( x ) at \(\tilde{x} \in \mathbb {R}\) is given by

where \(g_- (x)\) and \(g_+ (x)\) are left and right derivatives defined by

Then, \(\tilde{x}\) is a stationary point of f ( x ) if \(0 \in \partial f (\tilde{x})\) . For details of a subdifferential, see, e.g., Rockafellar ( 1970 ), Parts V and VI. In the following subsections, we separately describe the minimization of f ( x ) in cases where a canonical link and a general link are used.

3.1 Canonical link

We first describe the minimization of f ( x ) in ( 4 ) with a canonical link, i.e., \(h (\eta ) = \eta \) . That is, the update equation of the coordinate descent algorithm is given by minimizing the following function:

Notice that f ( x ) in ( 7 ) is strictly convex. Hence, \(\tilde{x}\) is the minimizer of f ( x ) if and only if \(0 \in \partial f (\tilde{x})\) . First, based on this relationship, we derive the condition that f ( x ) attains the minimum at a non-differentiable point \(z_\ell \) .

The subdifferential of f ( x ) at \(z_\ell \) is given by

Hence, if there exists \(\ell _\star \in \{ 1, \ldots , r \}\) such that \(0 \in \partial f (z_{\ell _\star })\) , f ( x ) attains the minimum at \(x = z_{\ell _\star }\) and \(\ell _\star \) uniquely exists because of the strict convexity of f ( x ).

On the other hand, when \(\ell _\star \) does not exist, we can specify an interval that includes the minimizer by checking the signs of the left and right derivatives at each non-differentiable point. Let \(s (x) = (\textrm{sign}(g_- (x)), \textrm{sign}(g_+ (x)))\) . From \(z_1< \cdots < z_r\) and the strict convexity of f ( x ), we have

Then, the minimizer of f ( x ) exists in the following interval:

Hence, it is sufficient to search for the minimizer in \(R_*\) . For all \(x \in R_*\) , the following equation holds:

This result allows us to rewrite the penalty term in f ( x ) as

Hence, f ( x ) is rewritten in non-absolute form as

The f ( x ) is differentiable when \(x \in R_*\) and its derivative is given by

Then, the solution \(x_*\) of \(d f (x) / d x = 0\) is the minimizer of f ( x ). Hence, we have the following theorem.

Let \(\hat{x}\) be the minimizer of f ( x ) in ( 7 ). Then, \(\hat{x}\) is given by

where \(\ell _*\) exists if and only if \(\ell _\star \) does not exist.

We can execute Algorithm 1 by applying Theorem  1 to ( 5 ) and ( 6 ) in the descent and fusion cycles, respectively. Thus, a detailed implementation of Algorithm 1 when using a canonical link is provided in Algorithm 2.

figure b

The coordinate descent algorithm for a canonical link

To apply Theorem  1 , we need to obtain \(x_*\) . In many cases, \(x_*\) can be obtained in closed form according to the following proposition.

Proposition 2

Let \(x_*\) be the solution of \(d f (x) / d x = 0\) and \(q_0\) be a value such that \(q_1 = \cdots = q_d = q_0\) . Then, \(x_*\) is given as follows:

When \(q_0\) exists, \(x_*\) is given in a general form as

Even when \(q_0\) does not exist, \(x_*\) for the Gaussian distribution is given by

and \(x_*\) for the Poisson distribution is given by

For example, \(q_0\) exists and \(q_0 = 0\) holds for GLMs without an offset. When \(q_0\) does not exist, \(x_*\) can be obtained for each distribution. For the Gaussian and Poisson distributions, since \(\mu (x + q)\) can be divided with respect to x and q , \(x_*\) can be obtained in closed form. Note that \(x_*\) for a Gaussian distribution when \(q_0\) exists and equals 0 coincides with the result in Ohishi et al. ( 2021 ). For distributions for which such a decomposition is impossible, such as the binomial distribution, a numerical search is required to obtain \(x_*\) . However, we can easily obtain \(x_*\) by a simple algorithm, such as a line search, because f ( x ) is strictly convex and has its minimizer in the interval \(R_*\) . Moreover, when \(x_*\) can be obtained in closed form as in Proposition 2 , the minimization of f ( x ) requires the computational complexity of \(O (r (d + r))\) .

3.2 General link

Here, we consider the minimization of f ( x ) in ( 4 ) with a general link, i.e., \(h (\cdot )\) is a generally differentiable function. Then, although strict convexity of f ( x ) is not guaranteed, its continuity is maintained. This means the uniqueness of the minimizer of f ( x ) is not guaranteed, but we can obtain minimizer candidates by using the same procedure as in the previous subsection.

where \(\dot{h} (x) = d h (x) / d x\) . Since \(z_\ell \) satisfying \(0 \in \partial f (z_\ell )\) is a stationary point of f ( x ), such points are minimizer candidates of f ( x ). Next, we define intervals as \(R_\ell = (z_\ell , z_{\ell +1})\ (\ell = 0, 1, \ldots , r)\) . For \(x \in R_\ell \) , f ( x ) can be written in non-absolute form as

We can then search for minimizer candidates of f ( x ) by piecewise minimization. That is, \(x \in R_\ell \) minimizing \(f_\ell (x)\) is a minimizer candidate. Hence, we have the following theorem.

Let \(\hat{x}\) be the minimizer of f ( x ) in ( 4 ) and define a set \(\mathcal {S}\) by

data analysis example in research paper

Now, suppose that

where \(\dot{f}_\ell (x) = d f_\ell (x) / dx\) . Then, \(\mathcal {S}\) is the set of minimizer candidates of f ( x ) and \(\hat{x}\) is given by

The assumption ( 8 ) excludes the case in which f ( x ) attains the minimum at \(x = \pm \infty \) . Moreover, we have the following corollary (the proof is given in Appendix A).

Corollary 4

Suppose that for all \(\ell \in \{ 0, 1, \ldots , r \}\) ,

is true, and that ( 8 ) holds. Then, f ( x ) is strictly convex and \(\# (\mathcal {S}) = 1\) , where \(\mathcal {S}\) is given in Theorem  3 . Moreover, the unique element of \(\mathcal {S}\) is the minimizer of f ( x ) and is given as in Theorem  1 .

To execute Algorithm 1 for GLMs with a general link, we can replace Theorem  1 with Theorem  3 or Corollary  4 in Algorithm 2. The next subsection gives specific examples of using a general link.

3.2.1 Examples

This subsection focuses on the negative binomial, gamma, and inverse Gaussian distributions with a log-link as examples of using a general link. In the framework of regression, the negative binomial distribution is often used to deal with overdispersion in Poisson regression, making it natural to use a log-link. Note that NB-C and NB2 indicate negative binomial regression with canonical and log-links, respectively (for details, see, e.g., Hilbe 2011 ). The gamma and inverse Gaussian distributions are used to model positive continuous data. Their expectations must be positive. However, their canonical links do not guarantee that their expectations will, in fact, be positive. Hence, a log-link rather than a canonical link is often used for these distributions (e.g., Algamal 2018 ; Dunn and Smyth 2018 , Chap. 11). Here, we consider coordinate-wise minimizations for the three distributions with a log-link.

For \(x \in R_\ell \) , f ( x ) in ( 4 ) and its first- and second-order derivatives are given by

Inverse Gaussian:

data analysis example in research paper

We can see that \(\ddot{f}_\ell (x) > 0\) holds for all \(\ell \in \{ 0, 1, \ldots , r \}\) , for NB2 and the gamma distribution. Hence, the minimizers of f ( x ) can be uniquely obtained from Corollary  4 . On the other hand, the uniqueness of the minimizer for the inverse Gaussian distribution is not guaranteed; however, we have \(v_0 < 0\) , \(v_r > 0\) , and

This implies \(x< \min \{\log (u_1 / u_2), z_1 \} \Rightarrow \dot{f}_0 (x) < 0\) and \(x> \max \{ \log (u_1 / u_2), z_r \} \Rightarrow \dot{f}_r (x) > 0\) . Hence, the minimizer for the inverse Gaussian distribution can be obtained by Theorem  3 .

We now give specific solutions. From above, we have the following proposition.

Proposition 5

Let \(\tilde{x}_\ell \) be a stationary point of \(f_\ell (x)\) . If \(\tilde{x}_\ell \) exists, it is given by

NB2 only when \(\exists q_0\ s.t.\ q_1 = \cdots = q_d = q_0\) :

Moreover, a relationship between \(\tilde{x}_\ell \) and the minimizer of f ( x ) is given by

NB2 and Gamma:

3.3 Some comments regarding implementation

3.3.1 dispersion parameter estimation.

In the previous subsections, we discussed the estimation of \(\beta _j\) which corresponds to the estimation of the canonical parameter \(\theta _{ji}\) . The GLMs in ( 1 ) also have dispersion parameter \(\phi \) . Although \(\phi \) is fixed at one for the binomial and Poisson distributions, it is unknown for other distributions, and, hence, we need to estimate the value of \(\phi \) . The Pearson estimator is often used as a suitable estimator (e.g., Dunn and Smyth 2018 , Chap. 6). Let \(\hat{\beta }_1, \ldots , \hat{\beta }_m\) be estimators of \(\beta _1, \ldots , \beta _m\) , t be the number of distinct values of them, and \(\hat{\zeta }_{ji} = \mu (\hat{\beta }_j + q_{ji})\) . Then, the Pearson estimator of \(\phi \) is given by

where \(V (\cdot )\) is a variance function (see Table  2 ). For distributions other than the negative binomial distribution, the estimator of \(\phi \) can be obtained after \({\varvec{\beta }}\) is estimated since the estimation of \({\varvec{\beta }}\) does not depend on \(\phi \) . For the negative binomial distribution, the estimation of \({\varvec{\beta }}\) depends on \(\phi \) because \(\mu (\cdot )\) and \(b (\cdot )\) depend on \(\phi \) . Hence, we need to add a step updating \(\phi \) and repeat updates of \({\varvec{\beta }}\) and \(\phi \) alternately. Moreover, this Pearson estimator is used for the diagnosis of overdispersion in the binomial and Poisson distributions. If \(\hat{\phi } > 1\) , it is doubtful that the model is appropriate.

3.3.2 Penalty weights

The objective function \(L ({\varvec{\beta }})\) in ( 3 ) includes penalty weights, and the GFL estimation proceeds with the given weights. Although setting \(w_{j \ell } = 1\) is usual, this may cause a problem of over-shrinkage because all pairs of parameters are shrunk uniformly by the common \(\lambda \) . As one option to avoid this problem, we can use the following weight based on adaptive-Lasso (Zou 2006 ):

where \(\tilde{\beta }_j\) is an estimator of \(\beta _j\) and the maximum likelihood estimator (MLE) may be a reasonable choice for it. If there exists \(q_{j 0}\) such that \(q_{j1} = \cdots = q_{j n_j} = q_{j 0}\) , the MLE is given in the following closed form:

For other cases, see Appendix B.

3.3.3 Tuning parameter selection

It is important for a penalized estimation, such as GFL estimation, to select a tuning parameter, which, in this paper, is represented as \(\lambda \) in ( 3 ). Because \(\lambda \) adjusts the strength of the penalty against a model fitting, we need to select a good value of \(\lambda \) in order to obtain a good estimator. The optimal value of \(\lambda \) is commonly selected from candidates based on the minimization of, e.g., cross-validation and a model selection criterion. For a given \(\lambda _{\max }\) , candidates for \(\lambda \) are selected from the interval \([0, \lambda _{\max }]\) . Following Ohishi et al. ( 2021 ), \(\lambda _{\max }\) is defined by a value such that all \(\beta _j\ (j \in \{ 1, \ldots , m \})\) are updated as \(\hat{\beta }_{\max }\) when a current solution of \({\varvec{\beta }}\) is \(\hat{{\varvec{\beta }}}_{\max } = \hat{\beta }_{\max } {\varvec{1}}_m\) , where \(\hat{{\varvec{\beta }}}_{\max }\) is the MLE under \({\varvec{\beta }}= \beta {\varvec{1}}_m\) (see Appendix B) and \({\varvec{1}}_m\) is the m -dimensional vector of ones. When a current solution of \({\varvec{\beta }}\) is \(\hat{{\varvec{\beta }}}_{\max }\) , the discussion in Sect.  3.2 gives the condition that \(\beta _j\) is updated as \(\hat{\beta }_{\max }\) as

Hence, \(\lambda _{\max }\) is given by

3.3.4 Extension

In this paper, we proposed the algorithm for the model ( 1 ) with the structure ( 2 ), which means the model does not have any explanatory variables. However, the proposed method can also be applied to the model with explanatory variables by simple modifications.

Let \({\varvec{x}}_{ji}\) and \({\varvec{\beta }}_j\) be p -dimensional vectors of explanatory variables and regression coefficients, respectively. We rewrite \(\eta _{ji}\) in ( 2 ) as

Focusing on the k th ( \(k \in \{ 1, \ldots , p \}\) ) explanatory variable, we have

where \(x_{jil}\) and \(\beta _{jl}\) are the l th elements of \({\varvec{x}}_{ji}\) and \({\varvec{\beta }}_j\) , respectively. The coordinate descent method updates each \(\beta _{jk}\) by regarding \(\tilde{q}_{jik}\) as a constant. Thus, in each cycle of the coordinate descent for GFL problem with explanatory variables, the following function is essentially minimized:

where \(z_{0i}\) is a constant. Hence, we can search \(z_\ell \ (\ell \in \{ 1, \ldots , r \})\) satisfying \(0 \in \partial f (z_\ell )\) by the similarly procedure. On the other hand, we can obtain the explicit minimizer of \(f_\ell (x)\ (x \in R_\ell )\) for only Gaussian distribution and cannot obtain it for other distributions because \(\mu (\cdot )\) is not separable with respect to the product. However, the minimizer can also be easily searched here.

4 Simulation

In this section, we focus on modeling using count data and establish whether our proposed method can select the true cluster from the clustering of groups through simulation. For count data, Poisson regression and NB2 are often used. Hence, we compare the performance of the two approaches for various settings of the dispersion parameter. Note that GFL for Poisson regression has already been proposed by Yamamura et al. ( 2023 ) and that our contribution is to apply GFL to NB2. Note, too, that simulation studies were not conducted in Yamamura et al. ( 2023 ). Moreover, both of R packages metafuse and smurf (e.g., Tang et al. 2016 ; Reynkens et al. 2023 ), which implement FLARCC and SMuRF algorithm, respectively, can deal with Poisson regression but cannot deal with NB2.

Let \(m^*\) be the number of true clusters and \(E_k^*\subset \{ 1, \ldots , m \}\ (k \in \{ 1, \ldots , m^*\})\) be an index set specifying groups in the k th true cluster. Then, we generate simulation data from

We consider four cases of m and \(m^*\) as \((m, m^*) = (10, 3), (10, 6), (20, 6), (20, 12)\) , and use the same settings as Ohishi et al. ( 2021 ) for adjacent relationships of m groups and true clusters (see Figs.  1 and 2 ).

figure 1

Adjacent relationship and true clusters when \(m = 10\)

figure 2

Adjacent relationship and true clusters when \(m = 20\)

The sample sizes for each group are common, i.e., \(n_1 = \cdots = n_m = n_0\) . Furthermore, the estimation of \(\phi \) , the definition of the penalty weights, and the candidates for \(\lambda \) follow Sect.  3.3 , and the optimal value of \(\lambda \) is selected based on the minimization of BIC (Schwarz 1978 ) from 100 candidates. Here, the simulation studies are conducted based on Monte Carlo simulation with 1,000 iterations.

4.1 Comparison with existing methods

Before comparing the performances of Poisson regression and NB2, we compare our proposed method with the two existing methods: FLARCC and SMuRF algorithm, for Poisson regression (i.e., \(\phi = 0\) ). As described in Sect.  1 , although FLARCC differs from our purpose, they are equivalent when using the complete graph structure as adjacent relationship. Hence, we use the complete graph structure at the comparison with FLARCC. On the other hand, SMuRF algorithm can be applied to minimize ( 3 ). Thus, we compare the minimum value of the objective function and runtime under given \(\lambda \) .

Table  5 summarizes the results of the comparison with FLARCC, in which SP is the selection probability (%) of the true cluster, and time is runtime (in seconds). We can see that the SP values for both methods approach 100% as \(n_0\) increases. Moreover, FLARCC is always better than the proposed method in terms of SP. We can consider that this result is reasonable. In the proposed method, there are many choices of clustering patterns and each group has \(m-1\) choices. On the other hand, each group has only two choices at most in FLARCC because of the restriction. It would be natural to consider that a wrong fusion is hard to occur if choices get fewer. To support this suggestion, the MLE has an important role. In FLARCC, the MLE is used to restrict fusion patterns. On the other hand, in the proposed method, the MLE is used for penalty weights and the penalty weights contribute to identify whether two parameters are equal. If the restriction in FLARCC is correct, penalty weights in the proposed method would also perform well. In such a situation, we can consider that FLARCC which has fewer choices performs better. If the restriction in FLARCC is wrong, penalty weights in the proposed method would not also perform well. In such a situation, we can consider that the proposed method which has more choices is easier to make a wrong fusion. Since the MLE becomes stable as n increases, the difference between the two methods becomes small as \(n_0\) increases. Hence, if the purpose is clustering without any adjacency, using FLARCC is better. However, recall that the proposed method is proposed to minimize the objective function ( 3 ). FLARCC cannot be applied to minimize the objective function. Moreover, FLARCC requires a reparameterization for its estimation process and hence, it also requires to transform the estimation results to obtain the estimation results for the original form. This may be the reason why FLARCC is slower than the proposed method.

Table  6 summarizes the results of the comparison with SMuRF algorithm, in which difR, win, and time are defined by

respectively, and \(\lambda _j = \tau _j \lambda _{\max }\) , where \(\tau _1 = 1/100\) , \(\tau _2 = 1/10\) , \(\tau _3 = 1/2\) , and \(L_1^\star \) and \(L_2^\star \) are the minimum values of the objective function ( 3 ) by the proposed method and SMuRF algorithm, respectively. We can see that the difR value is always positive, the win value is 100% in most cases and even the minimum value is around 70%. This means that the proposed method better minimized the objective function than SMuRF algorithm. Notice that the actual difR value is very small since the displayed value is multiplied by 1,000. That is, the difference between the two minimum values is not large very well. The difR value also tell us that the difference becomes larger as \(\lambda \) increases and becomes smaller as n increases. Moreover, the time value shows that the proposed method was faster than SMuRF algorithm in most cases. Hence, we found that the proposed method can minimize the objective function faster and more accurately than SMuRF algorithm.

4.2 Poisson vs. NB2

In this subsection, we show the comparison of Poisson regression and NB2. Tables  7 and 8 summarize the results for \(m = 10, 20\) , respectively, in which SP is the selection probability (%) of the true cluster, \(\hat{\phi }\) is the Pearson estimator of \(\phi \) , and time is runtime (in seconds). Table  9 summarizes standard errors of \(\hat{\phi }\) . First, focusing on \(\phi =0\) , i.e., the true model according to the Poisson distribution, the value of SP using Poisson regression approaches 100% as \(n_0\) increases. Furthermore, we can say that Poisson regression provides good estimation since \(\hat{\phi }\) is approximately 1. On the other hand, NB2 is unable to select the true cluster. The reason for this may be that the dispersion parameter in the negative binomial distribution is positive. Moreover, standard error values tell us that the estimation of Poisson regression is more stable than that of NB2. Next, we focus on \(\phi > 0\) . Here, Poisson regression produced overdispersion since \(\hat{\phi }\) is larger than 1, and, hence, it is unable to select the true cluster. On the other hand, the SP value for NB2 approaches 100% as \(n_0\) increases. Furthermore, \(\hat{\phi }\) is roughly the true value, indicating that NB2 can provide good estimation. Standard error values are also evidence for the goodness of NB2. Finally, it is apparent that Poisson regression is always faster than NB2. The reason for this may be that Poisson regression requires only the estimation of \({\varvec{\beta }}\) , whereas NB2 requires repeatedly estimating \({\varvec{\beta }}\) and \(\phi \) alternately. We can conclude from this simulation that Poisson regression is better when the true model is according to a Poisson distribution and that NB2 can effectively deal with overdispersion in Poisson regression.

5 Real data example

In this section, we apply our method to the estimation of spatio-temporal trend using real crime data. The data consist of the number of recognized crimes committed in the Tokyo area as collected by the Metropolitan Police Department, available at TOKYO OPEN DATA ( https://portal.data.metro.tokyo.lg.jp/ ). Footnote 1 Although these data were aggregated for each chou-chou (level 4), the finest regional division, we integrate the data for each chou-oaza (level 3) and apply our method by regarding level 3 as individuals and the city (level 2) as the group (see Fig.  3 ).

figure 3

Divisions of Tokyo

There are 53 groups as a division of space, and spatial adjacency is defined by the regional relationships of level 2. We use six years of data, from 2017 to 2022. The sample size is \(n = 9{,}570\) . Temporal adjacency is defined using a chain graph for the six time points. According to Yamamura et al. ( 2021 ), we can define adjacent spatio-temporal relationships for \(m = 318\ (= 53 \times 6)\) groups by combining spatial and temporal adjacencies. Furthermore, following Yamamura et al. ( 2023 ), we use population as a variable for the offset. The population data were obtained from the results of the population census, as provided in e-Stat ( https://www.e-stat.go.jp/en ). Since the population census is conducted every five years, we use the population in 2015 for the crimes in 2017 to 2019 and the population in 2020 for the crimes in 2020 to 2022.

In this analysis, we apply our method to the above crime data, with \(n = 9{,}570\) individuals aggregated into \(m = 318\) groups, and estimate the spatio-temporal trends in the data. Specifically, \(y_{ji}\) , the number of crimes in the i th region of the j th group, is modeled based on the Poisson and negative binomial distributions, respectively, as

where \(q_{j i}\) is a logarithm transformation of the population and canonical and log-links are used, respectively. Estimation of the dispersion parameter, the setting of penalty weights, and the candidates for the tuning parameter follow Sect.  3.3 . The optimal tuning parameter is selected from 100 candidates based on the minimization of BIC. Table  10 summarizes the estimation results.

The \(\hat{\phi }\) indicates the Pearson estimator of the dispersion parameter. Since the value of \(\hat{\phi }\) in the Poisson regression is far larger than 1, there is overdispersion, and we can say that using Poisson regression is inappropriate. To cope with this overdispersion, we adopted NB2. The cluster value in the table indicates the number of clusters using GFL. Poisson regression and NB2 clustered the \(m = 318\) groups into 160 and 109 groups, respectively. Figure  4 is a yearly choropleth map of the GFL estimates of \({\varvec{\beta }}\) using NB2. The map shows that the larger the value, the easier it is for crime to occur, and that the smaller the value, the harder it is. As in this figure, we can visualize the variation of trend with respect to time and space.

figure 4

GFL estimates of \({\varvec{\beta }}\) by NB2

6 Conclusion

To unify models based on a variety of distributions, we proposed a coordinate descent algorithm to obtain GFL estimators for GLMs. Although Yamamura et al. ( 2021 ), Ohishi et al. ( 2022 ), and Yamamura et al. ( 2023 ) dealt with GFL for the binomial and Poisson distributions, our method is more general, covering both these distributions and others. The proposed method repeats the partial update of parameters and directly solves sub-problems without any approximations of the objective function. In many cases, the solution can be updated in closed form. Indeed, in the ordinary situation where a canonical link is used and there is no offset, we can always update the solution in closed form. Moreover, even when an explicit update is impossible, we can easily update the solution using a simple numerical search since the interval containing the solution can be specified. Hence, our algorithm can efficiently search the solution. In simulation studies, it was demonstrated by a computational time that the proposed method is efficient.

Data availibility

Data are available at https://portal.data.metro.tokyo.lg.jp/ and https://www.e-stat.go.jp/en .

We arranged and used the following production: Tokyo Metropolitan Government & Metropolitan Police Department. The number of recognized cases by region, crime type, and method (yearly total; in Japanese), https://creativecommons.org/licenses/by/4.0/deed.en .

Algamal, Z.Y.: Developing a ridge estimator for the gamma regression model. J. Chemom. 32 , 3054 (2018). https://doi.org/10.1002/cem.3054

Article   Google Scholar  

Choi, H., Lee, S.: Convex clustering for binary data. Adv. Data Anal. Classif. 13 , 991–1018 (2019). https://doi.org/10.1007/s11634-018-0350-1

Article   MathSciNet   Google Scholar  

Devriendt, S., Antonio, K., Reynkens, T., Verbelen, R.: Sparse regression with multi-type regularized feature modeling. Insur. Math. Econ. 96 , 248–261 (2021). https://doi.org/10.1016/j.insmatheco.2020.11.010

Dunn, P.K., Smyth, G.K.: Generalized Linear Models With Examples in R. Springer, New York (2018)

Book   Google Scholar  

Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1 , 302–332 (2007). https://doi.org/10.1214/07-AOAS131

Gardner, W., Mulvey, E.P., Shaw, E.C.: Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. Psychol. Bull. 118 , 392–404 (1995). https://doi.org/10.1037/0033-2909.118.3.392

Hilbe, J.M.: Negative Binomial Regression, 2nd edn. Cambridge University Press, Cambridge (2011)

Höfling, H., Binder, H., Schumacher, M.: A coordinate-wise optimization algorithm for the fused Lasso. arXiv:1011.6409v1 (2010)

Nelder, J.A., Wedderburn, R.W.M.: Generalized linear models. J. R. Stat. Soc. Ser. A 135 , 370–384 (1972). https://doi.org/10.2307/2344614

Ohishi, M.: GFLglm: Generalized Fused Lasso for Grouped Data in Generalized Linear Models (2024). R package version 0.1.0. https://github.com/ohishim/GFLglm

Ohishi, M., Fukui, K., Okamura, K., Itoh, Y., Yanagihara, H.: Coordinate optimization for generalized fused Lasso. Comm. Stat. Theory Methods 50 , 5955–5973 (2021). https://doi.org/10.1080/03610926.2021.1931888

Ohishi, M., Yamamura, M., Yanagihara, H.: Coordinate descent algorithm of generalized fused Lasso logistic regression for multivariate trend filtering. Jpn. J. Stat. Data Sci. 5 , 535–551 (2022). https://doi.org/10.1007/s42081-022-00162-2

Reynkens, T., Devriendt, S., Antonio, K.: Smurf: Sparse Multi-Type Regularized Feature Modeling (2023). R package version 1.1.5. https://CRAN.R-project.org/package=smurf

Rockafellar, R.T.: Convex Analysis. Princeton University Press, New Jersey (1970)

Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6 , 461–464 (1978). https://doi.org/10.1214/aos/1176344136

Tang, L., Song, P.X.K.: Fused Lasso approach in regression coefficients clustering—learning parameter heterogeneity in data integration. J. Mach. Learn. Res. 17 , 1–23 (2016)

Tang, L., Zhou, L., Song, P.X.K.: Metafuse: Fused Lasso Approach in Regression Coefficient Clustering (2016). R package version 2.0-1. https://CRAN.R-project.org/package=metafuse

Tibshirani, R.J.: Adaptive piecewise polynomial estimation via trend filtering. Ann. Stat. 42 , 285–323 (2014). https://doi.org/10.1214/13-AOS1189

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused Lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 , 91–108 (2005). https://doi.org/10.1111/j.1467-9868.2005.00490.x

Ver Hoef, J.M., Boveng, P.L.: Quasi-Poisson vs. negative binomial regression: How should we model overdispersed count data? Ecology 88 , 2766–2772 (2007). https://doi.org/10.1890/07-0043.1

Xin, B., Kawahara, Y., Wang, Y., Gao, W.: Efficient generalized fused Lasso and its application to the diagnosis of Alzheimer’s disease. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 2163–2169. AAAI Press, California (2014)

Yamamura, M., Ohishi, M., Yanagihara, H.: Spatio-temporal adaptive fused Lasso for proportion data. In: Czarnowski, I., Howlett, R.J., Jain, L.C. (eds.) Intelligent Decision Technologies, pp. 479–489. Springer, Singapore (2021). https://doi.org/10.1007/978-981-16-2765-1_40

Chapter   Google Scholar  

Yamamura, M., Ohishi, M., Yanagihara, H.: Spatio-temporal analysis of rates derived from count data using generalized fused Lasso. In: Czarnowski, I., Howlett, R.J., Jain, L.C. (eds.) Intelligent Decision Technologies, pp. 225–234. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-2969-6_20

Zou, H.: The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101 , 1418–1429 (2006). https://doi.org/10.1198/016214506000000735

Download references


The author thanks Prof. Hirokazu Yanagihara of Hiroshima University for his many helpful comments and FORTE Science Communications ( https://www.forte-science.co.jp/ ) for English language editing of the first draft. Moreover, the author also thanks the associate editor and the two reviewers for their valuable comments. Furthermore, this work was partially supported by JSPS KAKENHI Grant Number JP20H04151, JP21K13834, JSPS Bilateral Program Grant Number JPJSBP120219927, and ISM Cooperative Research Program (2023-ISMCRP-4105).

Author information

Authors and affiliations.

Center for Data-Driven Science and Artificial Intelligence, Tohoku University, Kawauchi 41, Aoba-ku, Sendai, Miyagi, 980-8576, Japan

Mineaki Ohishi

You can also search for this author in PubMed   Google Scholar


M.O. contributed the whole paper.

Corresponding author

Correspondence to Mineaki Ohishi .

Ethics declarations

Conflict of interest.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Proof of Corollary 4

Suppose that for all \(\ell \in \{ 0, 1, \ldots , r \}\) , the statement

is true and that ( 8 ) holds. Then, \(\dot{f}_\ell (x)\) is strictly increasing on \(R_\ell \) and hence, \(f_\ell (x)\) is strictly convex. Moreover, for all \(\ell \in \{ 1, \ldots , r \}\) , there is the following relationship among a derivative and one-sided derivatives:

This fact and ( 8 ) imply the strict convexity of f ( x ) on \(\mathbb {R}\) and hence, the minimizer uniquely exists.

Appendix B: Derivation of MLEs

We first describe the derivation of the MLE of \(\beta _j\) . For distributions with a convex likelihood function, the MLE is obtained by solving

In Tables  2 and 3 , all distributions, with the exception of the inverse Gaussian distribution with log-link, have convexity. The MLE of \(\beta _j\) is given in closed form in the following cases:

\(q_{j1} = \cdots = q_{j n_j} = q_{j0}\) :

Poisson or Gamma with log-link:

Other distributions, including the inverse Gaussian distribution with log-link, require a numerical search. Furthermore, the negative binomial distribution requires the repeated updating of \({\varvec{\beta }}\) and \(\phi \) alternately.

Next, we describe the derivation of \(\beta _{\max }\) . The \(\beta _{\max }\) is the MLE of \(\beta \) under \({\varvec{\beta }}= \beta {\varvec{1}}_m\) , and for distributions with a convex likelihood function, its value is obtained by solving

Notice that this is essentially equal to the derivation of the MLE of \(\beta _j\) . Hence, \(\beta _{\max }\) is given in closed form in the following cases:

\(q_{j i} = q_0\ (\forall j, i)\) :

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Ohishi, M. Generalized fused Lasso for grouped data in generalized linear models. Stat Comput 34 , 124 (2024). https://doi.org/10.1007/s11222-024-10433-5

Download citation

Received : 12 February 2024

Accepted : 21 April 2024

Published : 25 May 2024

DOI : https://doi.org/10.1007/s11222-024-10433-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Grouped data
  • Coordinate descent algorithm
  • Generalized fused Lasso
  • Generalized linear models
  • Multivariate trend filtering
  • Find a journal
  • Publish with us
  • Track your research
  • Skip to content
  • Skip to search
  • Skip to footer

Products, Solutions, and Services

Want some help finding the Cisco products that fit your needs? You're in the right place. If you want troubleshooting help, documentation, other support, or downloads, visit our  technical support area .

Contact Cisco

  • Get a call from Sales

Call Sales:

  • 1-800-553-6387
  • US/CAN | 5am-5pm PT
  • Product / Technical Support
  • Training & Certification

Products by technology


  • Software-defined networking
  • Cisco Silicon One
  • Cloud and network management
  • Interfaces and modules
  • Optical networking
  • See all Networking

Wireless and Mobility

Wireless and Mobility

  • Access points
  • Outdoor and industrial access points
  • Controllers
  • See all Wireless and Mobility


  • Secure Firewall
  • Secure Endpoint
  • Secure Email
  • Secure Access
  • Multicloud Defense
  • See all Security



  • Collaboration endpoints
  • Conferencing
  • Cisco Contact Center
  • Unified communications
  • Experience Management
  • See all Collaboration

Data Center

Data Center

  • Servers: Cisco Unified Computing System
  • Cloud Networking
  • Hyperconverged infrastructure
  • Storage networking
  • See all Data Center


  • Nexus Dashboard Insights
  • Network analytics
  • Cisco Secure Network Analytics (Stealthwatch)


  • Video endpoints
  • Cisco Vision
  • See all Video

Internet of Things

Internet of Things (IoT)

  • Industrial Networking
  • Industrial Routers and Gateways
  • Industrial Security
  • Industrial Switching
  • Industrial Wireless
  • Industrial Connectivity Management
  • Extended Enterprise
  • Data Management
  • See all industrial IoT


  • Cisco+ (as-a-service)
  • Cisco buying programs
  • Cisco Nexus Dashboard
  • Cisco Networking Software
  • Cisco DNA Software for Wireless
  • Cisco DNA Software for Switching
  • Cisco DNA Software for SD-WAN and Routing
  • Cisco Intersight for Compute and Cloud
  • Cisco ONE for Data Center Compute and Cloud
  • See all Software
  • Product index

Products by business type

Service Providers

Service providers

Small Business

Small business


Midsize business

Cisco can provide your organization with solutions for everything from networking and data center to collaboration and security. Find the options best suited to your business needs.

  • By technology
  • By industry
  • See all solutions

CX Services

Cisco and our partners can help you transform with less risk and effort while making sure your technology delivers tangible business value.

  • See all services

Design Zone: Cisco design guides by category

Data center

  • See all Cisco design guides

End-of-sale and end-of-life

  • End-of-sale and end-of-life products
  • End-of-Life Policy
  • Cisco Commerce Build & Price
  • Cisco Software Central
  • Cisco Feature Navigator
  • See all product tools
  • Cisco Mobile Apps
  • Design Zone: Cisco design guides
  • Cisco DevNet
  • Marketplace Solutions Catalog
  • Product approvals
  • Product identification standard
  • Product warranties
  • Cisco Security Advisories
  • Security Vulnerability Policy
  • Visio stencils
  • Local Resellers
  • Technical Support

data analysis example in research paper


  1. FREE 10+ Sample Data Analysis Templates in PDF

    data analysis example in research paper

  2. Data analysis in research

    data analysis example in research paper

  3. What Is Data Analysis In Research Example

    data analysis example in research paper

  4. FREE 10+ Sample Data Analysis Templates in PDF

    data analysis example in research paper

  5. FREE 42+ Research Paper Examples in PDF

    data analysis example in research paper

  6. (PDF) Practical Data Analysis: An Example

    data analysis example in research paper


  1. Qualitative Data Analysis Procedures in Linguistics

  2. Gamification in Data Analysis

  3. Data Analysis in Research

  4. Importance of abstract in a research paper

  5. chapter -6: data analysis and presentation

  6. Analysis of Data? Some Examples to Explore


  1. A practical guide to data analysis in general literature reviews

    This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields.

  2. The Beginner's Guide to Statistical Analysis

    Learn how to plan, collect, and analyze quantitative data for research using five steps and two examples. Find out how to write hypotheses, choose a research design, measure variables, and interpret results.

  3. Data Analysis in Research: Types & Methods

    Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. Three essential things occur during the data ...

  4. PDF Structure of a Data Analysis Report

    Examples of distractions include: - Extra sentences, overly formal or flowery prose, or at the oth er extreme overly casual or overly ... The data analysis report isn't quite like a research paper or term paper in a class, nor like aresearch article in a journal. It is meant, primarily, to start an organized conversation between you and ...

  5. A Really Simple Guide to Quantitative Data Analysis

    It is important to know w hat kind of data you are planning to collect or analyse as this w ill. affect your analysis method. A 12 step approach to quantitative data analysis. Step 1: Start with ...

  6. What Is Data Analysis? (With Examples)

    Learn what data analysis is, how to do it, and why it's important for various fields and industries. Explore the data analysis process, types of data analysis, and recommended courses to get started on Coursera.

  7. Qualitative Data Analysis Methods: Top 6 + Examples

    QDA Method #1: Qualitative Content Analysis. Content analysis is possibly the most common and straightforward QDA method. At the simplest level, content analysis is used to evaluate patterns within a piece of content (for example, words, phrases or images) or across multiple pieces of content or sources of communication. For example, a collection of newspaper articles or political speeches.

  8. Qualitative data analysis: a practical example

    The aim of this paper is to equip readers with an understanding of the principles of qualitative data analysis and offer a practical example of how analysis might be undertaken in an interview-based study. Qualitative research is a generic term that refers to a group of methods, and ways of collecting and analysing data that are interpretative or explanatory in nature and focus on meaning.

  9. How to Write a Results Section

    The most logical way to structure quantitative results is to frame them around your research questions or hypotheses. For each question or hypothesis, share: A reminder of the type of analysis you used (e.g., a two-sample t test or simple linear regression). A more detailed description of your analysis should go in your methodology section.

  10. Data Analysis in Research

    Discover data analysis techniques, methods, and approaches, and study examples of data analysis in research papers. Updated: 11/21/2023 Table of Contents

  11. Data Analysis Techniques In Research

    Data analysis techniques in research are essential because they allow researchers to derive meaningful insights from data sets to support their hypotheses or research objectives.. Data Analysis Techniques in Research: While various groups, institutions, and professionals may have diverse approaches to data analysis, a universal definition captures its essence.

  12. Data Analysis in Quantitative Research

    Learn how to choose appropriate analysis models for different types of research questions and data in health and social sciences. This chapter provides introductory guides, examples, and SPSS outputs for nominal, ordinal, and scale levels of measurement.

  13. (PDF) Practical Data Analysis: An Example

    18 2 Practical Data Analysis: An Example. Fig. 2.1 A histogram for the distribution of the value of attribute age using 8 bins. Fig. 2.2 A histogram for the distribution of the value of attribute ...

  14. Data analysis write-ups

    When writing your report, organization will set you free. A good outline is: 1) overview of the problem, 2) your data and modeling approach, 3) the results of your data analysis (plots, numbers, etc), and 4) your substantive conclusions. 1) Overview. Describe the problem.

  15. Chapter 3

    10 Examples of Effective Experiment Design and Data Analysis in Transportation Research About this Chapter This chapter provides a wide variety of examples of research questions. The examples demon- strate varying levels of detail with regard to experiment designs and the statistical analyses required.

  16. How to write data analysis in a research paper?

    Learn the steps and methods for conducting statistical analysis of quantitative data in a research paper. Find out how to plan your hypothesis, design, sample, summarize, test, and interpret your results.

  17. Qualitative case study data analysis: an example from practice

    Data sources: The research example used is a multiple case study that explored the role of the clinical skills laboratory in preparing students for the real world of practice. Data analysis was conducted using a framework guided by the four stages of analysis outlined by Morse ( 1994 ): comprehending, synthesising, theorising and recontextualising.

  18. What is data analysis? Examples and how to start

    Learn what data analysis is, why it's important, and how to use it for your business. Explore five types of data analysis with examples and tips for each.

  19. How to Analyze Data in a Primary Research Study

    6 How to Analyze Data in a Primary Research Study . Melody Denny and Lindsay Clark. Overview. This chapter introduces students to the idea of working with primary research data grounded in qualitative inquiry, closed-and open-ended methods, and research ethics (Driscoll; Mackey and Gass; Morse; Scott and Garner). [1] We know this can seem intimidating to students, so we will walk them through ...

  20. (PDF) Quantitative Data Analysis

    The final section contains sample papers generated by undergraduates illustrating three major forms of quantitative research - primary data collection, secondary data analysis, and content analysis.

  21. PDF Chapter 4: Analysis and Interpretation of Results

    This report analyses the data collected from questionnaires and interviews on the effectiveness of AIDS education workshops in South Africa. It presents the quantitative and qualitative results, findings and recommendations based on the research questions and hypothesis.

  22. Chapter 4

    Moreover, the frequency distribution analysis suggested three age groups; '20-35', '36-60' and 'Above 60'. 39% of the respondents belonged to the '20-35' age group, while 56.5% of the respondents belonged to the '36-60' age group and the remaining 4.5% belonged to the age group of 'Above 60'. Furthermore, the annual ...

  23. Understanding Different Types of Data in Statistics

    Qualitative data is also termed categorical data because it can be sorted into categories rather than numbers. It answers important questions like "how things happened" or "why they happened." For example, data on characteristics like loyalty, truthfulness, creativity, and others are qualitative data. Common examples of qualitative data:

  24. Generalized fused Lasso for grouped data in generalized ...

    Generalized fused Lasso (GFL) is a powerful method based on adjacent relationships or the network structure of data. It is used in a number of research areas, including clustering, discrete smoothing, and spatio-temporal analysis. When applying GFL, the specific optimization method used is an important issue. In generalized linear models, efficient algorithms based on the coordinate descent ...

  25. Earth Sciences Research Journal

    Taking the Xinshan iron ore mine as an example, this paper, based on collecting and analyzing the actual production data and similar simulation test data of this iron ore mine, analyses various factors affecting ore depletion by bottomless column segmental chipping method by using hierarchical analysis method (AHP) and fuzzy comprehensive evaluation method (FCE), and establishes an evaluation ...

  26. Figures at a glance

    In October each year, the Mid-Year Trends report is released to provide updated figures and analysis for the initial six months of the current year (from 1 January to 30 June). These figures are preliminary, and the final data is included in the subsequent Global Trends report released in June of the following year.

  27. Products, Solutions, and Services

    Cisco can provide your organization with solutions for everything from networking and data center to collaboration and security. Find the options best suited to your business needs. By technology; By industry; See all solutions; CX Services.