Using Science to Inform Educational Practices

Descriptive Research

There are many research methods available to psychologists in their efforts to understand, describe, and explain behavior. Some methods rely on observational techniques. Other approaches involve interactions between the researcher and the individuals who are being studied—ranging from a series of simple questions to extensive, in-depth interviews—to well-controlled experiments. The main categories of psychological research are descriptive, correlational, and experimental research. Each of these research methods has unique strengths and weaknesses, and each method may only be appropriate for certain types of research questions.

Research studies that do not test specific relationships between variables are called  descriptive studies . For this method, the research question or hypothesis can be about a single variable (e.g., How accurate are people’s first impressions?) or can be a broad and exploratory question (e.g., What is it like to be a working mother diagnosed with depression?). The variable of the study is measured and reported without any further relationship analysis. A researcher might choose this method if they only needed to report information, such as a tally, an average, or a list of responses. Descriptive research can answer interesting and important questions, but what it cannot do is answer questions about relationships between variables.

Video 2.4.1.  Descriptive Research Design  provides explanation and examples for quantitative descriptive research. A closed-captioned version of this video is available here .

Descriptive research is distinct from  correlational research , in which researchers formally test whether a relationship exists between two or more variables.  Experimental research  goes a step further beyond descriptive and correlational research and randomly assigns people to different conditions, using hypothesis testing to make inferences about causal relationships between variables. We will discuss each of these methods more in-depth later.

Table 2.4.1. Comparison of research design methods

Candela Citations

  • Descriptive Research. Authored by : Nicole Arduini-Van Hoose. Provided by : Hudson Valley Community College. Retrieved from : https://courses.lumenlearning.com/edpsy/chapter/descriptive-research/. License : CC BY-NC-SA: Attribution-NonCommercial-ShareAlike
  • Descriptive Research. Authored by : Nicole Arduini-Van Hoose. Provided by : Hudson Valley Community College. Retrieved from : https://courses.lumenlearning.com/adolescent/chapter/descriptive-research/. License : CC BY-NC-SA: Attribution-NonCommercial-ShareAlike

Educational Psychology Copyright © 2020 by Nicole Arduini-Van Hoose is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

descriptive research video

What is Descriptive Research and How is it Used?

descriptive research video

Introduction

What does descriptive research mean, why would you use a descriptive research design, what are the characteristics of descriptive research, examples of descriptive research, what are the data collection methods in descriptive research, how do you analyze descriptive research data, ensuring validity and reliability in the findings.

Conducting descriptive research offers researchers a way to present phenomena as they naturally occur. Rooted in an open-ended and non-experimental nature, this type of research focuses on portraying the details of specific phenomena or contexts, helping readers gain a clearer understanding of topics of interest.

From businesses gauging customer satisfaction to educators assessing classroom dynamics, the data collected from descriptive research provides invaluable insights across various fields.

This article aims to illuminate the essence, utility, characteristics, and methods associated with descriptive research, guiding those who wish to harness its potential in their respective domains.

descriptive research video

At its core, descriptive research refers to a systematic approach used by researchers to collect, analyze, and present data about real-life phenomena to describe it in its natural context. It primarily aims to describe what exists, based on empirical observations .

Unlike experimental research, where variables are manipulated to observe outcomes, descriptive research deals with the "as-is" scenario to facilitate further research by providing a framework or new insights on which continuing studies can build.

Definition of descriptive research

Descriptive research is defined as a research method that observes and describes the characteristics of a particular group, situation, or phenomenon.

The goal is not to establish cause and effect relationships but rather to provide a detailed account of the situation.

The difference between descriptive and exploratory research

While both descriptive and exploratory research seek to provide insights into a topic or phenomenon, they differ in their focus. Exploratory research is more about investigating a topic to develop preliminary insights or to identify potential areas of interest.

In contrast, descriptive research offers detailed accounts and descriptions of the observed phenomenon, seeking to paint a full picture of what's happening.

The evolution of descriptive research in academia

Historically, descriptive research has played a foundational role in numerous academic disciplines. Anthropologists, for instance, used this approach to document cultures and societies. Psychologists have employed it to capture behaviors, emotions, and reactions.

Over time, the method has evolved, incorporating technological advancements and adapting to contemporary needs, yet its essence remains rooted in describing a phenomenon or setting as it is.

descriptive research video

Descriptive research serves as a cornerstone in the research landscape for its ability to provide a detailed snapshot of life. Its unique qualities and methods make it an invaluable method for various research purposes. Here's why:

Benefits of obtaining a clear picture

Descriptive research captures the present state of phenomena, offering researchers a detailed reflection of situations. This unaltered representation is crucial for sectors like marketing, where understanding current consumer behavior can shape future strategies.

Facilitating data interpretation

Given its straightforward nature, descriptive research can provide data that's easier to interpret, both for researchers and their audiences. Rather than analyzing complex statistical relationships among variables, researchers present detailed descriptions of their qualitative observations . Researchers can engage in in depth analysis relating to their research question , but audiences can also draw insights from their own interpretations or reflections on potential underlying patterns.

Enhancing the clarity of the research problem

By presenting things as they are, descriptive research can help elucidate ambiguous research questions. A well-executed descriptive study can shine light on overlooked aspects of a problem, paving the way for further investigative research.

Addressing practical problems

In real-world scenarios, it's not always feasible to manipulate variables or set up controlled experiments. For instance, in social sciences, understanding cultural norms without interference is paramount. Descriptive research allows for such non-intrusive insights, ensuring genuine understanding.

Building a foundation for future research

Often, descriptive studies act as stepping stones for more complex research endeavors. By establishing baseline data and highlighting patterns, they create a platform upon which more intricate hypotheses can be built and tested in subsequent studies.

descriptive research video

Descriptive research is distinguished by a set of hallmark characteristics that set it apart from other research methodologies . Recognizing these features can help researchers effectively design, implement , and interpret descriptive studies.

Specificity in the research question

As with all research, descriptive research starts with a well-defined research question aiming to detail a particular phenomenon. The specificity ensures that the study remains focused on gathering relevant data without unnecessary deviations.

Focus on the present situation

While some research methods aim to predict future trends or uncover historical truths, descriptive research is predominantly concerned with the present. It seeks to capture the current state of affairs, such as understanding today's consumer habits or documenting a newly observed phenomenon.

Standardized and structured methodology

To ensure credibility and consistency in results, descriptive research often employs standardized methods. Whether it's using a fixed set of survey questions or adhering to specific observation protocols, this structured approach ensures that data is collected uniformly, making it easier to compare and analyze.

Non-manipulative approach in observation

One of the standout features of descriptive research is its non-invasive nature. Researchers observe and document without influencing the research subject or the environment. This passive stance ensures that the data gathered is a genuine reflection of the phenomenon under study.

Replicability and consistency in results

Due to its structured methodology, findings from descriptive research can often be replicated in different settings or with different samples. This consistency adds to the credibility of the results, reinforcing the validity of the insights drawn from the study.

descriptive research video

Analyze data quickly and efficiently with ATLAS.ti

Download a free trial to see how you can make sense of complex qualitative data.

Numerous fields and sectors conduct descriptive research for its versatile and detailed nature. Through its focus on presenting things as they naturally occur, it provides insights into a myriad of scenarios. Here are some tangible examples from diverse domains:

Conducting market research

Businesses often turn to data analysis through descriptive research to understand the demographics of their target market. For instance, a company launching a new product might survey potential customers to understand their age, gender, income level, and purchasing habits, offering valuable data for targeted marketing strategies.

Evaluating employee behaviors

Organizations rely on descriptive research designs to assess the behavior and attitudes of their employees. By conducting observations or surveys , companies can gather data on workplace satisfaction, collaboration patterns, or the impact of a new office layout on productivity.

descriptive research video

Understanding consumer preferences

Brands aiming to understand their consumers' likes and dislikes often use descriptive research. By observing shopping behaviors or conducting product feedback surveys , they can gauge preferences and adjust their offerings accordingly.

Documenting historical patterns

Historians and anthropologists employ descriptive research to identify patterns through analysis of events or cultural practices. For instance, a historian might detail the daily life in a particular era, while an anthropologist might document rituals and ceremonies of a specific tribe.

Assessing student performance

Educational researchers can utilize descriptive studies to understand the effectiveness of teaching methodologies. By observing classrooms or surveying students, they can measure data trends and gauge the impact of a new teaching technique or curriculum on student engagement and performance.

descriptive research video

Descriptive research methods aim to authentically represent situations and phenomena. These techniques ensure the collection of comprehensive and reliable data about the subject of interest.

The most appropriate descriptive research method depends on the research question and resources available for your research study.

Surveys and questionnaires

One of the most familiar tools in the researcher's arsenal, surveys and questionnaires offer a structured means of collecting data from a vast audience. Through carefully designed questions, researchers can obtain standardized responses that lend themselves to straightforward comparison and analysis in quantitative and qualitative research .

Survey research can manifest in various formats, from face-to-face interactions and telephone conversations to digital platforms. While surveys can reach a broad audience and generate quantitative data ripe for statistical analysis, they also come with the challenge of potential biases in design and rely heavily on respondent honesty.

Observations and case studies

Direct or participant observation is a method wherein researchers actively watch and document behaviors or events. A researcher might, for instance, observe the dynamics within a classroom or the behaviors of shoppers in a market setting.

Case studies provide an even deeper dive, focusing on a thorough analysis of a specific individual, group, or event. These methods present the advantage of capturing real-time, detailed data, but they might also be time-intensive and can sometimes introduce observer bias .

Interviews and focus groups

Interviews , whether they follow a structured script or flow more organically, are a powerful means to extract detailed insights directly from participants. On the other hand, focus groups gather multiple participants for discussions, aiming to gather diverse and collective opinions on a particular topic or product.

These methods offer the benefit of deep insights and adaptability in data collection . However, they necessitate skilled interviewers, and focus group settings might see individual opinions being influenced by group dynamics.

Document and content analysis

Here, instead of generating new data, researchers examine existing documents or content . This can range from studying historical records and newspapers to analyzing media content or literature.

Analyzing existing content offers the advantage of accessibility and can provide insights over longer time frames. However, the reliability and relevance of the content are paramount, and researchers must approach this method with a discerning eye.

descriptive research video

Descriptive research data, rich in details and insights, necessitates meticulous analysis to derive meaningful conclusions. The analysis process transforms raw data into structured findings that can be communicated and acted upon.

Qualitative content analysis

For data collected through interviews , focus groups , observations , or open-ended survey questions , qualitative content analysis is a popular choice. This involves examining non-numerical data to identify patterns, themes, or categories.

By coding responses or observations , researchers can identify recurring elements, making it easier to comprehend larger data sets and draw insights.

Using descriptive statistics

When dealing with quantitative data from surveys or experiments, descriptive statistics are invaluable. Measures such as mean, median, mode, standard deviation, and frequency distributions help summarize data sets, providing a snapshot of the overall patterns.

Graphical representations like histograms, pie charts, or bar graphs can further help in visualizing these statistics.

Coding and categorizing the data

Both qualitative and quantitative data often require coding. Coding involves assigning labels to specific responses or behaviors to group similar segments of data. This categorization aids in identifying patterns, especially in vast data sets.

For instance, responses to open-ended questions in a survey can be coded based on keywords or sentiments, allowing for a more structured analysis.

Visual representation through graphs and charts

Visual aids like graphs, charts, and plots can simplify complex data, making it more accessible and understandable. Whether it's showcasing frequency distributions through histograms or mapping out relationships with networks, visual representations can elucidate trends and patterns effectively.

In the realm of research , the credibility of findings is paramount. Without trustworthiness in the results, even the most meticulously gathered data can lose its value. Two cornerstones that bolster the credibility of research outcomes are validity and reliability .

Validity: Measuring the right thing

Validity addresses the accuracy of the research. It seeks to answer the question: Is the research genuinely measuring what it aims to measure? In descriptive research, where the objective is to paint an authentic picture of the current state of affairs, ensuring validity is crucial.

For instance, if a study aims to understand consumer preferences for a product category, the questions posed should genuinely reflect those preferences and not veer into unrelated territories. Multiple forms of validity, including content, criterion, and construct validity, can be examined to ensure that the research instruments and processes are aligned with the research goals.

Reliability: Consistency in findings

Reliability, on the other hand, pertains to the consistency of the research findings. When a study demonstrates reliability, this suggests that others could repeat the study and the outcomes would remain consistent across repetitions.

In descriptive research, factors like the clarity of survey questions , the training of observers , and the standardization of interview protocols play a role in enhancing reliability. Techniques such as test-retest and internal consistency measurements can be employed to assess and improve reliability.

descriptive research video

Make your research happen with ATLAS.ti

Analyze descriptive research with our powerful data analysis interface. Download a free trial of ATLAS.ti.

descriptive research video

  • What is descriptive research?

Last updated

5 February 2023

Reviewed by

Cathy Heath

Descriptive research is a common investigatory model used by researchers in various fields, including social sciences, linguistics, and academia.

Read on to understand the characteristics of descriptive research and explore its underlying techniques, processes, and procedures.

Analyze your descriptive research

Dovetail streamlines analysis to help you uncover and share actionable insights

Descriptive research is an exploratory research method. It enables researchers to precisely and methodically describe a population, circumstance, or phenomenon.

As the name suggests, descriptive research describes the characteristics of the group, situation, or phenomenon being studied without manipulating variables or testing hypotheses . This can be reported using surveys , observational studies, and case studies. You can use both quantitative and qualitative methods to compile the data.

Besides making observations and then comparing and analyzing them, descriptive studies often develop knowledge concepts and provide solutions to critical issues. It always aims to answer how the event occurred, when it occurred, where it occurred, and what the problem or phenomenon is.

  • Characteristics of descriptive research

The following are some of the characteristics of descriptive research:

Quantitativeness

Descriptive research can be quantitative as it gathers quantifiable data to statistically analyze a population sample. These numbers can show patterns, connections, and trends over time and can be discovered using surveys, polls, and experiments.

Qualitativeness

Descriptive research can also be qualitative. It gives meaning and context to the numbers supplied by quantitative descriptive research .

Researchers can use tools like interviews, focus groups, and ethnographic studies to illustrate why things are what they are and help characterize the research problem. This is because it’s more explanatory than exploratory or experimental research.

Uncontrolled variables

Descriptive research differs from experimental research in that researchers cannot manipulate the variables. They are recognized, scrutinized, and quantified instead. This is one of its most prominent features.

Cross-sectional studies

Descriptive research is a cross-sectional study because it examines several areas of the same group. It involves obtaining data on multiple variables at the personal level during a certain period. It’s helpful when trying to understand a larger community’s habits or preferences.

Carried out in a natural environment

Descriptive studies are usually carried out in the participants’ everyday environment, which allows researchers to avoid influencing responders by collecting data in a natural setting. You can use online surveys or survey questions to collect data or observe.

Basis for further research

You can further dissect descriptive research’s outcomes and use them for different types of investigation. The outcomes also serve as a foundation for subsequent investigations and can guide future studies. For example, you can use the data obtained in descriptive research to help determine future research designs.

  • Descriptive research methods

There are three basic approaches for gathering data in descriptive research: observational, case study, and survey.

You can use surveys to gather data in descriptive research. This involves gathering information from many people using a questionnaire and interview .

Surveys remain the dominant research tool for descriptive research design. Researchers can conduct various investigations and collect multiple types of data (quantitative and qualitative) using surveys with diverse designs.

You can conduct surveys over the phone, online, or in person. Your survey might be a brief interview or conversation with a set of prepared questions intended to obtain quick information from the primary source.

Observation

This descriptive research method involves observing and gathering data on a population or phenomena without manipulating variables. It is employed in psychology, market research , and other social science studies to track and understand human behavior.

Observation is an essential component of descriptive research. It entails gathering data and analyzing it to see whether there is a relationship between the two variables in the study. This strategy usually allows for both qualitative and quantitative data analysis.

Case studies

A case study can outline a specific topic’s traits. The topic might be a person, group, event, or organization.

It involves using a subset of a larger group as a sample to characterize the features of that larger group.

You can generalize knowledge gained from studying a case study to benefit a broader audience.

This approach entails carefully examining a particular group, person, or event over time. You can learn something new about the study topic by using a small group to better understand the dynamics of the entire group.

  • Types of descriptive research

There are several types of descriptive study. The most well-known include cross-sectional studies, census surveys, sample surveys, case reports, and comparison studies.

Case reports and case series

In the healthcare and medical fields, a case report is used to explain a patient’s circumstances when suffering from an uncommon illness or displaying certain symptoms. Case reports and case series are both collections of related cases. They have aided the advancement of medical knowledge on countless occasions.

The normative component is an addition to the descriptive survey. In the descriptive–normative survey, you compare the study’s results to the norm.

Descriptive survey

This descriptive type of research employs surveys to collect information on various topics. This data aims to determine the degree to which certain conditions may be attained.

You can extrapolate or generalize the information you obtain from sample surveys to the larger group being researched.

Correlative survey

Correlative surveys help establish if there is a positive, negative, or neutral connection between two variables.

Performing census surveys involves gathering relevant data on several aspects of a given population. These units include individuals, families, organizations, objects, characteristics, and properties.

During descriptive research, you gather different degrees of interest over time from a specific population. Cross-sectional studies provide a glimpse of a phenomenon’s prevalence and features in a population. There are no ethical challenges with them and they are quite simple and inexpensive to carry out.

Comparative studies

These surveys compare the two subjects’ conditions or characteristics. The subjects may include research variables, organizations, plans, and people.

Comparison points, assumption of similarities, and criteria of comparison are three important variables that affect how well and accurately comparative studies are conducted.

For instance, descriptive research can help determine how many CEOs hold a bachelor’s degree and what proportion of low-income households receive government help.

  • Pros and cons

The primary advantage of descriptive research designs is that researchers can create a reliable and beneficial database for additional study. To conduct any inquiry, you need access to reliable information sources that can give you a firm understanding of a situation.

Quantitative studies are time- and resource-intensive, so knowing the hypotheses viable for testing is crucial. The basic overview of descriptive research provides helpful hints as to which variables are worth quantitatively examining. This is why it’s employed as a precursor to quantitative research designs.

Some experts view this research as untrustworthy and unscientific. However, there is no way to assess the findings because you don’t manipulate any variables statistically.

Cause-and-effect correlations also can’t be established through descriptive investigations. Additionally, observational study findings cannot be replicated, which prevents a review of the findings and their replication.

The absence of statistical and in-depth analysis and the rather superficial character of the investigative procedure are drawbacks of this research approach.

  • Descriptive research examples and applications

Several descriptive research examples are emphasized based on their types, purposes, and applications. Research questions often begin with “What is …” These studies help find solutions to practical issues in social science, physical science, and education.

Here are some examples and applications of descriptive research:

Determining consumer perception and behavior

Organizations use descriptive research designs to determine how various demographic groups react to a certain product or service.

For example, a business looking to sell to its target market should research the market’s behavior first. When researching human behavior in response to a cause or event, the researcher pays attention to the traits, actions, and responses before drawing a conclusion.

Scientific classification

Scientific descriptive research enables the classification of organisms and their traits and constituents.

Measuring data trends

A descriptive study design’s statistical capabilities allow researchers to track data trends over time. It’s frequently used to determine the study target’s current circumstances and underlying patterns.

Conduct comparison

Organizations can use a descriptive research approach to learn how various demographics react to a certain product or service. For example, you can study how the target market responds to a competitor’s product and use that information to infer their behavior.

  • Bottom line

A descriptive research design is suitable for exploring certain topics and serving as a prelude to larger quantitative investigations. It provides a comprehensive understanding of the “what” of the group or thing you’re investigating.

This research type acts as the cornerstone of other research methodologies . It is distinctive because it can use quantitative and qualitative research approaches at the same time.

What is descriptive research design?

Descriptive research design aims to systematically obtain information to describe a phenomenon, situation, or population. More specifically, it helps answer the what, when, where, and how questions regarding the research problem rather than the why.

How does descriptive research compare to qualitative research?

Despite certain parallels, descriptive research concentrates on describing phenomena, while qualitative research aims to understand people better.

How do you analyze descriptive research data?

Data analysis involves using various methodologies, enabling the researcher to evaluate and provide results regarding validity and reliability.

Get started today

Go from raw data to valuable insights with a flexible research platform

Editor’s picks

Last updated: 21 December 2023

Last updated: 16 December 2023

Last updated: 6 October 2023

Last updated: 25 November 2023

Last updated: 12 May 2023

Last updated: 15 February 2024

Last updated: 11 March 2024

Last updated: 12 December 2023

Last updated: 18 May 2023

Last updated: 6 March 2024

Last updated: 10 April 2023

Last updated: 20 December 2023

Latest articles

Related topics, log in or sign up.

Get started for free

  • Bipolar Disorder
  • Therapy Center
  • When To See a Therapist
  • Types of Therapy
  • Best Online Therapy
  • Best Couples Therapy
  • Best Family Therapy
  • Managing Stress
  • Sleep and Dreaming
  • Understanding Emotions
  • Self-Improvement
  • Healthy Relationships
  • Student Resources
  • Personality Types
  • Guided Meditations
  • Verywell Mind Insights
  • 2023 Verywell Mind 25
  • Mental Health in the Classroom
  • Editorial Process
  • Meet Our Review Board
  • Crisis Support

Descriptive Research in Psychology

Sometimes you need to dig deeper than the pure statistics

John Loeppky is a freelance journalist based in Regina, Saskatchewan, Canada, who has written about disability and health for outlets of all kinds.

descriptive research video

FG Trade / E+/ Getty

Types of Descriptive Research and the Methods Used

  • Advantages & Limitations of Descriptive Research

Best Practices for Conducting Descriptive Research

Descriptive research is one of the key tools needed in any psychology researcher’s toolbox in order to create and lead a project that is both equitable and effective. Because psychology, as a field, loves definitions, let’s start with one. The University of Minnesota’s Introduction to Psychology defines this type of research as one that is “...designed to provide a snapshot of the current state of affairs.” That's pretty broad, so what does that mean in practice? Dr. Heather Derry-Vick (PhD) , an assistant professor in psychiatry at Hackensack Meridian School of Medicine, helps us put it into perspective. "Descriptive research really focuses on defining, understanding, and measuring a phenomenon or an experience," she says. "Not trying to change a person's experience or outcome, or even really looking at the mechanisms for why that might be happening, but more so describing an experience or a process as it unfolds naturally.”

Within the descriptive research methodology there are multiple types, including the following.

Descriptive Survey Research

This involves going beyond a typical tool like a LIkert Scale —where you typically place your response to a prompt on a one to five scale. We already know that scales like this can be ineffective, particularly when studying pain, for example.

When that's the case, using a descriptive methodology can help dig deeper into how a person is thinking, feeling, and acting rather than simply quantifying it in a way that might be unclear or confusing.

Descriptive Observational Research

Think of observational research like an ethically-focused version of people-watching. One example would be watching the patterns of children on a playground—perhaps when looking at a concept like risky play or seeking to observe social behaviors between children of different ages.

Descriptive Case Study Research

A descriptive approach to a case study is akin to a biography of a person, honing in on the experiences of a small group to extrapolate to larger themes. We most commonly see descriptive case studies when those in the psychology field are using past clients as an example to illustrate a point.

Correlational Descriptive Research

While descriptive research is often about the here and now, this form of the methodology allows researchers to make connections between groups of people. As an example from her research, Derry-Vick says she uses this method to identify how gender might play a role in cancer scan anxiety, aka scanxiety.

Dr. Derry-Vick's research uses surveys and interviews to get a sense of how cancer patients are feeling and what they are experiencing both in the course of their treatment and in the lead-up to their next scan, which can be a significant source of stress.

David Marlon, PsyD, MBA , who works as a clinician and as CEO at Vegas Stronger, and whose research focused on leadership styles at community-based clinics, says that using descriptive research allowed him to get beyond the numbers.

In his case, that includes data points like how many unhoused people found stable housing over a certain period or how many people became drug-free—and identify the reasons for those changes.

Those [data points] are some practical, quantitative tools that are helpful. But when I question them on how safe they feel, when I question them on the depth of the bond or the therapeutic alliance, when I talk to them about their processing of traumas,  wellbeing...these are things that don't really fall on to a yes, no, or even on a Likert scale.

For the portion of his thesis that was focused on descriptive research, Marlon used semi-structured interviews to look at the how and the why of transformational leadership and its impact on clinics’ clients and staff.

Advantages & Limitations of Descriptive Research

So, if the advantages of using descriptive research include that it centers the research participants, gives us a clear picture of what is happening to a person in a particular moment,  and gives us very nuanced insights into how a particular situation is being perceived by the very person affected, are there drawbacks? Yes, there are. Dr. Derry-Vick says that it’s important to keep in mind that just because descriptive research tells us something is happening doesn’t mean it necessarily leads us to the resolution of a given problem.

I think that, by design, the descriptive research might not tell you why a phenomenon is happening. So it might tell you, very well, how often it's happening, or what the levels are, or help you understand it in depth. But that may or may not always tell you information about the causes or mechanisms for why something is happening.

Another limitation she identifies is that it also can’t tell you, on its own, whether a particular treatment pathway is having the desired effect.

“Descriptive research in and of itself can't really tell you whether a specific approach is going to be helpful until you take in a different approach to actually test it.”

Marlon, who believes in a multi-disciplinary approach, says that his subfield—addictions—is one where descriptive research had its limits, but helps readers go beyond preconceived notions of what addictions treatment looks and feels like when it is effective. “If we talked to and interviewed and got descriptive information from the clinicians and the clients, a much more precise picture would be painted, showing the need for a client's specific multidisciplinary approach augmented with a variety of modalities," he says. "If you tried to look at my discipline in a pure quantitative approach , it wouldn't begin to tell the real story.”

Because you’re controlling far fewer variables than other forms of research, it’s important to identify whether those you are describing, your study participants, should be informed that they are part of a study.

For example, if you’re observing and describing who is buying what in a grocery store to identify patterns, then you might not need to identify yourself.

However, if you’re asking people about their fear of certain treatment, or how their marginalized identities impact their mental health in a particular way, there is far more of a pressure to think deeply about how you, as the researcher, are connected to the people you are researching.

Many descriptive research projects use interviews as a form of research gathering and, as a result, descriptive research that is focused on this type of data gathering also has ethical and practical concerns attached. Thankfully, there are plenty of guides from established researchers about how to best conduct these interviews and/or formulate surveys .

While descriptive research has its limits, it is commonly used by researchers to get a clear vantage point on what is happening in a given situation.

Tools like surveys, interviews, and observation are often employed to dive deeper into a given issue and really highlight the human element in psychological research. At its core, descriptive research is rooted in a collaborative style that allows deeper insights when used effectively.

University of Minnesota. Introduction to Psychology .

By John Loeppky John Loeppky is a freelance journalist based in Regina, Saskatchewan, Canada, who has written about disability and health for outlets of all kinds.

Chapter 2: Psychological Research

Descriptive research.

Psychologists use descriptive, experimental, and correlational methods to conduct research. Descriptive, or qualitative, methods include the case study, naturalistic observation, surveys, archival research, longitudinal research, and cross-sectional research.

https://assessments.lumenlearning.com/assessments/2706

There are many research methods available to psychologists in their efforts to understand, describe, and explain behavior and the cognitive and biological processes that underlie it. Some methods rely on observational techniques. Other approaches involve interactions between the researcher and the individuals who are being studied—ranging from a series of simple questions to extensive, in-depth interviews—to well-controlled experiments.

The three main categories of psychological research are descriptive, correlational, and experimental research. Research studies that do not test specific relationships between variables are called descriptive, or qualitative, studies . These studies are used to describe general or specific behaviors and attributes that are observed and measured. In the early stages of research it might be difficult to form a hypothesis, especially when there is not any existing literature in the area. In these situations designing an experiment would be premature, as the question of interest is not yet clearly defined as a hypothesis. Often a researcher will begin with a non-experimental approach, such as a descriptive study, to gather more information about the topic before designing an experiment or correlational study to address a specific hypothesis.

Video 1.  Descriptive Research Design  provides explanation and examples for quantitative descriptive research. A closed-captioned version of this video is available here .

Descriptive research is distinct from correlational research , in which psychologists formally test whether a relationship exists between two or more variables. Experimental research goes a step further beyond descriptive and correlational research and randomly assigns people to different conditions, using hypothesis testing to make inferences about how these conditions affect behavior. It aims to determine if one variable directly impacts and causes another. Correlational and experimental research both typically use hypothesis testing, whereas descriptive research does not.

Each of these research methods has unique strengths and weaknesses, and each method may only be appropriate for certain types of research questions. For example, studies that rely primarily on observation produce incredible amounts of information, but the ability to apply this information to the larger population is somewhat limited because of small sample sizes. Survey research, on the other hand, allows researchers to easily collect data from relatively large samples. While this allows for results to be generalized to the larger population more easily, the information that can be collected on any given survey is somewhat limited and subject to problems associated with any type of self-reported data. Some researchers conduct archival research by using existing records. While this can be a fairly inexpensive way to collect data that can provide insight into a number of research questions, researchers using this approach have no control on how or what kind of data was collected.

Correlational research can find a relationship between two variables, but the only way a researcher can claim that the relationship between the variables is cause and effect is to perform an experiment. In experimental research, which will be discussed later in the text, there is a tremendous amount of control over variables of interest. While this is a powerful approach, experiments are often conducted in very artificial settings. This calls into question the validity of experimental findings with regard to how they would apply in real-world settings. In addition, many of the questions that psychologists would like to answer cannot be pursued through experimental research because of ethical concerns.

Data Collection

Regardless of the method of research, data collection will be necessary. The method of data collection selected will primarily depend on the type of information the researcher needs for their study; however, other factors, such as time, resources, and even ethical considerations can influence the selection of a data collection method. All of these factors need to be considered when selecting a data collection method because each method has unique strengths and weaknesses. We will discuss the uses and assessment of the most common data collection methods: observation, surveys, archival data, and tests.

Observation

If you want to understand how behavior occurs, one of the best ways to gain information is to simply observe the behavior in its natural context. However, people might change their behavior in unexpected ways if they know they are being observed. How do researchers obtain accurate information when people tend to hide their natural behavior? As an example, imagine that your professor asks everyone in your class to raise their hand if they always wash their hands after using the restroom. Chances are that almost everyone in the classroom will raise their hand, but do you think hand washing after every trip to the restroom is really that universal?

This is very similar to the phenomenon mentioned earlier in this module: many individuals do not feel comfortable answering a question honestly. But if we are committed to finding out the facts about handwashing, we have other options available to us.

Suppose we send a classmate into the restroom to actually watch whether everyone washes their hands after using the restroom. Will our observer blend into the restroom environment by wearing a white lab coat, sitting with a clipboard, and staring at the sinks? We want our researcher to be inconspicuous—perhaps standing at one of the sinks pretending to put in contact lenses while secretly recording the relevant information. This type of observational study is called naturalistic observation : observing behavior in its natural setting. To better understand peer exclusion, Suzanne Fanger collaborated with colleagues at the University of Texas to observe the behavior of preschool children on a playground. How did the observers remain inconspicuous over the duration of the study? They equipped a few of the children with wireless microphones (which the children quickly forgot about) and observed while taking notes from a distance. Also, the children in that particular preschool (a “laboratory preschool”) were accustomed to having observers on the playground (Fanger, Frankel, & Hazen, 2012).

A photograph shows two police cars driving, one with its lights flashing.

Figure 1 . Seeing a police car behind you would probably affect your driving behavior. (credit: Michael Gil)

It is critical that the observer be as unobtrusive and as inconspicuous as possible: when people know they are being watched, they are less likely to behave naturally. If you have any doubt about this, ask yourself how your driving behavior might differ in two situations: In the first situation, you are driving down a deserted highway during the middle of the day; in the second situation, you are being followed by a police car down the same deserted highway (Figure 1).

It should be pointed out that naturalistic observation is not limited to research involving humans. Indeed, some of the best-known examples of naturalistic observation involve researchers going into the field to observe various kinds of animals in their own environments. As with human studies, the researchers maintain their distance and avoid interfering with the animal subjects so as not to influence their natural behaviors. Scientists have used this technique to study social hierarchies and interactions among animals ranging from ground squirrels to gorillas. The information provided by these studies is invaluable in understanding how those animals organize socially and communicate with one another. The anthropologist Jane Goodall, for example, spent nearly five decades observing the behavior of chimpanzees in Africa (Figure 2). As an illustration of the types of concerns that a researcher might encounter in naturalistic observation, some scientists criticized Goodall for giving the chimps names instead of referring to them by numbers—using names was thought to undermine the emotional detachment required for the objectivity of the study (McKie, 2010).

(a) A photograph shows Jane Goodall speaking from a lectern. (b) A photograph shows a chimpanzee’s face.

Figure 2 . (a) Jane Goodall made a career of conducting naturalistic observations of (b) chimpanzee behavior. (credit “Jane Goodall”: modification of work by Erik Hersman; “chimpanzee”: modification of work by “Afrika Force”/Flickr.com)

The greatest benefit of naturalistic observation is the validity, or accuracy, of information collected unobtrusively in a natural setting. Having individuals behave as they normally would in a given situation means that we have a higher degree of ecological validity, or realism, than we might achieve with other research approaches. Therefore, our ability to generalize the findings of the research to real-world situations is enhanced. If done correctly, we need not worry about people or animals modifying their behavior simply because they are being observed. Sometimes, people may assume that reality programs give us a glimpse into authentic human behavior. However, the principle of inconspicuous observation is violated as reality stars are followed by camera crews and are interviewed on camera for personal confessionals. Given that environment, we must doubt how natural and realistic their behaviors are.

The major downside of naturalistic observation is that they are often difficult to set up and control. In our restroom study, what if you stood in the restroom all day prepared to record people’s handwashing behavior and no one came in? Or, what if you have been closely observing a troop of gorillas for weeks only to find that they migrated to a new place while you were sleeping in your tent? The benefit of realistic data comes at a cost. As a researcher, you have no control of when (or if) you have behavior to observe. In addition, this type of observational research often requires significant investments of time, money, and a good dose of luck.

Sometimes studies involve structured observation. In these cases, people are observed while engaging in set, specific tasks. An excellent example of structured observation comes from Strange Situation by Mary Ainsworth (you will read more about this in the module on lifespan development). The Strange Situation is a procedure used to evaluate attachment styles that exist between an infant and caregiver. In this scenario, caregivers bring their infants into a room filled with toys. The Strange Situation involves a number of phases, including a stranger coming into the room, the caregiver leaving the room, and the caregiver’s return to the room. The infant’s behavior is closely monitored at each phase, but it is the behavior of the infant upon being reunited with the caregiver that is most telling in terms of characterizing the infant’s attachment style with the caregiver.

Another potential problem in observational research is observer bias . Generally, people who act as observers are closely involved in the research project and may unconsciously skew their observations to fit their research goals or expectations. To protect against this type of bias, researchers should have clear criteria established for the types of behaviors recorded and how those behaviors should be classified. In addition, researchers often compare observations of the same event by multiple observers, in order to test inter-rater reliability : a measure of reliability that assesses the consistency of observations by different observers.

Often, psychologists develop surveys as a means of gathering data. Surveys are lists of questions to be answered by research participants, and can be delivered as paper-and-pencil questionnaires, administered electronically, or conducted verbally (Figure 3). Generally, the survey itself can be completed in a short time, and the ease of administering a survey makes it easy to collect data from a large number of people.

Surveys allow researchers to gather data from larger samples than may be afforded by other research methods . A sample is a subset of individuals selected from a population , which is the overall group of individuals that the researchers are interested in. Researchers study the sample and seek to generalize their findings to the population.

A sample online survey reads, “Dear visitor, your opinion is important to us. We would like to invite you to participate in a short survey to gather your opinions and feedback on your news consumption habits. The survey will take approximately 10-15 minutes. Simply click the “Yes” button below to launch the survey. Would you like to participate?” Two buttons are labeled “yes” and “no.”

Figure 3 . Surveys can be administered in a number of ways, including electronically administered research, like the survey shown here. (credit: Robert Nyman)

There is both strength and weakness of the survey in comparison to case studies. By using surveys, we can collect information from a larger sample of people. A larger sample is better able to reflect the actual diversity of the population, thus allowing better generalizability. Therefore, if our sample is sufficiently large and diverse, we can assume that the data we collect from the survey can be generalized to the larger population with more certainty than the information collected through a case study. However, given the greater number of people involved, we are not able to collect the same depth of information on each person that would be collected in a case study.

Another potential weakness of surveys is something we touched on earlier in this module: people don’t always give accurate responses. They may lie, misremember, or answer questions in a way that they think makes them look good. For example, people may report drinking less alcohol than is actually the case.

Any number of research questions can be answered through the use of surveys. One real-world example is the research conducted by Jenkins, Ruppel, Kizer, Yehl, and Griffin (2012) about the backlash against the US Arab-American community following the terrorist attacks of September 11, 2001. Jenkins and colleagues wanted to determine to what extent these negative attitudes toward Arab-Americans still existed nearly a decade after the attacks occurred. In one study, 140 research participants filled out a survey with 10 questions, including questions asking directly about the participant’s overt prejudicial attitudes toward people of various ethnicities. The survey also asked indirect questions about how likely the participant would be to interact with a person of a given ethnicity in a variety of settings (such as, “How likely do you think it is that you would introduce yourself to a person of Arab-American descent?”). The results of the research suggested that participants were unwilling to report prejudicial attitudes toward any ethnic group. However, there were significant differences between their pattern of responses to questions about social interaction with Arab-Americans compared to other ethnic groups: they indicated less willingness for social interaction with Arab-Americans compared to the other ethnic groups. This suggested that the participants harbored subtle forms of prejudice against Arab-Americans, despite their assertions that this was not the case (Jenkins et al., 2012).

Archival Data and Case Studies

Some researchers gain access to large amounts of data without interacting with a single research participant. Instead, they use existing records to answer various research questions. This type of research approach is known as archival research. Archival research relies on looking at past records or data sets to look for interesting patterns or relationships.

For example, a researcher might access the academic records of all individuals who enrolled in college within the past ten years and calculate how long it took them to complete their degrees, as well as course loads, grades, and extracurricular involvement. Archival research could provide important information about who is most likely to complete their education, and it could help identify important risk factors for struggling students (Figure 4).

(a) A photograph shows stacks of paper files on shelves. (b) A photograph shows a computer.

Figure 4 . A researcher doing archival research examines records, whether archived as a (a) hardcopy or (b) electronically. (credit “paper files”: modification of work by “Newtown graffiti”/Flickr; “computer”: modification of work by INPIVIC Family/Flickr)

In comparing archival research to other research methods, there are several important distinctions. For one, the researcher employing archival research never directly interacts with research participants. Therefore, the investment of time and money to collect data is considerably less with archival research. Additionally, researchers have no control over what information was originally collected. Therefore, research questions have to be tailored so they can be answered within the structure of the existing data sets. There is also no guarantee of consistency between the records from one source to another, which might make comparing and contrasting different data sets problematic.

https://assessments.lumenlearning.com/assessments/2712

descriptive research video

A good test will aid researchers in assessing a particular psychological construct. What is a good test? Researchers want a test that is standardized, reliable, and valid. A standardized test is one that is administered, scored, and analyzed in the same way for each participant. This minimizes differences in test scores due to confounding factors, such as variability in the testing environment or scoring process, and assures that scores are comparable. Reliability refers to the consistency of a measure. Researchers consider three types of consistency: over time (test-retest reliability), across items (internal consistency), and across different researchers (interrater reliability). Validity is the extent to which the scores from a measure represent the variable they are intended to. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to.

There are various types of tests used in psychological research. Self-report measures are those in which participants report on their own thoughts, feelings, and actions, such as the Rosenberg Self-Esteem Scale or the Big Five Personality Test. Some tests measure performance, ability, aptitude, or skill, like the Stanford-Binet Intelligence Scale or the SATs.There are also tests that measure physiological states, including electrical activity or blood flow in the brain.

Video 2.  Methods of Data Collection  explains various means for gathering data for quantitative and qualitative research. A closed-captioned version of this video is available here .

Studying Changes over Time

Sometimes, especially in developmental research, the researcher is interested in examining changes over time and will need to consider a research design that will capture these changes. Remember,  research methods  are tools that are used to collect information, while r esearch design  is the strategy or blueprint for deciding how to collect and analyze information. Research design dictates which methods are used and how. There are three types of developmental research designs: cross-sectional, longitudinal, and sequential.

Video 3.  Developmental Research Designs

Cross-Sectional Design

The majority of developmental studies use cross-sectional designs because they are less time-consuming and less expensive than other developmental designs.  Cross-sectional research  designs are used to examine behavior in participants of different ages who are tested at the same point in time. Let’s suppose that researchers are interested in the relationship between intelligence and aging. They might have a hypothesis that intelligence declines as people get older. The researchers might choose to give a particular intelligence test to individuals who are 20 years old, individuals who are 50 years old, and individuals who are 80 years old at the same time and compare the data from each age group. This research is cross-sectional in design because the researchers plan to examine the intelligence scores of individuals of different ages within the same study at the same time; they are taking a “cross-section” of people at one point in time. Let’s say that the comparisons find that the 80-year-old adults score lower on the intelligence test than the 50-year-old adults, and the 50-year-old adults score lower on the intelligence test than the 20-year-old adults. Based on these data, the researchers might conclude that individuals become less intelligent as they get older. Would that be a valid (accurate) interpretation of the results?

descriptive research video

Figure 5. Example of cross-sectional research design

No, that would not be a valid conclusion because the researchers did not follow individuals as they aged from 20 to 50 to 80 years old. One of the primary limitations of cross-sectional research is that the results yield information about age  differences  not necessarily  changes  over time. That is, although the study described above can show that the 80-year-olds scored lower on the intelligence test than the 50-year-olds, and the 50-year-olds scored lower than the 20-year-olds, the data used for this conclusion were collected from different individuals (or groups). It could be, for instance, that when these 20-year-olds get older, they will still score just as high on the intelligence test as they did at age 20. Similarly, maybe the 80-year-olds would have scored relatively low on the intelligence test when they were young; the researchers don’t know for certain because they did not follow the same individuals as they got older.

With each cohort being members of a different generation, it is also possible that the differences found between the groups are not due to age, per se, but due to cohort effects. Differences between these cohorts’ IQ results could be due to differences in life experiences specific to their generation, such as differences in education, economic conditions, advances in technology, or changes in health and nutrition standards, and not due to age-related changes.

Another disadvantage of cross-sectional research is that it is limited to one time of measurement. Data are collected at one point in time, and it’s possible that something could have happened in that year in history that affected all of the participants, although possibly each cohort may have been affected differently.

Longitudinal Research Design

descriptive research video

Longitudinal research designs are used to examine behavior in the same individuals over time. For instance, with our example of studying intelligence and aging, a researcher might conduct a longitudinal study to examine whether 20-year-olds become less intelligent with age over time. To this end, a researcher might give an intelligence test to individuals when they are 20 years old, again when they are 50 years old, and then again when they are 80 years old. This study is longitudinal in nature because the researcher plans to study the same individuals as they age. Based on these data, the pattern of intelligence and age might look different than from the cross-sectional research; it might be found that participants’ intelligence scores are higher at age 50 than at age 20 and then remain stable or decline a little by age 80. How can that be when cross-sectional research revealed declines in intelligence with age?

descriptive research video

Figure 6. Example of a longitudinal research design

Since longitudinal research happens over a period of time (which could be short term, as in months, but is often longer, as in years), there is a risk of attrition.  Attrition  occurs when participants fail to complete all portions of a study. Participants may move, change their phone numbers, die, or simply become disinterested in participating over time. Researchers should account for the possibility of attrition by enrolling a larger sample into their study initially, as some participants will likely drop out over time. There is also something known as  selective attrition— this means that certain groups of individuals may tend to drop out. It is often the least healthy, least educated, and lower socioeconomic participants who tend to drop out over time. That means that the remaining participants may no longer be representative of the whole population, as they are, in general, healthier, better educated, and have more money. This could be a factor in why our hypothetical research found a more optimistic picture of intelligence and aging as the years went by. What can researchers do about selective attrition? At each time of testing, they could randomly recruit more participants from the same cohort as the original members to replace those who have dropped out.

The results from longitudinal studies may also be impacted by repeated assessments. Consider how well you would do on a math test if you were given the exact same exam every day for a week. Your performance would likely improve over time, not necessarily because you developed better math abilities, but because you were continuously practicing the same math problems. This phenomenon is known as a practice effect. Practice effects occur when participants become better at a task over time because they have done it again and again (not due to natural psychological development). So our participants may have become familiar with the intelligence test each time (and with the computerized testing administration).

Another limitation of longitudinal research is that the data are limited to only one cohort. As an example, think about how comfortable the participants in the 2010 cohort of 20-year-olds are with computers. Since only one cohort is being studied, there is no way to know if findings would be different from other cohorts. In addition, changes that are found as individuals age over time could be due to age or to time of measurement effects. That is, the participants are tested at different periods in history, so the variables of age and time of measurement could be confounded (mixed up). For example, what if there is a major shift in workplace training and education between 2020 and 2040, and many of the participants experience a lot more formal education in adulthood, which positively impacts their intelligence scores in 2040? Researchers wouldn’t know if the intelligence scores increased due to growing older or due to a more educated workforce over time between measurements.

Sequential Research Design

Sequential research  designs include elements of both longitudinal and cross-sectional research designs. Similar to longitudinal designs, sequential research features participants who are followed over time; similar to cross-sectional designs, sequential research includes participants of different ages. This research design is also distinct from those that have been discussed previously in that individuals of different ages are enrolled into a study at various points in time to examine age-related changes, development within the same individuals as they age, and to account for the possibility of cohort and/or time of measurement effects

Consider, once again, our example of intelligence and aging. In a study with a sequential design, a researcher might recruit three separate groups of participants (Groups A, B, and C). Group A would be recruited when they are 20 years old in 2010 and would be tested again when they are 50 and 80 years old in 2040 and 2070, respectively (similar in design to the longitudinal study described previously). Group B would be recruited when they are 20 years old in 2040 and would be tested again when they are 50 years old in 2070. Group C would be recruited when they are 20 years old in 2070, and so on.

descriptive research video

Figure 7. Example of sequential research design

Studies with sequential designs are powerful because they allow for both longitudinal and cross-sectional comparisons—changes and/or stability with age over time can be measured and compared with differences between age and cohort groups. This research design also allows for the examination of cohort and time of measurement effects. For example, the researcher could examine the intelligence scores of 20-year-olds at different times in history and different cohorts (follow the yellow diagonal lines in figure 3). This might be examined by researchers who are interested in sociocultural and historical changes (because we know that lifespan development is multidisciplinary). One way of looking at the usefulness of the various developmental research designs was described by Schaie and Baltes (1975): cross-sectional and longitudinal designs might reveal change patterns while sequential designs might identify developmental origins for the observed change patterns.

Since they include elements of longitudinal and cross-sectional designs, sequential research has many of the same strengths and limitations as these other approaches. For example, sequential work may require less time and effort than longitudinal research (if data are collected more frequently than over the 30-year spans in our example) but more time and effort than cross-sectional research. Although practice effects may be an issue if participants are asked to complete the same tasks or assessments over time, attrition may be less problematic than what is commonly experienced in longitudinal research since participants may not have to remain involved in the study for such a long period of time.

Comparing Developmental Research Designs

When considering the best research design to use in their research, scientists think about their main research question and the best way to come up with an answer. A table of advantages and disadvantages for each of the described research designs is provided here to help you as you consider what sorts of studies would be best conducted using each of these different approaches.

Table 1.  Advantages and disadvantages of different research designs

  • Introductory content. Provided by : Lumen Learning. License : CC BY-NC-SA: Attribution-NonCommercial-ShareAlike
  • Modification, adaptation, and original content. Provided by : Lumen Learning. License : CC BY-SA: Attribution-ShareAlike
  • Paragraph on correlation. Authored by : Christie Napa Scollon. Provided by : Singapore Management University. Located at : http://nobaproject.com/modules/research-designs?r=MTc0ODYsMjMzNjQ%3D . Project : The Noba Project. License : CC BY-NC-SA: Attribution-NonCommercial-ShareAlike
  • Psychology, Approaches to Research. Authored by : OpenStax College. Located at : http://cnx.org/contents/[email protected]:mfArybye@7/Analyzing-Findings . License : CC BY: Attribution . License Terms : Download for free at http://cnx.org/contents/[email protected]
  • Lec 2 | MIT 9.00SC Introduction to Psychology, Spring 2011. Authored by : John Gabrieli. Provided by : MIT OpenCourseWare. Located at : https://www.youtube.com/watch?v=syXplPKQb_o . License : CC BY-NC-SA: Attribution-NonCommercial-ShareAlike
  • Descriptive Research. Provided by : Boundless. Located at : https://courses.lumenlearning.com/boundless-psychology/ . License : CC BY-SA: Attribution-ShareAlike
  • Researchers review documents. Authored by : National Cancer Institute. Provided by : Wikimedia. Located at : https://commons.wikimedia.org/wiki/File:Researchers_review_documents.jpg . License : Public Domain: No Known Copyright

Footer Logo Lumen Candela

Privacy Policy

Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service

Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve

Increase revenue and loyalty with real-time insights and recommendations delivered to teams on the ground

Know how your people feel and empower managers to improve employee engagement, productivity, and retention

Take action in the moments that matter most along the employee journey and drive bottom line growth

Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people

Get faster, richer insights with qual and quant tools that make powerful market research available to everyone

Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts

Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market

Explore the platform powering Experience Management

  • Free Account
  • For Digital
  • For Customer Care
  • For Human Resources
  • For Researchers
  • Financial Services
  • All Industries

Popular Use Cases

  • Customer Experience
  • Employee Experience
  • Employee Exit Interviews
  • Net Promoter Score
  • Voice of Customer
  • Customer Success Hub
  • Product Documentation
  • Training & Certification
  • XM Institute
  • Popular Resources
  • Customer Stories

Market Research

  • Artificial Intelligence
  • Partnerships
  • Marketplace

The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.

  • English/AU & NZ
  • Español/Europa
  • Español/América Latina
  • Português Brasileiro
  • REQUEST DEMO
  • Experience Management
  • Descriptive Research

Try Qualtrics for free

Descriptive research: what it is and how to use it.

8 min read Understanding the who, what and where of a situation or target group is an essential part of effective research and making informed business decisions.

For example you might want to understand what percentage of CEOs have a bachelor’s degree or higher. Or you might want to understand what percentage of low income families receive government support – or what kind of support they receive.

Descriptive research is what will be used in these types of studies.

In this guide we’ll look through the main issues relating to descriptive research to give you a better understanding of what it is, and how and why you can use it.

Free eBook: 2024 global market research trends report

What is descriptive research?

Descriptive research is a research method used to try and determine the characteristics of a population or particular phenomenon.

Using descriptive research you can identify patterns in the characteristics of a group to essentially establish everything you need to understand apart from why something has happened.

Market researchers use descriptive research for a range of commercial purposes to guide key decisions.

For example you could use descriptive research to understand fashion trends in a given city when planning your clothing collection for the year. Using descriptive research you can conduct in depth analysis on the demographic makeup of your target area and use the data analysis to establish buying patterns.

Conducting descriptive research wouldn’t, however, tell you why shoppers are buying a particular type of fashion item.

Descriptive research design

Descriptive research design uses a range of both qualitative research and quantitative data (although quantitative research is the primary research method) to gather information to make accurate predictions about a particular problem or hypothesis.

As a survey method, descriptive research designs will help researchers identify characteristics in their target market or particular population.

These characteristics in the population sample can be identified, observed and measured to guide decisions.

Descriptive research characteristics

While there are a number of descriptive research methods you can deploy for data collection, descriptive research does have a number of predictable characteristics.

Here are a few of the things to consider:

Measure data trends with statistical outcomes

Descriptive research is often popular for survey research because it generates answers in a statistical form, which makes it easy for researchers to carry out a simple statistical analysis to interpret what the data is saying.

Descriptive research design is ideal for further research

Because the data collection for descriptive research produces statistical outcomes, it can also be used as secondary data for another research study.

Plus, the data collected from descriptive research can be subjected to other types of data analysis .

Uncontrolled variables

A key component of the descriptive research method is that it uses random variables that are not controlled by the researchers. This is because descriptive research aims to understand the natural behavior of the research subject.

It’s carried out in a natural environment

Descriptive research is often carried out in a natural environment. This is because researchers aim to gather data in a natural setting to avoid swaying respondents.

Data can be gathered using survey questions or online surveys.

For example, if you want to understand the fashion trends we mentioned earlier, you would set up a study in which a researcher observes people in the respondent’s natural environment to understand their habits and preferences.

Descriptive research allows for cross sectional study

Because of the nature of descriptive research design and the randomness of the sample group being observed, descriptive research is ideal for cross sectional studies – essentially the demographics of the group can vary widely and your aim is to gain insights from within the group.

This can be highly beneficial when you’re looking to understand the behaviors or preferences of a wider population.

Descriptive research advantages

There are many advantages to using descriptive research, some of them include:

Cost effectiveness

Because the elements needed for descriptive research design are not specific or highly targeted (and occur within the respondent’s natural environment) this type of study is relatively cheap to carry out.

Multiple types of data can be collected

A big advantage of this research type, is that you can use it to collect both quantitative and qualitative data. This means you can use the stats gathered to easily identify underlying patterns in your respondents’ behavior.

Descriptive research disadvantages

Potential reliability issues.

When conducting descriptive research it’s important that the initial survey questions are properly formulated.

If not, it could make the answers unreliable and risk the credibility of your study.

Potential limitations

As we’ve mentioned, descriptive research design is ideal for understanding the what, who or where of a situation or phenomenon.

However, it can’t help you understand the cause or effect of the behavior. This means you’ll need to conduct further research to get a more complete picture of a situation.

Descriptive research methods

Because descriptive research methods include a range of quantitative and qualitative research, there are several research methods you can use.

Use case studies

Case studies in descriptive research involve conducting in-depth and detailed studies in which researchers get a specific person or case to answer questions.

Case studies shouldn’t be used to generate results, rather it should be used to build or establish hypothesis that you can expand into further market research .

For example you could gather detailed data about a specific business phenomenon, and then use this deeper understanding of that specific case.

Use observational methods

This type of study uses qualitative observations to understand human behavior within a particular group.

By understanding how the different demographics respond within your sample you can identify patterns and trends.

As an observational method, descriptive research will not tell you the cause of any particular behaviors, but that could be established with further research.

Use survey research

Surveys are one of the most cost effective ways to gather descriptive data.

An online survey or questionnaire can be used in descriptive studies to gather quantitative information about a particular problem.

Survey research is ideal if you’re using descriptive research as your primary research.

Descriptive research examples

Descriptive research is used for a number of commercial purposes or when organizations need to understand the behaviors or opinions of a population.

One of the biggest examples of descriptive research that is used in every democratic country, is during elections.

Using descriptive research, researchers will use surveys to understand who voters are more likely to choose out of the parties or candidates available.

Using the data provided, researchers can analyze the data to understand what the election result will be.

In a commercial setting, retailers often use descriptive research to figure out trends in shopping and buying decisions.

By gathering information on the habits of shoppers, retailers can get a better understanding of the purchases being made.

Another example that is widely used around the world, is the national census that takes place to understand the population.

The research will provide a more accurate picture of a population’s demographic makeup and help to understand changes over time in areas like population age, health and education level.

Where Qualtrics helps with descriptive research

Whatever type of research you want to carry out, there’s a survey type that will work.

Qualtrics can help you determine the appropriate method and ensure you design a study that will deliver the insights you need.

Our experts can help you with your market research needs , ensuring you get the most out of Qualtrics market research software to design, launch and analyze your data to guide better, more accurate decisions for your organization.

Related resources

Market intelligence 10 min read, marketing insights 11 min read, ethnographic research 11 min read, qualitative vs quantitative research 13 min read, qualitative research questions 11 min read, qualitative research design 12 min read, primary vs secondary research 14 min read, request demo.

Ready to learn more about Qualtrics?

Just one more step to your free trial.

.surveysparrow.com

Already using SurveySparrow?  Login

By clicking on "Get Started", I agree to the Privacy Policy and Terms of Service .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Don't miss the future of CX at RefineCX USA!  Register Now

Enterprise Survey Software

Enterprise Survey Software to thrive in your business ecosystem

NPS® Software

Turn customers into promoters

Offline Survey

Real-time data collection, on the move. Go internet-independent.

360 Assessment

Conduct omnidirectional employee assessments. Increase productivity, grow together.

Reputation Management

Turn your existing customers into raving promoters by monitoring online reviews.

Ticket Management

Build loyalty and advocacy by delivering personalized support experiences that matter.

Chatbot for Website

Collect feedback smartly from your website visitors with the engaging Chatbot for website.

Swift, easy, secure. Scalable for your organization.

Executive Dashboard

Customer journey map, craft beautiful surveys, share surveys, gain rich insights, recurring surveys, white label surveys, embedded surveys, conversational forms, mobile-first surveys, audience management, smart surveys, video surveys, secure surveys, api, webhooks, integrations, survey themes, accept payments, custom workflows, all features, customer experience, employee experience, product experience, marketing experience, sales experience, hospitality & travel, market research, saas startup programs, wall of love, success stories, sparrowcast, nps® benchmarks, learning centre, apps & integrations, testimonials.

Our surveys come with superpowers ⚡

Blog General

Descriptive Research 101: Definition, Methods and Examples

Parvathi vijayamohan.

8 April 2024

Table Of Contents

  • Descriptive Research 101: The Definitive Guide

What is Descriptive Research?

Key characteristics of descriptive research.

  • Descriptive Research Methods: The 3 You Need to Know!

Observation

Case studies, 7 types of descriptive research, descriptive research: examples to build your next study, tips to excel at descriptive research.

Imagine you are a detective called to a crime scene. Your job is to study the scene and report whatever you find: whether that’s the half-smoked cigarette on the table or the large “RACHE” written in blood on the wall. That, in a nutshell, is  descriptive research .

Researchers often need to do descriptive research on a problem before they attempt to solve it. So in this guide, we’ll take you through:

  • What is descriptive research + characteristics
  • Descriptive research methods
  • Types of descriptive research
  • Descriptive research examples
  • Tips to excel at the descriptive method

Click to jump to the section that interests you.

Definition: As its name says, descriptive research  describes  the characteristics of the problem, phenomenon, situation, or group under study.

So the goal of all descriptive studies is to  explore  the background, details, and existing patterns in the problem to fully understand it. In other words, preliminary research.

However, descriptive research can be both  preliminary and conclusive . You can use the data from a descriptive study to make reports and get insights for further planning.

What descriptive research isn’t: Descriptive research finds the  what/when/where  of a problem, not the  why/how .

Because of this, we can’t use the descriptive method to explore cause-and-effect relationships where one variable (like a person’s job role) affects another variable (like their monthly income).

  • Answers the “what,” “when,” and “where”  of a research problem. For this reason, it is popularly used in  market research ,  awareness surveys , and  opinion polls .
  • Sets the stage  for a research problem. As an early part of the research process, descriptive studies help you dive deeper into the topic.
  • Opens the door  for further research. You can use descriptive data as the basis for more profound research, analysis and studies.
  • Qualitative and quantitative . It is possible to get a balanced mix of numerical responses and open-ended answers from the descriptive method.
  • No control or interference with the variables . The researcher simply observes and reports on them. However, specific research software has filters that allow her to zoom in on one variable.
  • Done in natural settings . You can get the best results from descriptive research by talking to people, surveying them, or observing them in a suitable environment. For example, suppose you are a website beta testing an app feature. In that case, descriptive research invites users to try the feature, tracking their behavior and then asking their opinions .
  • Can be applied to many research methods and areas. Examples include healthcare, SaaS, psychology, political studies, education, and pop culture.

Descriptive Research Methods: The Top Three You Need to Know!

In short, survey research is a brief interview or conversation with a set of prepared questions about a topic.

So you create a questionnaire, share it, and analyze the data you collect for further action. Learn about the differences between surveys and questionnaires  here .

You can access free survey templates , over 20+ question types, and pass data to 1,500+ applications with survey software, like SurveySparrow . It enables you to create surveys, share them and capture data with very little effort.

Sign up today to launch stunning surveys for free.

Please enter a valid Email ID.

14-Day Free Trial • No Credit Card Required • No Strings Attached

  • Surveys can be hyper-local, regional, or global, depending on your objectives.
  • Share surveys in-person, offline, via SMS, email, or QR codes – so many options!
  • Easy to automate if you want to conduct many surveys over a period.

The observational method is a type of descriptive research in which you, the researcher, observe ongoing behavior.

Now, there are several (non-creepy) ways you can observe someone. In fact, observational research has three main approaches:

  • Covert observation: In true spy fashion, the researcher mixes in with the group undetected or observes from a distance.
  • Overt observation : The researcher identifies himself as a researcher – “The name’s Bond. J. Bond.” – and explains the purpose of the study.
  • Participatory observation : The researcher participates in what he is observing to understand his topic better.
  • Observation is one of the most accurate ways to get data on a subject’s behavior in a natural setting.
  • You don’t need to rely on people’s willingness to share information.
  • Observation is a universal method that can be applied to any area of research.

In the case study method, you do a detailed study of a specific group, person, or event over a period.

This brings us to a frequently asked question: “What’s the difference between case studies and longitudinal studies?”

A case study will go  very in-depth into the subject with one-on-one interviews, observations, and archival research. They are also qualitative, though sometimes they will use numbers and stats.

An example of longitudinal research would be a study of the health of night shift employees vs. general shift employees over a decade. An example of a case study would involve in-depth interviews with Casey, an assistant director of nursing who’s handled the night shift at the hospital for ten years now.

  • Due to the focus on a few people, case studies can give you a tremendous amount of information.
  • Because of the time and effort involved, a case study engages both researchers and participants.
  • Case studies are helpful for ethically investigating unusual, complex, or challenging subjects. An example would be a study of the habits of long-term cocaine users.

1. Case Study: Airbnb’s Growth Strategy

In an excellent case study, Tam Al Saad, Principal Consultant, Strategy + Growth at Webprofits, deep dives into how Airbnb attracted and retained 150 million users .

“What Airbnb offers isn’t a cheap place to sleep when you’re on holiday; it’s the opportunity to experience your destination as a local would. It’s the chance to meet the locals, experience the markets, and find non-touristy places.

Sure, you can visit the Louvre, see Buckingham Palace, and climb the Empire State Building, but you can do it as if it were your hometown while staying in a place that has character and feels like a home.” – Tam al Saad, Principal Consultant, Strategy + Growth at Webprofits

2. Observation – Better Tech Experiences for the Elderly

We often think that our elders are so hopeless with technology. But we’re not getting any younger either, and tech is changing at a hair trigger! This article by Annemieke Hendricks shares a wonderful example where researchers compare the levels of technological familiarity between age groups and how that influences usage.

“It is generally assumed that older adults have difficulty using modern electronic devices, such as mobile telephones or computers. Because this age group is growing in most countries, changing products and processes to adapt to their needs is increasingly more important. “ – Annemieke Hendricks, Marketing Communication Specialist, Noldus

3. Surveys – Decoding Sleep with SurveySparrow

SRI International (formerly Stanford Research Institute) – an independent, non-profit research center – wanted to investigate the impact of stress on an adolescent’s sleep. To get those insights, two actions were essential: tracking sleep patterns through wearable devices and sending surveys at a pre-set time –  the pre-sleep period.

“With SurveySparrow’s recurring surveys feature, SRI was able to share engaging surveys with their participants exactly at the time they wanted and at the frequency they preferred.”

Read more about this project : How SRI International decoded sleep patterns with SurveySparrow

1: Answer the six Ws –

  • Who should we consider?
  • What information do we need?
  • When should we collect the information?
  • Where should we collect the information?
  • Why are we obtaining the information?
  • Way to collect the information

#2: Introduce and explain your methodological approach

#3: Describe your methods of data collection and/or selection.

#4: Describe your methods of analysis.

#5: Explain the reasoning behind your choices.

#6: Collect data.

#7: Analyze the data. Use software to speed up the process and reduce overthinking and human error.

#8: Report your conclusions and how you drew the results.

Wrapping Up

That’s all, folks!

Growth Marketer at SurveySparrow

Fledgling growth marketer. Cloud watcher. Aunty to a naughty beagle.

You Might Also Like

You shouldn’t pay for an online survey tool unless it offers these eight features, customer experience research: tools, methods, and tips for 2024, surveynuts alternatives for all your survey needs.

Leave us your email, we wont spam. Promise!

Start your free trial today

No Credit Card Required. 14-Day Free Trial

Request a Demo

Want to learn more about SurveySparrow? We'll be in touch soon!

Scale up your descriptive research with the best survey software

Build surveys that actually work. give surveysparrow a free try today.

14-Day Free Trial • No Credit card required • 40% more completion rate

Hi there, we use cookies to offer you a better browsing experience and to analyze site traffic. By continuing to use our website, you consent to the use of these cookies. Learn More

Enago Academy

Bridging the Gap: Overcome these 7 flaws in descriptive research design

' src=

Descriptive research design is a powerful tool used by scientists and researchers to gather information about a particular group or phenomenon. This type of research provides a detailed and accurate picture of the characteristics and behaviors of a particular population or subject. By observing and collecting data on a given topic, descriptive research helps researchers gain a deeper understanding of a specific issue and provides valuable insights that can inform future studies.

In this blog, we will explore the definition, characteristics, and common flaws in descriptive research design, and provide tips on how to avoid these pitfalls to produce high-quality results. Whether you are a seasoned researcher or a student just starting, understanding the fundamentals of descriptive research design is essential to conducting successful scientific studies.

Table of Contents

What Is Descriptive Research Design?

The descriptive research design involves observing and collecting data on a given topic without attempting to infer cause-and-effect relationships. The goal of descriptive research is to provide a comprehensive and accurate picture of the population or phenomenon being studied and to describe the relationships, patterns, and trends that exist within the data.

Descriptive research methods can include surveys, observational studies , and case studies, and the data collected can be qualitative or quantitative . The findings from descriptive research provide valuable insights and inform future research, but do not establish cause-and-effect relationships.

Importance of Descriptive Research in Scientific Studies

1. understanding of a population or phenomenon.

Descriptive research provides a comprehensive picture of the characteristics and behaviors of a particular population or phenomenon, allowing researchers to gain a deeper understanding of the topic.

2. Baseline Information

The information gathered through descriptive research can serve as a baseline for future research and provide a foundation for further studies.

3. Informative Data

Descriptive research can provide valuable information and insights into a particular topic, which can inform future research, policy decisions, and programs.

4. Sampling Validation

Descriptive research can be used to validate sampling methods and to help researchers determine the best approach for their study.

5. Cost Effective

Descriptive research is often less expensive and less time-consuming than other research methods , making it a cost-effective way to gather information about a particular population or phenomenon.

6. Easy to Replicate

Descriptive research is straightforward to replicate, making it a reliable way to gather and compare information from multiple sources.

Key Characteristics of Descriptive Research Design

The primary purpose of descriptive research is to describe the characteristics, behaviors, and attributes of a particular population or phenomenon.

2. Participants and Sampling

Descriptive research studies a particular population or sample that is representative of the larger population being studied. Furthermore, sampling methods can include convenience, stratified, or random sampling.

3. Data Collection Techniques

Descriptive research typically involves the collection of both qualitative and quantitative data through methods such as surveys, observational studies, case studies, or focus groups.

4. Data Analysis

Descriptive research data is analyzed to identify patterns, relationships, and trends within the data. Statistical techniques , such as frequency distributions and descriptive statistics, are commonly used to summarize and describe the data.

5. Focus on Description

Descriptive research is focused on describing and summarizing the characteristics of a particular population or phenomenon. It does not make causal inferences.

6. Non-Experimental

Descriptive research is non-experimental, meaning that the researcher does not manipulate variables or control conditions. The researcher simply observes and collects data on the population or phenomenon being studied.

When Can a Researcher Conduct Descriptive Research?

A researcher can conduct descriptive research in the following situations:

  • To better understand a particular population or phenomenon
  • To describe the relationships between variables
  • To describe patterns and trends
  • To validate sampling methods and determine the best approach for a study
  • To compare data from multiple sources.

Types of Descriptive Research Design

1. survey research.

Surveys are a type of descriptive research that involves collecting data through self-administered or interviewer-administered questionnaires. Additionally, they can be administered in-person, by mail, or online, and can collect both qualitative and quantitative data.

2. Observational Research

Observational research involves observing and collecting data on a particular population or phenomenon without manipulating variables or controlling conditions. It can be conducted in naturalistic settings or controlled laboratory settings.

3. Case Study Research

Case study research is a type of descriptive research that focuses on a single individual, group, or event. It involves collecting detailed information on the subject through a variety of methods, including interviews, observations, and examination of documents.

4. Focus Group Research

Focus group research involves bringing together a small group of people to discuss a particular topic or product. Furthermore, the group is usually moderated by a researcher and the discussion is recorded for later analysis.

5. Ethnographic Research

Ethnographic research involves conducting detailed observations of a particular culture or community. It is often used to gain a deep understanding of the beliefs, behaviors, and practices of a particular group.

Advantages of Descriptive Research Design

1. provides a comprehensive understanding.

Descriptive research provides a comprehensive picture of the characteristics, behaviors, and attributes of a particular population or phenomenon, which can be useful in informing future research and policy decisions.

2. Non-invasive

Descriptive research is non-invasive and does not manipulate variables or control conditions, making it a suitable method for sensitive or ethical concerns.

3. Flexibility

Descriptive research allows for a wide range of data collection methods , including surveys, observational studies, case studies, and focus groups, making it a flexible and versatile research method.

4. Cost-effective

Descriptive research is often less expensive and less time-consuming than other research methods. Moreover, it gives a cost-effective option to many researchers.

5. Easy to Replicate

Descriptive research is easy to replicate, making it a reliable way to gather and compare information from multiple sources.

6. Informs Future Research

The insights gained from a descriptive research can inform future research and inform policy decisions and programs.

Disadvantages of Descriptive Research Design

1. limited scope.

Descriptive research only provides a snapshot of the current situation and cannot establish cause-and-effect relationships.

2. Dependence on Existing Data

Descriptive research relies on existing data, which may not always be comprehensive or accurate.

3. Lack of Control

Researchers have no control over the variables in descriptive research, which can limit the conclusions that can be drawn.

The researcher’s own biases and preconceptions can influence the interpretation of the data.

5. Lack of Generalizability

Descriptive research findings may not be applicable to other populations or situations.

6. Lack of Depth

Descriptive research provides a surface-level understanding of a phenomenon, rather than a deep understanding.

7. Time-consuming

Descriptive research often requires a large amount of data collection and analysis, which can be time-consuming and resource-intensive.

7 Ways to Avoid Common Flaws While Designing Descriptive Research

descriptive research video

1. Clearly define the research question

A clearly defined research question is the foundation of any research study, and it is important to ensure that the question is both specific and relevant to the topic being studied.

2. Choose the appropriate research design

Choosing the appropriate research design for a study is crucial to the success of the study. Moreover, researchers should choose a design that best fits the research question and the type of data needed to answer it.

3. Select a representative sample

Selecting a representative sample is important to ensure that the findings of the study are generalizable to the population being studied. Researchers should use a sampling method that provides a random and representative sample of the population.

4. Use valid and reliable data collection methods

Using valid and reliable data collection methods is important to ensure that the data collected is accurate and can be used to answer the research question. Researchers should choose methods that are appropriate for the study and that can be administered consistently and systematically.

5. Minimize bias

Bias can significantly impact the validity and reliability of research findings.  Furthermore, it is important to minimize bias in all aspects of the study, from the selection of participants to the analysis of data.

6. Ensure adequate sample size

An adequate sample size is important to ensure that the results of the study are statistically significant and can be generalized to the population being studied.

7. Use appropriate data analysis techniques

The appropriate data analysis technique depends on the type of data collected and the research question being asked. Researchers should choose techniques that are appropriate for the data and the question being asked.

Have you worked on descriptive research designs? How was your experience creating a descriptive design? What challenges did you face? Do write to us or leave a comment below and share your insights on descriptive research designs!

' src=

extremely very educative

Indeed very educative and useful. Well explained. Thank you

Simple,easy to understand

Rate this article Cancel Reply

Your email address will not be published.

descriptive research video

Enago Academy's Most Popular Articles

7 Step Guide for Optimizing Impactful Research Process

  • Publishing Research
  • Reporting Research

How to Optimize Your Research Process: A step-by-step guide

For researchers across disciplines, the path to uncovering novel findings and insights is often filled…

Launch of "Sony Women in Technology Award with Nature"

  • Industry News
  • Trending Now

Breaking Barriers: Sony and Nature unveil “Women in Technology Award”

Sony Group Corporation and the prestigious scientific journal Nature have collaborated to launch the inaugural…

Guide to Adhere Good Research Practice (FREE CHECKLIST)

Achieving Research Excellence: Checklist for good research practices

Academia is built on the foundation of trustworthy and high-quality research, supported by the pillars…

ResearchSummary

  • Promoting Research

Plain Language Summary — Communicating your research to bridge the academic-lay gap

Science can be complex, but does that mean it should not be accessible to the…

Journals Combat Image Manipulation with AI

Science under Surveillance: Journals adopt advanced AI to uncover image manipulation

Journals are increasingly turning to cutting-edge AI tools to uncover deceitful images published in manuscripts.…

Choosing the Right Analytical Approach: Thematic analysis vs. content analysis for…

Comparing Cross Sectional and Longitudinal Studies: 5 steps for choosing the right…

Research Recommendations – Guiding policy-makers for evidence-based decision making

descriptive research video

Sign-up to read more

Subscribe for free to get unrestricted access to all our resources on research writing and academic publishing including:

  • 2000+ blog articles
  • 50+ Webinars
  • 10+ Expert podcasts
  • 50+ Infographics
  • 10+ Checklists
  • Research Guides

We hate spam too. We promise to protect your privacy and never spam you.

I am looking for Editing/ Proofreading services for my manuscript Tentative date of next journal submission:

descriptive research video

What should universities' stance be on AI tools in research and academic writing?

  • Privacy Policy

Buy Me a Coffee

Research Method

Home » Descriptive Research Design – Types, Methods and Examples

Descriptive Research Design – Types, Methods and Examples

Table of Contents

Descriptive Research Design

Descriptive Research Design

Definition:

Descriptive research design is a type of research methodology that aims to describe or document the characteristics, behaviors, attitudes, opinions, or perceptions of a group or population being studied.

Descriptive research design does not attempt to establish cause-and-effect relationships between variables or make predictions about future outcomes. Instead, it focuses on providing a detailed and accurate representation of the data collected, which can be useful for generating hypotheses, exploring trends, and identifying patterns in the data.

Types of Descriptive Research Design

Types of Descriptive Research Design are as follows:

Cross-sectional Study

This involves collecting data at a single point in time from a sample or population to describe their characteristics or behaviors. For example, a researcher may conduct a cross-sectional study to investigate the prevalence of certain health conditions among a population, or to describe the attitudes and beliefs of a particular group.

Longitudinal Study

This involves collecting data over an extended period of time, often through repeated observations or surveys of the same group or population. Longitudinal studies can be used to track changes in attitudes, behaviors, or outcomes over time, or to investigate the effects of interventions or treatments.

This involves an in-depth examination of a single individual, group, or situation to gain a detailed understanding of its characteristics or dynamics. Case studies are often used in psychology, sociology, and business to explore complex phenomena or to generate hypotheses for further research.

Survey Research

This involves collecting data from a sample or population through standardized questionnaires or interviews. Surveys can be used to describe attitudes, opinions, behaviors, or demographic characteristics of a group, and can be conducted in person, by phone, or online.

Observational Research

This involves observing and documenting the behavior or interactions of individuals or groups in a natural or controlled setting. Observational studies can be used to describe social, cultural, or environmental phenomena, or to investigate the effects of interventions or treatments.

Correlational Research

This involves examining the relationships between two or more variables to describe their patterns or associations. Correlational studies can be used to identify potential causal relationships or to explore the strength and direction of relationships between variables.

Data Analysis Methods

Descriptive research design data analysis methods depend on the type of data collected and the research question being addressed. Here are some common methods of data analysis for descriptive research:

Descriptive Statistics

This method involves analyzing data to summarize and describe the key features of a sample or population. Descriptive statistics can include measures of central tendency (e.g., mean, median, mode) and measures of variability (e.g., range, standard deviation).

Cross-tabulation

This method involves analyzing data by creating a table that shows the frequency of two or more variables together. Cross-tabulation can help identify patterns or relationships between variables.

Content Analysis

This method involves analyzing qualitative data (e.g., text, images, audio) to identify themes, patterns, or trends. Content analysis can be used to describe the characteristics of a sample or population, or to identify factors that influence attitudes or behaviors.

Qualitative Coding

This method involves analyzing qualitative data by assigning codes to segments of data based on their meaning or content. Qualitative coding can be used to identify common themes, patterns, or categories within the data.

Visualization

This method involves creating graphs or charts to represent data visually. Visualization can help identify patterns or relationships between variables and make it easier to communicate findings to others.

Comparative Analysis

This method involves comparing data across different groups or time periods to identify similarities and differences. Comparative analysis can help describe changes in attitudes or behaviors over time or differences between subgroups within a population.

Applications of Descriptive Research Design

Descriptive research design has numerous applications in various fields. Some of the common applications of descriptive research design are:

  • Market research: Descriptive research design is widely used in market research to understand consumer preferences, behavior, and attitudes. This helps companies to develop new products and services, improve marketing strategies, and increase customer satisfaction.
  • Health research: Descriptive research design is used in health research to describe the prevalence and distribution of a disease or health condition in a population. This helps healthcare providers to develop prevention and treatment strategies.
  • Educational research: Descriptive research design is used in educational research to describe the performance of students, schools, or educational programs. This helps educators to improve teaching methods and develop effective educational programs.
  • Social science research: Descriptive research design is used in social science research to describe social phenomena such as cultural norms, values, and beliefs. This helps researchers to understand social behavior and develop effective policies.
  • Public opinion research: Descriptive research design is used in public opinion research to understand the opinions and attitudes of the general public on various issues. This helps policymakers to develop effective policies that are aligned with public opinion.
  • Environmental research: Descriptive research design is used in environmental research to describe the environmental conditions of a particular region or ecosystem. This helps policymakers and environmentalists to develop effective conservation and preservation strategies.

Descriptive Research Design Examples

Here are some real-time examples of descriptive research designs:

  • A restaurant chain wants to understand the demographics and attitudes of its customers. They conduct a survey asking customers about their age, gender, income, frequency of visits, favorite menu items, and overall satisfaction. The survey data is analyzed using descriptive statistics and cross-tabulation to describe the characteristics of their customer base.
  • A medical researcher wants to describe the prevalence and risk factors of a particular disease in a population. They conduct a cross-sectional study in which they collect data from a sample of individuals using a standardized questionnaire. The data is analyzed using descriptive statistics and cross-tabulation to identify patterns in the prevalence and risk factors of the disease.
  • An education researcher wants to describe the learning outcomes of students in a particular school district. They collect test scores from a representative sample of students in the district and use descriptive statistics to calculate the mean, median, and standard deviation of the scores. They also create visualizations such as histograms and box plots to show the distribution of scores.
  • A marketing team wants to understand the attitudes and behaviors of consumers towards a new product. They conduct a series of focus groups and use qualitative coding to identify common themes and patterns in the data. They also create visualizations such as word clouds to show the most frequently mentioned topics.
  • An environmental scientist wants to describe the biodiversity of a particular ecosystem. They conduct an observational study in which they collect data on the species and abundance of plants and animals in the ecosystem. The data is analyzed using descriptive statistics to describe the diversity and richness of the ecosystem.

How to Conduct Descriptive Research Design

To conduct a descriptive research design, you can follow these general steps:

  • Define your research question: Clearly define the research question or problem that you want to address. Your research question should be specific and focused to guide your data collection and analysis.
  • Choose your research method: Select the most appropriate research method for your research question. As discussed earlier, common research methods for descriptive research include surveys, case studies, observational studies, cross-sectional studies, and longitudinal studies.
  • Design your study: Plan the details of your study, including the sampling strategy, data collection methods, and data analysis plan. Determine the sample size and sampling method, decide on the data collection tools (such as questionnaires, interviews, or observations), and outline your data analysis plan.
  • Collect data: Collect data from your sample or population using the data collection tools you have chosen. Ensure that you follow ethical guidelines for research and obtain informed consent from participants.
  • Analyze data: Use appropriate statistical or qualitative analysis methods to analyze your data. As discussed earlier, common data analysis methods for descriptive research include descriptive statistics, cross-tabulation, content analysis, qualitative coding, visualization, and comparative analysis.
  • I nterpret results: Interpret your findings in light of your research question and objectives. Identify patterns, trends, and relationships in the data, and describe the characteristics of your sample or population.
  • Draw conclusions and report results: Draw conclusions based on your analysis and interpretation of the data. Report your results in a clear and concise manner, using appropriate tables, graphs, or figures to present your findings. Ensure that your report follows accepted research standards and guidelines.

When to Use Descriptive Research Design

Descriptive research design is used in situations where the researcher wants to describe a population or phenomenon in detail. It is used to gather information about the current status or condition of a group or phenomenon without making any causal inferences. Descriptive research design is useful in the following situations:

  • Exploratory research: Descriptive research design is often used in exploratory research to gain an initial understanding of a phenomenon or population.
  • Identifying trends: Descriptive research design can be used to identify trends or patterns in a population, such as changes in consumer behavior or attitudes over time.
  • Market research: Descriptive research design is commonly used in market research to understand consumer preferences, behavior, and attitudes.
  • Health research: Descriptive research design is useful in health research to describe the prevalence and distribution of a disease or health condition in a population.
  • Social science research: Descriptive research design is used in social science research to describe social phenomena such as cultural norms, values, and beliefs.
  • Educational research: Descriptive research design is used in educational research to describe the performance of students, schools, or educational programs.

Purpose of Descriptive Research Design

The main purpose of descriptive research design is to describe and measure the characteristics of a population or phenomenon in a systematic and objective manner. It involves collecting data that describe the current status or condition of the population or phenomenon of interest, without manipulating or altering any variables.

The purpose of descriptive research design can be summarized as follows:

  • To provide an accurate description of a population or phenomenon: Descriptive research design aims to provide a comprehensive and accurate description of a population or phenomenon of interest. This can help researchers to develop a better understanding of the characteristics of the population or phenomenon.
  • To identify trends and patterns: Descriptive research design can help researchers to identify trends and patterns in the data, such as changes in behavior or attitudes over time. This can be useful for making predictions and developing strategies.
  • To generate hypotheses: Descriptive research design can be used to generate hypotheses or research questions that can be tested in future studies. For example, if a descriptive study finds a correlation between two variables, this could lead to the development of a hypothesis about the causal relationship between the variables.
  • To establish a baseline: Descriptive research design can establish a baseline or starting point for future research. This can be useful for comparing data from different time periods or populations.

Characteristics of Descriptive Research Design

Descriptive research design has several key characteristics that distinguish it from other research designs. Some of the main characteristics of descriptive research design are:

  • Objective : Descriptive research design is objective in nature, which means that it focuses on collecting factual and accurate data without any personal bias. The researcher aims to report the data objectively without any personal interpretation.
  • Non-experimental: Descriptive research design is non-experimental, which means that the researcher does not manipulate any variables. The researcher simply observes and records the behavior or characteristics of the population or phenomenon of interest.
  • Quantitative : Descriptive research design is quantitative in nature, which means that it involves collecting numerical data that can be analyzed using statistical techniques. This helps to provide a more precise and accurate description of the population or phenomenon.
  • Cross-sectional: Descriptive research design is often cross-sectional, which means that the data is collected at a single point in time. This can be useful for understanding the current state of the population or phenomenon, but it may not provide information about changes over time.
  • Large sample size: Descriptive research design typically involves a large sample size, which helps to ensure that the data is representative of the population of interest. A large sample size also helps to increase the reliability and validity of the data.
  • Systematic and structured: Descriptive research design involves a systematic and structured approach to data collection, which helps to ensure that the data is accurate and reliable. This involves using standardized procedures for data collection, such as surveys, questionnaires, or observation checklists.

Advantages of Descriptive Research Design

Descriptive research design has several advantages that make it a popular choice for researchers. Some of the main advantages of descriptive research design are:

  • Provides an accurate description: Descriptive research design is focused on accurately describing the characteristics of a population or phenomenon. This can help researchers to develop a better understanding of the subject of interest.
  • Easy to conduct: Descriptive research design is relatively easy to conduct and requires minimal resources compared to other research designs. It can be conducted quickly and efficiently, and data can be collected through surveys, questionnaires, or observations.
  • Useful for generating hypotheses: Descriptive research design can be used to generate hypotheses or research questions that can be tested in future studies. For example, if a descriptive study finds a correlation between two variables, this could lead to the development of a hypothesis about the causal relationship between the variables.
  • Large sample size : Descriptive research design typically involves a large sample size, which helps to ensure that the data is representative of the population of interest. A large sample size also helps to increase the reliability and validity of the data.
  • Can be used to monitor changes : Descriptive research design can be used to monitor changes over time in a population or phenomenon. This can be useful for identifying trends and patterns, and for making predictions about future behavior or attitudes.
  • Can be used in a variety of fields : Descriptive research design can be used in a variety of fields, including social sciences, healthcare, business, and education.

Limitation of Descriptive Research Design

Descriptive research design also has some limitations that researchers should consider before using this design. Some of the main limitations of descriptive research design are:

  • Cannot establish cause and effect: Descriptive research design cannot establish cause and effect relationships between variables. It only provides a description of the characteristics of the population or phenomenon of interest.
  • Limited generalizability: The results of a descriptive study may not be generalizable to other populations or situations. This is because descriptive research design often involves a specific sample or situation, which may not be representative of the broader population.
  • Potential for bias: Descriptive research design can be subject to bias, particularly if the researcher is not objective in their data collection or interpretation. This can lead to inaccurate or incomplete descriptions of the population or phenomenon of interest.
  • Limited depth: Descriptive research design may provide a superficial description of the population or phenomenon of interest. It does not delve into the underlying causes or mechanisms behind the observed behavior or characteristics.
  • Limited utility for theory development: Descriptive research design may not be useful for developing theories about the relationship between variables. It only provides a description of the variables themselves.
  • Relies on self-report data: Descriptive research design often relies on self-report data, such as surveys or questionnaires. This type of data may be subject to biases, such as social desirability bias or recall bias.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Questionnaire

Questionnaire – Definition, Types, and Examples

Case Study Research

Case Study – Methods, Examples and Guide

Observational Research

Observational Research – Methods and Guide

Quantitative Research

Quantitative Research – Methods, Types and...

Qualitative Research Methods

Qualitative Research Methods

Explanatory Research

Explanatory Research – Types, Methods, Guide

descriptive research video

  • Survey Software The world’s leading omnichannel survey software
  • Online Survey Tools Create sophisticated surveys with ease.
  • Mobile Offline Conduct efficient field surveys.
  • Text Analysis
  • Close The Loop
  • Automated Translations
  • NPS Dashboard
  • CATI Manage high volume phone surveys efficiently
  • Cloud/On-premise Dialer TCPA compliant Cloud on-premise dialer
  • IVR Survey Software Boost productivity with automated call workflows.
  • Analytics Analyze survey data with visual dashboards
  • Panel Manager Nurture a loyal community of respondents.
  • Survey Portal Best-in-class user friendly survey portal.
  • Voxco Audience Conduct targeted sample research in hours.
  • Predictive Analytics
  • Customer 360
  • Customer Loyalty
  • Fraud & Risk Management
  • AI/ML Enablement Services
  • Credit Underwriting

descriptive research video

Find the best survey software for you! (Along with a checklist to compare platforms)

Get Buyer’s Guide

  • 100+ question types
  • Drag-and-drop interface
  • Skip logic and branching
  • Multi-lingual survey
  • Text piping
  • Question library
  • CSS customization
  • White-label surveys
  • Customizable ‘Thank You’ page
  • Customizable survey theme
  • Reminder send-outs
  • Survey rewards
  • Social media
  • SMS surveys
  • Website surveys
  • Correlation analysis
  • Cross-tabulation analysis
  • Trend analysis
  • Real-time dashboard
  • Customizable report
  • Email address validation
  • Recaptcha validation
  • SSL security

Take a peek at our powerful survey features to design surveys that scale discoveries.

Download feature sheet.

  • Hospitality
  • Financial Services
  • Academic Research
  • Customer Experience
  • Employee Experience
  • Product Experience
  • Market Research
  • Social Research
  • Data Analysis
  • Banking & Financial Services
  • Retail Solution
  • Risk Management
  • Customer Lifecycle Solutions
  • Net Promoter Score
  • Customer Behaviour Analytics
  • Customer Segmentation
  • Data Unification

Explore Voxco 

Need to map Voxco’s features & offerings? We can help!

Watch a Demo 

Download Brochures 

Get a Quote

  • NPS Calculator
  • CES Calculator
  • A/B Testing Calculator
  • Margin of Error Calculator
  • Sample Size Calculator
  • CX Strategy & Management Hub
  • Market Research Hub
  • Patient Experience Hub
  • Employee Experience Hub
  • Market Research Guide
  • Customer Experience Guide
  • The Voxco Guide to Customer Experience
  • NPS Knowledge Hub
  • Survey Research Guides
  • Survey Template Library
  • Webinars and Events
  • Feature Sheets
  • Try a sample survey
  • Professional services
  • Blogs & White papers
  • Case Studies

Find the best customer experience platform

Uncover customer pain points, analyze feedback and run successful CX programs with the best CX platform for your team.

Get the Guide Now

descriptive research video

We’ve been avid users of the Voxco platform now for over 20 years. It gives us the flexibility to routinely enhance our survey toolkit and provides our clients with a more robust dataset and story to tell their clients.

VP Innovation & Strategic Partnerships, The Logit Group

  • Client Stories
  • Voxco Reviews
  • Why Voxco Research?
  • Why Voxco Intelligence?
  • Careers at Voxco
  • Vulnerabilities and Ethical Hacking

Explore Regional Offices

  • Cloud/On-premise Dialer TCPA compliant Cloud & on-premise dialer
  • Fraud & Risk Management

Get Buyer’s Guide

  • Banking & Financial Services

Explore Voxco 

Watch a Demo 

Download Brochures 

  • CX Strategy & Management Hub
  • Blogs & White papers

VP Innovation & Strategic Partnerships, The Logit Group

  • Our clients
  • Client stories
  • Featuresheets

Descriptive Research: Definition, Methods & Examples

  • August 19, 2021

Voxco’s Descriptive Research guide helps uncover the how, when, what, and where questions in a research problem

SHARE THE ARTICLE ON

Descriptive Research cvr 1

When you are a store manager in a convenience store, and you have to make a report. Any finding such as which product is selling most, what time of the day you have the most crowd, or which product customers are demanding most, all these observations and reporting is descriptive research. 

It is often the first step of any research since the data you gather sets the stage for the research question. It is used to determine the problem you want to explore before fully realizing it. The information helps you identify the problem. 

In this blog, we’ll discuss the characteristics, types, pros & cons, and three ways to conduct this research type to help you in your next market research.

What is descriptive research?

Descriptive research refers to the research method that describes the characteristics of the variables you are studying. This methodology focuses on answering questions to the “WHAT” than the “WHY” of the research question. The primary focus of this research method is to describe the nature of the demographics understudy instead of focusing on the “why”.

It is called an observational research method as none of the variables in the study are influenced during the research process.

For example, let’s assume that a UK-based brand is trying to establish itself in New York and wants to understand the demographics of the buyers who generally purchase from brands similar to it. 

In descriptive research, the information gathered from the survey will only focus on the population’s demographics. It will uncover details on the buying patterns of different age cohorts in New York. It will not study why such patterns exist because the brand is trying to establish itself in New York. 

They want to understand the buying behavior of the population, not why such associations exist. It is a part of quantitative market research or social research study, which involves conducting survey research using quantitative variables on a market research software or social research software .

Voxco’s omnichannel survey software helps you collect insights from multiple channels using a single platform.

See the true power of using an integrated survey platform to conduct online, offline, and phone surveys along with a built-in analytical suite.

What are the characteristics of descriptive research?

Among the many, the following are the main characteristics of this research type:

  • Quantitative research
  • Nature of variables
  • Cross-sectional studies
  • Directs future research

Let’s discuss these four characteristics in detail. 

1. Quantitative research:

It is quantitative as it attempts to collect and statistically analyze information. This research type is a powerful research tool that permits a researcher to collect data and describe the demographics of the same with the help of statistical analysis. Thus, it is a quantitative research method .

2. Nature of variables:

The variables included in this research are uncontrolled. They are not manipulated in any way. Descriptive research mostly uses observational methods; thus, the researcher cannot control the nature and behavior of the variables under study.

3. Cross-sectional studies:

In this research type, different sections of the same group are studied. For instance, in order to study the fashion preferences of New York, the researcher can study Gen Z as well as Millennials from the same population in New York.

4. Directs future research:

Since this research identifies the patterns between variables and describes them, researchers can further study the data collected here. It guides researchers to discover further why such patterns have been found and their association. Hence, it gives researchers a direction toward insightful market research.

What are the methods of conducting descriptive research?

Primarily, there are three descriptive research methods: 

  • Observation,
  • Survey, & 

We have explained how you can conduct this research type in three different ways. Each method helps gather descriptive data and sets the scene for thorough research.

MicrosoftTeams image 9 2

1. Observational method

All research has some component of observation, this observation can be quantitative or qualitative. A quantitative observation includes objectively collecting data that is primarily in numerical form. 

The data collected should be related to or understood in terms of quantity.

Quantitative observations are analyzed with the help of survey analytics software . 

Examples of quantitative observations include observation of any variable related to a numerical value such as age, shape, weight, height, scale, etc.

For example, a researcher can understand a customer’s satisfaction with their recent purchases by asking them to rate their satisfaction on a Likert scale ranging from 1 (extremely unsatisfied) to 7 (extremely satisfied).

Qualitative observations monitor the characteristics of a phenomenon and do not involve numerical measurements.

Using this type of descriptive research, you can observe respondents in a naturalistic environment from a distance. Since the respondents are in a natural environment, the observed characteristics enrich and offer more insights. 

For instance, you can monitor and note down the observations of customers in a supermarket by observing their selection and purchasing patterns. This offers a detailed cognizance of the customer.

In any kind of research, you should ensure high survey response rates for improved quality of insights.  

2. Survey method

The survey method includes recording the answers of respondents through surveys or questionnaires. Surveys can include polls as well. They are the most common tool for collecting market research data. 

Surveys are generally used to collect feedback from the respondents. It should have a survey that taps into both open-ended and closed-ended questions .

The biggest advantage of the survey method is that it can be conducted using online or offline survey tools . One of the reasons why the survey method is the go-to option for descriptive research is that it entails the collection of large amounts of data in a limited span of time.

3. Case study method

The in-depth study of an individual or a group is known as a case study. Case studies usually lead to developing a hypothesis to explore a phenomenon further. Case studies are limited in their scope in that they don’t allow the researcher to make cause-effect conclusions or accurate predictions. 

This is because these associations could reflect the bias on the researchers’ part instead of a naturally occurring phenomenon. Another reason why case studies are limited in scope is that they could just be reflecting an atypical respondent in the survey. 

An atypical respondent refers to someone who is different from the average consumer, and if researchers make judgments about the entire target population based on this consumer, it can affect the external validity of the study.

[ Related read: Descriptive vs experimental research ]

Read how Voxco helped Brain Research improve research productivity by 60%.

“The platform extends our ability to productively manage our busy intercept survey projects, and the confidence to support major new clients.”

Laura Ruvalcaba, President & CEO, Brain Research

What are the types of descriptive research?

There are seven types of descriptive research based on when you conduct them and what type of data research you conduct. We have explained these seven types in brief with examples to help you better understand them.

1. Cross-sectional: 

A descriptive method of studying a particular section of the target population at a specific point in time. 

Example : Tracking the use of social media by Gen Z in the Netherlands. 

2. Longitudinal: 

This type of descriptive study is conducted for an extended period on a group of people. 

Example : Monitoring changes in the volume of cyber-bullying among Millenials from 2022 to 2024. 

3. Normative: 

In this descriptive method, we compare the result of a study with an existing norm. 

Example : Comparing legal verdicts in similar types of cases. 

4. Relational/Correlational:

We investigate the type of relationships (correlation) between two variables in this type of descriptive research. 

Example : Investigating the relationship between video games and mental health. 

5. Comparative: 

A descriptive study that compares two or more people, groups, or conditions based on a specific aspect. 

Example : Comparing the salary of two employees in similar job roles from two companies. 

6. Classification: 

This type of research arranges collected data into classes based on specific criteria to analyze them. 

Example : Classification of customers based on their buying behavior. 

7. Archival: 

A descriptive study where you search for past records and extract information.

Example : Tracking company’s sales data over the decade. 

We have been discussing the descriptive method with examples. So now let’s see how you can use this research type in a real-world application.

Guide to Descriptive Research

Learn the key steps of conducting descriptive research to uncover breakthrough insights into your target market.

Examples of Descriptive Research Under Market Research

MicrosoftTeams image 8 2

This research type helps you gather the necessary information you need to understand the problem. It sets the scene to conduct further research. But how can you use this research method in the real world? 

We have explained its real-world application in three scenarios to help you determine where and where you want to use this research type. 

1. Sales Studies

You can use this research type to analyze the potential of the market, what is currently trending in the market, and which products may perform well in terms of sales. You can also study what circumstances influence the market shares and when they are likely to increase or decrease. 

This research type can help you gather the demographic data of the consumers.

2. Consumer Perception and Behavior Studies

You can use this research method to analyze what consumers think about the brand. You can evaluate their perceptions about the products sold by a particular brand and the uses of other competitive products. 

Using descriptive research, you can also analyze what advertising strategies have worked to increase the positive perceptions of the brand. You can assess consumers’ consumption behavior and how it is influenced by product pricing.

3. Market Characteristics Studies

Another way you can use this research method is by analyzing the distribution of the products in the market. You can gather contextual data on questions such as “which countries have more sales”, “which countries have fewer products but the product is sold out quickly” , etc. 

You can also analyze the brand management of competitors ; what strategy is working for them and what is not.

What are the applications of descriptive research?

This research method is used for a variety of reasons. Even after outlining survey goals, and survey designs as well as collecting information through surveys, there is no way of knowing whether or not the research you are conducting will meet the predictions that you have made. 

Here are some popular ways in which organizations use this research type:

1. Defining the characteristics of respondents

Since most descriptive research methods use close-ended questions for the collection of data, it helps in drawing objective conclusions about the respondents.

It helps in deriving patterns, traits, and behaviors of respondents. It also aims to understand respondents’ attitudes and opinions about certain phenomena.

For instance , researchers can understand how many hours young adults spend on the internet, their opinions about social media platforms, and how important they consider these platforms to be. This information will help the company make informed decisions regarding its products and brands. 

One-stop-shop to gather, measure, uncover, and act on insightful data.

Curious about the price? Click below to get a personalized quote.

2. Analyzing trends in data

You can use statistical data analysis to understand the trends in data over time. 

For instance, consider an apparel company that drops a new line of clothing; they may research how Gen Z and Millennials react to the new launch. If they discover that the new range of clothes has worked effectively for one group (Gen Z) but not the other, the company may stop producing clothes for the other group.

Leverage a data analysis platform that allows you to conduct advanced statistical analysis and offers a data analytics dashboard to track real-time data.

3. Comparing different groups

Something closely knit to the previous point is also comparing different groups of customers based on their demographics. With descriptive research, you can study how different groups of people respond to specific services offered by a company. 

For instance , what is the influence of income, age, gender, income, etc. influence the spending behaviors of consumers?

This research method helps companies understand what they should do to increase their brand appeal in different groups of the population. 

4. Validating existing patterns of respondents

Since it is non-invasive and makes use of quantitative data (mostly), you can make observations about why the current patterns of purchasing exist in customers. 

You can also use the findings as the basis of a more in-depth study in the future. 

5. Conducting research at different times

Descriptive research can be conducted at different periods of time in order to see whether the patterns are similar or dissimilar at different points in time. You can also replicate the studies to verify the findings of the original study to draw accurate conclusions.

6. Finding correlations among variables

This method is also used to draw correlations between variables and the degree of association between the variables. 

For instance, if the focus is on men’s age and expenditure. 

There is a possibility of finding a negative correlation between the two variables, indicating that as the age of men increases, the less they spend on sports products.

Download Market Research Tool kit

5 MR templates + MR trends guide + Online survey guide + Agile MR guide

Descriptive research Examples

A descriptive method of research aims to gather answers for how, what, when, and where. 

Let’s use some examples to understand how a descriptive method of research is used. 

Before investing in housing at any location, you would want to conduct your own research to understand 

  • How is the market changing?
  • When or at what time of year is it changing?
  • Where would you make more profit?

This type of research is an example of a descriptive study. 

A company studies the behavior of its customers to identify its target market before it launches a new product. This is another use case of how brands use descriptive research. 

The company may conduct this research by observing the customer’s reaction and behavior toward a competitor’s product. 

Or, they can also conduct surveys to ask customer opinions on the new product by the company before its launch. 

A restaurant planning to open a branch in a new locality will research to understand the behavior of the people living there. They will survey the people to know their choice of flavor, taste, foods, drinks, and more. 

Now that we’ve seen how you can use this research method for your research purpose, let’s also see the advantages & disadvantages of the research.

What Are the Advantages of Descriptive Research?

It is the preliminary research method. Most researchers use this method to discover the problem they should prioritize. Before diving into the experiments, let’s see some of the reasons why you should be conducting this research. 

1. Primary data collection

In this type of descriptive research, the data is collected through primary data collection methods such as case studies, observational methods, and surveys. This kind of data collection provides us with rich information and can be used for future research as well. It can also be used for developing hypotheses or your research objective.

2. Multiple data collection

Descriptive research can also be conducted by collecting qualitative or quantitative data . Hence, it is more varied, flexible, and diverse and tends to be thorough and elaborate.

[ Related read: Data Collection: All you need to know! ]

3. Observational behavior 

The observational method of this research allows researchers to observe the respondent’s behavior in natural settings. This also ensures that the data collected is high in quality and honest.

4. Cost-effective

It is cost-effective and the data collection of this research can be done quickly. You can conduct descriptive research using an all-in-one solution such as Voxco. Leverage a platform that gives you the capability of the best market research software to conduct customer, product, and brand research.

What Are the Disadvantages of Descriptive Research?

Descriptive research also has some disadvantages. Let’s learn about these cons so you can wisely decide when you should use this research to keep the disadvantages to a minimum. 

1. Misleading information

Respondents can give misleading or incorrect responses if they feel that the questions are assessing intimate matters. Respondents can also be affected by the observer’s presence and may engage in pretending. This is known as the observer effect.

2. Biases in studies

The researchers’ own opinions of biases may affect the results of the study. This is known as the experimenter effect.

3. Representative issue 

There is also the problem of data representativeness. It occurs when a case study or the data of a small sample does not adequately represent the whole population.

4. Limited scope

Descriptive research has limited scope, wherein it only analyzes the “what” of research, it does not evaluate the “why” or “how” questions of research.

Voxco is trusted by 500+ global brands and top 50 MR firms to gather insights and take actions.

See how Voxco can help enhance your research efficiency.

Wrapping up;

So that sums up our descriptive research guide. It is a wide concept that demands a conceptual framework for descriptive design and a thorough understanding of descriptive survey design . 

Naturally, it becomes essential that you adopt online survey tools that facilitates all of the above and provides ample room for insightful research.  

Voxco’s omnichannel survey software allows you to create interactive surveys, deploy them across multiple channels, and conduct data analysis in one platform.

This research method enables you to explain and describe the characteristics of a target population. The descriptive research method helps you uncover deeper insights into various aspects of the target population, such as who, what, when, where, and how. 

There are many data collection methods you can use to collect descriptive research data. For example, you can perform the research via surveys (online, phone, or offline), case studies, observations, and archival research.

Here are some key characteristics of this research methodology: 

This research type helps you describe the characteristics, behavior, opinions, and perspectives of the population or research subject. 

The data gathered from descriptive research is a reliable and comprehensive source of explanation of the research subject. 

In this methodology, the researcher focuses on observing and reporting on the natural relationship between the variables. There is no manipulation of variables or establishing a cause-and-effect relationship.

Descriptive research offers many advantages. 

Descriptive research methods are simple and easy to design and conduct. You don’t need research expertise for this research design in comparison to conducting more complex research. 

This research method is more cost-effective than other research methodologies, particularly experimental research designs. 

The descriptive research method enables you to collect qualitative and quantitative data. The research data helps extract valuable insights and supports further root-cause analysis.

Descriptive research methodology also has some limitations, here are some of those: 

Descriptive research data may generate insights specific to a population under study. This limits your ability to generalize the results to a wider population, which makes the data less representative. 

The data collection approaches and observation biases can lead to bias in the research method, which can negatively impact the accuracy and reliability of the research findings.

Join the network of 500+ happy survey creators.

Explore all the survey question types possible on Voxco

Explore Voxco Survey Software

Descriptive Research : Definition, Method and Examples Ad Effectiveness Survey

+ Omnichannel Survey Software 

+ Online Survey Software 

+ CATI Survey Software 

+ IVR Survey Software 

+ Market Research Tool

+ Customer Experience Tool 

+ Product Experience Software 

+ Enterprise Survey Software 

Ad Effectiveness Survey3

Ad Effectiveness Survey

Ad Effectiveness Survey SHARE THE ARTICLE ON Share on facebook Share on twitter Share on linkedin Voxco is trusted by 450+ Global Brands in 40+

Face to face Surveys2

Formal vs Informal assessment: differences and similarities

Mastering Assessments: A Guide to Formal vs. Informal SHARE THE ARTICLE ON Table of Contents Assessment is referred to as assessing someone or something. Teachers

All you need to know about running a successful VoC program

Voice of the customer questions

Voice of the customer questions SHARE THE ARTICLE ON Table of Contents What is the voice of customer questions? The Voice of customers is the

Descriptive Research cvr 1

DATA DEMOCRATIZATION

DATA DEMOCRATIZATION Exclusive Step by Step guide to Descriptive Research Get ready to uncover the how, when, what, and where questions in a research problem

Descriptive Research : Definition, Method and Examples Ad Effectiveness Survey

How does a Predictive Dialer Work?

How Does a Predictive Dialer Work? SHARE THE ARTICLE ON Table of Contents When you call a client, there is nothing more discouraging than their

How to do Aspect Based Sentiment Analysis

How to do Aspect Based Sentiment Analysis SHARE THE ARTICLE ON Table of Contents If you thought sentiment analysis was perfect, plan to be floored

We use cookies in our website to give you the best browsing experience and to tailor advertising. By continuing to use our website, you give us consent to the use of cookies. Read More

Looking for the best research tools?

Voxco offers the best online & offline survey research tools!

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Perspect Clin Res
  • v.10(1); Jan-Mar 2019

Study designs: Part 2 – Descriptive studies

Rakesh aggarwal.

Department of Gastroenterology, Sanjay Gandhi Postgraduate Institute of Medical Sciences, Lucknow, Uttar Pradesh, India

Priya Ranganathan

1 Department of Anaesthesiology, Tata Memorial Centre, Mumbai, Maharashtra, India

One of the first steps in planning a research study is the choice of study design. The available study designs are divided broadly into two types – observational and interventional. Of the various observational study designs, the descriptive design is the simplest. It allows the researcher to study and describe the distribution of one or more variables, without regard to any causal or other hypotheses. This article discusses the subtypes of descriptive study design, and their strengths and limitations.

INTRODUCTION

In our previous article in this series,[ 1 ] we introduced the concept of “study designs”– as “the set of methods and procedures used to collect and analyze data on variables specified in a particular research question.” Study designs are primarily of two types – observational and interventional, with the former being loosely divided into “descriptive” and “analytical.” In this article, we discuss the descriptive study designs.

WHAT IS A DESCRIPTIVE STUDY?

A descriptive study is one that is designed to describe the distribution of one or more variables, without regard to any causal or other hypothesis.

TYPES OF DESCRIPTIVE STUDIES

Descriptive studies can be of several types, namely, case reports, case series, cross-sectional studies, and ecological studies. In the first three of these, data are collected on individuals, whereas the last one uses aggregated data for groups.

Case reports and case series

A case report refers to the description of a patient with an unusual disease or with simultaneous occurrence of more than one condition. A case series is similar, except that it is an aggregation of multiple (often only a few) similar cases. Many case reports and case series are anecdotal and of limited value. However, some of these bring to the fore a hitherto unrecognized disease and play an important role in advancing medical science. For instance, HIV/AIDS was first recognized through a case report of disseminated Kaposi's sarcoma in a young homosexual man,[ 2 ] and a case series of such men with Pneumocystis carinii pneumonia.[ 3 ]

In other cases, description of a chance observation may open an entirely new line of investigation. Some examples include: fatal disseminated Bacillus Calmette–Guérin infection in a baby born to a mother taking infliximab for Crohn's disease suggesting that adminstration of infliximab may bring about reactivation of tuberculosis,[ 4 ] progressive multifocal leukoencephalopathy following natalizumab treatment – describing a new adverse effect of drugs that target cell adhesion molecule α4-integrin,[ 5 ] and demonstration of a tumor caused by invasive transformed cancer cells from a colonizing tapeworm in an HIV-infected person.[ 6 ]

Cross-sectional studies

Studies with a cross-sectional study design involve the collection of information on the presence or level of one or more variables of interest (health-related characteristic), whether exposure (e.g., a risk factor) or outcome (e.g., a disease) as they exist in a defined population at one particular time. If these data are analyzed only to determine the distribution of one or more variables, these are “descriptive.” However, often, in a cross-sectional study, the investigator also assesses the relationship between the presence of an exposure and that of an outcome. Such cross-sectional studies are referred to as “analytical” and will be discussed in the next article in this series.

Cross-sectional studies can be thought of as providing a “snapshot” of the frequency and characteristics of a disease in a population at a particular point in time. These are very good for measuring the prevalence of a disease or of a risk factor in a population. Thus, these are very helpful in assessing the disease burden and healthcare needs.

Let us look at a study that was aimed to assess the prevalence of myopia among Indian children.[ 7 ] In this study, trained health workers visited schools in Delhi and tested visual acuity in all children studying in classes 1–9. Of the 9884 children screened, 1297 (13.1%) had myopia (defined as spherical refractive error of −0.50 diopters (D) or worse in either or both eyes), and the mean myopic error was −1.86 ± 1.4 D. Furthermore, overall, 322 (3.3%), 247 (2.5%) and 3 children had mild, moderate, and severe visual impairment, respectively. These parts of the study looked at the prevalence and degree of myopia or of visual impairment, and did not assess the relationship of one variable with another or test a causative hypothesis – these qualify as a descriptive cross-sectional study. These data would be helpful to a health planner to assess the need for a school eye health program, and to know the proportion of children in her jurisdiction who would need corrective glasses.

The authors did, subsequently in the paper, look at the relationship of myopia (an outcome) with children's age, gender, socioeconomic status, type of school, mother's education, etc. (each of which qualifies as an exposure). Those parts of the paper look at the relationship between different variables and thus qualify as having “analytical” cross-sectional design.

Sometimes, cross-sectional studies are repeated after a time interval in the same population (using the same subjects as were included in the initial study, or a fresh sample) to identify temporal trends in the occurrence of one or more variables, and to determine the incidence of a disease (i.e., number of new cases) or its natural history. Indeed, the investigators in the myopia study above visited the same children and reassessed them a year later. This separate follow-up study[ 8 ] showed that “new” myopia had developed in 3.4% of children (incidence rate), with a mean change of −1.09 ± 0.55 D. Among those with myopia at the time of the initial survey, 49.2% showed progression of myopia with a mean change of −0.27 ± 0.42 D.

Cross-sectional studies are usually simple to do and inexpensive. Furthermore, these usually do not pose much of a challenge from an ethics viewpoint.

However, this design does carry a risk of bias, i.e., the results of the study may not represent the true situation in the population. This could arise from either selection bias or measurement bias. The former relates to differences between the population and the sample studied. The myopia study included only those children who attended school, and the prevalence of myopia could have been different in those did not attend school (e.g., those with severe myopia may not be able to see the blackboard and hence may have been more likely to drop out of school). The measurement bias in this study would relate to the accuracy of measurement and the cutoff used. If the investigators had used a cutoff of −0.25 D (instead of −0.50 D) to define myopia, the prevalence would have been higher. Furthermore, if the measurements were not done accurately, some cases with myopia could have been missed, or vice versa, affecting the study results.

Ecological studies

Ecological (also sometimes called as correlational) study design involves looking for association between an exposure and an outcome across populations rather than in individuals. For instance, a study in the United States found a relation between household firearm ownership in various states and the firearm death rates during the period 2007–2010.[ 9 ] Thus, in this study, the unit of assessment was a state and not an individual.

These studies are convenient to do since the data have often already been collected and are available from a reliable source. This design is particularly useful when the differences in exposure between individuals within a group are much smaller than the differences in exposure between groups. For instance, the intake of particular food items is likely to vary less between people in a particular group but can vary widely across groups, for example, people living in different countries.

However, the ecological study design has some important limitations.First, an association between exposure and outcome at the group level may not be true at the individual level (a phenomenon also referred to as “ecological fallacy”).[ 10 ] Second, the association may be related to a third factor which in turn is related to both the exposure and the outcome, the so-called “confounding”. For instance, an ecological association between higher income level and greater cardiovascular mortality across countries may be related to a higher prevalence of obesity. Third, migration of people between regions with different exposure levels may also introduce an error. A fourth consideration may be the use of differing definitions for exposure, outcome or both in different populations.

Descriptive studies, irrespective of the subtype, are often very easy to conduct. For case reports, case series, and ecological studies, the data are already available. For cross-sectional studies, these can be easily collected (usually in one encounter). Thus, these study designs are often inexpensive, quick and do not need too much effort. Furthermore, these studies often do not face serious ethics scrutiny, except if the information sought to be collected is of confidential nature (e.g., sexual practices, substance use, etc.).

Descriptive studies are useful for estimating the burden of disease (e.g., prevalence or incidence) in a population. This information is useful for resource planning. For instance, information on prevalence of cataract in a city may help the government decide on the appropriate number of ophthalmologic facilities. Data from descriptive studies done in different populations or done at different times in the same population may help identify geographic variation and temporal change in the frequency of disease. This may help generate hypotheses regarding the cause of the disease, which can then be verified using another, more complex design.

DISADVANTAGES

As with other study designs, descriptive studies have their own pitfalls. Case reports and case-series refer to a solitary patient or to only a few cases, who may represent a chance occurrence. Hence, conclusions based on these run the risk of being non-representative, and hence unreliable. In cross-sectional studies, the validity of results is highly dependent on whether the study sample is well representative of the population proposed to be studied, and whether all the individual measurements were made using an accurate and identical tool, or not. If the information on a variable cannot be obtained accurately, for instance in a study where the participants are asked about socially unacceptable (e.g., promiscuity) or illegal (e.g., substance use) behavior, the results are unlikely to be reliable.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

We have a new app!

Take the Access library with you wherever you go—easy access to books, videos, images, podcasts, personalized features, and more.

Download the Access App here: iOS and Android . Learn more here!

  • Remote Access
  • Save figures into PowerPoint
  • Download tables as PDFs

Foundations of Clinical Research: Applications to Practice, 3e

Chapter 14:  Descriptive Research

  • Download Chapter PDF

Disclaimer: These citations have been automatically generated based on the information we have and it may not be 100% accurate. Please consult the latest official manual style if you have any questions regarding the format accuracy.

Download citation file:

  • Search Book

Jump to a Section

Introduction, developmental research.

  • NORMATIVE STUDIES
  • QUALITATIVE RESEARCH
  • DESCRIPTIVE SURVEYS
  • CASE STUDIES
  • Full Chapter
  • Supplementary Content

Descriptive research is designed to document the factors that describe characteristics, behaviors and conditions of individuals and groups. For example, researchers have used this approach to describe a sample of individuals with spinal cord injuries with respect to gender, age, and cause and severity of injury to see whether these properties were similar to those described in the past. 1 Descriptive studies have documented the biomechanical parameters of wheelchair propulsion, 2 and the clinical characteristics of stroke. 3 As our diagram of the continuum of research shows, descriptive and exploratory elements are commonly combined, depending on how the investigator conceptualizes the research question.

Descriptive studies document the nature of existing phenomena and describe how variables change over time. They will generally be structured around a set of guiding questions or research objectives to generate data or characterize a situation of interest. Often this information can be used as a basis for formulation of research hypotheses that can be tested using exploratory or experimental techniques. The descriptive data supply the foundation for classifying individuals, for identifying relevant variables, and for asking new research questions.

Descriptive studies may involve prospective or retrospective data collection, and may be designed using longitudinal or cross-sectional methods (see Chapter 13 ). Surveys and secondary analysis of clinical databases are often used as sources of data for descriptive analysis. Several types of research can be categorized as descriptive, including developmental research, normative research, qualitative research and case studies. The purpose of this chapter is to describe these approaches.

Concepts of human development, whether they are related to cognition, perceptual-motor control, communication, physiological change, or psychological processes, are important elements of a clinical knowledge base. Valid interpretation of clinical outcomes depends on our ability to develop a clear picture of those we treat, their characteristics and performance expectations under different conditions. Developmental research involves the description of developmental change and the sequencing of behaviors in people over time. Developmental studies have contributed to the theoretical foundations of clinical practice in many ways. For example, the classic descriptive studies of Gesell and Amatruda 4 and McGraw 5 provide the basis for much of the research on sequencing of motor development in infants and children. Erikson's studies of life span development have contributed to an understanding of psychological growth through old age. 6

Pop-up div Successfully Displayed

This div only appears when the trigger link is hovered over. Otherwise it is hidden from view.

Please Wait

Video description: A comprehensive survey of deep learning approaches

  • Open access
  • Published: 11 April 2023
  • Volume 56 , pages 13293–13372, ( 2023 )

Cite this article

You have full access to this open access article

descriptive research video

  • Ghazala Rafiq 1 ,
  • Muhammad Rafiq   ORCID: orcid.org/0000-0001-6713-8766 2 &
  • Gyu Sang Choi 1  

4993 Accesses

10 Citations

1 Altmetric

Explore all metrics

Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional approaches. The current literature lacks a thorough interpretation of the recently developed and employed sequence to sequence techniques for video description. This paper fills that gap by focusing mainly on deep learning-enabled approaches to automatic caption generation. Sequence to sequence models follow an Encoder–Decoder architecture employing a specific composition of CNN, RNN, or the variants LSTM or GRU as an encoder and decoder block. This standard-architecture can be fused with an attention mechanism to focus on a specific distinctiveness, achieving high quality results. Reinforcement learning employed within the Encoder–Decoder structure can progressively deliver state-of-the-art captions by following exploration and exploitation strategies. The transformer mechanism is a modern and efficient transductive architecture for robust output. Free from recurrence, and solely based on self-attention, it allows parallelization along with training on a massive amount of data. It can fully utilize the available GPUs for most NLP tasks. Recently, with the emergence of several versions of transformers, long term dependency handling is not an issue anymore for researchers engaged in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes. They can get auspicious directions from this research.

Similar content being viewed by others

descriptive research video

A Survey on Video Description and Summarization Using Deep Learning-Based Methods

descriptive research video

Characterizing the Impact of Using Features Extracted from Pre-trained Models on the Quality of Video Captioning Sequence-to-Sequence Models

descriptive research video

Move Forward and Tell: A Progressive Generator of Video Descriptions

Avoid common mistakes on your manuscript.

1 Introduction

The global-village phenomenon is strengthening day by day. Technological advancements, the abundance of devices, automation, and widespread availability of the internet has connected people like never before. People exchange texts, images, and videos for communication, resulting in a massive amount of textual and visual data. These copiously available videos linked with accurate processing can help to address numerous real-world challenges in various disciplines of life. No doubt, human beings are intelligent enough to know that understanding visual aspects and language intricacies is among the inherent capabilities they possess. However, for machines to be rationally sharp enough, proper understanding of images and their consequential interpretations are essential for content description. The primary objective of video description is to provide a concise and accurate textual alternative to visual content. Researchers put considerable effort into understanding visual characteristics and generating an eloquent interpretation, i.e., a video description that is a blend of vision and language encompassing the two prominent domains of CV and NLP (Bhatt et al. 2017 ). Scientists from both areas have mutually worked on getting appropriate insights from images and videos, and then accurately and precisely interpreting them by considering all the elements appearing in the video frames, like the objects, actions, interactions, backgrounds, overlapping scenes with localization information, and most importantly, their temporal sequence. Table 1 lists abbreviations with full form.

The importance of video description is evident from its practical and real time applications (including efficient searching and indexing of videos on the internet, human-robot relationships in industrial zones, and facilitation of autonomous vehicle driving), and video descriptions can outline procedures in instructional/tutorial videos for industry, education, and the household (e.g., recipes). The visually impaired can gain useful information from a video that incorporates audio descriptions. Long surveillance videos can be transformed into short texts for quick previews. Sign language videos can be converted to natural language descriptions. Automatic, accurate, and precise video/movie subtitling is another important and practical application of the video description task.

1.1 Classical approach

Video description research began with the classical approach (Rohrbach et al. 2013 ; Kojima et al. 2002 ; Khan and Gotoh 2012 ; Barbu et al. 2012 ; Das et al. 2013 ; Hakeem et al. 2004 ) where, after identification of subject, verb, and object in a constrained domain video, the fitting of the SVO in a standard predefined template was performed. These classical methods were effective only for short video clips with a limited number of objects and minimal interactions. For semantic verification, the research in Das et al. ( 2013 ) developed a hybrid model addressing the issues in Khan and Gotoh ( 2012 ) and Barbu et al. ( 2012 ) , combining the best aspects of bottom-up and top-down exploitation of rich semantic spaces of both visual and textual features. They produced high-relevance content beyond simple keyword annotations. The SVO tuple based methods can be split up into two phases for the better performance of video captioning system, i.e., phase-I is the content identification and phase-II is the sentence generation for the identified objects/events/actions in phase-I. Methods for identification (phase-I) include edge detection/color matching (Kojima et al. 2002 ), Scale Invariant Feature Transform (SIFT) (Lowe 1999 ) and context based object recognition (Torralba et al. 2003 ) whereas for sentence generation phase there exists HALogen (Langkilde-geary and Knight 2002 ) representation and Head-driven Phrase Structure Grammar (HPSG) (Levine and Meurers 2006 ).

figure 1

Hierarchical structure of this paper

The methods adopted for the task of image/video captioning can be segregated into two broad categories of retrieval based and template based approaches. In retrieval based methods, the captions are retrieved from a set of existing captions. These methods first find candidate captions, i.e., visually similar frames with their provided captions from the training dataset and then most appropriate and suitable caption is selected from the candidate cations. Although retrieval based captions are grammatically correct, but frame/video specific caption generation is very challenging. The template based approaches have fixed templates with a number of blank slots for generated caption’s subject verb and object. These methods are also capable of generating grammatically correct captions but can not generate variable length captions because of the limitation of their dependence on fixed, predefined templates, which are not capable of generating semantically rich natural language sentences, and hence, are not analogous to human annotations.

1.2 Video captioning

The deep learning models employed for video description tasks primarily follow the Encoder–Decoder structure, which is the most productive/beneficial sequence-to-sequence modeling technique. Describing a video can also be defined as a sequence-to-sequence task, since it has a sequence of visual representations as input, and a sequence of generated words as output. The ED architecture gained considerable attention in the earlier research specific to neural machine translations, where the implementation was for text translations from one language domain to another. The task of describing videos can be partitioned into two major sections: the visual model for understanding visual content correctly (without missing any information), and the language model for transforming learned visual information into grammatically correct natural language sentences. Since computers only understand numbers, arrays, and matrices, so learned visual representations are stored as context vector . The context vector is a collection of numbers communicating some visual information into the language model. The language model then extracts the connotation of each of the context vectors, and accordingly generates semantically aligned words, one by one. Represented mathematically, we can say that the language model is employed to establish the probability P of generating a word w at time t conditioned on previously generated words \(w1, ..., wt-1\) , during the preceding time steps (where \(1,2,.... t-1,t\) represents time step), i.e., \(P ( wt \Vert w1, ... , wt-1 )\) where \(w_{i}\) represents word generated at a certain time.

Figure  2 demonstrates the deep learning-based basic model employing visual and language models for video description. Following the ED architecture for video descriptions, the standard ED structure employs a combination of the convolutional neural network, the recurrent neural network , or the variants LSTM or GRU as encoder and decoder blocks. RNNs for sequential data processing have demonstrated comparable results; but for long sequences, the implementation of the RNN system is not appreciable. The associated vanishing and exploding gradient problems, as well as the recurrent nature involving previous-step computations in the next step, hinders the parallel processing of the sequence, hence degrading overall performance. In order to upgrade the performance of the standard ED architecture, it can be equipped with an attention mechanism, reinforcement learning, or a transformer mechanism. Attention mechanisms focus on specific areas of the frame, and achieve high-quality results. RL employed within the ED architecture can progressively deliver state-of-the-art captions, employing its own agent-environment interactions. The transformer mechanism is an efficient architecture for robust output. It does not contain any convolution and recurrence, and is developed solely on the basis of self-attention. The transformer allows parallelization along with training on a massive amount of data, with the capability to fully utilize the available GPUs for most machine learning tasks. Drastically reduced training time and efficient model training can take place with high accuracy by using the parallel processing capability of transformers. Recently with the emergence of several versions of transformers, long-term dependency handling is not an issue anymore.

figure 2

The basic model for video description (dense video captioning) examines a long video comprising multiple scenes or events, i.e., Event 1, Event 2, up until the last identified event. After localization (identification of start and end times) of each event, the paragraph-like (multi-sentence) description is generated by coherently combining the captions generated for each event, catering to concurrent and overlapping events

1.3 Dense video captioning/ video description

Comprehending the localized events of a video appropriately and then transforming them accurately into a textual format is called dense video captioning, or simply, video description. This task of describing complex and diverse visual perceptions establishes a connection between the two world-leading realms of computer vision and natural language processing. Capturing the scenes, objects, and activities in a video, as well as the spatial-temporal relationships and the temporal order, is crucial for precise and grammatically correct multi-line text narration.

Nevertheless, the task of automatically describing video is challenging. The model employed for the generation of a caption characterizing a long-duration video or a short clip consisting of a significant number of frames requires not only an understanding of sequential visual data but also the capability to provide a syntactically and semantically accurate translation of that understanding into natural language. Similarly, the proper understanding of a considerable number of objects, events, and actions, and their interactions in the video (as well as their relationships and the order in which they happen) must be captured accurately and explained properly using natural sentences. Whether they belong to an open or a constrained domain, videos mostly contain numerous scenes or events. The dependencies between the events are captured by using contextual information from the previous (past) and coming (future) events, and then all events are jointly described accordingly by using natural language. Analogous to dense image captioning which describes regions in space after localization. Similarly, with the help of transformer, 2D images are transformed into 3D objects with color and texture-aware information by Yuan et al. ( 2022 ), for dense captioning. Dense video captioning (Krishna et al. 2017 ) localizes events in time, and afterwards expresses them. These events can intersect with other events, and hence, are challenging to describe appropriately. Dense video captions capture details of event localization and their co-occurrence (Aafaq et al. 2022 ).

Terminologies associated with video description have their specific implications. Keeping current research in mind, the task of video captioning can be distributed into two sections: mono-sentence caption generation and multi-sentence (paragraph) caption generation. The mono-sentence is supposed to be a precise, yet fully informative, abstractive representative sentence of the whole video, whereas a multi-sentence (dense) caption is supposed to localize and describe all events in the video temporally, including intersecting and overlapping events. Here, event localization refers to identification of each event in the video with its start and end times; event description means expressing each localized event temporally in a much more detailed way, resulting in the generation of multiple sentences or paragraphs (like a dense summary of the whole video). The generated fine-grained caption is a requirement of such a mechanism that proves to be expressive and subtle. Its purpose is to capture the temporal dynamics of the visuals present in the video, and to then join that with syntactically and semantically correct representations using natural language.

Problem setup: video captioning/description

For video captioning (single-sentence): Let us suppose we have a video, V , containing N frames such that \(V=\{f_{1},f_{2},....f_{N}\}\) ( f representing frame), and our aim is to generate a single-sentence textual caption, T , representing the video content comprising n words such that \(T=\{w_{1},w_{2},....w_{n}\}\) ( w representing word), and semantically aligned words, one by one, are generated conditioned on previously generated words. At time t , the word wt is generated conditional on probability \(P ( wt \Vert w1, ..., wt-1 )\) where \(w_{i}\) represents a word generated at a certain time, i .

For video description (dense captioning): Particular to videos containing multiple scenes or events, event localization (Krishna et al. 2017 ) is the identification of start and end times of a particular event in the video. Comprehending these localized events semantically, and transforming them into precise and grammatically correct multi-sentence natural language explanations, is required. For video V containing N events such that \(V=\{E_{1},E_{2},....E_{N}\}\) ( E representing event), each event needs to be identified such that \(E_{1}=\{EST, w_{1},w_{2},....w_{A}, EET\}\) with event start time (EST) and event end time(EET). A certain number of words, A , expresses event \(E_{1}\) , and similarly, localized \(E_{2}=\{EST, w_{1},w_{2},....w_{B},EET\}\) has a certain number of words, B , to express event \(E_{2}\) and so on, until all events in the video are understood. Every event can be expressed with a different number of words ( A ,  B , ...) depending on the duration of the event. The aim is to gather all localized event descriptions and generate a semantically and grammatically correct and coherent paragraph-like description for the video, avoiding redundancy.

This survey aims to present inclusive insights into the deep learning-based techniques implemented in the video description, supported by the most recent research. During the past few years, the field of captioning (image/video) has exhibited remarkable success and has achieved amazing state-of-the-arts. A thorough discussion on these techniques/methodologies adopted from time to time lacks in the current literature. The key motivation behind this research work is to fill this research gap and facilitate the researchers in a clear understanding of the employed approaches. Our contributions to this research are as follows.

We provide an elaborate view of the latest deep learning-based techniques for video description, with up-to-date supporting articles from the literature.

Besides the standard ED architecture, a detailed exploration of deep RL, attention mechanisms, and transformer mechanisms for video descriptions is performed.

We categorize and compare the key components of the models, and the substantially crucial information is highlighted for in-depth insights and quick understanding, making it expedient for researchers who are involved in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes, to find the state of the art in a single go.

Finally, we identify future research directions for further improvement in video description systems.

Outline of the survey This paper is organized as shown in Figure  1 . The next section, Section 2 , offers a brief discussion on the available surveys on the topic. These surveys primarily focus on the simple Encoder–Decoder based models. Section  3 demonstrates detailed deep learning based techniques employed for video description. At first, standard encoder decoder architecture employing CNN-RNN, RNN-RNN and CNN-CNN compositions followed by a thorough discussion is explore. Secondly, we describe fusion of attention mechanism in encoder decoder system for video captioning models to focus on specific distinctiveness. Thirdly, we present transformer based recent state of the art methods and analyze them for video description generation. Finally, the successful strategies for optimizing the generated descriptions through deep reinforcement learning is discuss in detail. Discussion on limitations and challenges of every technique’s is also present with working strategy, computational concept, and literature review in their respective sub-sections. In Section 4 , we analyze and compared the benchmark results produced by the state of the art methods, segregated chronologically based on specific dataset. A brief overview of the evaluation metrics and datasets used for video description is also provided in this section. Finally, Section  5 concludes the review with few future directions.

2 Literature review

Computer vision mainly deals with classification, detection, and segmentation tasks (Rafiq et al. 2020 ; Agyeman et al. 2021 ). The first part of video captioning, i.e., temporal action recognition, solely belongs under computer vision, whereas the second part (caption generation) bridges computer vision and natural language processing. Captioning is once again split into two types–one for the simple-to-recognize and -describe actions, and the second, for actions too complex to be described with simple and short natural language sentences.

The selection of appropriate components plays a substantial role in the generation of accurate and truthful output. A thorough empirical analysis of each component in the ED framework was presented in Aafaq et al. ( 2019b ). Significant performance gains were demonstrated by careful selection of an efficient and capable mechanism for four major constituent components: feature extraction, feature transformation, word embedding, and language modeling. The authors emphasized which efficient mechanisms can be adopted for these four factors and how to generate state-of-the-art results. For feature extraction, five different CNN models (3D-CNN, VGG-16, VGG-19, Inception-v3, and Inception-ResNet-v2) were analyzed, and the authors concluded that C3D is a common choice because of its ability to process both individual frames and short video clips. For 2D-CNN models, Inception-ResNet-v2 performed best in feature transformation. For mean pooling, and temporal encoding, temporal encoding was favored since mean pooling will result in considerable loss of information. In contrast, temporal encoding can capture highly reliable temporal dynamics of the whole video without any noteworthy loss of information, creating a positive counterbalance for system performance. In the literature, two methods are commonly referred for word embedding. The first is randomly initializing the embedding vector, and then computing the task-specific embedding, which is not able to capture rich semantics, whereas the second method makes use of pre-trained embedding. The authors examined four pre-trained embeddings–Word2Vec, FastText, and Glove (glove6B, glove 840b)–as well as randomly initialized embedding. FastText, with operative word embedding, performed prominently. Finally, in language modeling, the depth (or number of layers) in the system is crucial for superior performance, along with various hyperparameters, e.g., internal state size, the number of processed frames, fine-tuned word embedding, and dropout regularization.

This research work features deep learning-based frameworks for video description–ED in particular. It is clear from Table 2 that all available surveys on video description were primarily focused on simple ED-based frameworks. Several among them, notably (Li et al. 2019a ; Chen et al. 2019b ; Aafaq et al. 2019c ; Amaresh and Chitrakala 2019 ; Su 2018 ) and (Wu 2017 ), briefly discussed the application of an attention mechanism, and (Li et al. 2019a ; Chen et al. 2019b ; Aafaq et al. 2019c ) just gave an overview of reinforcement learning within the encoder-decoder, but none of them elaborated on these architectures, on related articles from the literature in detail, or explored transformer mechanism employment for video captioning. In this survey, all four approaches are described in detail, with state-of-the-art articles proving their worth.

To take full advantage of the advanced state-of-the-art hardware, i.e., GPUs, it is essential to adopt the models/mechanisms that can fully exploit these hardware structures. The sequential nature of RNNs cannot utilize the parallelization found in GPUs, resulting in inferior performance and slow training. As an alternate option to recurrence and convolution, an efficient approach is proposed by the transformer. It is capable of parallel processing, accelerated training, and handling long-term dependencies, and is space-efficient, much faster, solely self-attention-based, and is the model of choice for current advanced hardware.

3 Techniques/approaches

Inspired by technological advancements, researchers have experimented with deep neural networks for the automatic caption generation task. The early frameworks comprised the standard ED structure, but with their methodical rise, new high-tech approaches are fused with the standard structure to produce more expressive and flexible natural language sentences with richer semantics. In this paper, we have classified the adopted techniques into four categories per their technological advancement in time–the standard ED approach, the fusion of attention mechanisms in the standard ED structure, and adoption of the transformer mechanism for robust performance, and the decision-based DRL approaches, which have prominence in accurate natural language caption generation and optimization. The arrangement of these approaches/techniques is based on their technological evolution over time. We discuss these techniques one by one in detail in this section.

figure 3

The standard Encoder–Decoder architecture

3.1 Standard encoder–decoder approaches

The ED approach is a neural network configuration, as shown in Figure  3 . The architecture is partitioned into two components, namely, the encoder and the decoder. It has proven to be cutting-edge technology. The modern approach has been employed by the research community around the globe to solve sophisticated tasks, i.e., image captioning, video description, text and video summarization (Rafiq et al. 2020 ), visual question-answering systems and conversational modeling, and movement classification.

The ED framework comprises two neural networks (NNs):

The vector R in ( 2 ) is an internal representation that captures the context and meaning of the input, and is known as a context vector or thought vector . The choice of encoder structure mainly depends on the type of input , e.g., for text, the best encoder architecture is the RNN. For an image/frame or video clips as input, the CNN structure proved to be best suited for context-vector/visual-feature extraction (Xu et al. 2015 ). However, deliberation regarding CNN or RNN selection (Yin et al. 2017 ), and their behavioral differences for NLP, is under way among researchers. The fusion of these two architectures has accomplished outstanding results, since they both process information through altered techniques, and complement one another.

Context vector R generated from the encoder is input to the second neural network in the system, i.e., the decoder . The decoder generates the corresponding output. Selection of the decoder architecture depends on the type of output . In the video description task, when meaningful textual information is required as output from the input video, the RNN is the architecture mostly employed for this purpose. RNN variants like long short-term memory and the gated recurrent unit are popular in research involving natural language processing because of their ability to handle long-term dependencies. Decoder RNN \(\theta\) functionality at any given time t is

where \(O_t\) represents output at time t , and \(h_t\) is the internal/hidden state of the RNN, whereas \(h_{(t-1)}\) and \(O_{(t-1)}\) represent the hidden state and the output of the previous time step, \((t-1)\) . The RNN repeatedly works until the end-of-sequence \(\langle EOS\rangle\) token is generated. LSTM and GRUs with improved performance replace the basic RNN structure.

figure 4

Composition of the standard ED architecture for video description based on the literature explored for this research work

Specific to the video description, the encoder can be treated as the visual model for the system, whereas the decoder is responsible for language modeling. Two-dimensional (2D) or 3D convolutional neural networks are mostly used as an encoder for computing the context vector of a fixed or variable length. The context vector can be called a vector representation or a visual feature . After extraction, certain transformations are applied to these visual features, i.e., Mean/Max pooling or temporal encoding . The resultant transformed visual features are then entered into the language model for description generation. The ED framework is the most popular paradigm for video description tasks in recent years, so the authors in Aafaq et al. ( 2019b ) partitioned the ED structure for video description into four essential components: a CNN model for visual feature extraction, the types of transformations applied to extracted visual features, and the language model and the word embedding within the language model. Since the involvement of each of these components in the performance of the system is of high importance, intelligent selection is essential. By keeping in mind the pros and cons of each selected component, one can straight forwardly determine the overall performance of the description system. Blohm et al. ( 2018 ) explored the behavioral variance between CNNs and RNNs using a MovieQA dataset with 11 models trained for different random initializations of both an RNN-LSTM and a CNN, and finally, they observed that RNN-LSTM models outperformed CNN models by a large margin, although both models share the same weaknesses. Considering limitations or weaknesses, they test the transferability of the adversarial examples across models to investigate the CNN models on the adversarial examples optimization to fool the RNN models and vice versa. Degradation in performance was observed for both CNNs and RNNs and was fixed by including some adversarial examples in the training data.

Three compositions of encoder and decoder for video description available in the literature (CNN-RNN, RNN-RNN, and CNN-CNN, are summarized in Table  3 for convenience along with their visual & language components, contributions, and shortcomings (if any). Figure  4 shows these compositions as percentages. In recent research works, the transformers are also exploited as visual or language component of the ED structure, Seo et al. ( 2022 ) employed ViViT (video vision transformer) Arnab et al. ( 2021 ) & BERT as encoder and GPT-2 based decoder. Zhao et al. ( 2022 ) used an encoder composed of transformer encoder blocks for video features extraction in a global view resulting in reduced loss of intermediate hidden layer information.

3.1.1 CNN–RNN

The conventional ED pipeline typically comprises a CNN as a visual model for extracting visual features from each frame of the video, employing an RNN as a language model for generating the captions, word by word. VSJM-Net (Aafaq et al. 2022 ) presented a visual and semantic joint embedding network which is employed to detect proposals as well as learn the visual and semantic space. vc-HRNAT (Gao et al. 2022 ) using hierarchical representations is capable to learn in a self-supervised environment with multi-level semantic representation learning of video concepts. However, the system lacks the ability to visualize concepts of objects and actions that are absent or unclear in videos. VNS-GRU (Chen et al. 2020 ), a semantic GRU model with variational dropout and layer normalization, is trained using professional learning. For feature generation, the system utilizes ResNetXT-101 pre-trained on ImageNet (Deng et al. 2009 ) at the frame level, and an efficient convolutional Network (ECN) (Zolfaghari et al. 2018 ) pre-trained on Kinetics-400 at the video level. The model can learn unique words and delicate grammar based on vocabulary and tagging mechanisms. Similarly, a system comprising 2D and 3D ConvNets with a semantic detection network (SDN) as the encoder and a semantic-assisted LSTM as the decoder was proposed in (Chen et al. 2019a ) to overcome the limitations of short and inappropriate descriptions, deprived training approaches, and the non-availability of critical semantic features. Static spatial as well as dynamic spatio-temporal features are involved, along with a scheduled sampling strategy for self-learning of long sentences. A proposal for sentence-length modulated loss reassures optimization as well as thorough and detailed captions. In order to enhance the visual encoding mechanism for captioning purposes, GRU-EVE (Aafaq et al. 2019a ) was the first to emphasize feature encoding for semantically robust descriptions using a 2D/3D CNN with short Fourier transform as a visual model and a two-layered GRU as a language model for capturing spatio-temporal video dynamics. A 2D-CNN (InceptionResNetv2 Szegedy et al. 2017 ) pre-trained on an ImageNet dataset, and a 3D-CNN (C3D Tran et al. 2015 ) pre-trained on a Sports 1M dataset (Karpathy et al. 2014 ) are used for feature extraction. Then, the extracted features are processed hierarchically with short Fourier transformation, and the visual features are semantically improved. The approach proved that application of short Fourier transformation on a 2D-CNN produces improved results compared to the 3D-CNN. Feature extraction techniques play a significant role in the generation of an accurate caption. Both static and dynamic feature extraction were explored in SEmantic Feature Learning and Attention-Based Caption Generation (SeFLA) (Lee and Kim 2018 ). The paper suggested a multi-modal feature learning system with an attention mechanism. This research explains the prominence of semantics acquired using LSTM along with broad-spectrum visual features extracted using a ResNet CNN for generating accurate descriptions. Semantics was further categorized as static or dynamic, where static (a noun in the description) refers to the object, person, and background, whereas dynamic (a verb in the description) corresponds to the action taking place within the input video, as shown in Figure 5 .

figure 5

Semantic feature categorization as either static or dynamic, where static refers to the object, the person, and/or the background, and dynamic corresponds to the action taking place within the input video. Sample video frames and reference captions were taken from the Microsoft Video Description (MSVD) dataset

Systems with multiple independently trained models from different domains utilized in a pipeline fashion, focusing only on the input and output and skipping all the intermediate steps to get the required output, are called end-to-end systems. In video description, we have visual and language models for vision and language processing. If we train them independently, and then plug them into a pipeline, they are end-to-end systems. The first end-to-end trainable deep RNN (Zhang et al. 2017 ) proposed a description model employing Caffe CNN (Jia et al. 2014 ), a variant of AlexNet, fused with a two-layered LSTM accompanied by transfer learning, forming the ED to describe videos in an efficient fashion. The model is trained on the popular ImageNet dataset, and trained weights are utilized for initialization of the LSTM-based language model, boosting the training speed. Feature extraction, aggregation, and caption generation, are all steps involved in the process that require memory for computation and evaluation. Limitations associated with memory requirements while generating captions is addressed in EtENet-IRv2 (Olivastri 2019 ), which is also an end-to-end trainable ED architecture proposing a gradient accumulating strategy employing Inception-ResNet-v2 (Szegedy et al. 2017 ) and GoogLeNet (Szegedy et al. 2015 ) with two-stage training for encoding. Evaluation of benchmark datasets (Rafiq et al. 2021 ) showed significant improvement, but with a limitation on the computational resources required for end-to-end training.

Long Short-Term Memory with Transferred Semantic Attributes (LSTM-TSA) (Pan et al. 2017 ) emphasizes the fusion of jointly exploited semantic attributes for both images and video, along with the significance of its injection in extracted visual features for automatic sentence generation. A transfer unit to model the jointly associated attributes extracted from images and videos was proposed for integrating semantic attributes into sequence learning. The visual model, afterward accompanied by semantic attributes mined from both images and video, is fed into an LSTM for caption generation. Similarly, ResNet50 and VGG-16 CNN architectures coupled with the LSTM structure were exploited in Rivera-soto and Ordóñez ( 2013 ) for sequence-to-sequence video description models. Three types of model were proposed: mean pool, a single-layer ED, and a stacked ED. Extensive experimentation, performed on the Microsoft Video Description (MSVD) dataset, proved that a single-layer ED network performs best for machine translation, but complicates the network convergence for video descriptions. Instead, two stacked LSTM networks concentrate efficiently on both visual encoding and natural language decoding.

Both global and local features play roles while captioning a video. Object-aware aggregation with a bidirectional temporal graph-based (OA-BTG) description model (Zhang et al. 2019a ) captures in-depth temporal dynamics for significant objects in a video, and learns particular spatio-temporal representations by performing object-aware local feature aggregation on the detected object-aware regions and frames. A bi-directional graph is designed to capture both forward and backward temporal trajectories of a specific object. For learning certain representations, the global frame sequence and object spatio-temporal trajectories are aggregated. The influence of objects at a particular time is differentiated using a hierarchical attention mechanism. Understanding the global contents of a video, as well as the in-depth object information, is essential for the generation of flawless and fine-grained automatic captions. Likewise, RecNet (Wang et al. 2018a ) (a novel ED-reconstructor architecture) also exploits the phenomenon of the global and local structure of the video by employing two types of reconstructors and bi-directional flow. The relationship between video frames and generated natural sentences is established and enhanced by incorporating a reconstruction network for video captioning. Global structure is captured by mean pooling, while the attention mechanism is included in the local part of the model to exploit local temporal dynamics for the reconstruction of each frame.

CVC (Yan et al. 2010 ) proposed a system using the ED approach to describe numerous characteristics of off-site viewers or an audience’s crowd (such as the number of people in the crowd), the movement conditions, and the flow direction. The model employs a 2D/3D CNN for crowd feature extraction from video, which then feeds into an LSTM-GRU-based language model for captioning. The authors created their own crowd captioning dataset based on WorldExpo10. Based on the famous S2VT model, the CVC model showed improvement because of the small dataset and simple captions. To deal with the uncertainties faced during inappropriate data-driven static fusion methods employed in the video description system, TDDF (Zhang et al. 2017 ) established a task-driven dynamic fusion method. VGG-19 and the GoogLeNet CNN were employed for extraction of appearance features, whereas C3D was utilized for motion feature extraction. The proposed method achieved the best METEOR and CIDEr score when evaluated with the MSVD and Microsoft Research Video to Text (MSR-VTT) datasets, compared to a single-feature system.

One of the significant characteristics required in a generated description is its diversity. Lexical-FCN (Shen et al. 2017 ) was proposed for generation of multiple diverse and expressive captions based on weak video-level sentence annotations. Although the model is trained with a weakly supervised signal, it produces multiple diverse and meaningful captions with the sequence-to-sequence language model. A convolution-based lexical FCN forms the visual part of the model, whereas the language model follows the state-of-the-art S2VT (Venugopalan et al. 2015 ) mechanism with a bi-directional LSTM to improve the quality of automatically generated captions. Diversity, coherence, and informativeness of the generated captions ensure the supremacy of the proposed model.

3.1.2 RNN–RNN

In early research, employing an RNN in both encoding and decoding for neural machine translation demonstrated very efficient performance. Researchers explored the horizons for video description by exploiting the RNN for both feature extraction and language modeling. Long-term recurrent convolutional networks (LR-CNs) (Donahue et al. 2017 ) were proposed with an ED architecture for long sequences with time-varying input and output. Video description is carried out using three variants of the architecture: an LSTM encoder and decoder with a conditional random field (CRF) max, an LSTM decoder with a CRF max, and an LSTM decoder with CRF probabilities. For a broader scope, the research focuses on activity recognition, image captioning, and video description.

A state-of-the-art, sequence-to-sequence video-to-text generator, S2VT (Venugopalan et al. 2015 ), following the ED architecture uses a stacked two-layer LSTM ED model that takes a sequence of RGB frames as input and produces a sequence of words corresponding to the input sequence. The encoding and decoding of the frame and word representations are learned jointly from a parallel corpus. To model the temporal aspects of activities typically shown in videos, optical flow (Brox et al. 2014 ) between pairs of consecutive frames is computed. The flow images pass through a CNN and are provided as input to the encoding LSTM. Employing a single LSTM for both encoding and decoding allows parameter-sharing between the two states. Sequential processing at both stages is incorporated because both input and output are of variable, potentially different, lengths. Loss is computed on the decoding side for optimization of the video description system. The model was taken as a basis by many researchers, like in S2VT with knowledge (S2VTK) (Wang and Song 2017 ), and follows a detect, fetch, and combine approach. It first detects an object in the video, fetches object-related information from a knowledge base DBPedia, and creates a vector using Doc2Vec. Both elements, i.e., visual extracted features and related information regarding the detected object, are then input to the LSTM-based language model for caption generation. Another model based on S2VT (Venugopalan et al. 2015 ), meaning a guided system (Babariya and Tamaki 2020 ), was proposed in connection with the object detection module YOLOv3 (Redmon and Farhadi 2018 ) to generate correct captions having a similar meaning. The proposed model picks the object having the highest abjectness score in the YOLO detector, and after detection, searches for the nearest string describing the detected object. Word2Vec (Demeester et al. 2016 ) pre-trained on part of the Google News Dataset, is used for string embedding. Semantic similarity or caption meaning is considered for optimization of the training instead of training using the conventional word-by-word loss. Following the object detection approach, tube features for video description was proposed in Zhao et al. ( 2018 ). Trajectories of objects in input videos are captured, employing a Faster-RCNN (Wallach 2017 ) to extract region proposals, and afterwards, the regions from different frames (but belonging to the same objects) are associated as tubes. A similarity graph is created among the detected bounding boxes, and a similarity score is assigned to a pair of bounding boxes in adjacent frames. A bi-directional LSTM encoder encodes both forward and backward dynamic information of the tubes, and converts each tube into a fixed-sized visual vector, whereas a single LSTM decoder with an attention mechanism to monitor the most correlated tubes, generates the captions.

Dealing with multiple and diverse caption generation, the Diverse Captioning Model (DCM) (Xiao and Shi 2019z ) is a conditional Generative Adversarial Network (GAN) with an ED model to describe video content with multiple descriptions. It can describe video content with great accuracy, and can capture both forward and backward temporal relationships to encode the extracted visual features. For a given video, the intermediate latent variables of the conventional encode-decode process are utilized as input to the conditional GAN (CGAN) to generate diverse sentences. Generators comprising different CNNs generate diverse descriptions while the discriminator inspects the worth or quality of the formed captions. Combining the reasonableness and differences between the generated sentences, a diverse captioning evaluation metric (DCE) was also proposed.

Feature extraction from pre-trained models and their sensible arrangement can considerably affect the quality of generated captions. These extracted features or modalities are recognized, and their detailed effects were discussed in Hammad et al. ( 2019 ) with an S2VT (Venugopalan et al. 2015 ) basis. The different video modalities can be recognized as a frame or image, with a scene, the action, and audio (Ramanishka et al. 2016 ). All these modalities have their own significance while generating the description and inclusion of essential features, and when accompanied by a decoder with an attention mechanism, can help the model to extract the most pertinent information related to the scene and can have a substantial effect on quality improvement of the generated description. A human-like ability to extract the most relevant information from a scene can be incorporated by intelligent selection and accurate concatenation of features.

3.1.3 CNN–CNN

TDConvED (Chen et al. 2019b ) was the first and (so far) the only ED approach fully employing CNNs for both visual and language modeling. To address the limitations of vanishing/exploding gradients, as well as the recurrent dependency of the RNN preventing parallelization during sequence training, a system with convolutions in both the encoder and the decoder was proposed. Feed-forward convolution networks are free from recurrent functions, and previous step computations are not considered in the next step, so parallelization of sequence training can be achieved. The proposed model also exploits the temporal attention mechanism for sentence generation. In the encoder, the convolutional block is provided with temporal deformable convolutions by capturing dynamics in temporal extents of actions or scenes. The significant contribution of this research is to use convolutions for sequence-to-sequence learning and for enhancing the quality of video captioning.

3.2 Discussion - ED based approaches

The famous Encoder–Decoder structure for video description configures two neural networks, one for visual information extraction and other for textual narration generation corresponding to the visual perspective. This composition of neural networks involve CNN, RNN, LSTMs, GRUs, and tranformers as its encoding and decoding modules. CNNs are proficient in automatic identification of relavent features without human intervention (Zhang et al. 2019b ). As per (Goodfellow et al. 2016 ) the key charachteristics of CNN are sparse interactions, equivalent representations, and parameter sharing. Scanning regions instead of whole image results in less parameters with simplified and speedy training process and enhance generalization capability to avoid overfitting (Alzubaidi et al. 2021 ). RNNs are applied mostly in speech and language processing contexts. It uses sequential data to convey the information catering the order of the sequence. It offers recurrent connections to memory blocks in the network and flow of information is controlled through gated units in the network. This algorithm’s sensitivity to exploding and vanishing gradients are the main limitations associated with it while dealing with long range dependencies. In comparison with RNNs, CNNs are considered to be more powerful due to its less feature compatibility when compared to CNN (Alzubaidi et al. 2021 ). It variants LSTMs and GRUs are further enhanced to utilize less training parameters, less memory with more accuracy and faster execution. These deep learning models composition, i.e., CNN-RNN, RNN-RNN and CNN-CNN employed by researcher for video description task demonstrated their findings. A thorough empirical analysis by (Aafaq et al. 2019c ) concluded that C3D is commonly employed model for visual features extraction from images and short clips. Inception-ResNet-v2 and temporal encoding achieved compareable results in features transformation. Towards language modelling, the depth or number of layers in the decoder module, internal state size, number of processed frames and word embedding with dropout regularization selection is crucial for high quality description generation. If we train these modules (visual & language) independently, and then plug them into a pipeline, they are end-to-end systems. These systems are pre-trained on large scale datasets and then fine-tuned on video description datasets for the down stream task of video description. The use of deep learning to caption video has been extensively researched, but there are still numerous challenges to be resolved including objects accurate identification and their interactions, generating improved event proposals for dense captioning, utilization of task specific transformers for vision and language accurate comprehension.

3.3 Attention mechanism

An attention mechanism can be characterized as an act of cautiously focusing on the directed, relevant, and important parts in an image, frame, or scene, i.e., considering only the salient contents to be described, while ignoring others. The general structure of the video captioning model supported by the attention mechanism is grounded on various types of cues from the video. These cues are integrated into the basic framework of the ED to get the decoding process to concentrate on specific parts of the video at each time step to generate an appropriate description.

figure 6

Distribution of various attentions for video descriptions based on the literature explored for this research work

Before establishment of the attention mechanism in a standard ED architecture, the encoder block of the employed model was able to convert image or frame features into a single context vector, which is then fed to the decoder unit for caption generation word by word. For images loaded with multiple/complicated objects, one intermediate vector is unable to adequately convey the subsequent image features, instigating the loss of important information and substandard caption generation. The fusion of the attention mechanism empowers the encoder to concentrate on the various essential parts of the frame with distinct intensity, generating multiple context vectors, resulting in enhanced quality of the generated natural language sentences.

Let us suppose the video description system takes a video and generates caption Y for that video such that

where K is the size of the vocabulary, and C is the number of words in the caption. Using a 2D/3D CNN to extract the features from each frame/clip of the video, we have an annotation vector as a collection of all the intermediate context vectors or feature vectors, expressed as

where L is the number of feature vectors, each of which is a D-dimensional representation corresponding to the relevant part of the frame/clip of the video (Xu et al. 2015 ; Bahdanau et al. 2015 ). The attention mechanism permits more direct dependence between the states of the model at different points in time (Raffel and Ellis 2015 ). A model produces an intermediate context vector, or hidden state \(CV_t\) , at time step t . Attention-based models compute a single context vector at time t , \(SCV_t\) , as the weighted mean of the state sequence \(CV_i\) , expressed in ( 6 ) and simplified as ( 7 ):

Each \(CV_i\) contains information about the whole input sequence with a strong focus on the parts surrounding the i’th word of the input sequence, which is the main essence of the attention mechanism, to find mappings between an input element and its corresponding output. The attention weight computed at each time step t for each feature vector \(CV_i\) using Softmax is

\(SCORE_{ti}\) in ( 9 ) is the function of attention, which indicates the goodness of input position, and the generated output matches how well the input around the i’th position matches output at time t . The score is computed based on hidden state \(CV_i\) and decoder-generated output in the previous time step, i.e., \(W_{t-1}\) . \(SCV_t\) is then concatenated with the word output from the decoder’s previous time step, resulting in a concatenated context vector with the weighted feature information conveying where to focus more attention while generating the word at this particular position. This process continues until reaching the decoder output \(\langle END \rangle\) token. The generic review network in Yang et al. ( 2016 ) also proposed the same review steps as the concatenation of feature vectors with attention weights, producing a thought vector after each review for input to the decoder attention mechanism. Figure 7 depicts the attention process carried out inside the ED architecture.

Strategies for model optimization during training include the teacher forcing technique (Williams and Zipser 1989 ), curriculum learning (Bengio et al. 2009 ), and RL-based optimization techniques. Teacher forcing is a simple way to train RNN-based models while constituting a concatenated context vector. A word is provided from reference annotation instead of the actual generated word at the previous time step to guide word generation. This showed improvements in the model’s learning capabilities, and produced better results in the testing phase. Later, (Huszár 2015 ) proved the biased learning tendency of teacher forcing and curriculum learning, and proposed professor forcing (Goyal et al. 2016 ) for RNN optimization by adopting an adversarial domain method for alignment of the RNN during training and testing phases (Chen et al. 2019a ).

figure 7

The attention mechanism in the Encoder–Decoder architecture at time t

Different types of attention can be applied depending on the nature of the problem and situation. Figure  6 shows adaptation of different types of attention mechanism for video captioning. In Gella et al. ( 2020 ), the authors proposed two categories of temporal attention mechanisms (local and global temporal structures) considering the task of video description. The local temporal structure symbolizes fine-grained or detailed in-depth information like picking up the spoon or laying on the bed, whereas the global temporal structure mentions the sequence of events, objects, shots, and persons in the video. For a video description system to be state of the art, it must selectively concentrate on the most prominent features of the sequence in the video, exploiting both global and local temporal information.

To generate high-quality captions, the model needs to integrate the fine-grained visual clues from the image/frame. Lu et al. ( 2017 ) proposed a novel adaptive attention model with a visual sentinel. The spatial and adaptive attention-based model was capable of automatically making decisions on when to count on visual signals and which part of the image to focus on at a particular time, and vice versa. The combination of spatial and adaptive attention with the employed LSTM produced an additional visual sentinel providing a fallback option to the decoder. The sentinel gate helps the decoder get the required information from the image. Supporting that idea, researchers in Gao et al. ( 2019 ) and Song et al. ( 2017 ) suggested a system of hierarchical LSTM (hLSTMat) based on adaptive and temporal attention to enrich the representation ability of the LSTM. The model’s capability to adapt to low-level visual or high-level language information at a certain time step demonstrated robust description for videos.

Two commonly available design strategies in the captioning related literature are top-down and bottom-up. The top-down (or modern) strategy starts from the essence, gist, or central idea of the image/frame and then transforms that gist into appropriate words, whereas the bottom-up (or classical) approach first abstracts the words for the various dynamics of the frame, and then combines those words in a coherent manner. Both approaches suffer from certain limitations; top-down is unable to attend to the fine-grained details, and end-to-end training is not possible for the bottom-up approach. To get the benefits from both strategies You et al. ( 2016 ) developed a model to combine both approaches through semantic attention. The proposed model is capable of selectively attending to semantic ideas and regions, guiding when and where to pay more attention. The fusion of attention with the employed RNN structure leads to more efficient and robust performance. Likewise, to tackle the correlation between caption semantics and visual content, Gao et al. ( 2017 ) proposed an end-to-end attention LSTM with semantic consistency to automatically generate captions with rich semantics. Attention weights computed from video spatial dynamics are fed into the LSTM decoder, and finally, to bridge the semantic gap between visual content and generated captions, a multi-word embedding methodology is integrated in the system.

The role of spatial and temporal attention exploited for the task of video captioning is very important. Temporal attention refers to the specific case of visual attention that involves focusing attention on a particular instant in time, whereas spatial attention involves some specific location in space. Most of the recent models have adopted spatial-temporal attention to upgrade the accuracy of the model. The studies in Lowell et al. ( 2014 ) and Laokulrat et al. ( 2016 ) presented early approaches, exploiting temporal attention for sequence-to-sequence learning. To attend to both spatial and temporal information present in video frames, Chen et al. ( 2018b ) presented a visual framework based on saliency spatio-temporal attention (SSTA) to extract the informative visual information better and then transform it into natural sentences using an LSTM decoder. The designed spatial mechanism facilitates capturing the dominant visual notions from salient regions, and the semantic context from the non-salient regions of the video frame. Experimentation on spatial attention demonstrated that employing residual learning for spatial attention feature generation can improve performance. Models with their approach, visual, and language components are summarized in Table  4 for convenience.

Temporal attention commonly captures global features, whereas spatial attention captures local features. Xu et al. ( 2020 ) proposed channel attention along with spatial and temporal attention to ensure consistency in the visual features when generating natural language descriptions. Channel features refers to several feature graphs generated by each CNN layer. Spatial (S), temporal (T), and channel (C) attention weights are used to compute the fused features for decoding and caption generation. Eight different combinations of the three attentions were investigated, and S-C-T was the best performing combination, defining the sequence of attention for consideration while capturing features. For end-to-end learning (Chen and Jiang 2019 ), Motion Guided Spatial Attention (MGSA) is a spatial attention system for exploiting motion between video frames and was developed with a Gated Attention Recurrent Unit (GARU).

Considering attention while incorporating external linguistic knowledge in a captioning system, Zhang et al. ( 2020 ) proposed a combination of the object-relational graph (ORG) model with teacher recommended learning (TRL) by Williams and Zipser ( 1989 ). The explored external language model (ELM) produces semantically more analogous captions for long sequences. Appearance, motion and object features are extracted by employing 2D and 3D CNNs, reflecting the temporal and spatial dynamics of the given video. The STAT captioning system decoder (Yan et al. 2020 ) automatically selects important regions for word prediction depending on the local, global, and motion features extracted, exploiting the spatial and temporal structures in the video. The end-to-end semantic-temporal attention (STA-FG) model (Gao et al. 2020 ) integrated global semantic visual features of a video into the attention network to enhance the quality of generated captions. The hierarchical decoder is comprised of a sematic-based GRU, a semantic-temporal attention block, and a multi-modal decoder for word-by-word semantically rich and accurate caption generation. SibNet (Liu et al. 2020 ) employs a dual branch structure for video encoding where the first branch deals with visual content encoding, and the second branch captures semantic information in the video, exploiting visual-semantic joint embeddings. The two branches were designed using temporal convolution blocks (TCBs) and fused employing soft attention for caption generation. OmniNet (Pramanik et al. 2019 ), employing transformer and spatio-temporal cache mechanisms, supports multiple input modalities and can perform parts-of-speech tagging, video activity recognition, captioning, and visual question answering. Due to the efficient capture of global temporal dependencies in sequential data by the employed self-attention mechanism in the transformer architecture, simultaneous shared learning from multiple input domains is possible for accurate and superior performance.

The intrinsic multi-modal nature of video (i.e., static or appearance features, motion features, and audio features) contributes while generating captions. Learning most of these features increases the model’s ability to better understand and interpret the visuals, thus improving the overall captioning quality. The video description systems proposed in Wang et al. ( 2018c ), Li et al. ( 2017 ), Xu et al. ( 2017 ), and Hori et al. ( 2017 ) exploit a multi-modal attention mechanism for automatic natural language sentence generation.

More recently, dense video captioning (sports-related) was proposed in Yan et al. ( 2019 ) to segment distinct events in time and to then describe them in a series of coherent sentences, particularly focusing on multiple, fine-grain granularities or details of teams. The model auto-narrates the inter-team, intra-team, and individual actions, plus group interactions and all the interactive actions in a progressive manner. Incorporation of a dense multi-granular attention block exploits the spatio-temporal granular feature selection to generate a description. Authors also developed a Sports Video Narrative (SVN) dataset comprising 6k sports videos from YouTube.com and designed an evaluation metric Fine-grained Captioning Evaluation (FCE) to measure the accuracy of the generated linguistic description, demonstrating fine-grained action details along with the complete spatio-temporal interactional structure for dense caption generation.

3.4 Discussion—attention based approaches

Attention mechanism a general notion of memory, was implemented at first for the performance improvement of Encoder–Decoder based model in the machine translation domain (Bahdanau et al. 2015 ). Its key concept combines all the encoded input vectors in a weighted manner, with the most salient vectors being given the highest weights. The attention mechanism intended to form a direct connection with each time-step and enable the decoder to utilize the most relevant parts of the input sequence in the most flexible manner. The crucial limitation imposed by ED’s fixed length encoding vector for long and complex sequences is its inability to retain long sequences and hinder system performance. Attention mechanism’s primary reason for creation was to address the bottleneck of handling long range dependencies. In Implicit attention, the system tend to ignore some of the input parts while concentrating on the other parts. In contrast, explicit attention weigh each part of the input based on previous inputs and concentrate accordingly. Various types of proposed attentions include soft (Vaswani et al. 2017 ; Liu et al. 2018 ), hard, self (Pramanik et al. 2019 ), adaptive, semantic, temporal, spatial (Chen and Jiang 2019 ), spatio-temporal (Chen et al. 2018b ; Yan et al. 2020 ; Zhang et al. 2020 ), semantic-temporal (Gao et al. 2020 ), residual (Li et al. 2019b ) , global, and local (Peng et al. 2021 ) attention. The attention mechanism eliminates the vanishing gradient by providing direct connection between the visual and language modules. The memory in attention mechanism is encapsulated in attention scores computed over time. The attention score acts as a magnifier, directing where to focus in the input for accurate output generation. Several optimization techniques, i.e., teacher forcing, curriculum earning, and reinforcement learning are also combined with the attention mechanism in ED structure to further boost the system performance.With easy to understand nature of attention mechanism, there is a need for more theoretical studies that will contribute to an understanding of the mechanism of attention in complex scenarios.

3.5 Transformer mechanism

Transformer (the first sequence transduction model), which has promptly become the model of choice in language processing, is a novel deep machine learning architecture introduced in 2017. It transforms one sequence into another following the ED architecture employing an attention mechanism, but it differs from the formerly explained ED mechanism in the sense that it does not imply any recurrent networks, i.e., an RNN, a GRU, or an LSTM. Transformers are designed to handle ordered sequences of data. However, unlike RNNs, they do not require ordered processing of the data, resulting in effective and efficient parallelization during training, compared to recurrent architectures. Becoming a fundamental building block of the most natural language-related tasks, it facilitates more parallelization during training, along with training on a huge amount of data. Table  5 lists some transformer-based approaches.

Self-attention, or intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. No doubt, the transformer mechanism theory revolves around self-attention following the Encoder–Decoder architecture instead of the recurrent network to encode each position. It has multiple layers of self-attention addressing the prohibitive sequential nature, computational complexity, and memory utilization of CNNs and RNNs, and allows for parallelization resulting in accelerated training, handling of global/long-range dependency learning requiring minimal inductive bias (prior knowledge), and facilitating domain-agnostic processing of multiple modalities, i.e., text, images, and videos. Because of the performance-boosting characteristics of transformers, it has become the model of choice in NLP, CV, and the cross-modal tasks relating to the combination of these two world-leading realms. On the other hand, without transformers, self-attention in the recurrent architectures relies on sequential processing of input at the encoding step, resulting in computational inefficiency because the processing cannot be parallelized (Vaswani et al. 2017 ). Undoubtedly, self-attention-based transformers are a considerable improvement over recurrence-based sequential modeling.

The strong motive behind the development of the transformer was to get rid of the problem faced when learning long-range dependencies in sequences, and to allow for more parallelization by eliminating convolution and recurrence. The transformer does not rely as heavily on the prohibitive sequential nature of input data as the CNNs and RNNs do. Ordered sequential processing in the former deep models is a considerable obstacle in parallelization of the process. If the sequences are too long, it is either difficult to remember the content of distant positions in the sequence, or too difficult to ensure correctness. Although CNNs are much less sequential than RNNs, even then, the number of steps required to collect information from far-off positions, as the sequence grows, increases the computational cost and causes the long-range dependencies issue.

figure 8

The standard/vanilla transformer architecture (Vaswani et al. 2017 )

Similar to the ED structure, a transformer also has two key components: an encoder and a decoder. Both encoder and decoder contain a stack of identical units. Each encoder consists of two layers: the self-attention layer and the feed-forward neural network layer. The self-attention layer helps the encoder connect a specific part of the input sequence with other parts. The embedding is performed only in the bottom-most encoder, because only that encoder will get a vector, and all other encoders will get input from the previous encoder; i.e., the output of one encoder will become the input of the next encoder. After embedding the parts of the input sequence, each of them flows through the two layers of the encoder allowing parallel execution. Both attention and feed-forward neural network layers have a residual connection around them and are followed by a normalization layer. The decoder also contains the same layers with an additional ED attention layer to help the decoder focus on the relevant parts of the input sequence while generating captions.

Multi-head attention (refinement of self-attention) improves the performance of the attention layer by efficiently extending the model’s ability to focus on different positions of the input sequence, and the attention layer is helped with multiple representation sub-spaces. In order to determine the position or distance of each word in the input sequence, the transformer adds a vector to each input embedding, i.e., positional encoding. Because of the same specific pattern of this vector, it facilitates efficient learning in the model. This positional encoding is able to scale unseen input lengths.

The output of the top encoder is transformed into attention vectors, K (Keys) and V (Values), and is fed into each decoder’s ED attention layer. The attention layer in the decoder can only attend to the earlier positions in the output sequence before the Softmax calculation. The working mechanism of the ED attention layer in the decoder is the same as that of the multi-head attention layer except that it creates its Queries vector from the layer below it, and accepts the keys and values vectors from the top encoder. Logit and Softmax layers at the end of the decoder choose the word with the highest probability.

To solve the issues related to the computational complexity, memory utilization, and long-term dependencies of sequence-to-sequence modeling, several variants of transformers have been proposed in the literature over time. Since video description is a sequence-to-sequence modeling task, these updated versions of transformers can be utilized for reduced complexity and superior performance.

3.5.1 Standard/Vanilla transformer

A simple transduction architecture for sequence modeling is entirely based on an attention mechanism (Vaswani et al. 2017 ) with the objectives of parallelization (the ability to process all input symbols simultaneously) and a reduction in sequential computation, i.e., a constant number of operations is required to determine the dependency between two input symbols without considering their positional distance in the sequence. The commonly used recurrent layers in the ED architecture are replaced with multi-head self-attention layers where self-attention is about computing the sequence representation by relating different positions or parts of it.

A positional encoding vector, which is used to determine the context based on the position of the words in the sentence, is combined with the input, embedding both in the encoder and decoder stacks, since no recurrence or convolution is involved, so the positional encoding vector will serve the purpose of determining a word’s relative or absolute position in the sequence. The dimensions for both input embedding and positional encoding is the same. Sinusoidals (sine and cosine functions of different frequencies) are used to compute these positional encodings. Although both learned positional encoding (Gehring et al. 2017 ) and sinusoidal positional encoding generate the same results, even then, sinusoidals are preferred because of sequence length extrapolation. Along with multi-head attention layers, each layer in the encoder and decoder section contains a feed-forward neural network (FFN). This FFN has two linear transformations with a rectified linear unit (ReLU) as an activation function.

Three main requirements–total computational complexity per layer, a parallelizable amount of computation, and the path length between long-range dependencies in the network–motivated the use of self-attention for mapping input and output sequences. The number of operations is fixed while computing the representation. Moreover, self-attention layers are considered faster, compared to recurrent layers, and are capable of producing models with increased interpretability. Figure  8 represents the architecture of the standard/vanilla transformer.

3.5.2 Universal transformer

The Universal Transformer (UT) (Uszkoreit and Kaiser 2019 ) proposed in 2019 by Google introduced recurrence in the transformer to address the issue of the standard/vanilla transformer not being computationally universal. The UT, a generalized form of the standard transformer, is a parallel-in-time recurrent self-attentive sequence model based on the ED architecture, and employs an RNN for representations of every position in both input and output sequences. The recurrence is over the depth, not over the position in the sequence. These representations are revised in parallel following two steps: first is use of a self-attention mechanism for information exchange; second is application of a transition function to the output from self-attention. For the standard transformer and RNNs, the depth (the number of sequential steps in the computation) is fixed because of the fixed number of layers, whereas there is no limit to the transition function count in a UT, proving its variable depth. This depth is the main difference between the standard transformer and the UT.

In the encoder section of the UT, representations are computed by applying multi-head soft attention at each time step for all positions in parallel tracked by a transition function and the residual connections; dropout and layer normalization are applied. The transition function can use either a separable convolution or a fully connected neural network. A dynamic per-position halting mechanism based on Adaptive Computation Time (ACT) is also incorporated for selection of the number of computational steps required for the refinement of each symbol, resulting in enhanced accuracy for many structured algorithmic as well as linguistic tasks. Bilkhu et al. ( 2019 ) employed a UT for single, as well as dense, video captioning tasks, utilizing a 3D CNN for video feature extraction, and reported promising results.

3.5.3 Masked transformer

Dense video captioning is about detecting and describing temporally localized events in a video. An end-to-end masked transformer model was proposed in Zhou et al. ( 2018 ) for dense video captioning. The proposed model consists of three parts. The video encoder is composed of multiple self-attention layers (since events are associated with long-range dependencies), so self-attention is required, instead of RNNs, for more effective learning. A proposal decoder following ProcNets (Zhou and Corso 2016 ) (an automatic procedure segmentation method) decodes the start and end times of events with a confidence score. A captioning decoder takes input from both the video encoder and the proposal decoder, decoding the event proposals into a differentiable mask to restrict attention to the proposed event. Both decoders learn during training to adjust for the best caption generation. Zhou et al. ( 2018 ) proposed a differentiable masking scheme by confirming training stability between proposals and captioning decoders. A standard transformer is employed for both encoder and decoder because of its fast self-attention mechanism implemented for accurate and useful performance.

3.5.4 Two-view transformer

Two-view Transformer (TvT) is a video captioning technique derived from the standard transformer and accompanied by two fusion blocks in the decoder layer to combine different modalities effectively. Parallelization, the primary quality of a transformer, leads to efficient and robust training activity, and instead of simple concatenation, two types of fusion blocks are proposed to explore information from frame features, motions, and previously generated words.

TvT (Chen et al. 2018 ) contains two views of visual representations extracted by the encoder block, i.e., a frame representation obtained using a 2D-CNN (ResNet-152 and NasNet pre-trained on ImageNet) on every frame individually, and motion representation is obtained by employing a 3D-CNN (I3D pre-trained on Kinetics) on connecting frames.

The decoder block contains two types of fusion block: add-fusion and attentive-fusion. The add-fusion block simply combines the frame and motion representation with a fixed weight between 0 and 1. The attentive-fusion block combines the two representations in a learnable way such that these two representations, with previously generated words, can jointly guide the model to accurately generate a description.

3.5.5 Bidirectional transformer

Bidirectional Encoder Representations from Transformers (BERT) (Kenton et al. 1953 ; Sun et al. 2019b ) a conceptually simple yet powerful fine-tuning-based bidirectional language representation model, is the state of the art for several NLP-specific tasks. BERT uses a bidirectional self-attention mechanism to carry out the tasks of masked language modeling and next-sentence prediction. VideoBERT (Sun et al. 2019b ) (based on BERT) was proposed basically for text-to-video generation or future prediction, and can be utilized for automatic illustration of instructional videos, such as recipes. VideoBERT is also applied to the task of video captioning following the masked transformer (Zhou et al. 2018 ) with a transformer ED, but the inputs to the encoder are replaced with features extracted by VideoBERT. VideoBERT reliably outpaces the S3D baseline (Xie et al. 2018 ), particularly with the CIDEr score. Furthermore, by combining VideoBERT and S3D, the proposed model demonstrated outstanding performance for all metrics. VideoBERT is capable of learning high-level semantic representations, and hence, achieved substantially better results on the YouCookII dataset. Vision & Language BERT (ViLBERT) (Lu et al. 2019 ) extended BERT to jointly represent text and images, and consists of two parallel streams (visual processing and linguistic processing) interacting through co-attentional transformer layers. The proposed ViLBERT with co-attentional transformer blocks outperformed the ablations and surpassed state-of-the-art models when transferred to multiple established vision-and-language tasks, e.g., visual question answering (VQA) (Antol et al. 2015 ), visual common sense reasoning (VCR) (Zellers et al. 2019 ), ground-referring expressions (Kazemzadeh et al. 2014 ), and caption-based image retrieval (Young et al. 2014 ).

3.5.6 Sparse transformer

Even with the transformers, the processing of lengthy sequences demands more time and memory, resulting in poor performance and inefficient systems. Sparse transformers (Child et al. 2019 ) introduced several sparse factorizations of the attention matrix, as well as restructured residual blocks, weight initialization for training enhancement of deeper networks, and a reduction in memory usage. Unlike a transformer, where training with many layers is difficult, the sparse transformer facilitates hundreds of layers by using the pre-activation residual block. Instead of positional encoding, learned embedding is useful and efficient. Gradient checkpoints are incorporated for effective reductions in memory requirements to train deep neural networks. Dropout is applied once, at the end of the residual attention instead of within the residual block. Experimentation with a sparse transformer demonstrated better performance on long-sequence modeling, with less computational complexity.

3.5.7 Reformer (the efficient transformer)

To improve the efficiency of the transformer on long sequences, Reformer (Kitaev et al. 2020 ) was proposed with reduced complexity and reversible residual layers (Gomez et al. 2017 ) for storing single-time activations during training. Inside the FFN layer, the activations are split and processed in chunks to save memory inside the FFN. Inclusion of locality sensitive hashing (LSH) in attention, depending on the total number of hashes employed, influences training aspects a lot. It was observed that regular attention is slower for lengthy sequences, but LSH attention speed remains smooth. Experimentation performed on text- and image-generation tasks produced the same results as the standard transformer but with more speed and efficient memory usage.

3.5.8 Transformer-XL

Transformer-XL (Dai et al. 2020 ) is based on the standard transformer architecture, and deals with better learning of long-range dependencies. Its key technological contributions include the concept of recurrence in a totally self-attentive model and developing an exclusive positional encoding scheme. It introduces a simple but more effective relative positional encoding design that generalizes attention lengths longer than the ones observed during training. For both character-level and word-level modeling, Transformer-XL is the first self-attention model that accomplishes significantly improved results compared to RNNs. Evaluating speed in comparison to the 64-layer standard transformer proposed in Al-Rfou et al. ( 2019 ), Transformer-XL achieved speeds up to 1,874 times faster.

3.6 Discussion—transformer based approaches

In the wake of Vaswani et al. ( 2017 ) successful implementation of the transformer in natural language processing, the transformer has become increasingly popular in a wide range of fields, including computer vision and speech analysis. Transformers have recently been improved in several variants compared to the vanilla model from the perspective of generalization, parallelization, adaptation, and efficiency. Its first application in the field of NLP for translation(Vaswani et al. 2017 ) initiated the tech journey. Recently with the robust representation competences it is proving its worth in the computer vision domain. Particular to transformer for video description (Zhou et al. 2018 ) introduced the first video paragraph captioning model using masked transformer. Due to the sequential nature of captioning task, unlike RNN which unroll sequence one step at a time, transformers can perform parallel processing of the entire sequence at both ends resulting in efficient and accurate captioning. The transformer enhanced with an external memory block further facilitates history maintenance of the visual & language information and augmentation of the current segment. Dependency among different sequence segments is learned through the self-attention mechanism inside the transformers. Considering long-range dependencies, hard to resolve for RNNs in the case of more extensive sequences is no longer an issue with the use of transformers. The vision transformer (ViT) (Hussain et al. 2022 ) recognized human activities in surveillance videos and adopted CNN free approach and capture long range dependencies in time to accurately encode relative spatial information. Likewise, video vision transformer (ViViT) (Arnab et al. 2021 ) factorized the spatial and temporal dimensions of the input video to handle long sequences of tokens encountered in video. Models employing modern transformers demonstrated comparable results handling long-range dependencies on vido description task, still developing efficient transformer models for computer vision’s tasks remains an open problem. Transformer models are usually huge and computationally expensive. In spite of their success in various applications, transformer models require a high amount of computing and memory resources, which limits their use on resources-constrained devices such as mobile phones (Han et al. 2022 ). So to cater resource-limited devices, research in designing efficient transformer models for visio-linguistic tasks need attention.

3.7 Deep reinforcement learning (DRL)

Trial and error, or experience and learn from experience, is the core of reinforcement learning (RL). It is all about taking appropriate actions in a certain environment and accommodating the reward/penalty by following a policy. Deep RL approaches have shown efficient performance in the field of real-world games. Particularly in 2013, Google DeepMind (Mnih et al. 2013 , 2015 ) took the initiative and demonstrated that a single architecture could successfully learn control policies in a range of different environments with minimal prior knowledge. It showed successful integration of RL with deep network architectures. Although many adversities exist for DRL models, compared to conventional learning, even then, DRL has shown extraordinarily proficient performance in the field of captioning. Optimization of evaluation metrics, considered for the reward function, for increasing the generated caption readability, and for training stability and system convergence, is kept under consideration while employing DRL for descriptions. Figure  9 shows the RL agent-environment interaction for video descriptions, and some of the famous DRL approaches and their components (agent, action, environment, reward, and goal) are summarized in Table 6 for a quick view.

An efficacious combination of RL (He et al. 2019 ) with supervised learning was presented in a multi-task learning framework. The goal of the system is to learn the policy to correctly ground the specific descriptions in the video. Hence, as a reward, the model encourages the agent to better match clips gradually, which is carried out by helping the agent get precise information from the environment, and to maximize the reward by exploring or exploiting the whole environment, forming a sequential decision case. The actor-critic algorithm (Montague 1999 ) is employed to generate policy and take appropriate action. The agent is responsible for the iterative adjustment of temporal boundaries until specified conditions are met. After an action is accomplished, the environment (a combination of video, description, and temporally grounded boundaries) is modified accordingly. State vectors combine a description with global, local, and location features, which are then fed into a GRU-FC-based actor-critic module for policy and state-value learning. A penalty mechanism is also defined to keep computational costs within limits. As the name indicates, the agent (model) reads the description, watches the video and localization, and after that iteratively moves the temporal grounding boundaries for best clip matching, according to the description.

The ED architecture intrinsically obstructs the use of end-to-end training because of lengthy sequences in both input and output for the model. Therefore, a multi-task RL model (Li and Qiu 2020 ) to avoid over-fitting was proposed for end-to-end training. The primary job of the model is to mine or extract as many tasks as possible from human-annotated videos, which can regulate the search space of the ED network. After that, end-to-end training is carried out for video captioning. The auxiliary assignment of the model is to predict the characteristics mined from reference captions and, based on these predictions, maximize the reward defined in the RL system. Specific to RL, the objective of the model is to train an agent to accomplish the tasks in an environment by performing a sequence of actions. For video captioning, the model aims to automatically generate a precise and meaningful sentence after processing the provided video. The agent’s action is to predict the next word in the sequence at each time step. The model’s reward is defined as the evaluation metric used in the test phase. The CIDEr score functions as a reward signal. Finally, evaluation of multi-task training revealed that domain-specific video representation is more influential than generic image features.

Sequence-to-sequence models optimize word-level cross-entropy loss during training, whereas the video captioning model proposed in Pasunuru and Bansal ( 2017 ) optimizes sentence-level, task-based metrics using policy gradients and mixed loss methods for RL. Moreover, an entailment enhanced reward, CIDEnt, was proposed that adjusts phrase-matching-based metrics and, on achieving a low entailment score, penalizes the phrase-matching metric (CIDEr-based) reward. An automatically generated caption gets a high entailment score only when the generated caption has logical matching with the ground truth annotation, instead of word matching.

Cross-entropy loss and reward-based loss are combined as a mixed loss to maintain output fluency and resolve the exposure bias issue. At first, the CIDEr reward demonstrated significant improvement, and after that, the CIDEnt reward further enhanced system performance.

Most of the captioning systems are trained by maximizing the maximum likelihood estimation (MLE), i.e., the similarity between the generated and reference captions, or by minimizing the cross-entropy (XE) loss. However, the MLE/XE approach suffers from two inadequacies: objective mismatch and exposure bias. Recent research demonstrated that for the captioning task, evaluation metrics could be optimized directly using RL, keeping in mind the associated computational cost and the designed reward for system convergence. The Self-Consensus Baseline (SCB) model proposed in Phan et al. ( 2017 ) trains concurrently on multiple descriptions of the same video, and employs human-annotated captions as a baseline for reward calculation instead of creating a new baseline for each generated caption. Following the ED approach, as an encoder, ResNet is used for static image features, C3D is used for short-term motion, and MFCC for acoustic features; GloVe is for word embedding, and LSTM is the decoder employed for language generation. Taking the LSTM language model as an agent in the environment of video features and words, the action is to predict the next word, attaining the dual goal of coming up with an accurate textual alternate of the given video and minimizing the negative expected reward of the model. Compared to the MIXER approach (Ranzato et al. 2016 ), where RL training gradually mixes into XE training to stabilize the learning, CBT trains both RL and XE simultaneously. A connection between RL and XE training is established in this research, utilizing consensus among multiple reference captions for training improvement, objective mismatch, and exposure bias elimination.

figure 9

DRL agent-environment interaction for video description

A fully differentiable deep neural network comprising a higher- and a lower-level sequence model, was proposed in Wang et al. ( 2018b ) for video description. Employing a hierarchical-RL (HRL) framework, the agent, environment, action, reward, and goal are defined to efficiently learn semantic dynamics adopting the ED architecture. Each video is sampled at 3fps, extracting ResNet-152 (Zhang et al. 2017 ) features from the sampled frames. The extracted features are fed into a two-phased encoder, i.e., a low-level bidirectional LSTM (Schuster and Paliwal 1997 ) and a high-level LSTM (Cascade-correlation and Chunking 1997 ). In the decoding phase, the HRL agent resembles the decoder. To better capture the temporal dynamics, an attention strategy is employed. The HRL agent is composed of three components: (1) a low-level worker that selects certain actions in each time step to achieve the goal, (2) a high-level manager that set goals, and 3) an internal critic (an RNN structure) to ensure accomplishment of the task and serve the manager accordingly. Both worker and manager are accommodated with the attention mechanism. A strong convergence policy is a challenging area in RL implementation, and the proposed HRL model achieved high convergence by applying cross-entropy loss optimization. The model is able to capture in-depth details of the video content, and can generate more detailed and accurate descriptions.

To avoid redundant visual processing and to lower computational costs, a plug-and-play PickNet model (Chen et al. 2018a ) was proposed to perform informative frame selection. The solution comprises two parts; first is PickNet for efficient frame selection, and second is a standard encoder (LSTM) and decoder (GRU) architecture for caption generation. RL-based PickNet selects the informative frames without having full details on the environment, i.e., it makes decisions to pick or drop a frame only on the basis of the current state and the history. The agent selects a subset of frames retaining the maximum visual content (i.e., six to eight frames selected, on average, from a video) while other models commonly need up to 40 frames for analysis. Following flexibility, efficiency, and effectiveness, the selected keyframes are capable of increasing visual diversity and decreasing textual inconsistency. Visual diversity and language rewards are defined. A negative reward is defined to discourage the selection of too many (or too few) frames. Model training is performed in three phases. First is the supervision phase, where the ED is pre-trained. Second is the reinforcement phase, where PickNet is trained by employing RL. Third is the adaptation phase in which both PickNet and the ED are jointly trained.

3.8 Discussion—deep reinforcement learning (DRL)

In recent years, the Encoder–Decoder structure demonstrated promising results fusing attention and transformer mechanisms. However, due to the long range dependencies handling and semantic gap between the visual and language domain, the generated descriptions contain numerous inaccuracies. These errors can be handled by adopting optimization through deep reinforcement learning. The polishing network (Xu et al. 2021 ) follows human proofreading mechanism by evaluating and improving the generated captions gradually for revise word errors and grammatical errors. Evaluation mtric selection as reward function also plays role in robust performance. CIDEr score is a choice in most articles. The deep reinforcement learning framework includes an environment, agent, action, reward and goal function. For video captioning, the goal is to generate accurate description aligned with the visual information of the video. The generative language model acts as an agent and takes an action of next word prediction. The provided video and the ground truth descriptions plays the environment role resulting in rewarding the selected evaluation metric on successful word generation or penalize the metric score otherwise. The environment upates the state of attention weights or hidden states based on the employed mechanism. This cycle of agent’s action and environment’s state and reward update continues to gradually improve the generated description as shown in Figure 9 . There has been a growing interest in DRL and hierarchical RL based methods in recent years, which have shown comparable results in the video description.

4 Results comparison & discussion

The benchmark results generated by various models in the recent past are discussed in this section. Dataset-based segregated techniques are further categorized in chronological order according to the approach/mechanism adopted for experimentation.

Video description models are mostly evaluated on MSVD (Chen and Dolan 2011 ) and MSR-VTT (Xu et al. 2016 ) datasets because of the wide-ranging and diverse nature of the videos, the availability of multiple ground truth captions for model training and evaluation, and most importantly, task specificity. For models having multiple variants during experimentation, the best performing variant is reported here. Scores shown in bold were the best performing.

4.1 Evaluation metrics

Most of the metrics commonly used for automatically generated captions evaluation, namely, BLEU@(1,2,3,4), METEOR, ROUGE, and WMD, are from the NLP domain, namely, NMT , and document summarization. CIDEr and SPICE evolved as a result of the increased demand for task-specific (captioning) metrics. It is essential for the description to possess the qualities of acceptability, consistency, and expression-fluency, particularly when considering the evaluations made by humans (Sharif et al. 2018 ). The evaluation metric is considered best when it exhibits a significant correlation with the human scores (Zhang and Vogel 2010 ). A short description of the metrics mostly used to evaluate the automatically generated description is given below, For detailed computational concept along with the limitations, please refer to Rafiq et al. ( 2021 ).

(Bi-Lingual Evaluation Understudy): The evaluation metric proposed by Doddington ( 2002 ) measures the numerical proximity between generated captions and their referenced counterparts. It Computes the unigram (overlap of single word) or n-gram(overlap of adjacent n words) between the two texts, i.e., referenced and generated. Multiple reference annotations for a single video can guarantee good BLEU score. The basis for this metric is the precision measure which is the main limitation of this metric. The research work by Lavie et al. ( 2004 ) demonstrated that considerably high correlation can be achieved by emphasizing more on recall measure than on the precision score.

4.1.2 METEOR

Metric for Evaluation of Translation with Explicit ORdering In order to ensure the accuracy of this metric (Lavie and Agarwal 2007 ), an explicit exact word match must be made between the predicted translation and one or more reference annotations. It supports the matching of identical words, synonyms, words with identical stem and also the order of words in referenced and predicted sentences. The computational procedure is based on the harmonic mean of precision and recall of uni-gram matches between the sentences (Kilickaya et al. 2017 ). Moreover, METEOR score is more closely correlated with human judgment (Elliott and Keller 2014 ).

4.1.3 ROUGE

Recall-Oriented Understudy for Gisting Evaluation This metric (Lin 2004 ) belongs to the NLP domain (documents summaries) evaluation metrics family. There are multiple variants in Rouge which are used to determine how closely the generated and reference summaries are comparable. Among these variants, Rouge-N (n-gram Co-occurrence), and Rouge-L (Longest Common Sub-sequence) are related to image and video captioning evaluation. In terms of Rouge-N, it is the n-gram recall between the predicted summary and one or more reference summaries. In contrast, Rouge-L uses a similarity score based on the recall and precision of the longest common sub-sequence between the generated and the reference sentences.

4.1.4 CIDEr

Consensus-Based Image Description Evaluation An image description evaluation metric based on human consensus is proposed by CIDEr (Vedantam et al. 2015 ). When comparing a generated sentence with the set of reference human annotations provided for an image, the CIDEr understands the underlying concepts of prominence, accuracy, and linguistics. Computational concept involves the cosine similarities between the referenced and generated captions for a provided image. The CIDEr-D variant of CIDEr is famous for image and video description evaluation. Where verb stem removal in the basic CIDEr metric ensured the usage of correct form of verb and exhibited high spearman’s rank correlation with respect to original CIDEr score.

All the evaluation metrics follow the strategy of the higher, the better , higher scores are considered better for BLEU, METEOR, ROUGE, and CIDEr. For the models computing BLEU@1, BLEU@2, BLEU@3, and BLEU@4, only BLEU@4 is reported here because of characteristics analogous to human annotations.

figure 10

Performance evaluation of BLEU, METEOR, ROUGE-L and CIDEr on the MSVD & MSR-VTT dataset for standard encoder–decoder approach

4.2 Datasets for evaluation

Defining a dataset as a collection of video clips with their respective annotations or descriptions is the act of creating a basis for training, validating, and testing a model. Among the domain-specific datasets are those relating to cooking, movies, social media, wild and human actions. In contrast, a wide variety of videos can be found in open-domain datasets. Following is a brief description of the most widely used benchmark datasets used in recent research for video descriptions.

4.2.1 MSVD—the microsoft video description dataset

Table  7 summarizes the results from popular models using the MSVD dataset. MSVD (Chen and Dolan 2011 ) is one of the earlier available corpora frequently used by the research community around the globe. It is a collection of 1,970 YouTube video clips provided with human annotations. The collection of these clips was carried out by requesting them from Amazon Mechanical Turk (AMT) workers. They were guided to pick short snippets depicting a single activity and were asked to mute the audio. Each video clip is 10 to 25 seconds long, on average. Afterward, these snippets were labeled with multilingual, mono-sentence captions provided by annotators. Frequently used slices of the dataset for training, validation, and testing comprise 1,200, 100, and 670 video clips. Figure 10 shows histograms for BLEU, ROUGE-L, METEOR, and CIDEr scores employing standard Encoder–Decoder structures and evaluated on MSVD and MSR-VTT datasets. Figure 11 demonstrates performance evaluation of transformer based models on MSVD and MSR-VTT datasets. Figure  12 graphically explains the performance evaluation of DRL based methods employing MSVD and MSR-VTT datasets. Figure  13 depicts the results obtained employing attention based approaches and evaluated on MSVD and MER-VTT datasets.

figure 11

Performance evaluation of BLEU, METEOR, ROUGE-L and CIDEr on the MSVD & MSR-VTT dataset for transformer mechanism based approach

Considering the standard ED mechanism, benchmarking with MSVD revealed that SeFLA Lee and Kim ( 2018 ), a semantic feature learning-based caption generation model, showed a better BLEU score, and VNS-GRU (Chen et al. 2020 ) achieved best performance results from METEOR, ROUGE, and CIDEr scoring. Advancements in the field of neural machine learning have demonstrated encouraging improvements on the video description task, but models trained using word-level losses cannot correlate well with sentence-level metrics, although all the evaluation metrics are sentence-level. So, metric optimization is critically needed for high-quality caption generation. Deep reinforcement learning is employed for optimization of many techniques. DRL approaches evaluated on the MSVD dataset concluded with the best performance from Pasunuru and Bansal ( 2017 ) in all metrics (BLEU, METEOR, ROUGE, and CIDEr).

figure 12

Performance evaluation of BLEU, METEOR, ROUGE-L and CIDEr on the MSVD & MSR-VTT dataset for reinforcement learning based approach

Models employing a transformer mechanism are progressing at a good pace. Among the transformer-based models, Two-view Transformer (Chen et al. 2018 ) performed the best in BLEU scoring, whereas Non-Autoregressive Video Captioning with Iterative Refinement (Yang et al. 2019 ) performed excellently under METEOR scoring, and the recently proposed SBAT (Jin et al. 2020 ) outperformed all previous models based on the ROUGE and CIDEr metrics. For attention-based approaches, SemSynAN (Perez-Martin et al. 2021b ) outperformed the existing methods based on BLEU@4, METEOR, ROUGE, and CIDEr scores with the MSVD dataset.

figure 13

Performance evaluation of BLEU, METEOR, ROUGE-L and CIDEr on the MSVD & MSR-VTT dataset for attention mechanism based approach

For the overall performance evaluation from all four mechanisms on the MSVD dataset, SeFLA (Lee and Kim 2018 ), a semantic feature learning-based caption generation model, demonstrated an excellent BLEU score; SemSynAN (Perez-Martin et al. 2021b ) produced the top METEOR and ROUGE scores, and VNS-GRU (Chen et al. 2020 ) achieved the best CIDEr score. It is clear from Table  7 that for short video clips comprising a single activity, the standard ED mechanism and the attention-based mechanism achieved top results.

4.2.2 MSR-VTT—microsoft research video to text

Table  8 demonstrates the results reported from using the MSR-VTT dataset (Xu et al. 2016 ), which is an open-domain, large-scale benchmark with 20 broad categories and diverse video content bridging vision and language. It comprises 10,000 clips that originated from 7180 videos. Being open-domain, it includes videos from categories like music, people, gaming, sports, news, education, vehicles, beauty, and advertisement. The duration of each clip, on average, is 10–30 seconds resulting in a total 41.2 h of video. To provide good semantics from the clips, 1327 AMT workers were engaged to annotate each one with 20 natural sentences. Data were split in Xu et al. ( 2016 ), suggesting 6513 videos for training, 497 videos for validation, and 2990 videos for testing purposes.

Considering the standard ED mechanism, benchmarking on the MSR-VTT dataset demonstrated that VNS-GRU (Chen et al. 2020 ), a variational-normalized semantic GRU-based caption generation model, showed better BLEU and CIDEr scores; DCM (Xiao and Shi 2019z ), a diverse captioning model with a conditional GAN, achieved the best performance from the METEOR and ROUGE metrics. Among the DRL-based methods, Consensus-based Sequence Training (CST) (Phan et al. 2017 ) was trained concurrently on multiple descriptions of the same video. It employed human-annotated captions as a baseline for reward calculation, instead of creating a new baseline for each generated caption resulting in directly optimizing the evaluation metrics. Using DRL performed well based on BLEU, METEOR, ROUGE, and CIDEr metrics with MSR-VTT. Approaches based on a transformer mechanism demonstrated that the recently proposed SBAT (Jin et al. 2020 ) outperformed all previous models for all four metrics. In the attention-based approaches with the MSR-VTT dataset, the recently proposed SemSynAN (Perez-Martin et al. 2021b ) outperformed the existing methods based on the METEOR and ROUGE metrics, whereas MSAN (Sun et al. 2019b ), a multi-modal semantic attention network, performed excellently based on BLEU and CIDEr.

Considering the overall performance evaluations from all four mechanisms with the MSR-VTT dataset, MSAN (Sun et al. 2019b ) demonstrated an excellent BLEU score, DCM (Xiao and Shi 2019z ) (a diverse captioning model with a conditional GAN) achieved the best results for METEOR and ROUGE metrics, and the DRL-based CST model (Phan et al. 2017 ) achieved the best score from CIDEr.

figure 14

Performance evaluation of BLEU, METEOR, ROUGE-L and CIDEr on the ActivityNet Captiopns dataset for transformer mechanism & standard Encoder–Decoder based approaches

4.2.3 ActivityNet Captions

Results reported from the ActivityNet Captions dataset are presented in Table  9 . ActivityNet Captions (Krishna et al. 2017 ) is a dataset specific to dense captioning events. It covers a wide range of categories, and comprises 20k videos taken from the activity net centered around human activities, with a total duration of 849 hours and 100k descriptions. Overlapping events occurring in a video are provided, and each description uniquely describes a dedicated segment of the video, so it describes events over time. Temporally localized descriptions are used to annotate each video. On average, each video is annotated with 3.65 sentences and 40 words. Event detection is demonstrated in small clips as well as in long video sequences.

Considering the standard ED mechanism, benchmarking on the ActivityNet Captions dataset demonstrated that Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning (JSRL-VCT) (Hou et al. 2019 ) produced the top METEOR, ROUGE, and CIDEr scores, whereas Video Captioning of Future Frames (VC-FF) (Hosseinzadeh et al. 2021 ) achieved the best results from the BLEU metric. Among the transformer-based approaches, the recently proposed COOT (Ging et al. 2020 ), a cooperative hierarchical transformer model, outperformed all previous models for all four metrics. Although the universal transformer approach (Bilkhu et al. 2019 ) demonstrated the highest BLEU score, evaluation based only on a single metric cannot guarantee whole-system performance. For attention-based approaches, the pioneer and creator of the ActivityNet Captions dataset (Krishna et al. 2017 ) reported scores for all four metrics. Related to dense video captioning, in overall performance evaluations of all four mechanisms on ActivityNet Captions, COOT (Ging et al. 2020 ) outperformed all mechanisms for all four metrics. Figure 14 represents graphical illustration of the results obtained by employing standard Encoder–Decoder and transformer based models on ActivityNet Captipns dataset.

4.2.4 YouCookII

Table  10 presents results reported with the YouCookII (Zhou and Corso 2016 ) dataset, another dataset mostly utilized to evaluate dense video captioning systems. This dataset comprises 2k YouTube videos that are almost uniformly distributed over 89 recipes from major cuisines all over the world, using a wide variety of cooking styles, components, instructions, and appliances. Each video in the dataset contains 3–16 temporally localized segments annotated in English. There are 7.7 segments per video, on average. About 2,600 words are used while describing the recipes. The data split was 67% videos for training, 23% for validation, and 10% for testing purposes.

Only transformer-based model evaluation results are reported using YouCook II dataset, showing that COOT (Ging et al. 2020 ) again produced excellent scores from all four metrics.

4.2.5 TVC - TV show caption

Results reported from using the TV show Caption dataset are presented in Table  11 . The TVC dataset (Lei et al. 2020b ) is a multi-modal captioning dataset with 262k captions extended from the TV show Retrieval (TVR) dataset by storing additional descriptions for every single annotated video clip or moment. It involves utilizing both video and subtitles for required information collection and appropriate description generation. The TVC dataset contains 108k video clips paired with 262K descriptions, and on average, there are two to four descriptions per video clip. Human annotators were engaged to write descriptions for video only and video+subtitle if subtitles already existed. The transformer-based MMT model (Lei et al. 2020b ) evaluated on TVC for both video and subtitle modalities outperformed the models with only one of the modalities. It establishes the fact that both videos and subtitles are equally valuable for concise and appropriate description generation. Unlike previous datasets employed for video descriptions focusing on captions illustrating visual content, the TVC dataset aims at captions that also describe subtitles.

The creators of the TVC dataset, MMT (Lei et al. 2020b ), reported comparable results; however, HERO (Li et al. 2020 ) demonstrated the highest scores from BLEU, METEOR, ROUGE and CIDEr, demonstrating tough competition from the MMT (Lei et al. 2020b ) model.

4.2.6 VATEX - video and TEXt

Table  12 shows the results reported from using the VATEX dataset in both English and Chinese.

VATEX (Wang et al. 2019b ) is a large, complex, and diverse multilingual dataset for video descriptions. It contains over 41,269 unique videos covering 600 human activities from kinetic-600 (Kay et al. 2017 ). There are 10 English and 10 Chinese captions with at least 10 words for English and 15 words for Chinese captions for every clip in the dataset. VATEX comprises 413k English captions and 413k Chinese captions for 41.3k unique videos. Chinese descriptions for each video clip are divided into two parts; half of the descriptions directly describe the video content, while the other half is the paired English translation (done through Google, Microsoft, and a self-developed translation system) of the same clip.

For VATEX English and Chinese evaluations of the transformer model, only the X-linear+transformer model is reported, considering it had the highest scores for all metrics. For attention-based systems, Multi-modal Feature Fusion with Feature Attention (FAtt) (Lin et al. 2020 ) outperformed the baseline with a significant gap, and recorded the highest results for both English and Chinese captioning. However, if we consider the overall performance comparison, the X-linear+transformer model achieved the highest scores for both English and Chinese captioning based on all four metrics.

In Table  13 , demonstrating results for miscellaneous datasets, none of the results is highlighted because they were all evaluated on different datasets with different diversities and complexities, so we cannot compare them directly.

From all the above results, we conclude that for simple, single-sentence caption generation, the standard ED & attention mechanisms provide excellent performance, whereas for dense video captioning, the transformer mechanism outperformed the others. For the models to better correlate with sentence-level losses, DRL-based metric optimization is critically needed for high-quality caption generation.

5 Conclusions

Vision and language are the two fundamental systems of human representations, and combining these two into one intelligent and smart system has long been a dream of artificial intelligence.

This survey investigated in detail the four main approaches to video description systems. These deep learning techniques, primarily employing the ED architecture, further accommodate the attention mechanism, the transformer mechanism, and DRL for efficient and accurate output. Owing to the diverse and complex intrinsic structure of video, capturing all the fine-grained detail and complicated spatio-temporal information present in the video context has not yet been achieved. To accomplish the image captioning task, a lot of research is in progress across the globe on the task of creating video descriptions, and even then, there is a requirement for further achievement and improvement in diverse visual information extraction and accurate description generation.

Deep learning video description mostly revolves around recurrence for sequential data processing, but the main bottleneck from long-term dependencies remains. As an option to recurrence, the transformer mechanism is capable of parallel processing, accelerated training, and handling long-term dependencies; it is space-efficient and much faster than solely self-attention-based methods, and is the model of choice for current advanced hardware. Researchers worldwide have put their efforts into the task of improving generated video descriptions using different state-of-the-art methodologies, but still, even the best performing method cannot match human-generated descriptions. Despite tremendous improvements, generated descriptions are not yet analogous to human interpretations. So, we can say that the upper bound is still far away, and there is a lot more room for research in this area.

There is a need for incorporation of rational expertise in the models to improve the generated captions.

The intrinsic multi-modal nature of video contributes to generating captions. Learning multiple features, like visuals, audio, and subtitles (if available in the video), increases the model’s ability to better understand and interpret (Ramanishka et al. 2016 ; Iashin and Rahtu 2020 ; Wang et al. 2018c ; Xu et al. 2017 ; Hori et al. 2017 ), thus, improving the overall captioning quality. There is a need to explore this research direction further.

The design and development of diversity measuring evaluation metrics to facilitate diverse, efficient, and accurate caption generation is indispensable.

For optimization of video captioning systems, extensive exploration of DRL is required.

The unprecedented breakthrough of data-hungry deep learning in various challenging tasks is due to a large number of publicly annotated datasets. The currently available video description datasets lack the visual diversity and language intricacies required to generate human-analogous captions. In particular, for dense video captioning, task-specific dataset creation for improved performance is indispensable. Since the acquisition of high-quality annotations is costly, as an alternative to passive learning (training on a massive labeled dataset), active learning (attempts to maximize a model’s performance while annotating the fewest samples possible) can be explored.

We hope this paper will not only facilitate better understanding of video description techniques, but will also accommodate scientists in future research and developments in this specific area.

Aafaq N, Akhtar N, Liu W, Mian A (2019a) Empirical autopsy of deep video captioning frameworks. arXiv:1911.09345

Aafaq N, Akhtar N, Liu W, Mian A (2019b) Empirical autopsy of deep video captioning frameworks. arXiv:1911.09345

Aafaq N, Mian A, Liu W, Gilani SZ, Sha M (2019c) Video description: a survey of methods, datasets, and evaluation metrics 52(6). https://doi.org/10.1145/3355390

Aafaq N, Mian AS, Akhtar N, Liu W, Shah M (2022) Dense video captioning with early linguistic information fusion. IEEE Trans Multimedia. https://doi.org/10.1109/TMM.2022.3146005

Article   Google Scholar  

Agyeman R, Rafiq M, Shin HK, Rinner B, Choi GS (2021) Optimizing spatiotemporal feature learning in 3D convolutional neural networks with pooling blocks. IEEE Access 9:70797–70805. https://doi.org/10.1109/access.2021.3078295

Al-Rfou R, Choe D, Constant N, Guo M, Jones L (2019) Character-level language modeling with deeper self-attention. Proc AAAI Conf Artif Intell 33 , 3159–3166. https://doi.org/10.1609/aaai.v33i01.33013159 arxiv.org/abs/1808.04444

Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, et al (2021) Review of deep learning: concepts, CNN architectures, challenges, applications, future directions 8(1). https://doi.org/10.1186/s40537-021-00444-8

Amaresh M, Chitrakala S (2019) Video captioning using deep learning: an overview of methods, datasets and metrics. Proceedings of the 2019 IEEE international conference on communication and signal processing, ICCSP 2019 (pp. 656–661). https://doi.org/10.1109/ICCSP.2019.8698097

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. Proc IEEE Int Conf Comput Vis. https://doi.org/10.1109/ICCV.2015.279

Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) ViViT: a video vision transformer. Proceedings of the IEEE international conference on computer vision, 6816–6826. https://doi.org/10.1109/ICCV48922.2021.00676 arXiv:2103.15691

Babariya RJ, Tamaki T (2020) Meaning guided video captioning. In: Pattern Recognition: 5th Asian Conference, ACPR 2019, Auckland, New Zealand, November 26–29, 2019, Revised Selected Papers, Part II 5, pp 478–488. Springer International Publishing

Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015 -Conference Track Proceedings, 1–15. arXiv:1409.0473

Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Zhang Z (2012) Video in sentences out. Uncertainty Artif Intell–Proc 28th Conf–UAI 2012:102–112 arXiv:1204.2742

Google Scholar  

Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. ACM Int Conf Proc Ser. https://doi.org/10.1145/1553374.1553380

Bhatt S, Patwa F, Sandhu R (2017) Natural language processing (almost) from scratch. Proc IEEE 3rd Int Conf Collaboration Internet Comput CIC 2017 2017:328–338. https://doi.org/10.1109/CIC.2017.00050

Bilkhu M, Wang S, Dobhal T (2019) Attention is all you need for videos: self-attention based video summarization using universal Transformers. arXiv:1906.02792

Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2019) Describing video with attention-based bidirectional LSTM. IEEE Trans Cybern 49(7):2631–2641. https://doi.org/10.1109/TCYB.2018.2831447

Blohm M, Jagfeld G, Sood E, Yu X, Vu NT (2018) Comparing attention-based convolutional and recurrent neural networks: success and limitations in machine reading comprehension. CoNLL 2018–22nd Conference on Computational Natural Language Learning, Proceedings, 108–118. https://doi.org/10.18653/v1/k18-1011 arXiv:1808.08744

Brox T, Papenberg N, Weickert J (2014) High accuracy optical flow estimation based on warping–presentation. Lecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 3024(May):25–36

MATH   Google Scholar  

Cascade-correlation R, Chunking NS (1997) Long Short–Term Memory 9(8):1735–1780

Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. Aclhlt 2011–Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies 1 (pp. 190–200)

Chen DZ, Gholami A, Niesner M, Chang AX (2021) Scan2Cap: context-aware dense captioning in RGB-D scans. 3192–3202. https://doi.org/10.1109/cvpr46437.2021.00321 arXiv:2012.02206

Chen H, Li J, Hu X (2020) Delving deeper into the decoder for video captioning. arXiv:2001.05614

Chen H, Lin K, Maye A, Li J, Hu X (2019a) A semantics-assisted video captioning model trained with scheduled sampling. https://zhuanzhi.ai/paper/f88d29f09d1a56a1b1cf719dfc55ea61 arXiv:1909.00121

Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019b) Temporal deformable convolutional encoder–decoder networks for video captioning. Proc AAAI Conf Artif Intell 33 , 8167–8174. https://doi.org/10.1609/aaai.v33i01.33018167 arXiv:1905.01077

Chen M, Li Y, Zhang Z, Huang S (2018) TVT: two-view transformer network for video captioning. Proc Mach Learn Res 95(1997):847–862

Chen S, Jiang Y-G (2019) Motion guided spatial attention for video captioning. Proc AAAI Conf Artif Intel 33:8191–8198. https://doi.org/10.1609/aaai.v33i01.33018191

Chen S, Jiang YG (2021c) Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn 1:8421–8431. https://doi.org/10.1109/CVPR46437.2021.00832

Chen S, Yao T, Jiang YG (2019b) Deep learning for video captioning: a review. IJCAI Int Joint Conf Artif Intell 2019:6283–6290. https://doi.org/10.24963/ijcai.2019/877

Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning. Lecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11217:367–384. https://doi.org/10.1007/978-3-030-01261-8_22

Chen Y, Zhang W, Wang S, Li L, Huang Q (2018) Saliency-based spatiotemporal attention for video captioning. 2018 IEEE 4th Int Conf Multimedia Big Data BigMM 2018:1–8

Child R, Gray S, Radford A, Sutskever I (2019) Generating Long Sequences with Sparse Transformers. arXiv:1904.10509

Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. EMNLP 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 1724–1734. https://doi.org/10.3115/v1/d14-1179 arXiv:1406.1078

Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2020) Transformer-XL: Attentive language models beyond a fixed-length context. ACL 2019 -57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2978–2988. https://doi.org/10.18653/v1/p19-1285 arXiv:1901.02860

Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2634–2641). https://doi.org/10.1109/CVPR.2013.340

Demeester T, Rocktäschel T, Riedel S (2016) Lifted rule injection for relation embeddings. Emnlp 2016—conference on empirical methods in natural language processing, proceedings (pp. 1389–1399). https://doi.org/10.18653/v1/d16-1146

Deng C, Chen S, Chen D, He Y, Wu Q (2021) Sketch, ground, and refine: top-down dense video captioning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn. https://doi.org/10.1109/CVPR46437.2021.00030

Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009, June). Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255. IEEE

Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics, 138. https://doi.org/10.3115/1289189.1289273

Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Analys Mach Intell 39(4):677–691. https://doi.org/10.1109/TPAMI.2016.2599174

Elliott D, Keller F (2014) Comparing automatic evaluation measures for image description. 52nd Annu Meet Assoc Comput Linguistics ACL 2014–Proc Conf 2:452–457. https://doi.org/10.3115/v1/p14-2074

Estevam V, Laroca R, Pedrini H, Menotti D (2021) Dense video captioning using unsupervised semantic information. arXiv:2112.08455v1

Fang Z, Gokhale T, Banerjee P, Baral C, Yang Y (2020) Video2Commonsense: generating commonsense descriptions to enrich video captioning. arXiv:2003.05162

Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimedia 19(9):2045–2055. https://doi.org/10.1109/TMM.2017.2729019

Gao L, Lei Y, Zeng P, Song J, Wang M, Shen HT (2022) Hierarchical representation network with auxiliary tasks for video captioning and video question answering. IEEE Trans Image Process 31:202–215. https://doi.org/10.1109/TIP.2021.3120867

Gao L, Li X, Song J, Shen HT (2019) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Analys Mach Intell 14(8):1–1. https://doi.org/10.1109/tpami.2019.2894139

Gao L, Wang X, Song J, Liu Y (2020) Fused GRU with semantic-temporal attention for video captioning. Neurocomputing 395:222–228. https://doi.org/10.1016/j.neucom.2018.06.096

Gehring J, Dauphin YN (2016) Convolutional Sequence to Sequence Learning. https://proceedings.mlr.press/v70/gehring17a/gehring17a.pdf

Gella S, Lewis M, Rohrbach M (2020) A dataset for telling the stories of social media videos. Proceedings of the 2018 conference on empirical methods in natural language processing, EMNLP 2018:968–974

Ging S, Zolfaghari M, Pirsiavash H, Brox T (2020) COOT: cooperative hierarchical transformer for video-text representation learning. (NeurIPS):1–27. arXiv:2011.00597

Gomez AN, Ren M, Urtasun R, Grosse RB (2017) The reversible resid-ual network: backpropagation without storing activations. Adv Neural Inform Process Syst 2017:2215–2225. arXiv:1707.04585

Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press. ( http://www.deeplearningbook.org )

Goyal A, Lamb A, Zhang Y, Zhang S, Courville A, Bengio Y (2016) Professor forcing: anew algorithm for training recurrent networks. Adv Neural Inform Process Syst (Nips):4608–4616. arXiv:1610.09038

Hakeem A, Sheikh Y, Shah M (2004) CASE E: a hierarchical event representation for the analysis of videos. Proc Natl Conf Artif Intell:263–268

Hammad M, Hammad M, Elshenawy M (2019) Characterizing the impact of using features extracted from pretrained models on the quality of video captioning sequence-to-sequence models. arXiv:1911.09989

Hammoudeh A, Vanderplaetse B, Dupont S (2022) Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation:1–15. arXiv:2202.05728

Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tao D (2022) A survey on vision transformer. IEEE Trans Pattern Analys Mach Intel 8828:1–20. https://doi.org/10.1109/TPAMI.2022.3152247

He D, Zhao X, Huang J, Li F, Liu X, Wen S (2019) Read, watch, and move: reinforcement learning for temporally grounding natural language descriptions in videos. Proceed AAAI Conf Artif Intel 33:8393–8400. https://doi.org/10.1609/aaai.v33i01.33018393 . arXiv:1901.06829

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn 2016:770–778. https://doi.org/10.1109/CVPR.2016.90

Hori C, Hori T, Lee TY, Zhang Z, Harsham B, Hershey JR et al (2017) Attention-based multimodal fusion for video description. Proc IEEE Int Conf Comput Vis 2017:4203–4212. https://doi.org/10.1109/ICCV.2017.450

Hosseinzadeh M, Wang Y, Canada HT (2021) Video captioning of future frames. Winter Conf App Comput Vis:980–989

Hou J, Wu X, Zhao W, Luo J, Jia Y (2019) Joint syntax representation learning and visual cue translation for video captioning. IEEE Int Conf Comput Vis 2019:8917–8926. https://doi.org/10.1109/ICCV.2019.00901

Hussain A, Hussain T, Ullah W, Baik SW (2022) Vision transformer and deep sequence learning for human activity recognition in surveillance videos. Comput Intel Neurosci. https://doi.org/10.1155/2022/3454167

Huszár F (2015) How (not) to train your generative model: scheduled sampling, likelihood, adversary?:1–9. arXiv:1511.05101

Iashin V, Rahtu E (2020) Multi-modal dense video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp 958–959

Im H, Choi Y-S (2022) UAT: universal attention transformer for video captioning. Sensors 22(13):4817. https://doi.org/10.3390/s22134817

Ji W, Wang R, Tian Y, Wang X (2022) An attention based dual learning approach for video captioning. Appl Soft Comput 117:108332. https://doi.org/10.1016/j.asoc.2021.108332

Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, et al. (2014) Caffe: convolutional architecture for fast feature embedding. Mm 2014–proceedings of the 2014 ACM conference on multimedia (pp. 675-678). https://doi.org/10.1145/2647868.2654889

Jin T, Huang S, Chen M, Li Y, Zhang Z (2020) SBAT: Video captioning with sparse boundary-aware transformer. IJCAI Int Joint Conf Artif Intel 2021:630–636. https://doi.org/10.24963/ijcai.2020.88

Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li FF (2014) Large-scale video classification with convolutional neural net-works. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1725–1732). https://doi.org/10.1109/CVPR.2014.223

Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, et al. (2017) The kinetics human action video dataset. arXiv:1705.06950

Kazemzadeh S, Ordonez V, Matten M, Berg TL (2014) ReferItGame: referring to objects in photographs of natural scenes:787–798

Kenton M-wC, Kristina L, Devlin J (1953) BERT: pre-training of deep bidirectional transformers for language understanding. (Mlm). arXiv:1810.04805v2

Khan M, Gotoh Y (2012) Describing video contents in natural language. Proceedings of the workshop on innovative hybrid (pp. 27–35)

Kilickaya M, Erdem A, Ikizler-Cinbis N, Erdem E (2017) Re-evaluating automatic metrics for image captioning. 15th conference of the european chapter of the association for computational linguistics, EACL 2017–proceedings of conference (Vol. 1, pp. 199-209). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/e17-1019

Kitaev N, Kaiser L, Levskaya A (2020) Reformer: the efficient transformer, 1–12. arXiv:2001.04451

Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vis 50(2):171–184. https://doi.org/10.1023/A:1020346032608

Article   MATH   Google Scholar  

Krishna R, Hata K, Ren F, Fei-Fei L, Niebles JC (2017) Dense-captioning events in videos. Proc Int Conf Comput Vis 2017:706–715. https://doi.org/10.1109/ICCV.2017.83

Langkilde-geary I, Knight K (2002) HALogen statistical sentence generator. (July):102–103

Laokulrat N, Phan S, Nishida N, Shu R, Ehara Y, Okazaki N, Nakayama H (2016) Generating video description using sequence-to-sequence model with temporal attention. Coling 2015:44–52

Lavie A, Agarwal A (2007) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. Proceedings of the Second Workshop on Statistical Machine Translation (June):228–231. http://acl.ldc.upenn.edu/W/W05/W05-09.pdf

Lavie A, Sagae K, Jayaraman S (2004) The significance of recall in automatic metrics for MT evaluation. Lecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 3265:134–143. https://doi.org/10.1007/978-3-540-30194-3-16

Lee J, Lee Y, Seong S, Kim K, Kim S, Kim J (2019) Capturing long-range dependencies in video captioning. Proc Int Conf Image Process, ICIP, 2019:1880–1884. https://doi.org/10.1109/ICIP.2019.8803143

Lee S, Kim I (2018) Multimodal feature learning for video captioning. Math Prob Eng. https://doi.org/10.1155/2018/3125879

Lei J, Wang L, Shen Y, Yu D, Berg T, Bansal M (2020) MART: memory-augmented recurrent transformer for coherent video paragraph captioning:2603–2614. https://doi.org/10.18653/v1/2020.acl-main.233 arXiv:2005.05402

Lei J, Yu L, Berg TL, Bansal M (2020) TVR: a large-scale dataset for video-subtitle moment retrieval. Lecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12366:447–463. https://doi.org/10.1007/978-3-030-58589-1_27

Levine R, Meurers D (2006) Head-driven phrase structure grammar linguistic approach , formal head-driven phrase structure grammar linguistic approach , formal foundations , and computational realization (January)

Li J, Qiu H (2020) Comparing attention-based neural architectures for video captioning, vol 1194. Available on: https://web.stanford.edu/class/archive/cs/cs224n/cs224n

Li L, Chen Y-C, Cheng Y, Gan Z, Yu L, Liu J (2020) HERO: hierarchical encoder for video+language omni-representation pre-training, 2046–2065. https://doi.org/10.18653/v1/2020.emnlp-main.161 arXiv:2005.00200

Li S, Tao Z, Li K, Fu Y (2019) Visual to text: survey of image and video captioning. IEEE Trans Emerg Top Comput Intel 3(4):297–312. https://doi.org/10.1109/tetci.2019.2892755

Li X, Zhao B, Lu X (2017) MAM-RNN: Multi-level attention model based RNN for video captioning. IJCAI International Joint Conference on Artificial Intelligence, 2208–2214. https://doi.org/10.24963/ijcai.2017/307

Li X, Zhou Z, Chen L, Gao L (2019) Residual attention-based LSTM for video captioning. World Wide Web 22(2):621–636. https://doi.org/10.1007/s11280-018-0531-z

Li Y, Yao T, Pan Y, Chao H, Mei T (2018) Jointly localizing and describing events for dense video captioning. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 7492–7500). https://doi.org/10.1109/CVPR.2018.00782

Lin C-Y (2004) ROUGE: A Package for Automatic Evaluation of Summaries. In: Text summarization branches out. Association for Computational Linguistics. Barcelona, Spain, pp 74–81. https://aclanthology.org/W04-1013

Lin K, Gan Z, Wang L (2020) Multi-modal feature fusion with feature attention for vatex captioning challenge 2020:2–5. arXiv:2006.03315

Liu F, Ren X, Wu X, Yang B, Ge S, Sun X (2021) O2NA: an object-oriented non-autoregressive approach for controllable video captioning. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021:281–292. https://doi.org/10.18653/v1/2021.findings-acl.24 arXiv:2108.02359

Liu S, Ren Z, Yuan J (2018) SibNet: Sibling convolutional encoder for video captioning. MM 2018 -Proceedings of the 2018 ACM Multimedia Conference, 1425–1434. https://doi.org/10.1145/3240508.3240667

Liu S, Ren Z, Yuan J (2020) SibNet: sibling convolutional encoder for video captioning. IEEE Trans Pattern Analys Mach Intel, 1–1. https://doi.org/10.1109/tpami.2019.2940007

Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 1999, pp 1150–1157, vol 2. https://doi.org/10.1109/ICCV.1999.790410

Lowell U, Donahue J, Berkeley UC, Rohrbach M, Berkeley UC, Mooney R (2014) Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729v3

Lu J, Batra D, Parikh D, Lee S (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. (NeurIPS), 1–11. arXiv:1908.02265

Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR, 2017:3242–3250. https://doi.org/10.1109/CVPR.2017.345 arXiv:1612.01887

Luo H, Ji L, Shi B, Huang H, Duan N, Li T, et al. (2020) UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv:2002.06353

Madake J (2022) Dense video captioning using BiLSTM encoder, 1–6

Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning, 1–9. arXiv:1312.5602

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533. https://doi.org/10.1038/nature14236

Montague P (1999) Reinforcement learning: an introduction, by Sutton RS and Barto AG trends in cognitive sciences 3(9): 360. https://doi.org/10.1016/s1364-6613(99)01331-5

Olivastri S, Singh G, Cuzzolin F (2019) End-to-end video captioning. International conference on computer vision workshop. https://zhuanzhi.ai/paper/004e3568315600ed58e6a699bef3cbba

Pan Y, Li Y, Luo J, Xu J, Yao T, Mei T (2020) Auto-captions on GIF: a large-scale video-sentence dataset for vision-language pre-training. arXiv:2007.02375

Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn 2016:4594–4602. https://doi.org/10.1109/CVPR.2016.497 arXiv:1505.01861

Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR 2017:984–992. https://doi.org/10.1109/CVPR.2017.111 arXiv:1611.07675

Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn, 10968–10977. https://doi.org/10.1109/CVPR42600.2020.01098 arXiv:2003.14080

Park J, Song C, Han JH (2018) A study of evaluation metrics and datasets for video captioning. ICIIBMS 2017 -2nd Int Conf Intel Inform Biomed Sci 2018:172–175. https://doi.org/10.1109/ICIIBMS.2017.8279760

Pasunuru R, Bansal M (2017) Reinforced video captioning with entailment rewards. Emnlp 2017—conference on empirical methods in natural language processing, proceedings (pp. 979–985). https://doi.org/10.18653/v1/d17-1103

Peng Y, Wang C, Pei Y, Li Y (2021) Video captioning with global and local text attention. Visual Computer (0123456789). https://doi.org/10.1007/s00371-021-02294-0

Perez-Martin J, Bustos B, Perez J (2021) Attentive visual semantic specialized network for video captioning, 5767–5774. https://doi.org/10.1109/icpr48806.2021.9412898

Perez-Martin J, Bustos B, Pérez J (2021) Improving video captioning with temporal composition of a visual-syntactic embedding. Winter Conference on Applications of Computer Vision, 3039–3049

Phan S, Henter GE, Miyao Y, Satoh S (2017) Consensus-based sequence training for video captioning. arXiv:1712.09532

Pramanik S, Agrawal P, Hussain A (2019) OmniNet: a unified architecture for multi-modal multi-task learning, 1–16. arXiv:1907.07804

Raffel C, Ellis DPW (2015) Feed-forward networks with attention can solve some long-term memory problems, 1–6. arXiv:1512.08756

Rafiq M, Rafiq G, Agyeman R, Jin S-I, Choi G (2020) Scene classification for sports video summarization using transfer learning. Sensors (Switzerland) 20(6). https://doi.org/10.3390/s20061702

Rafiq M, Rafiq G, Choi GS (2021) Video description: datasets evaluation metrics. IEEE Access 9:121665–121685. https://doi.org/10.1109/ACCESS.2021.3108565

Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) Multimodal video description. MM 2016 -Proceedings of the 2016 ACM Multimedia Conference, 1092–1096. https://doi.org/10.1145/2964284.2984066

Ranzato M, Chopra S, Auli M, Zaremba W (2016) Sequence level training with recurrent neural networks. 4th international conference on learning representations, ICLR 2016—conference track proceedings (pp. 1–16)

Redmon J, Farhadi A (2018) YOLOv3: an incremental improvement. arXiv:1804.02767

Ren Z, Wang X, Zhang N, Lv X, Li LJ (2017) Deep reinforcement learning-based image captioning with embedding reward. Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR 2017:1151–1159. https://doi.org/10.1109/CVPR.2017.128

Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR 2017:1179–1195. https://doi.org/10.1109/CVPR.2017.131

Rivera-soto RA, Ordóñez J (2013) Sequence to sequence models for generating video captions. http://cs231n.stanford.edu/reports/2017/pdfs/31.pdf

Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. Proc IEEE Int Conf Comput Vis. https://doi.org/10.1109/ICCV.2013.61

Ryu H, Kang S, Kang H, Yoo CD (2021) Semantic grouping network for video captioning. arXiv:2102.00831

Schuster M, Paliwal KK (1997) Bidirectional recurrent. Neural Netw 45(11):2673–2681

Seo PH, Nagrani A, Arnab A, Schmid C (2022) End-to-end generative pretraining for multimodal video captioning, 17959–17968. arXiv:2201.08264

Sharif N, White L, Bennamoun M, Shah SAA (2018) Learning-based composite metrics for improved caption evaluation. ACL 2018 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop, 14–20. https://doi.org/10.18653/v1/p18-3003

Shen Z, Li J, Su Z, Li M, Chen Y, Jiang YG, Xue X (2017) Weakly supervised dense video captioning. Proc 30th IEEE Conf Comput Vis Pattern Recogn, CVPR 2017 2017:5159–5167. https://doi.org/10.1109/CVPR.2017.548c

Song J, Gao L, Guo Z, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning, 2737–2743

Song Y, Chen S, Jin Q (2021) Towards diverse paragraph captioning for untrimmed videos. Proceedings of the IEEE Comput Soc Conf Comput Vis Pattern Recogn, 11240–11249. https://doi.org/10.1109/CVPR46437.2021.01109 arXiv:2105.14477

Su J (2018) Study of Video Captioning Problem. https://www.semanticscholar.org/paper/Study-of-Video-Captioning-Problem-Su/511f0041124d8d14bbcdc7f0e57f3bfe13a58e99

Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) VideoBERT: a joint model for video and language representation learning. Proc IEEE Int Conf Comput Vis 2019:7463–7472. https://doi.org/10.1109/ICCV.2019.00756

Sun L, Li B, Yuan C, Zha Z, Hu W (2019) Multimodal semantic attention network for video captioning. Proc IEEE Int Conf Multimedia Expo 2019:1300–1305. https://doi.org/10.1109/ICME.2019.00226 . arxiv.org/abs/1905.02963

Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-ResNet and the impact of residual connections on learning. 31st AAAI Conf Artif Intel AAAI 2017:4278–4284

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. (2015) Going deeper with convolutions. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (07-12-June, pp. 1-9). https://doi.org/10.1109/CVPR.2015.7298594

Torralba A, Murphy KP, Freeman WT, Rubin MA (2003) Context-based vision system for place and object recognition. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, ICCV'03, vol 2, pp 273. IEEE Computer Society. https://doi.org/10.5555/946247.946665

Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. Proc IEEE Int Conf Comput Vis 2015:4489–4497. https://doi.org/10.1109/ICCV.2015.510

Uszkoreit J, Kaiser L (2019) Universal transformers, 1-23. arxiv.org/abs/arXiv:1807.03819v3

Vaswani A, Brain G, Shazeer N, Parmar N, Uszkoreit J, Jones L, et al. (2017) Attention is all you need. Adv Neural Inform Process Syst (Nips), 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence -video to text. Proceedings IEEE Int Conf Comput Vis 2015:4534–4542. https://doi.org/10.1109/ICCV.2015.515

Vo DM, Chen H, Sugimoto A, Nakayama H (2022) NOC-REK: Novel object captioning with retrieved vocabulary from external knowledge. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp 17979–17987. https://doi.org/10.1109/CVPR52688.2022.01747

Wallach B (2017) Developing: a world made for money (pp. 241–294). https://doi.org/10.2307/j.ctt1d98bxx.10

Wang D, Song D (2017) Video Captioning with Semantic Information from the Knowledge Base. Proceedings -2017 IEEE International Conference on Big Knowledge, ICBK 2017 , 224–229. https://doi.org/10.1109/ICBK.2017.26

Wang B, Ma L, Zhang W, Liu W (2018a) Reconstruction network for video captioning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7622–7631. https://doi.org/10.1109/CVPR.2018.00795

Wang X, Chen W, Wu J, Wang YF, Wang WY (2018b) Video captioning via hierarchical reinforcement learning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn, 4213–4222. https://doi.org/10.1109/CVPR.2018.00443 arXiv:1711.11135

Wang X, Wang, Y-f, Wang WY (2018c) Watch , listen , and describe: globally and locally aligned cross-modal attentions for video captioning, 795–801

Wang B, Ma L, Zhang W, Jiang W, Wang J, Liu W (2019a) Controllable video captioning with pos sequence guidance based on gated fusion network. Proc IEEE Int Conf Comput Vis 2019:2641–2650. https://doi.org/10.1109/ICCV.2019.00273 . arXiv:1908.10072

Wang X, Wu J, Chen J, Li L, Wang Y-F, Wang WY (2019b) VATEX: a large-scale, high-quality multilingual dataset for video-and-language research. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 4580–4590. https://doi.org/10.1109/ICCV.2019.00468

Wang H, Zhang Y, Yu X (2020) An overview of image caption generation methods. Computational Intelligence and Neuroscience 2020. https://doi.org/10.1155/2020/3062706

Wang T, Zhang R, Lu Z, Zheng F, Cheng R, Luo P (2021) Endto-End Dense Video Captioning with Parallel Decoding. Proceedings of the IEEE International Conference on Computer Vision, 6827–6837. https://doi.org/10.1109/ICCV48922.2021.00677 arXiv:2108.07781

Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1(2):270–280. https://doi.org/10.1162/neco.1989.1.2.270

Wu D, Zhao H, Bao X, Wildes RP (2022) Sports video analysis on large-scale data (1). arXiv:2208.04897

Wu Z, Yao T, Fu Y, Jiang, Y-G (2017) Deep learning for video classification and captioning. Front Multimedia Res, 3–29. https://doi.org/10.1145/3122865.3122867 arXiv:1609.06782

Xiao H, Shi J (2019a) Diverse video captioning through latent variable expansion with conditional GAN. https://zhuanzhi.ai/paper/943af2926865564d7a84286c23fa2c63 arXiv:1910.12019

Xiao H, Shi J (2019b) Huanhou Xiao, Jinglun Shi South China University of Technology, Guangzhou China, 619–623

Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotem-poral feature learning: speed-accuracy trade-offs in video classification. Lecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11219:318–335. https://doi.org/10.1007/978-3-030-01267-0_19

Xu H, Li B, Ramanishka V, Sigal L, Saenko K (2019) Joint event detection and description in continuous video streams. Proc 2019 IEEE Winter Conf App Comput Vis, WACV 2019:396–405. https://doi.org/10.1109/WACV.2019.00048 . arXiv:1802.10250

Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn 2016:5288–5296. https://doi.org/10.1109/CVPR.2016.571

Xu J, Wei H, Li L, Fu Q, Guo J (2020) Video description model based on temporal-spatial and channel multi-attention mechanisms. Appl Sci (Switzerland). https://doi.org/10.3390/app10124312

Xu J, Yao T, Zhang Y, Mei T (2017) Learning multimodal attention LSTM networks for video captioning. MM 2017 -Proceedings of the 2017 ACM Multimedia Conference, 537–545. https://doi.org/10.1145/3123266.3123448

Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, et al. (2015) Show, attend and tell: neural image caption gener-ation with visual attention. 32nd International Conference on Machine Learning, ICML 2015 3:2048–2057. arXiv:1502.03044

Xu W, Yu J, Miao Z, Wan L, Tian Y, Ji Q (2021) Deep reinforcement polishing network for video captioning. IEEE Trans Multimedia 23:1772–1784. https://doi.org/10.1109/TMM.2020.3002669

Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2020) STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241. https://doi.org/10.1109/TMM.2019.2924576

Yan L, Zhu M, Yu C (2010) Crowd video captioning. arXiv:1911.05449v1

Yan Y, Zhuang N, Bingbing Ni, Zhang J, Xu M, Zhang Q, et al (2019) Fine-grained video captioning via graph-based multi-granularity interaction learning. IEEE Trans Pattern Analys Mach Intel. https://doi.org/10.1109/TPAMI.2019.2946823

Yang B, Liu F, Zhang C, Zou Y (2019) Non-autoregressive coarse-to-fine video captioning. In: AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v35i4.16421

Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Review networks for caption generation. Adv Neural Inform Process Syst (Nips), 2369–2377. arXiv:1605.07912

Yin W, Kann K, Yu M, Schütze H (2017) Comparative study of CNN and RNN for natural language processing. arXiv:1702.01923

You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn 2016:4651–4659. https://doi.org/10.1109/CVPR.2016.503 . arXiv:1603.03925

Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations? New similarity metrics for semantic inference over event descriptions 2:67–78

Yu Y, Choi J, Kim Y, Yoo K, Lee SH, Kim G (2017) Supervising neural attention models for video captioning by human gaze data. Proc 30th IEEE Conf Comput Vis Pattern Recogn 2017:6119–6127. https://doi.org/10.1109/CVPR.2017.648 . arXiv:1707.06029

Yuan Z, Yan X, Liao Y, Guo Y, Li G, Li Z, Cui S (2022) X-Trans2Cap: cross-modal knowledge transfer using transformer for 3D dense captioning, 3–4. arXiv:2203.00843

Zellers R, Bisk Y, Farhadi A, Choi Y, (2019) From recognition to cognition: visual commonsense reasoning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn 2019:6713–6724. https://doi.org/10.1109/CVPR.2019.00688

Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. https://zhuanzhi.ai/paper/237b5837832fb600d4269cacdb0286e3 arXiv:1906.04375

Zhang Q, Zhang M, Chen T, Sun Z, Ma Y, Yu B (2019) Recent advances in convolutional neural network acceleration. Neurocomputing 323:37–51. https://doi.org/10.1016/j.neucom.2018.09.038 . arXiv:1807.08596

Zhang W, Wang B, Ma L, Liu W (2019) Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Trans Pattern Analys Mach Intel, 1–1. https://doi.org/10.1109/tpami.2019.2920899 arxiv.org/abs/1906.01452

Zhang X, Gao K, Zhang Y, Zhang D, Li J, Tian Q (2017) Task-driven dynamic fusion: reducing ambiguity in video description. Proc 30th IEEE Conf Comput Vis Pattern Recogn CVPR 2017:6250–6258. https://doi.org/10.1109/CVPR.2017.662

Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Ji R (2021) RSTnet: captioning with adaptive attention on visual and non-visual words. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn 1:15460–15469. https://doi.org/10.1109/CVPR46437.2021.01521

Zhang Y, Vogel S (2010) Significance tests of automatic machine translation evaluation metrics. Machine Transl 24(1):51–65. https://doi.org/10.1007/s10590-010-9073-6

Zhang Z, Qi Z, Yuan C, Shan Y, Li B, Deng Y, Hu W (2021) Open-book video captioning with retrieve-copy-generate network. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn, 9832–9841. https://doi.org/10.1109/CVPR46437.2021.00971 arXiv:2103.05284

Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z (2020) Object relational graph with teacher-recommended learning for video captioning. arXiv:2002.11566

Zhao B, Li X, Lu X (2018) Video captioning with tube features. IICAI Int Joint Conf Artif Intel 2018:1177–1183. https://doi.org/10.24963/ijcai.2018/164

Zhao H, Chen Z, Guo L, Han Z (2022) Video captioning based on vision transformer and reinforcement learning. Peer J Comput Sci 8(2002):1–16. https://doi.org/10.7717/PEERJ-CS.916

Zheng Q, Wang C, Tao D (2020) Syntax-Aware Action Targeting for Video Captioning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 13093–13102. https://doi.org/10.1109/CVPR42600.2020.01311

Zhou L, Corso JJ (2016) Towards automatic learning of procedures from web instructional videos. arXiv:1703.09788v3

Zhou L, Kalantidis Y, Chen X, Corso JJ, Rohrbach M (2019) Grounded video description. Proc IEEE Comput Soc Conf Comput Vis Pattern Recogn 2019:6571–6580. https://doi.org/10.1109/CVPR.2019.00674 . arXiv:1812.06587

Zhou L, Zhou Y, Corso JJ, Socher R, Xiong C (2018) End-to-End Dense Video Captioning with Masked Transformer. Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 8739–8748). https://doi.org/10.1109/CVPR.2018.00911

Zhu X, Guo L, Yao P, Lu S, Liu W, Liu J (2019) Vatex video captioning challenge 2020: multi-view features and hybrid reward strategies for video captioning. arXiv:1910.11102

Zolfaghari M, Singh K, Brox T (2018) ECO: efficient convolutional network for online video understanding. Lecture Notes Comput Sci (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11206:713–730. https://doi.org/10.1007/978-3-030-01216-8-43

Download references

This work was supported in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education under Grant NRF-2019R1A2C1006159 and Grant NRF-2021R1A6A1A03039493, and in part by the 2022 Yeungnam University Research Grant.

Author information

Authors and affiliations.

Department of Information & Communication Engineering, Yeungnam University, Gyeongsan-si, 38541, South Korea

Ghazala Rafiq & Gyu Sang Choi

Department of Game & Mobile Engineering, Keimyung University, 1095 Dalgubeol-daero, Dalseo-gu, Daegu, 42601, South Korea

Muhammad Rafiq

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Muhammad Rafiq or Gyu Sang Choi .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Rafiq, G., Rafiq, M. & Choi, G.S. Video description: A comprehensive survey of deep learning approaches. Artif Intell Rev 56 , 13293–13372 (2023). https://doi.org/10.1007/s10462-023-10414-6

Download citation

Published : 11 April 2023

Issue Date : November 2023

DOI : https://doi.org/10.1007/s10462-023-10414-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Deep learning
  • Encoder–Decoder architecture
  • Text description
  • Video captioning techniques
  • Video description approaches
  • Video captioning
  • Vision to text
  • Find a journal
  • Publish with us
  • Track your research

ORIGINAL RESEARCH article

This article is part of the research topic.

Destination Advertising in the Digital Age

A Study on the Construction of Destination Image for China's County-Level Integrated Media Centers: A Case Study of Four Counties in Fuzhou Provisionally Accepted

  • 1 Communication University of China, China
  • 2 South China University of Technology, China

The final, formatted version of the article will be published soon.

Today, as social media plays an increasingly important role in disseminating destination images, short videos have emerged as the primary channel through which tourists obtain information about their desired destinations. In comparison to traditional methods of using text and pictures, the new media accounts of local government agencies offer a means to convey more comprehensive local news and shape destination images that are more accurate and diverse, leveraging the potential of the short video platform. This study utilizes a combination of manual analysis (subject terms classification) and computer-assisted techniques (key-frame extraction and text mining) to examine the short videos posted on the TikTok (Douyin) platform by the integrated media centers of Minhou County, Yongtai County, Minqing County, and Lianjiang County in Fuzhou City, China. The objective is to explore the shared characteristics and variations in the dimensional aspects of destination images. The findings reveal that the short video contents released by the governmental new media accounts in these four locations primarily highlight three dimensions: stakeholders, urban infrastructures, and regional landscapes. These dimensions are evident in both descriptive texts and visual symbols.However, in terms of the presented destination image, a notable degree of homogeneity is observed, and there is a lack of emphasis on uncovering and presenting the cultural dimensions, thus failing to fully reflect the distinctive local characteristics.Consequently, it is essential for local integrated media centers to thoroughly explore the cultural uniqueness of their respective regions and enhance the development of thematic dimensions in creating short video content. This approach will effectively strengthen tourists' association with and perception of destination images.

Keywords: Short videos, Destination images, TikTok Platform, Governmental New Media, visual symbol

Received: 29 Nov 2023; Accepted: 23 Apr 2024.

Copyright: © 2024 Lin, Wen and Ma. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Mx. Hanzheng Lin, Communication University of China, Beijing, China

People also looked at

IMAGES

  1. Descriptive Research

    descriptive research video

  2. 18 Descriptive Research Examples (2024)

    descriptive research video

  3. PPT

    descriptive research video

  4. Understanding Descriptive Research Methods

    descriptive research video

  5. Understanding Descriptive Research Methods

    descriptive research video

  6. Descriptive Research: Methods, Types, and Examples

    descriptive research video

VIDEO

  1. II.2 Research 101 (11) Qualitative/Descriptive Research

  2. Descriptive Research Design #researchmethodology

  3. Types of Research in Psychology ! Descriptive, Correlational and Experimental Research in URDU

  4. Descriptive Research definition, types, and its use in education

  5. Descript Comprehensive Review

  6. Tricky Topics: Descriptive Research Designs in Psychology & Neuroscience

COMMENTS

  1. What is Descriptive Research? Examples, Methodology ...

    In this video, I provide a simple and easy-to-understand explanation of DESCRIPTIVE RESEARCH. I use a number of different examples to convey when and in whic...

  2. DESCRIPTIVE RESEARCH TECHNIQUES: SURVEY, CASE STUDY, CONTENT ...

    Survey, case study, and content analysis are techniques commonly used under the descriptive method of research.Note: The transcript of this video is uploaded...

  3. Exploring Descriptive Research: Real-World Examples and Insights

    Welcome to the world of research excellence! In this video, we dive into the realm of descriptive research, offering real-world examples and valuable insight...

  4. Video: Descriptive Research

    Short Summary. Descriptive research is used to describe a specific population. These descriptions are used to conduct comparisons, measure data trends, validate existing conditions, conduct ...

  5. Descriptive Research

    Descriptive research methods. Descriptive research is usually defined as a type of quantitative research, though qualitative research can also be used for descriptive purposes. The research design should be carefully developed to ensure that the results are valid and reliable.. Surveys. Survey research allows you to gather large volumes of data that can be analyzed for frequencies, averages ...

  6. Descriptive Research

    Video 2.4.1. Descriptive Research Design provides explanation and examples for quantitative descriptive research.A closed-captioned version of this video is available here.. Descriptive research is distinct from correlational research, in which researchers formally test whether a relationship exists between two or more variables. Experimental research goes a step further beyond descriptive and ...

  7. What is Descriptive Research?

    Definition of descriptive research. Descriptive research is defined as a research method that observes and describes the characteristics of a particular group, situation, or phenomenon. The goal is not to establish cause and effect relationships but rather to provide a detailed account of the situation.

  8. Descriptive Research: Design, Methods, Examples, and FAQs

    Descriptive research is an exploratory research method.It enables researchers to precisely and methodically describe a population, circumstance, or phenomenon.. As the name suggests, descriptive research describes the characteristics of the group, situation, or phenomenon being studied without manipulating variables or testing hypotheses.This can be reported using surveys, observational ...

  9. Descriptive Research in Psychology

    Descriptive research is one of the key tools needed in any psychology researcher's toolbox in order to create and lead a project that is both equitable and effective. Because psychology, as a field, loves definitions, let's start with one. The University of Minnesota's Introduction to Psychology defines this type of research as one that ...

  10. Descriptive Research

    Video 1. Descriptive Research Design provides explanation and examples for quantitative descriptive research.A closed-captioned version of this video is available here.. Descriptive research is distinct from correlational research, in which psychologists formally test whether a relationship exists between two or more variables.Experimental research goes a step further beyond descriptive and ...

  11. Research Methodology: Types of Research Design: Descriptive ...

    This video discusses descriptive research design with suitable examples.1.Exploratory Research Design: https://www.youtube.com/watch?v=GEHuo0dW0bQ2.Fundament...

  12. Descriptive Research Design: What It Is and How to Use It

    Descriptive research design. Descriptive research design uses a range of both qualitative research and quantitative data (although quantitative research is the primary research method) to gather information to make accurate predictions about a particular problem or hypothesis. As a survey method, descriptive research designs will help ...

  13. Descriptive Research 101: Definition, Methods and Examples

    For example, suppose you are a website beta testing an app feature. In that case, descriptive research invites users to try the feature, tracking their behavior and then asking their opinions. Can be applied to many research methods and areas. Examples include healthcare, SaaS, psychology, political studies, education, and pop culture.

  14. What is Descriptive Research? Definition, Methods, Types and Examples

    Descriptive research is a methodological approach that seeks to depict the characteristics of a phenomenon or subject under investigation. In scientific inquiry, it serves as a foundational tool for researchers aiming to observe, record, and analyze the intricate details of a particular topic. This method provides a rich and detailed account ...

  15. Descriptive Research

    1. Purpose. The primary purpose of descriptive research is to describe the characteristics, behaviors, and attributes of a particular population or phenomenon. 2. Participants and Sampling. Descriptive research studies a particular population or sample that is representative of the larger population being studied.

  16. Descriptive Research Design

    As discussed earlier, common data analysis methods for descriptive research include descriptive statistics, cross-tabulation, content analysis, qualitative coding, visualization, and comparative analysis. I nterpret results: Interpret your findings in light of your research question and objectives.

  17. Understanding Descriptive Research Methods

    4. Cost-effective. It is cost-effective and the data collection of this research can be done quickly. You can conduct descriptive research using an all-in-one solution such as Voxco. Leverage a platform that gives you the capability of the best market research software to conduct customer, product, and brand research.

  18. What is Descriptive Research?

    Descriptive research, also known as statistical research, describes data and characteristics about the population or phenomenon being studied.visit: www.b2bw...

  19. Study designs: Part 2

    INTRODUCTION. In our previous article in this series, [ 1] we introduced the concept of "study designs"- as "the set of methods and procedures used to collect and analyze data on variables specified in a particular research question.". Study designs are primarily of two types - observational and interventional, with the former being ...

  20. Chapter 14: Descriptive Research

    Descriptive research is designed to document the factors that describe characteristics, behaviors and conditions of individuals and groups. For example, researchers have used this approach to describe a sample of individuals with spinal cord injuries with respect to gender, age, and cause and severity of injury to see whether these properties were similar to those described in the past. 1 ...

  21. Tricky Topics: Descriptive Research Designs in Psychology ...

    This video covers the basics of descriptive research in the fields of Psychology & Neuroscience.Credit: Dr. Dylan Deska-Gauthier

  22. Video description: A comprehensive survey of deep learning ...

    Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional ...

  23. ORIGINAL RESEARCH article

    Today, as social media plays an increasingly important role in disseminating destination images, short videos have emerged as the primary channel through which tourists obtain information about their desired destinations. In comparison to traditional methods of using text and pictures, the new media accounts of local government agencies offer a means to convey more comprehensive local news and ...

  24. Descriptive Research

    This is the introductory video to the research unit in psychology, examining descriptive research.Psychology with Ms. TurnbullAP Psychology Lecture VideosVid...