The Seattle Report on Database Research

  • Hacker News
  • Download PDF
  • Join the Discussion
  • View in the ACM Digital Library
  • Introduction

Key Insights

What has changed for the database community in the last five years, research challenges.

database stacks marked with descriptive text

From the inception of the field, academic database research has strongly influenced the database industry and vice versa. The database community, both research and industry, has grown substantially over the years. The relational database market alone has revenue upwards of $50B. On the academic front, database researchers continue to be recognized with significant awards. With Michael Stonebraker’s Turing Award in 2014, the community can now boast of four Turing Awards and three ACM Systems Software Awards.

Back to Top

  • Data science and database research communities must work together closely to enable data to insights pipeline.
  • Data governance is an increasingly important societal challenge in today’s data-rich world.
  • Architectures for cloud data services need rethinking to take into consideration hardware trends, disaggregation, and new consumption models.

Over the last decade, our research community pioneered the use of columnar storage, which is used in all commercial data analytic platforms. Database systems offered as cloud services have witnessed explosive growth. Hybrid transactional/analytical processing (HTAP) systems are now an important segment of the industry. Furthermore, memory-optimized data structures, modern compilation, and code-generation have significantly enhanced performance of traditional database engines. All data platforms have embraced SQL-style APIs as the predominant way to query and retrieve data. Database researchers have played an important part in influencing the evolution of streaming data platforms as well as distributed key-value stores. A new generation of data cleaning and data wrangling technology is being actively explored.

These achievements demonstrate that our community is strong. Yet, in technology, the only constant is change. Today’s society is a data-driven one, where decisions are increasingly based on insights from data analysis. This societal transformation places us squarely in the center of technology disruptions. It has caused the field to become broader and exposed many new challenges and opportunities for data management research.

In the fall of 2018, the authors of this report met in Seattle to identify especially promising research directions for our field. There is a long tradition of such meetings, which have been held every five years since 1988. 1 , 3 , 4 , 7 , 8 , 11 , 12 , 13 This report summarizes findings from the Seattle meeting 2 , 9 and subsequent discussions, including panels at ACM SIGMOD 20 20 6 and VLDB 2020. 5 We begin by reviewing key technology trends that impact our field the most. The central part of the report covers research themes and specific examples of research challenges that meeting participants believe are important for database researchers to pursue, where their unique technical expertise is especially relevant such as cleaning and transforming data to support data science pipelines and disaggregated engine architectures to support multitenant cloud data services. We close by discussing steps the community can take for impact beyond solving technical research challenges.

Unlike database conference proceedings such as ACM SIGMOD and VLDB, this report does not attempt to provide a comprehensive summary of the wide breadth of technical challenges being pursued by database researchers or the many innovations introduced by the industry, for example, confidential computing, cloud security, blockchain technology, or graph databases.

The last report identified big data as our field’s central challenge. 1 However, in the last five years, the transformation has accelerated well beyond our projections, in part due to technological breakthroughs in machine learning (ML) and artificial intelligence (AI). The barrier to writing ML-based applications has been sharply lowered by widely available programming frameworks, such as TensorFlow and PyTorch, architectural innovations in neural networks leading to BERT and GPT-3, as well as specialized hardware for use in private and public clouds. The database community has a lot to offer ML users given our expertise in data discovery, versioning, cleaning, and integration. These technologies are critical for machine learning to derive meaningful insights from data. Given that most of the valuable data assets of enterprises are governed by database systems, it has become imperative to explore how SQL querying functionality is seamlessly integrated with ML. The community is also actively pursuing how ML can be leveraged to improve the database platform itself.

A related development has been the rise of data science as a discipline that combines elements of data cleaning and transformation, statistical analysis, data visualization, and ML techniques. Today’s world of data science is quite different from the previous generation of statistical and data integration tools. Notebooks have become by far the most popular interactive environment. Our expertise in declarative query languages can enrich the world of data science by making it more accessible to domain experts, especially those without traditional computer science background.

As personal data is increasingly valuable to customize the behavior of applications, society has become more concerned about the state of data governance as well as ethical and fair use of data. This concern impacts all fields of computer science but is especially important for data platforms, which must enforce such policies as custodians of data. Data governance has also led to the rise of confidential cloud computing whose goal is to enable customers to leverage the cloud to perform computation even though customers keep their data encrypted in the cloud.

Usage of managed cloud data systems , in contrast to simply using virtual machines in the cloud, has grown tremendously since our last report observed that “cloud computing has become mainstream.” 2 The industry now offers on-demand resources that provide extremely flexible elasticity, popularly referred to as serverless. For cloud analytics, the industry has converged on a data lake architecture , which uses on-demand elastic compute services to analyze data stored in cloud storage. The elastic compute could be extract, transformation, and load (ETL) jobs on a big data system such as Apache Spark, a traditional SQL data warehousing query engine, or an ML workflow. It operates on cloud storage with the network in-between. This architecture disaggregates compute and storage, enabling each to scale independently. These changes have profound implications on how we design future data systems.

Industrial Internet-of-Things (IoT) , focusing on domains such as manufacturing, retail, and healthcare, greatly accelerated in the last five years, aided by cheaper sensors, versatile connectivity, cloud data services, and data analytics infrastructure. IoT has further stress-tested our ability to do efficient data processing at the edge, do fast data ingestion from edge devices to cloud data infrastructure, and support data analytics with minimal delay for real-time scenarios such as monitoring.

Finally, there are significant changes in hardware. With the end of Dennard scaling 10 and the rise of compute-intensive workloads such as Deep Neural Networks (DNN), a new generation of powerful accelerators leveraging FPGAs, GPUs, and ASICs are now available. The memory hierarchy continues to evolve with the advent of faster SSDs and low-latency NVRAM. Improvements in network bandwidth and latency have been remarkable. These developments point to the need to rethink the hardware-software co-design of the next generation of database engines.

The changes noted here present new research opportunities and while we have made progress on key challenges in the last report, 2 many of those problems demand more research. Here, we summarize these two sets of research challenges, organized into four sub-sections. The first part addresses data science where our community can play a major role. The following section focuses on data governance. The last two sections cover cloud data services and the closely related topic of database engines. Advances in ML have influenced the database community’s research agenda across the board. Industrial IoT and hardware innovations have influenced cloud architectures and database engines. Thus, ML, IoT, and hardware are three cross-cutting themes and feature in multiple places in the rest of this section.

Data science. The NSF CISE Advisory Council a defines data science as “the processes and systems that enable the extraction of knowledge or insights from data in various forms, either structured or unstructured.” Over the past decade, it has emerged as a major interdisciplinary field and its use drives important decisions in enterprises and discoveries in science.

From a technical standpoint, data science is about the pipeline from raw input data to insights that requires use of data cleaning and transformation, data analytic techniques, and data visualization. In enterprise database systems, there are well-developed tools to move data from OLTP databases to data warehouses and to extract insights from their curated data warehouses by using complex SQL queries, online analytical processing (OLAP), data mining techniques, and statistical software suites. Although many of the challenges in data science are closely related to problems that arise in enterprise data systems, modern data scientists work in a different environment. They heavily use Data Science Notebooks, such as Jupyter, Spark, and Zeppelin, despite their weaknesses in versioning, IDE integration, and support for asynchronous tasks. Data scientists rely on a rich ecosystem of open source libraries such as Pandas for sophisticated analysis, including the latest ML frameworks. They also work with data lakes that hold datasets with varying levels of data quality—a significant departure from carefully curated data warehouses. These characteristics have created new requirements for the database community to address, in collaboration with the researchers and engineers in machine learning, statistics, and data visualization.

Data to insights pipeline. Data science pipelines are often complex with several stages, each with many participants. One team prepares the data, sourced from heterogeneous data sources in data lakes. Another team builds models on the data. Finally, end users access the data and models through interactive dashboards. The database community needs to develop simple and efficient tools that support building and maintaining data pipelines. Data scientists repeatedly say that data cleaning, integration, and transformation together consume 80%-90% of their time. These are problems the database community has experienced in the context of enterprise data for decades. However, much of our past efforts focused on solving algorithmic challenges for important “point problems,” such as schema mapping and entity resolution. Moving forward, we must adapt our community’s expertise in data cleaning, integration, and transformation to aid the iterative end-to-end development of the data-to-insights pipeline.

Data context and provenance. Unlike applications built atop curated data warehouses, today’s data scientists tap into data sources of varying quality for which correctness, completeness, freshness, and trustworthiness of data cannot be taken for granted. Data scientists need to understand and assess these properties of their data and to reason about their impact on the results of their data analysis. This requires understanding the context of the incoming data and the processes working on it. This is a data provenance problem, which is an active area of research for the database community. It involves tracking data, as it moves across repositories, integrating and analyzing the metadata as well as the data content. Beyond explaining results, data provenance enables reproducibility, which is key to data science, but is difficult, especially when data has a limited retention policy. Our community has made progress, but much more needs to be done to develop scalable techniques for data provenance.

Data exploration at scale. As the volume and variety of data continues to increase, our community must develop more effective techniques for discovery, search, understanding, and summarization of data distributed across multiple repositories. For example, for a given dataset, a user might want to search for public and enterprise-specific structured data that are joinable, after suitable transformations, with this dataset. The joined data may then provide additional context and enrichment for the original dataset. Furthermore, users need systems that support interactive exploratory analyses that can scale to large datasets, since high latency reduces the rate at which users can make observations, draw generalizations, and generate hypotheses. To support these requirements, the system stack for data exploration needs to be further optimized using both algorithmic and systems techniques. Specifically, data profiling , which provides a statistical characterization of data, must be efficient and scale to large data repositories. It should also be able to generate at low latency approximate profiles for large data sets to support interactive data discovery. To enable a data scientist to get from a large volume of raw data to insights through data transformation and analysis, low latency and scalable data visualization techniques are needed. Scalable data exploration is also key to addressing challenges that arise in data lakes (see “Database Engines”).

Declarative programming. Even though popular data science libraries such as Pandas support tabular view of data using the DataFrame abstraction, their programming paradigms have important differences with SQL. The success of declarative query languages in boosting programmer productivity in relational databases as well as big data systems point to an opportunity to investigate language abstractions to bring the full power of declarative programming to specify all stages of data-to-insights pipelines, including data discovery, data preparation, and ML model training and inference.

Metadata management. Our community can advance the state of the art for the tracking and managing metadata related to data science experiments and ML models. This includes automated labeling and annotations of data, such as identification of data types. Metadata annotations as well as provenance need to be searchable to support experimentation with different models and model versioning. Data provenance could be helpful to determine when to retrain models. Another metadata challenge is minimizing the cost of modifying applications as a schema evolves, an old problem where better solutions continue to be needed. The existing academic solutions to schema evolution are hardly used in practice.

Data governance. Consumers and enterprises are generating data at an unprecedented rate. Our homes have smart devices, our medical records are digitized, and social media is publicly available. All data producers (consumers and enterprises) have an interest in constraining how their data is used by applications while maximizing its utility, including controlled sharing of data. For instance, a set of users might allow the use of their personal health records for medical research, but not for military applications. Data governance is a suite of technologies that supports such specifications and their enforcement. We now discuss three key facets of data governance that participants in the Seattle Database meeting thought deserves more attention. Much like data science, the database community needs to work together with other communities that share interest in these important concerns to bring transformative changes.

Data use policy. The European Union’s General Data Protection Regulation (GDPR) is a prime example of such a directive. To implement GDPR and similar data use policy, metadata annotations and provenance must accompany data items as data is shared, moved, or copied according to a data use policy. Another essential element of data governance is auditing to ensure data is used by the right people for the right purpose per the data usage policy. Since data volumes continue to rise sharply, scalability of such auditing techniques is critically important. Much work is also needed to develop a framework for data collection, data retention and data disposal that supports policy constraints and will enable research on the trade-off between utility of data and limiting data gathering. Such a framework can also help answer when data may be safely discarded given a set of data usage goals.

Data privacy. A very important pillar of data governance is data privacy. In addition to cryptographic techniques to keep the data private, data privacy includes the challenges of ensuring that aggregation and other data analytic techniques may be applied effectively on a data set without revealing any individual member of the dataset. Although models such as differential privacy and local differential privacy address these challenges, more work is needed to understand how best to take advantage of these models in database platforms without significantly restricting the class of query expressions. Likewise, enabling efficient multiparty computation to enable data sharing across organizations without sacrificing privacy is an important challenge.

Ethical data science. Challenges in countering bias and discrimination in leveraging data science techniques, especially for ML, have gained traction in research and practice. The bias often comes from the input data itself such as when insufficiently representative data is used to train models. We need to work with other research communities to help mitigate this challenge. Responsible data management has emerged recently as a new research direction for the community and contributes to the interdisciplinary research in the broader area of Fairness, Accountability, Transparency, and Ethics (FATE).

Cloud services. The movement of workloads to the cloud has led to explosive growth for cloud database services, which in turn has led to substantial innovation as well as new research challenges, some of which are discussed below.

Serverless data services. In contrast to Infrastructure-as-a-Service (IaaS), which is akin to renting servers, serverless cloud database services support a consumption model that has usage-based pricing along with on-demand auto-scaling of compute and storage resources. Although the first generation of serverless cloud database services is already available and increasingly popular, there is need for research innovations to solve some of the fundamental challenges of this consumption model. Specifically, in serverless data services, users pay not only for the resources they consume but also for how quickly those resources can be allocated to their workloads. However, today’s cloud database systems do not tell users how quickly they will be able to auto-scale (up and down). In other words, there is lack of transparency on the service-level agreement (SLA) that captures the trade-off between the cost of and the delay in autoscaling resources. Conversely, the architectural changes in the cloud data services that will best address the requirements for autoscaling and pay-as-you-go need to be understood from the ground up. The first example of a serverless pay-as-you-go approach that is already available today is the Function-as-a-Service (FaaS) model. The database community has made significant contributions toward developing the next generation of serverless data services, and this remains an active research area.

Disaggregation. Commodity hardware used by cloud services is subject to hardware and software failures. It treats directly attached storage as ephemeral storage and instead relies on cloud storage services that support durability, scalability, and high availability. The disaggregation of storage and compute also provides the ability to scale compute and storage independently. However, to ensure low latency of data services, such disaggregated architectures must use caching across multiple levels of memory hierarchy inexpensively and can benefit from limited compute within the storage service to reduce data movement (see “Database Engines”). Database researchers need to develop principled solutions for OLTP and analytics workloads that are suitable for a disaggregated architecture. Finally, leveraging disaggregation of memory from compute is a problem still wide open. Such disaggregation will allow compute and memory to scale independently and make more efficient use of memory among compute nodes.

Multitenancy. The cloud offers an opportunity to rethink databases in a world with an abundance of resources that can be pooled together. However, it is critical to efficiently support multi-tenancy do careful capacity management to control costs and optimize utilization. The research community can lead by rethinking the resource management aspect of database systems considering multitenancy. The range of required innovation here spans reimagining database systems as composite microservices, developing mechanisms for agile response to alleviate resource pressure as demand causes local spikes, and reorganizing resources among active tenants dynamically, all while ensuring tenants are isolated from noisy neighbor tenants.

Edge and cloud. IoT has resulted in a skyrocketing number of computing devices connected to the cloud, in some cases only intermittently. The limited capabilities of these devices, and diverse characteristics of their connectivity (for example, often disconnected, limited bandwidth for offshore devices, or ample bandwidth for 5G-connected devices), and their data profiles will lead to new optimization challenges for distributed data processing and analytics.

Hybrid cloud and multi-cloud. There is a pressing need to identify architectural approaches that enable on-premises data infrastructure and cloud systems to take advantage of each other instead of relying on “cloud only” or “on-premises only”. In an ideal world, on-premises data platforms would seamlessly draw upon compute and storage resources available in the cloud “on-demand.” We are far from that vision today even though a single control plane for data split across on-premises and cloud data is beginning to emerge. The need to take advantage of specific services available only on one cloud, avoid being locked in the “walled garden” of a single infrastructure cloud, and increase resilience to failures, has led enterprise customers to spread their data estate across multiple public clouds. Recently we have seen emergence of data clouds by providers of multi-cloud data services that not only support movement of data across the infrastructure clouds, but also allow their data services to operate over data split across multiple infrastructure clouds. Understanding novel optimization challenges as well as selectively leveraging past research on heterogeneous and federated databases deserves our attention.

Auto-tuning. While auto-tuning has always been desirable, it has become critically important for cloud data services. Studies of cloud workloads indicate that many cloud database applications do not use appropriate configuration settings, schema designs, or access structures. Furthermore, as discussed earlier, cloud databases need to support a diverse set of time-varying multitenant workloads. No single configuration or resource allocation works well universally. A predictive model that helps guide configuration settings and resource reallocation is desirable. Fortunately, telemetry logs are plentiful for cloud services and present a great opportunity to improve the auto-tuning functionality through use of advanced analytics. However, since the cloud provider is not allowed to have access to the tenant’s data objects, such telemetry log analysis must be done in an “eyes off” mode, that is, inside of the tenant’s compliance boundary. Last but not the least, cloud services provide a unique opportunity to experiment with changes to data services and measure the effectiveness of their changes, much like how Internet search engines leveraged query logs and experimented with changes in ranking algorithms.

SaaS cloud database applications. All tenants of Software-as-Service (SaaS) database applications share the same application code and have approximately (or exactly) the same database schema but no shared data. For cost effectiveness, such SaaS database applications must be multitenant. One way to support such multitenant SaaS applications is to have all tenants share one database instance with the logic to support multi-tenancy pushed into the application stack. While this is simple to support from a database platform perspective, it makes customization (for example, schema evolution), query optimization, and resource sharing among tenants harder. The other extreme is to spawn a separate database instance for each tenant. While this approach is flexible and offers isolation from other tenants, it fails to take advantage of the commonality among tenants and thus may incur higher cost. Yet another approach is to pack tenants into shards with large tenants placed in shards of their own. Although these architectural alternatives are known, principled tradeoffs among them as well as identifying additional support at the database services layer that may be beneficial for SaaS database applications deserves in-depth study.

As the volume and variety of data continues to increase, our community must develop more effective techniques for discovery, search, understanding, and summarization of data distributed across multiple repositories.

Database engines. Cloud platforms and hardware innovations are leading to the exploration of new architectures for database systems. We now discuss some of the key themes that have emerged for research on database engines:

Heterogeneous computation. We see an inevitable trend toward heterogeneous computation with the death of Dennard scaling and the advent of new accelerators to offload compute. GPUs and FPGAs are available today, with the software stack for GPUs much better developed than for FPGAs. The progress in networking technology, including adoption of RDMA, is also receiving the attention of the database community. These developments offer the opportunity for database engines to take advantage of stack bypass. The memory and storage hierarchy are more heterogeneous than ever before. The advent of high-speed SSDs has altered the traditional tradeoffs between in-memory systems and disk-based database engines. Engines with the new generation of SSDs are destined to erode some of the key benefits of in-memory systems. Furthermore, availability of NVRAM may have significant impact on database engines due to their support for persistence and low latency. Re-architecting database engines with the right abstractions to explore hardware-software co-designs in this changed landscape, including disaggregation in the cloud context, has great potential.

Distributed transactions. Cloud data management systems are increasingly geo-distributed both within a region (across multiple availability zones) and across multiple geographic regions. This has renewed interest in industry and academia on the challenges of processing distributed transactions. The increased complexity and variability of failure scenarios, combined with increased communication latency and performance variability in distributed architectures has resulted in a wide array of trade-offs between consistency, isolation level, availability, latency, throughput under contention, elasticity, and scalability. There is an ongoing debate between two schools of thought: (a) Distributed transactions are hard to process at scale with high throughput and availability and low latency without giving up some traditional transactional guarantees. Therefore, consistency and isolation guarantees are reduced at the expense of increased developer complexity. (b) The complexity of implementing a bug-free application is extremely high unless the system guarantees strong consistency and isolation. Therefore, the system should offer the best throughput, availability, and low-latency service it can, without sacrificing correctness guarantees. This debate will likely not be fully resolved anytime soon, and industry will offer systems consistent with each school of thought. However, it is critical that application bugs and limitations in practice that result from weaker system guarantees be better identified and quantified, and tools be built to help application developers using both types of system achieve their correctness and performance goals.

Data lakes. There is an increasing need to consume data from a variety of data sources, structured, semi-structured, and unstructured, to transform and perform complex analyses flexibly. This has led to a transition from a classical data warehouse to a data lake architecture for analytics. Instead of a traditional setting where data is ingested into an OLTP store and then swept into a curated data warehouse through an ETL process, perhaps powered by a Big Data framework such as Spark, the data lake is a flexible storage repository. Subsequently, a variety of compute engines can operate on the data that are of varying data quality, to curate it or execute complex SQL queries, and store the results back in the data lake or ingest them into an operational system. Thus, data lakes exemplify a disaggregated architecture with the separation of compute and storage. An important challenge for data lakes is finding relevant data for a given task efficiently. Therefore, solutions to open problems in scalable data exploration and metadata management, discussed in the Data Science section, are of importance. While the flexibility of data lakes is attractive, it is vital that the guard rails of data governance are firmly adhered to, and we refer the reader to that section of the report for more details. To ensure consistency of data and high data quality so that the result of analytics is as accurate as possible, support for transactions, enforcement of schema constraints, and data validation are central concerns. Enabling scalable querying on the heterogeneous collection of data demands caching solutions that trade off performance, scale, and cost.

The cloud offers an opportunity to rethink databases in a world with an abundance of resources that can be pooled together. However, it is critical to efficiently support multitenancy to control costs and optimize utilization.

Approximation in query answering. As the volume of data continues to explode, we must seek techniques that reduce latency or increase throughput of query processing. For example, leveraging approximation for fast progressive visualization of answers to queries over data lakes can help exploratory data analysis to unlock insights in data. Data sketches are already mainstream and are classic examples of effective approximations. Sampling is another tool used to reduce the cost of query processing. However, support for sampling in today’s big data systems is quite limited and does not cater to the richness of query languages such as SQL. Our community has done much foundational work in approximate query processing, but we need a better way to expose it in a programmer-friendly manner with clear semantics.

Machine learning workloads. Modern data management workloads include ML, which adds an important, new requirement for database engines. While ML workloads include training as well as inferencing, supporting the latter efficiently is an immediate need. Today the challenge of efficiently supporting “in-database” inferencing is achieved by leveraging database extensibility mechanisms. As we look forward, the ML models that are invoked as part of inferencing, must be treated as first-class citizens inside databases. ML models may be browsed and queried as database objects and database systems need to support popular ML programming frameworks. While today’s database systems can support inferencing over relatively simple models, the increasing popularity and effectiveness of extremely large models such as BERT and GPT-3 requires database engine developers to leverage heterogeneous hardware and work with architects responsible for building ML infrastructure using FPGAs, GPUs, and specialized ASICs.

Machine learning for reimagining data platform components. Recent advances in ML have inspired our community to reflect on how data engine components could potentially use ML to significantly advance the state of the art. The most obvious such opportunity is auto tuning. Database systems can systematically replace “magic numbers” and thresholds with ML models to auto-tune system configurations. Availability of ample training data also provides opportunities to explore new approaches that take advantage of ML for query optimization or multidimensional index structures, especially as state-of-the-art solutions to these problems have seen only modest improvements in the last two decades. ML-model driven engine components must demonstrate significant benefits as well as robustness when test data or test queries deviate from the training data and training queries. To handle such deviations, the ML models need to be augmented with guardrails so that the system degrades gracefully. Furthermore, a well-thought-out software engineering pipeline to support the life cycle of a ML-model driven component will be important.

Benchmarking and reproducibility. Benchmarks tremendously helped move forward the database industry and the database research community. It is necessary to focus on benchmarking for new application scenarios and database engine architectures. Existing benchmarks (for example, TPC-E, TPC-DS, TPCH) are very useful but do not capture the full breadth of our field, for example, streaming scenarios and analytics on new types of data such as videos. Moreover, without the development of appropriate benchmarking and data sets, a fair comparison between traditional database architectures and ML-inspired architectural modifications to the engine components will not be feasible. Benchmarking in the cloud environment also presents unique challenges since differences in infrastructure across cloud providers makes apples to apples comparison more difficult. A closely related issue is reproducibility of performance results in database publications. Fortunately, since 2008, database conferences have been encouraging reproducibility of results in the papers accepted in ACM SIGMOD and VLDB. Focus on reproducibility also increases rigor in selection of workloads, databases, parameters picked for experimentation, and how results are aggregated and reported.

In addition to technical challenges, the meeting participants discussed steps the community of database researchers can take to enhance our ability to contribute to and learn from the emerging data challenges.

We will continue the rich tradition of learning from users of our systems and using database conferences as meeting places for both users and system innovators. Industry tracks of our conferences foster such interaction, by discussing industry challenges and innovations in practice. This is more important due to today’s rapidly changing data management challenges. We must redouble our efforts to learn from application developers or SaaS solution providers in industry verticals.

As our community develops new systems, releasing them as part of the existing popular ecosystems of open source tools or easy-to-use cloud services will greatly enhance the ability to receive feedback and do iterative improvements. Recent examples of such systems that benefited from significant input from the database community include Apache Spark, Apache Flink, and Apache Kafka. In addition, as a community, we should take advantage of every opportunity to get closer to application developers and other users of database technology to learn their unique data challenges.

The database community must do a better job integrating database research with the data science ecosystem. Database techniques for data integration, data cleaning, data processing, and data visualization should be easy to call from Python scripts.

We see many exciting research directions in today’s data-driven world around data science, machine-learning, data governance, new architectures for cloud systems, and next-generation data platforms. This report summarized results from the Seattle Database meeting and subsequent community discussions, 5 , 6 which identified a few of the important challenges and opportunities for the database community to continue its tradition of strong impact on research and industry. Supplementary materials from the meeting is available on the event website. 9

Acknowledgments. The Seattle Database meeting was supported financially by donations from Google, Megagon Labs, and Microsoft Corp. Thanks to Yannis Ioannidis, Christian Konig, Vivek Narasayya, and the anonymous reviewers for their feedback on earlier drafts.

uf1.jpg

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Copyright held by authors/owners. Request permission to (re)publish from the owner/author

The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.

August 2022 Issue

Published: August 1, 2022

Vol. 65 No. 8

Pages: 72-79

Related Reading

Tactical Deceit

Computing Profession

Remaking American Medicine

Computing Applications

Research and Advances

The Dark Side of Employee Email

Neighborhood Watch

Advertisement

report on database research

Join the Discussion (0)

Become a member or sign in to post a comment, the latest from cacm.

How We Lost the Internet

Credit: Shutterstock cracked display screen, illustration

Vendor Software Solutions in a Cloudy World

Credit: Shutterstock datacenter in the clouds, cloud-computing concept, illustration

Whither BCIs?

Credit: Getty Images human head with color motion trails, illustration

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Abstract: Every few years a group of database researchers meets to discuss the state of database research, its impact on practice, and important new directions. This report summarizes the discussion and conclusions of the eighth such meeting, held October 14-15, 2013 in Irvine, California. It observes that Big Data has now become a defining challenge of our time, and that the database research community is uniquely positioned to address it, with enormous opportunities to make transformative impact. To do so, the report recommends significantly more attention to five research areas: scalable big/fast data infrastructures; coping with diversity in the data management landscape; end-to-end processing and understanding of data; cloud services; and managing the diverse roles of people in the data life cycle.

Participants

Daniel Abadi, Rakesh Agrawal, Anastasia Ailamaki, Magdalena Balazinska, Philip A. Bernstein, Michael J. Carey, Surajit Chaudhuri, Jeffrey Dean, AnHai Doan, Michael J. Franklin, Johannes Gehrke, Laura M. Haas, Alon Y. Halevy, Joseph M. Hellerstein, Yannis E. Ioannidis, H.V. Jagadish, Donald Kossmann, Samuel Madden, Sharad Mehrotra, Tova Milo, Jeffrey F. Naughton, Raghu Ramakrishnan, Volker Markl, Christopher Olston, Beng Chin Ooi, Christopher Ré, Dan Suciu, Michael Stonebraker, Todd Walter, Jennifer Widom

Meeting Agenda and Talks

Acknowledgments.

The Claremont report on database research

Research areas.

Data Management

Meet the teams driving innovation

Our teams advance the state of the art through research, systems engineering, and collaboration across Google.

Teams

Abstract: In late May, 2008, a group of database researchers, architects, users and pundits met at the Claremont Resort in Berkeley, California to discuss the state of the research field and its impacts on practice. This was the seventh meeting of this sort in twenty years, and was distinguished by a broad consensus that we are at a turning point in the history of the field, due both to an explosion of data and usage scenarios, and to major shifts in computing hardware and platforms. Given these forces, we are at a time of opportunity for research impact, with an unusually large potential for influential results across computing, the sciences and society. This report details that discussion, and highlights the group's consensus view of new focus areas, including new database engine architectures, declarative programming languages, the interplay of structured and unstructured data, cloud data services, and mobile and virtual worlds. We also report on discussions of the community's growth, including suggestions for changes in community processes to move the research agenda forward, and to enhance impact on a broader audience.

Participants

Rakesh Agrawal, Anastasia Ailamaki, Philip A. Bernstein, Eric A. Brewer, Michael J. Carey, Surajit Chaudhuri, AnHai Doan, Daniela Florescu, Michael J. Franklin, Hector Garcia-Molina, Johannes Gehrke, Le Gruenwald, Laura M. Haas, Alon Y. Halevy, Joseph M. Hellerstein, Yannis E. Ioannidis, Hank F. Korth, Donald Kossmann, Samuel Madden, Roger Magoulas, Beng Chin Ooi, Tim O'Reilly, Raghu Ramakrishnan, Sunita Sarawagi, Michael Stonebraker, Alexander S. Szalay, Gerhard Weikum

Research Directions

Hosted at Google Groups .

NYU Scholars Logo

  • Help & FAQ

The Seattle Report on Database Research

Daniel Abadi, Anastasia Ailamaki, David Andersen, Peter Bailis, Magdalena Balazinska, Philip Bernstein, Peter Boncz, Surajit Chaudhuri, Alvin Cheung, An Hai Doan, Luna Dong, Michael J. Franklin, Juliana Freire , Alon Halevy, Joseph M. Hellerstein, Stratos Idreos, Donald Kossmann, Tim Kraska, Sailesh Krishnamurthy, Volker Markl Sergey Melnik, Tova Milo, C. Mohan, Thomas Neumann, Beng Chin Ooi, Fatma Ozcan, Jignesh Patel, Andrew Pavlo, Raluca Popa, Raghu Ramakrishnan, Christopher Ré, Michael Stonebraker, Dan Suciu Show 13 others Show less

  • Urban Initiative

Research output : Contribution to journal › Article › peer-review

Approximately every five years, a group of database researchers meet to do a self-assessment of our community, including reflections on our impact on the industry as well as challenges facing our research community. This report summarizes the discussion and conclusions of the 9th such meeting, held during October 9-10, 2018 in Seattle.

ASJC Scopus subject areas

  • Information Systems

Access to Document

  • 10.1145/3385658.3385668

Other files and links

  • Link to publication in Scopus
  • Link to the citations in Scopus

Fingerprint

  • Industry Engineering & Materials Science 100%

T1 - The Seattle Report on Database Research

AU - Abadi, Daniel

AU - Ailamaki, Anastasia

AU - Andersen, David

AU - Bailis, Peter

AU - Balazinska, Magdalena

AU - Bernstein, Philip

AU - Boncz, Peter

AU - Chaudhuri, Surajit

AU - Cheung, Alvin

AU - Doan, An Hai

AU - Dong, Luna

AU - Franklin, Michael J.

AU - Freire, Juliana

AU - Halevy, Alon

AU - Hellerstein, Joseph M.

AU - Idreos, Stratos

AU - Kossmann, Donald

AU - Kraska, Tim

AU - Krishnamurthy, Sailesh

AU - Markl, Volker

AU - Melnik, Sergey

AU - Milo, Tova

AU - Mohan, C.

AU - Neumann, Thomas

AU - Ooi, Beng Chin

AU - Ozcan, Fatma

AU - Patel, Jignesh

AU - Pavlo, Andrew

AU - Popa, Raluca

AU - Ramakrishnan, Raghu

AU - Ré, Christopher

AU - Stonebraker, Michael

AU - Suciu, Dan

N1 - Funding Information: The Seattle database meeting was supported financially by donations from Google, Megagon Labs, and Microsoft Corporation. Publisher Copyright: © 2020 Association for Computing Machinery. All rights reserved.

PY - 2020/2/25

Y1 - 2020/2/25

N2 - Approximately every five years, a group of database researchers meet to do a self-assessment of our community, including reflections on our impact on the industry as well as challenges facing our research community. This report summarizes the discussion and conclusions of the 9th such meeting, held during October 9-10, 2018 in Seattle.

AB - Approximately every five years, a group of database researchers meet to do a self-assessment of our community, including reflections on our impact on the industry as well as challenges facing our research community. This report summarizes the discussion and conclusions of the 9th such meeting, held during October 9-10, 2018 in Seattle.

UR - http://www.scopus.com/inward/record.url?scp=85081632407&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85081632407&partnerID=8YFLogxK

U2 - 10.1145/3385658.3385668

DO - 10.1145/3385658.3385668

M3 - Article

AN - SCOPUS:85081632407

SN - 0163-5808

JO - SIGMOD Record

JF - SIGMOD Record

Detail of a painting depicting the landscape of New Mexico with mountains in the distance

Explore millions of high-quality primary sources and images from around the world, including artworks, maps, photographs, and more.

Explore migration issues through a variety of media types

  • Part of The Streets are Talking: Public Forms of Creative Expression from Around the World
  • Part of The Journal of Economic Perspectives, Vol. 34, No. 1 (Winter 2020)
  • Part of Cato Institute (Aug. 3, 2021)
  • Part of University of California Press
  • Part of Open: Smithsonian National Museum of African American History & Culture
  • Part of Indiana Journal of Global Legal Studies, Vol. 19, No. 1 (Winter 2012)
  • Part of R Street Institute (Nov. 1, 2020)
  • Part of Leuven University Press
  • Part of UN Secretary-General Papers: Ban Ki-moon (2007-2016)
  • Part of Perspectives on Terrorism, Vol. 12, No. 4 (August 2018)
  • Part of Leveraging Lives: Serbia and Illegal Tunisian Migration to Europe, Carnegie Endowment for International Peace (Mar. 1, 2023)
  • Part of UCL Press

Harness the power of visual materials—explore more than 3 million images now on JSTOR.

Enhance your scholarly research with underground newspapers, magazines, and journals.

Explore collections in the arts, sciences, and literature from the world’s leading museums, archives, and scholars.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Data Descriptor
  • Open access
  • Published: 03 May 2024

A dataset for measuring the impact of research data and their curation

  • Libby Hemphill   ORCID: orcid.org/0000-0002-3793-7281 1 , 2 ,
  • Andrea Thomer 3 ,
  • Sara Lafia 1 ,
  • Lizhou Fan 2 ,
  • David Bleckley   ORCID: orcid.org/0000-0001-7715-4348 1 &
  • Elizabeth Moss 1  

Scientific Data volume  11 , Article number:  442 ( 2024 ) Cite this article

583 Accesses

8 Altmetric

Metrics details

  • Research data
  • Social sciences

Science funders, publishers, and data archives make decisions about how to responsibly allocate resources to maximize the reuse potential of research data. This paper introduces a dataset developed to measure the impact of archival and data curation decisions on data reuse. The dataset describes 10,605 social science research datasets, their curation histories, and reuse contexts in 94,755 publications that cover 59 years from 1963 to 2022. The dataset was constructed from study-level metadata, citing publications, and curation records available through the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan. The dataset includes information about study-level attributes (e.g., PIs, funders, subject terms); usage statistics (e.g., downloads, citations); archiving decisions (e.g., curation activities, data transformations); and bibliometric attributes (e.g., journals, authors) for citing publications. This dataset provides information on factors that contribute to long-term data reuse, which can inform the design of effective evidence-based recommendations to support high-impact research data curation decisions.

Similar content being viewed by others

report on database research

SciSciNet: A large-scale open data lake for the science of science research

report on database research

Data, measurement and empirical methods in the science of science

report on database research

Interdisciplinarity revisited: evidence for research impact and dynamism

Background & summary.

Recent policy changes in funding agencies and academic journals have increased data sharing among researchers and between researchers and the public. Data sharing advances science and provides the transparency necessary for evaluating, replicating, and verifying results. However, many data-sharing policies do not explain what constitutes an appropriate dataset for archiving or how to determine the value of datasets to secondary users 1 , 2 , 3 . Questions about how to allocate data-sharing resources efficiently and responsibly have gone unanswered 4 , 5 , 6 . For instance, data-sharing policies recognize that not all data should be curated and preserved, but they do not articulate metrics or guidelines for determining what data are most worthy of investment.

Despite the potential for innovation and advancement that data sharing holds, the best strategies to prioritize datasets for preparation and archiving are often unclear. Some datasets are likely to have more downstream potential than others, and data curation policies and workflows should prioritize high-value data instead of being one-size-fits-all. Though prior research in library and information science has shown that the “analytic potential” of a dataset is key to its reuse value 7 , work is needed to implement conceptual data reuse frameworks 8 , 9 , 10 , 11 , 12 , 13 , 14 . In addition, publishers and data archives need guidance to develop metrics and evaluation strategies to assess the impact of datasets.

Several existing resources have been compiled to study the relationship between the reuse of scholarly products, such as datasets (Table  1 ); however, none of these resources include explicit information on how curation processes are applied to data to increase their value, maximize their accessibility, and ensure their long-term preservation. The CCex (Curation Costs Exchange) provides models of curation services along with cost-related datasets shared by contributors but does not make explicit connections between them or include reuse information 15 . Analyses on platforms such as DataCite 16 have focused on metadata completeness and record usage, but have not included related curation-level information. Analyses of GenBank 17 and FigShare 18 , 19 citation networks do not include curation information. Related studies of Github repository reuse 20 and Softcite software citation 21 reveal significant factors that impact the reuse of secondary research products but do not focus on research data. RD-Switchboard 22 and DSKG 23 are scholarly knowledge graphs linking research data to articles, patents, and grants, but largely omit social science research data and do not include curation-level factors. To our knowledge, other studies of curation work in organizations similar to ICPSR – such as GESIS 24 , Dataverse 25 , and DANS 26 – have not made their underlying data available for analysis.

This paper describes a dataset 27 compiled for the MICA project (Measuring the Impact of Curation Actions) led by investigators at ICPSR, a large social science data archive at the University of Michigan. The dataset was originally developed to study the impacts of data curation and archiving on data reuse. The MICA dataset has supported several previous publications investigating the intensity of data curation actions 28 , the relationship between data curation actions and data reuse 29 , and the structures of research communities in a data citation network 30 . Collectively, these studies help explain the return on various types of curatorial investments. The dataset that we introduce in this paper, which we refer to as the MICA dataset, has the potential to address research questions in the areas of science (e.g., knowledge production), library and information science (e.g., scholarly communication), and data archiving (e.g., reproducible workflows).

We constructed the MICA dataset 27 using records available at ICPSR, a large social science data archive at the University of Michigan. Data set creation involved: collecting and enriching metadata for articles indexed in the ICPSR Bibliography of Data-related Literature against the Dimensions AI bibliometric database; gathering usage statistics for studies from ICPSR’s administrative database; processing data curation work logs from ICPSR’s project tracking platform, Jira; and linking data in social science studies and series to citing analysis papers (Fig.  1 ).

figure 1

Steps to prepare MICA dataset for analysis - external sources are red, primary internal sources are blue, and internal linked sources are green.

Enrich paper metadata

The ICPSR Bibliography of Data-related Literature is a growing database of literature in which data from ICPSR studies have been used. Its creation was funded by the National Science Foundation (Award 9977984), and for the past 20 years it has been supported by ICPSR membership and multiple US federally-funded and foundation-funded topical archives at ICPSR. The Bibliography was originally launched in the year 2000 to aid in data discovery by providing a searchable database linking publications to the study data used in them. The Bibliography collects the universe of output based on the data shared in each study through, which is made available through each ICPSR study’s webpage. The Bibliography contains both peer-reviewed and grey literature, which provides evidence for measuring the impact of research data. For an item to be included in the ICPSR Bibliography, it must contain an analysis of data archived by ICPSR or contain a discussion or critique of the data collection process, study design, or methodology 31 . The Bibliography is manually curated by a team of librarians and information specialists at ICPSR who enter and validate entries. Some publications are supplied to the Bibliography by data depositors, and some citations are submitted to the Bibliography by authors who abide by ICPSR’s terms of use requiring them to submit citations to works in which they analyzed data retrieved from ICPSR. Most of the Bibliography is populated by Bibliography team members, who create custom queries for ICPSR studies performed across numerous sources, including Google Scholar, ProQuest, SSRN, and others. Each record in the Bibliography is one publication that has used one or more ICPSR studies. The version we used was captured on 2021-11-16 and included 94,755 publications.

To expand the coverage of the ICPSR Bibliography, we searched exhaustively for all ICPSR study names, unique numbers assigned to ICPSR studies, and DOIs 32 using a full-text index available through the Dimensions AI database 33 . We accessed Dimensions through a license agreement with the University of Michigan. ICPSR Bibliography librarians and information specialists manually reviewed and validated new entries that matched one or more search criteria. We then used Dimensions to gather enriched metadata and full-text links for items in the Bibliography with DOIs. We matched 43% of the items in the Bibliography to enriched Dimensions metadata including abstracts, field of research codes, concepts, and authors’ institutional information; we also obtained links to full text for 16% of Bibliography items. Based on licensing agreements, we included Dimensions identifiers and links to full text so that users with valid publisher and database access can construct an enriched publication dataset.

Gather study usage data

ICPSR maintains a relational administrative database, DBInfo, that organizes study-level metadata and information on data reuse across separate tables. Studies at ICPSR consist of one or more files collected at a single time or for a single purpose; studies in which the same variables are observed over time are grouped into series. Each study at ICPSR is assigned a DOI, and its metadata are stored in DBInfo. Study metadata follows the Data Documentation Initiative (DDI) Codebook 2.5 standard. DDI elements included in our dataset are title, ICPSR study identification number, DOI, authoring entities, description (abstract), funding agencies, subject terms assigned to the study during curation, and geographic coverage. We also created variables based on DDI elements: total variable count, the presence of survey question text in the metadata, the number of author entities, and whether an author entity was an institution. We gathered metadata for ICPSR’s 10,605 unrestricted public-use studies available as of 2021-11-16 ( https://www.icpsr.umich.edu/web/pages/membership/or/metadata/oai.html ).

To link study usage data with study-level metadata records, we joined study metadata from DBinfo on study usage information, which included total study downloads (data and documentation), individual data file downloads, and cumulative citations from the ICPSR Bibliography. We also gathered descriptive metadata for each study and its variables, which allowed us to summarize and append recoded fields onto the study-level metadata such as curation level, number and type of principle investigators, total variable count, and binary variables indicating whether the study data were made available for online analysis, whether survey question text was made searchable online, and whether the study variables were indexed for search. These characteristics describe aspects of the discoverability of the data to compare with other characteristics of the study. We used the study and series numbers included in the ICPSR Bibliography as unique identifiers to link papers to metadata and analyze the community structure of dataset co-citations in the ICPSR Bibliography 32 .

Process curation work logs

Researchers deposit data at ICPSR for curation and long-term preservation. Between 2016 and 2020, more than 3,000 research studies were deposited with ICPSR. Since 2017, ICPSR has organized curation work into a central unit that provides varied levels of curation that vary in the intensity and complexity of data enhancement that they provide. While the levels of curation are standardized as to effort (level one = less effort, level three = most effort), the specific curatorial actions undertaken for each dataset vary. The specific curation actions are captured in Jira, a work tracking program, which data curators at ICPSR use to collaborate and communicate their progress through tickets. We obtained access to a corpus of 669 completed Jira tickets corresponding to the curation of 566 unique studies between February 2017 and December 2019 28 .

To process the tickets, we focused only on their work log portions, which contained free text descriptions of work that data curators had performed on a deposited study, along with the curators’ identifiers, and timestamps. To protect the confidentiality of the data curators and the processing steps they performed, we collaborated with ICPSR’s curation unit to propose a classification scheme, which we used to train a Naive Bayes classifier and label curation actions in each work log sentence. The eight curation action labels we proposed 28 were: (1) initial review and planning, (2) data transformation, (3) metadata, (4) documentation, (5) quality checks, (6) communication, (7) other, and (8) non-curation work. We note that these categories of curation work are very specific to the curatorial processes and types of data stored at ICPSR, and may not match the curation activities at other repositories. After applying the classifier to the work log sentences, we obtained summary-level curation actions for a subset of all ICPSR studies (5%), along with the total number of hours spent on data curation for each study, and the proportion of time associated with each action during curation.

Data Records

The MICA dataset 27 connects records for each of ICPSR’s archived research studies to the research publications that use them and related curation activities available for a subset of studies (Fig.  2 ). Each of the three tables published in the dataset is available as a study archived at ICPSR. The data tables are distributed as statistical files available for use in SAS, SPSS, Stata, and R as well as delimited and ASCII text files. The dataset is organized around studies and papers as primary entities. The studies table lists ICPSR studies, their metadata attributes, and usage information; the papers table was constructed using the ICPSR Bibliography and Dimensions database; and the curation logs table summarizes the data curation steps performed on a subset of ICPSR studies.

Studies (“ICPSR_STUDIES”): 10,605 social science research datasets available through ICPSR up to 2021-11-16 with variables for ICPSR study number, digital object identifier, study name, series number, series title, authoring entities, full-text description, release date, funding agency, geographic coverage, subject terms, topical archive, curation level, single principal investigator (PI), institutional PI, the total number of PIs, total variables in data files, question text availability, study variable indexing, level of restriction, total unique users downloading study data files and codebooks, total unique users downloading data only, and total unique papers citing data through November 2021. Studies map to the papers and curation logs table through ICPSR study numbers as “STUDY”. However, not every study in this table will have records in the papers and curation logs tables.

Papers (“ICPSR_PAPERS”): 94,755 publications collected from 2000-08-11 to 2021-11-16 in the ICPSR Bibliography and enriched with metadata from the Dimensions database with variables for paper number, identifier, title, authors, publication venue, item type, publication date, input date, ICPSR series numbers used in the paper, ICPSR study numbers used in the paper, the Dimension identifier, and the Dimensions link to the publication’s full text. Papers map to the studies table through ICPSR study numbers in the “STUDY_NUMS” field. Each record represents a single publication, and because a researcher can use multiple datasets when creating a publication, each record may list multiple studies or series.

Curation logs (“ICPSR_CURATION_LOGS”): 649 curation logs for 563 ICPSR studies (although most studies in the subset had one curation log, some studies were associated with multiple logs, with a maximum of 10) curated between February 2017 and December 2019 with variables for study number, action labels assigned to work description sentences using a classifier trained on ICPSR curation logs, hours of work associated with a single log entry, and total hours of work logged for the curation ticket. Curation logs map to the study and paper tables through ICPSR study numbers as “STUDY”. Each record represents a single logged action, and future users may wish to aggregate actions to the study level before joining tables.

figure 2

Entity-relation diagram.

Technical Validation

We report on the reliability of the dataset’s metadata in the following subsections. To support future reuse of the dataset, curation services provided through ICPSR improved data quality by checking for missing values, adding variable labels, and creating a codebook.

All 10,605 studies available through ICPSR have a DOI and a full-text description summarizing what the study is about, the purpose of the study, the main topics covered, and the questions the PIs attempted to answer when they conducted the study. Personal names (i.e., principal investigators) and organizational names (i.e., funding agencies) are standardized against an authority list maintained by ICPSR; geographic names and subject terms are also standardized and hierarchically indexed in the ICPSR Thesaurus 34 . Many of ICPSR’s studies (63%) are in a series and are distributed through the ICPSR General Archive (56%), a non-topical archive that accepts any social or behavioral science data. While study data have been available through ICPSR since 1962, the earliest digital release date recorded for a study was 1984-03-18, when ICPSR’s database was first employed, and the most recent date is 2021-10-28 when the dataset was collected.

Curation level information was recorded starting in 2017 and is available for 1,125 studies (11%); approximately 80% of studies with assigned curation levels received curation services, equally distributed between Levels 1 (least intensive), 2 (moderately intensive), and 3 (most intensive) (Fig.  3 ). Detailed descriptions of ICPSR’s curation levels are available online 35 . Additional metadata are available for a subset of 421 studies (4%), including information about whether the study has a single PI, an institutional PI, the total number of PIs involved, total variables recorded is available for online analysis, has searchable question text, has variables that are indexed for search, contains one or more restricted files, and whether the study is completely restricted. We provided additional metadata for this subset of ICPSR studies because they were released within the past five years and detailed curation and usage information were available for them. Usage statistics including total downloads and data file downloads are available for this subset of studies as well; citation statistics are available for 8,030 studies (76%). Most ICPSR studies have fewer than 500 users, as indicated by total downloads, or citations (Fig.  4 ).

figure 3

ICPSR study curation levels.

figure 4

ICPSR study usage.

A subset of 43,102 publications (45%) available in the ICPSR Bibliography had a DOI. Author metadata were entered as free text, meaning that variations may exist and require additional normalization and pre-processing prior to analysis. While author information is standardized for each publication, individual names may appear in different sort orders (e.g., “Earls, Felton J.” and “Stephen W. Raudenbush”). Most of the items in the ICPSR Bibliography as of 2021-11-16 were journal articles (59%), reports (14%), conference presentations (9%), or theses (8%) (Fig.  5 ). The number of publications collected in the Bibliography has increased each decade since the inception of ICPSR in 1962 (Fig.  6 ). Most ICPSR studies (76%) have one or more citations in a publication.

figure 5

ICPSR Bibliography citation types.

figure 6

ICPSR citations by decade.

Usage Notes

The dataset consists of three tables that can be joined using the “STUDY” key as shown in Fig.  2 . The “ICPSR_PAPERS” table contains one row per paper with one or more cited studies in the “STUDY_NUMS” column. We manipulated and analyzed the tables as CSV files with the Pandas library 36 in Python and the Tidyverse packages 37 in R.

The present MICA dataset can be used independently to study the relationship between curation decisions and data reuse. Evidence of reuse for specific studies is available in several forms: usage information, including downloads and citation counts; and citation contexts within papers that cite data. Analysis may also be performed on the citation network formed between datasets and papers that use them. Finally, curation actions can be associated with properties of studies and usage histories.

This dataset has several limitations of which users should be aware. First, Jira tickets can only be used to represent the intensiveness of curation for activities undertaken since 2017, when ICPSR started using both Curation Levels and Jira. Studies published before 2017 were all curated, but documentation of the extent of that curation was not standardized and therefore could not be included in these analyses. Second, the measure of publications relies upon the authors’ clarity of data citation and the ICPSR Bibliography staff’s ability to discover citations with varying formality and clarity. Thus, there is always a chance that some secondary-data-citing publications have been left out of the bibliography. Finally, there may be some cases in which a paper in the ICSPSR bibliography did not actually obtain data from ICPSR. For example, PIs have often written about or even distributed their data prior to their archival in ICSPR. Therefore, those publications would not have cited ICPSR but they are still collected in the Bibliography as being directly related to the data that were eventually deposited at ICPSR.

In summary, the MICA dataset contains relationships between two main types of entities – papers and studies – which can be mined. The tables in the MICA dataset have supported network analysis (community structure and clique detection) 30 ; natural language processing (NER for dataset reference detection) 32 ; visualizing citation networks (to search for datasets) 38 ; and regression analysis (on curation decisions and data downloads) 29 . The data are currently being used to develop research metrics and recommendation systems for research data. Given that DOIs are provided for ICPSR studies and articles in the ICPSR Bibliography, the MICA dataset can also be used with other bibliometric databases, including DataCite, Crossref, OpenAlex, and related indexes. Subscription-based services, such as Dimensions AI, are also compatible with the MICA dataset. In some cases, these services provide abstracts or full text for papers from which data citation contexts can be extracted for semantic content analysis.

Code availability

The code 27 used to produce the MICA project dataset is available on GitHub at https://github.com/ICPSR/mica-data-descriptor and through Zenodo with the identifier https://doi.org/10.5281/zenodo.8432666 . Data manipulation and pre-processing were performed in Python. Data curation for distribution was performed in SPSS.

He, L. & Han, Z. Do usage counts of scientific data make sense? An investigation of the Dryad repository. Library Hi Tech 35 , 332–342 (2017).

Article   Google Scholar  

Brickley, D., Burgess, M. & Noy, N. Google dataset search: Building a search engine for datasets in an open web ecosystem. In The World Wide Web Conference - WWW ‘19 , 1365–1375 (ACM Press, San Francisco, CA, USA, 2019).

Buneman, P., Dosso, D., Lissandrini, M. & Silvello, G. Data citation and the citation graph. Quantitative Science Studies 2 , 1399–1422 (2022).

Chao, T. C. Disciplinary reach: Investigating the impact of dataset reuse in the earth sciences. Proceedings of the American Society for Information Science and Technology 48 , 1–8 (2011).

Article   ADS   Google Scholar  

Parr, C. et al . A discussion of value metrics for data repositories in earth and environmental sciences. Data Science Journal 18 , 58 (2019).

Eschenfelder, K. R., Shankar, K. & Downey, G. The financial maintenance of social science data archives: Four case studies of long–term infrastructure work. J. Assoc. Inf. Sci. Technol. 73 , 1723–1740 (2022).

Palmer, C. L., Weber, N. M. & Cragin, M. H. The analytic potential of scientific data: Understanding re-use value. Proceedings of the American Society for Information Science and Technology 48 , 1–10 (2011).

Zimmerman, A. S. New knowledge from old data: The role of standards in the sharing and reuse of ecological data. Sci. Technol. Human Values 33 , 631–652 (2008).

Cragin, M. H., Palmer, C. L., Carlson, J. R. & Witt, M. Data sharing, small science and institutional repositories. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 368 , 4023–4038 (2010).

Article   ADS   CAS   Google Scholar  

Fear, K. M. Measuring and Anticipating the Impact of Data Reuse . Ph.D. thesis, University of Michigan (2013).

Borgman, C. L., Van de Sompel, H., Scharnhorst, A., van den Berg, H. & Treloar, A. Who uses the digital data archive? An exploratory study of DANS. Proceedings of the Association for Information Science and Technology 52 , 1–4 (2015).

Pasquetto, I. V., Borgman, C. L. & Wofford, M. F. Uses and reuses of scientific data: The data creators’ advantage. Harvard Data Science Review 1 (2019).

Gregory, K., Groth, P., Scharnhorst, A. & Wyatt, S. Lost or found? Discovering data needed for research. Harvard Data Science Review (2020).

York, J. Seeking equilibrium in data reuse: A study of knowledge satisficing . Ph.D. thesis, University of Michigan (2022).

Kilbride, W. & Norris, S. Collaborating to clarify the cost of curation. New Review of Information Networking 19 , 44–48 (2014).

Robinson-Garcia, N., Mongeon, P., Jeng, W. & Costas, R. DataCite as a novel bibliometric source: Coverage, strengths and limitations. Journal of Informetrics 11 , 841–854 (2017).

Qin, J., Hemsley, J. & Bratt, S. E. The structural shift and collaboration capacity in GenBank networks: A longitudinal study. Quantitative Science Studies 3 , 174–193 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Acuna, D. E., Yi, Z., Liang, L. & Zhuang, H. Predicting the usage of scientific datasets based on article, author, institution, and journal bibliometrics. In Smits, M. (ed.) Information for a Better World: Shaping the Global Future. iConference 2022 ., 42–52 (Springer International Publishing, Cham, 2022).

Zeng, T., Wu, L., Bratt, S. & Acuna, D. E. Assigning credit to scientific datasets using article citation networks. Journal of Informetrics 14 , 101013 (2020).

Koesten, L., Vougiouklis, P., Simperl, E. & Groth, P. Dataset reuse: Toward translating principles to practice. Patterns 1 , 100136 (2020).

Du, C., Cohoon, J., Lopez, P. & Howison, J. Softcite dataset: A dataset of software mentions in biomedical and economic research publications. J. Assoc. Inf. Sci. Technol. 72 , 870–884 (2021).

Aryani, A. et al . A research graph dataset for connecting research data repositories using RD-Switchboard. Sci Data 5 , 180099 (2018).

Färber, M. & Lamprecht, D. The data set knowledge graph: Creating a linked open data source for data sets. Quantitative Science Studies 2 , 1324–1355 (2021).

Perry, A. & Netscher, S. Measuring the time spent on data curation. Journal of Documentation 78 , 282–304 (2022).

Trisovic, A. et al . Advancing computational reproducibility in the Dataverse data repository platform. In Proceedings of the 3rd International Workshop on Practical Reproducible Evaluation of Computer Systems , P-RECS ‘20, 15–20, https://doi.org/10.1145/3391800.3398173 (Association for Computing Machinery, New York, NY, USA, 2020).

Borgman, C. L., Scharnhorst, A. & Golshan, M. S. Digital data archives as knowledge infrastructures: Mediating data sharing and reuse. Journal of the Association for Information Science and Technology 70 , 888–904, https://doi.org/10.1002/asi.24172 (2019).

Lafia, S. et al . MICA Data Descriptor. Zenodo https://doi.org/10.5281/zenodo.8432666 (2023).

Lafia, S., Thomer, A., Bleckley, D., Akmon, D. & Hemphill, L. Leveraging machine learning to detect data curation activities. In 2021 IEEE 17th International Conference on eScience (eScience) , 149–158, https://doi.org/10.1109/eScience51609.2021.00025 (2021).

Hemphill, L., Pienta, A., Lafia, S., Akmon, D. & Bleckley, D. How do properties of data, their curation, and their funding relate to reuse? J. Assoc. Inf. Sci. Technol. 73 , 1432–44, https://doi.org/10.1002/asi.24646 (2021).

Lafia, S., Fan, L., Thomer, A. & Hemphill, L. Subdivisions and crossroads: Identifying hidden community structures in a data archive’s citation network. Quantitative Science Studies 3 , 694–714, https://doi.org/10.1162/qss_a_00209 (2022).

ICPSR. ICPSR Bibliography of Data-related Literature: Collection Criteria. https://www.icpsr.umich.edu/web/pages/ICPSR/citations/collection-criteria.html (2023).

Lafia, S., Fan, L. & Hemphill, L. A natural language processing pipeline for detecting informal data references in academic literature. Proc. Assoc. Inf. Sci. Technol. 59 , 169–178, https://doi.org/10.1002/pra2.614 (2022).

Hook, D. W., Porter, S. J. & Herzog, C. Dimensions: Building context for search and evaluation. Frontiers in Research Metrics and Analytics 3 , 23, https://doi.org/10.3389/frma.2018.00023 (2018).

https://www.icpsr.umich.edu/web/ICPSR/thesaurus (2002). ICPSR. ICPSR Thesaurus.

https://www.icpsr.umich.edu/files/datamanagement/icpsr-curation-levels.pdf (2020). ICPSR. ICPSR Curation Levels.

McKinney, W. Data Structures for Statistical Computing in Python. In van der Walt, S. & Millman, J. (eds.) Proceedings of the 9th Python in Science Conference , 56–61 (2010).

Wickham, H. et al . Welcome to the Tidyverse. Journal of Open Source Software 4 , 1686 (2019).

Fan, L., Lafia, S., Li, L., Yang, F. & Hemphill, L. DataChat: Prototyping a conversational agent for dataset search and visualization. Proc. Assoc. Inf. Sci. Technol. 60 , 586–591 (2023).

Download references

Acknowledgements

We thank the ICPSR Bibliography staff, the ICPSR Data Curation Unit, and the ICPSR Data Stewardship Committee for their support of this research. This material is based upon work supported by the National Science Foundation under grant 1930645. This project was made possible in part by the Institute of Museum and Library Services LG-37-19-0134-19.

Author information

Authors and affiliations.

Inter-university Consortium for Political and Social Research, University of Michigan, Ann Arbor, MI, 48104, USA

Libby Hemphill, Sara Lafia, David Bleckley & Elizabeth Moss

School of Information, University of Michigan, Ann Arbor, MI, 48104, USA

Libby Hemphill & Lizhou Fan

School of Information, University of Arizona, Tucson, AZ, 85721, USA

Andrea Thomer

You can also search for this author in PubMed   Google Scholar

Contributions

L.H. and A.T. conceptualized the study design, D.B., E.M., and S.L. prepared the data, S.L., L.F., and L.H. analyzed the data, and D.B. validated the data. All authors reviewed and edited the manuscript.

Corresponding author

Correspondence to Libby Hemphill .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Hemphill, L., Thomer, A., Lafia, S. et al. A dataset for measuring the impact of research data and their curation. Sci Data 11 , 442 (2024). https://doi.org/10.1038/s41597-024-03303-2

Download citation

Received : 16 November 2023

Accepted : 24 April 2024

Published : 03 May 2024

DOI : https://doi.org/10.1038/s41597-024-03303-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

report on database research

Links and Data

A vast fundamental collection of databases comprise the synergistic knowledge base for NIH research. Links to an extensive scope of data, statistics, strategic plans, policy studies, and program evaluations are provided. Access to additional report sources on biomedical and behavioral research programs, clinical trials, as well as broader science-related information resources are also referenced.

expand content

  • June 19 (Wednesday), 2024: NIH Closed for the Federal Holiday - May 2024
  • Updated NIH Grants Policy Statement for Fiscal Year 2024 - May 2024
  • Plan Your Research Career at NIH - May 2024
  • FDA-NIH Want Your Input on a New Resource for Terminology in Clinical Research - May 2024
  • Updates on BESH Registration and Summary Results Reporting  - May 2024

PubMed

NIH… Turning Discovery Into Health ®

  • Privacy Policy

Research Method

Home » Research Report – Example, Writing Guide and Types

Research Report – Example, Writing Guide and Types

Table of Contents

Research Report

Research Report

Definition:

Research Report is a written document that presents the results of a research project or study, including the research question, methodology, results, and conclusions, in a clear and objective manner.

The purpose of a research report is to communicate the findings of the research to the intended audience, which could be other researchers, stakeholders, or the general public.

Components of Research Report

Components of Research Report are as follows:

Introduction

The introduction sets the stage for the research report and provides a brief overview of the research question or problem being investigated. It should include a clear statement of the purpose of the study and its significance or relevance to the field of research. It may also provide background information or a literature review to help contextualize the research.

Literature Review

The literature review provides a critical analysis and synthesis of the existing research and scholarship relevant to the research question or problem. It should identify the gaps, inconsistencies, and contradictions in the literature and show how the current study addresses these issues. The literature review also establishes the theoretical framework or conceptual model that guides the research.

Methodology

The methodology section describes the research design, methods, and procedures used to collect and analyze data. It should include information on the sample or participants, data collection instruments, data collection procedures, and data analysis techniques. The methodology should be clear and detailed enough to allow other researchers to replicate the study.

The results section presents the findings of the study in a clear and objective manner. It should provide a detailed description of the data and statistics used to answer the research question or test the hypothesis. Tables, graphs, and figures may be included to help visualize the data and illustrate the key findings.

The discussion section interprets the results of the study and explains their significance or relevance to the research question or problem. It should also compare the current findings with those of previous studies and identify the implications for future research or practice. The discussion should be based on the results presented in the previous section and should avoid speculation or unfounded conclusions.

The conclusion summarizes the key findings of the study and restates the main argument or thesis presented in the introduction. It should also provide a brief overview of the contributions of the study to the field of research and the implications for practice or policy.

The references section lists all the sources cited in the research report, following a specific citation style, such as APA or MLA.

The appendices section includes any additional material, such as data tables, figures, or instruments used in the study, that could not be included in the main text due to space limitations.

Types of Research Report

Types of Research Report are as follows:

Thesis is a type of research report. A thesis is a long-form research document that presents the findings and conclusions of an original research study conducted by a student as part of a graduate or postgraduate program. It is typically written by a student pursuing a higher degree, such as a Master’s or Doctoral degree, although it can also be written by researchers or scholars in other fields.

Research Paper

Research paper is a type of research report. A research paper is a document that presents the results of a research study or investigation. Research papers can be written in a variety of fields, including science, social science, humanities, and business. They typically follow a standard format that includes an introduction, literature review, methodology, results, discussion, and conclusion sections.

Technical Report

A technical report is a detailed report that provides information about a specific technical or scientific problem or project. Technical reports are often used in engineering, science, and other technical fields to document research and development work.

Progress Report

A progress report provides an update on the progress of a research project or program over a specific period of time. Progress reports are typically used to communicate the status of a project to stakeholders, funders, or project managers.

Feasibility Report

A feasibility report assesses the feasibility of a proposed project or plan, providing an analysis of the potential risks, benefits, and costs associated with the project. Feasibility reports are often used in business, engineering, and other fields to determine the viability of a project before it is undertaken.

Field Report

A field report documents observations and findings from fieldwork, which is research conducted in the natural environment or setting. Field reports are often used in anthropology, ecology, and other social and natural sciences.

Experimental Report

An experimental report documents the results of a scientific experiment, including the hypothesis, methods, results, and conclusions. Experimental reports are often used in biology, chemistry, and other sciences to communicate the results of laboratory experiments.

Case Study Report

A case study report provides an in-depth analysis of a specific case or situation, often used in psychology, social work, and other fields to document and understand complex cases or phenomena.

Literature Review Report

A literature review report synthesizes and summarizes existing research on a specific topic, providing an overview of the current state of knowledge on the subject. Literature review reports are often used in social sciences, education, and other fields to identify gaps in the literature and guide future research.

Research Report Example

Following is a Research Report Example sample for Students:

Title: The Impact of Social Media on Academic Performance among High School Students

This study aims to investigate the relationship between social media use and academic performance among high school students. The study utilized a quantitative research design, which involved a survey questionnaire administered to a sample of 200 high school students. The findings indicate that there is a negative correlation between social media use and academic performance, suggesting that excessive social media use can lead to poor academic performance among high school students. The results of this study have important implications for educators, parents, and policymakers, as they highlight the need for strategies that can help students balance their social media use and academic responsibilities.

Introduction:

Social media has become an integral part of the lives of high school students. With the widespread use of social media platforms such as Facebook, Twitter, Instagram, and Snapchat, students can connect with friends, share photos and videos, and engage in discussions on a range of topics. While social media offers many benefits, concerns have been raised about its impact on academic performance. Many studies have found a negative correlation between social media use and academic performance among high school students (Kirschner & Karpinski, 2010; Paul, Baker, & Cochran, 2012).

Given the growing importance of social media in the lives of high school students, it is important to investigate its impact on academic performance. This study aims to address this gap by examining the relationship between social media use and academic performance among high school students.

Methodology:

The study utilized a quantitative research design, which involved a survey questionnaire administered to a sample of 200 high school students. The questionnaire was developed based on previous studies and was designed to measure the frequency and duration of social media use, as well as academic performance.

The participants were selected using a convenience sampling technique, and the survey questionnaire was distributed in the classroom during regular school hours. The data collected were analyzed using descriptive statistics and correlation analysis.

The findings indicate that the majority of high school students use social media platforms on a daily basis, with Facebook being the most popular platform. The results also show a negative correlation between social media use and academic performance, suggesting that excessive social media use can lead to poor academic performance among high school students.

Discussion:

The results of this study have important implications for educators, parents, and policymakers. The negative correlation between social media use and academic performance suggests that strategies should be put in place to help students balance their social media use and academic responsibilities. For example, educators could incorporate social media into their teaching strategies to engage students and enhance learning. Parents could limit their children’s social media use and encourage them to prioritize their academic responsibilities. Policymakers could develop guidelines and policies to regulate social media use among high school students.

Conclusion:

In conclusion, this study provides evidence of the negative impact of social media on academic performance among high school students. The findings highlight the need for strategies that can help students balance their social media use and academic responsibilities. Further research is needed to explore the specific mechanisms by which social media use affects academic performance and to develop effective strategies for addressing this issue.

Limitations:

One limitation of this study is the use of convenience sampling, which limits the generalizability of the findings to other populations. Future studies should use random sampling techniques to increase the representativeness of the sample. Another limitation is the use of self-reported measures, which may be subject to social desirability bias. Future studies could use objective measures of social media use and academic performance, such as tracking software and school records.

Implications:

The findings of this study have important implications for educators, parents, and policymakers. Educators could incorporate social media into their teaching strategies to engage students and enhance learning. For example, teachers could use social media platforms to share relevant educational resources and facilitate online discussions. Parents could limit their children’s social media use and encourage them to prioritize their academic responsibilities. They could also engage in open communication with their children to understand their social media use and its impact on their academic performance. Policymakers could develop guidelines and policies to regulate social media use among high school students. For example, schools could implement social media policies that restrict access during class time and encourage responsible use.

References:

  • Kirschner, P. A., & Karpinski, A. C. (2010). Facebook® and academic performance. Computers in Human Behavior, 26(6), 1237-1245.
  • Paul, J. A., Baker, H. M., & Cochran, J. D. (2012). Effect of online social networking on student academic performance. Journal of the Research Center for Educational Technology, 8(1), 1-19.
  • Pantic, I. (2014). Online social networking and mental health. Cyberpsychology, Behavior, and Social Networking, 17(10), 652-657.
  • Rosen, L. D., Carrier, L. M., & Cheever, N. A. (2013). Facebook and texting made me do it: Media-induced task-switching while studying. Computers in Human Behavior, 29(3), 948-958.

Note*: Above mention, Example is just a sample for the students’ guide. Do not directly copy and paste as your College or University assignment. Kindly do some research and Write your own.

Applications of Research Report

Research reports have many applications, including:

  • Communicating research findings: The primary application of a research report is to communicate the results of a study to other researchers, stakeholders, or the general public. The report serves as a way to share new knowledge, insights, and discoveries with others in the field.
  • Informing policy and practice : Research reports can inform policy and practice by providing evidence-based recommendations for decision-makers. For example, a research report on the effectiveness of a new drug could inform regulatory agencies in their decision-making process.
  • Supporting further research: Research reports can provide a foundation for further research in a particular area. Other researchers may use the findings and methodology of a report to develop new research questions or to build on existing research.
  • Evaluating programs and interventions : Research reports can be used to evaluate the effectiveness of programs and interventions in achieving their intended outcomes. For example, a research report on a new educational program could provide evidence of its impact on student performance.
  • Demonstrating impact : Research reports can be used to demonstrate the impact of research funding or to evaluate the success of research projects. By presenting the findings and outcomes of a study, research reports can show the value of research to funders and stakeholders.
  • Enhancing professional development : Research reports can be used to enhance professional development by providing a source of information and learning for researchers and practitioners in a particular field. For example, a research report on a new teaching methodology could provide insights and ideas for educators to incorporate into their own practice.

How to write Research Report

Here are some steps you can follow to write a research report:

  • Identify the research question: The first step in writing a research report is to identify your research question. This will help you focus your research and organize your findings.
  • Conduct research : Once you have identified your research question, you will need to conduct research to gather relevant data and information. This can involve conducting experiments, reviewing literature, or analyzing data.
  • Organize your findings: Once you have gathered all of your data, you will need to organize your findings in a way that is clear and understandable. This can involve creating tables, graphs, or charts to illustrate your results.
  • Write the report: Once you have organized your findings, you can begin writing the report. Start with an introduction that provides background information and explains the purpose of your research. Next, provide a detailed description of your research methods and findings. Finally, summarize your results and draw conclusions based on your findings.
  • Proofread and edit: After you have written your report, be sure to proofread and edit it carefully. Check for grammar and spelling errors, and make sure that your report is well-organized and easy to read.
  • Include a reference list: Be sure to include a list of references that you used in your research. This will give credit to your sources and allow readers to further explore the topic if they choose.
  • Format your report: Finally, format your report according to the guidelines provided by your instructor or organization. This may include formatting requirements for headings, margins, fonts, and spacing.

Purpose of Research Report

The purpose of a research report is to communicate the results of a research study to a specific audience, such as peers in the same field, stakeholders, or the general public. The report provides a detailed description of the research methods, findings, and conclusions.

Some common purposes of a research report include:

  • Sharing knowledge: A research report allows researchers to share their findings and knowledge with others in their field. This helps to advance the field and improve the understanding of a particular topic.
  • Identifying trends: A research report can identify trends and patterns in data, which can help guide future research and inform decision-making.
  • Addressing problems: A research report can provide insights into problems or issues and suggest solutions or recommendations for addressing them.
  • Evaluating programs or interventions : A research report can evaluate the effectiveness of programs or interventions, which can inform decision-making about whether to continue, modify, or discontinue them.
  • Meeting regulatory requirements: In some fields, research reports are required to meet regulatory requirements, such as in the case of drug trials or environmental impact studies.

When to Write Research Report

A research report should be written after completing the research study. This includes collecting data, analyzing the results, and drawing conclusions based on the findings. Once the research is complete, the report should be written in a timely manner while the information is still fresh in the researcher’s mind.

In academic settings, research reports are often required as part of coursework or as part of a thesis or dissertation. In this case, the report should be written according to the guidelines provided by the instructor or institution.

In other settings, such as in industry or government, research reports may be required to inform decision-making or to comply with regulatory requirements. In these cases, the report should be written as soon as possible after the research is completed in order to inform decision-making in a timely manner.

Overall, the timing of when to write a research report depends on the purpose of the research, the expectations of the audience, and any regulatory requirements that need to be met. However, it is important to complete the report in a timely manner while the information is still fresh in the researcher’s mind.

Characteristics of Research Report

There are several characteristics of a research report that distinguish it from other types of writing. These characteristics include:

  • Objective: A research report should be written in an objective and unbiased manner. It should present the facts and findings of the research study without any personal opinions or biases.
  • Systematic: A research report should be written in a systematic manner. It should follow a clear and logical structure, and the information should be presented in a way that is easy to understand and follow.
  • Detailed: A research report should be detailed and comprehensive. It should provide a thorough description of the research methods, results, and conclusions.
  • Accurate : A research report should be accurate and based on sound research methods. The findings and conclusions should be supported by data and evidence.
  • Organized: A research report should be well-organized. It should include headings and subheadings to help the reader navigate the report and understand the main points.
  • Clear and concise: A research report should be written in clear and concise language. The information should be presented in a way that is easy to understand, and unnecessary jargon should be avoided.
  • Citations and references: A research report should include citations and references to support the findings and conclusions. This helps to give credit to other researchers and to provide readers with the opportunity to further explore the topic.

Advantages of Research Report

Research reports have several advantages, including:

  • Communicating research findings: Research reports allow researchers to communicate their findings to a wider audience, including other researchers, stakeholders, and the general public. This helps to disseminate knowledge and advance the understanding of a particular topic.
  • Providing evidence for decision-making : Research reports can provide evidence to inform decision-making, such as in the case of policy-making, program planning, or product development. The findings and conclusions can help guide decisions and improve outcomes.
  • Supporting further research: Research reports can provide a foundation for further research on a particular topic. Other researchers can build on the findings and conclusions of the report, which can lead to further discoveries and advancements in the field.
  • Demonstrating expertise: Research reports can demonstrate the expertise of the researchers and their ability to conduct rigorous and high-quality research. This can be important for securing funding, promotions, and other professional opportunities.
  • Meeting regulatory requirements: In some fields, research reports are required to meet regulatory requirements, such as in the case of drug trials or environmental impact studies. Producing a high-quality research report can help ensure compliance with these requirements.

Limitations of Research Report

Despite their advantages, research reports also have some limitations, including:

  • Time-consuming: Conducting research and writing a report can be a time-consuming process, particularly for large-scale studies. This can limit the frequency and speed of producing research reports.
  • Expensive: Conducting research and producing a report can be expensive, particularly for studies that require specialized equipment, personnel, or data. This can limit the scope and feasibility of some research studies.
  • Limited generalizability: Research studies often focus on a specific population or context, which can limit the generalizability of the findings to other populations or contexts.
  • Potential bias : Researchers may have biases or conflicts of interest that can influence the findings and conclusions of the research study. Additionally, participants may also have biases or may not be representative of the larger population, which can limit the validity and reliability of the findings.
  • Accessibility: Research reports may be written in technical or academic language, which can limit their accessibility to a wider audience. Additionally, some research may be behind paywalls or require specialized access, which can limit the ability of others to read and use the findings.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Data collection

Data Collection – Methods Types and Examples

Delimitations

Delimitations in Research – Types, Examples and...

Research Process

Research Process – Steps, Examples and Tips

Research Design

Research Design – Types, Methods and Examples

Institutional Review Board (IRB)

Institutional Review Board – Application Sample...

Evaluating Research

Evaluating Research – Process, Examples and...

Numbers, Facts and Trends Shaping Your World

Read our research on:

Full Topic List

Regions & Countries

  • Publications
  • Our Methods
  • Short Reads
  • Tools & Resources

Read Our Research On:

Broad Public Support for Legal Abortion Persists 2 Years After Dobbs

By more than 2 to 1, americans say medication abortion should be legal, table of contents.

  • Other abortion attitudes
  • Overall attitudes about abortion
  • Americans’ views on medication abortion in their states
  • How statements about abortion resonate with Americans
  • Acknowledgments
  • The American Trends Panel survey methodology

Pew Research Center conducted this study to understand Americans’ views on the legality of abortion, as well as their perceptions of abortion access. For this analysis, we surveyed 8,709 adults from April 8 to 14, 2024. Everyone who took part in this survey is a member of the Center’s American Trends Panel (ATP), an online survey panel that is recruited through national, random sampling of residential addresses. This way nearly all U.S. adults have a chance of selection. The survey is weighted to be representative of the U.S. adult population by gender, race, ethnicity, partisan affiliation, education and other categories. Read more about the ATP’s methodology .

Here are the questions used for the report and its methodology .

Nearly two years after the Supreme Court overturned the 1973 Roe v. Wade decision guaranteeing a national right to abortion, a majority of Americans continue to express support for abortion access.

Chart shows Majority of Americans say abortion should be legal in all or most cases

About six-in-ten (63%) say abortion should be legal in all or most cases. This share has grown 4 percentage points since 2021 – the year prior to the 2022 decision in Dobbs v. Jackson Women’s Health Organization that overturned Roe.

The new Pew Research Center survey, conducted April 8-14, 2024, among 8,709 adults, surfaces ongoing – and often partisan – divides over abortion attitudes:

  • Democrats and Democratic-leaning independents (85%) overwhelmingly say abortion should be legal in all or most cases, with near unanimous support among liberal Democrats.
  • By comparison, Republicans and Republican leaners (41%) are far less likely to say abortion should be legal in all or most cases. However, two-thirds of moderate and liberal Republicans still say it should be.

Chart shows Partisan divide over abortion has widened over the past decade

Since before Roe was overturned, both parties have seen a modest uptick in the share who say abortion should be legal.

As in the past, relatively few Americans (25%) say abortion should be legal in all cases, while even fewer (8%) say it should be illegal in all cases. About two-thirds of Americans do not take an absolutist view: 38% say it should be legal in most cases, and 28% say it should be illegal in most cases.

Related: Americans overwhelmingly say access to IVF is a good thing

Women’s abortion decisions

Chart shows A majority of Americans say the decision to have an abortion should belong solely to the pregnant woman; about a third say embryos are people with rights

A narrow majority of Americans (54%) say the statement “the decision about whether to have an abortion should belong solely to the pregnant woman” describes their views extremely or very well. Another 19% say it describes their views somewhat well, and 26% say it does not describe their views well.

Views on an embryo’s rights

About a third of Americans (35%) say the statement “human life begins at conception, so an embryo is a person with rights” describes their views extremely or very well, while 45% say it does not describe their views well.

But many Americans are cross-pressured in their views: 32% of Americans say both statements about women’s decisions and embryos’ rights describe their views at least somewhat well.

Abortion access

About six-in-ten Americans in both parties say getting an abortion in the area where they live would be at least somewhat easy, compared with four-in-ten or fewer who say it would be difficult.

Chart shows About 6 in 10 Americans say it would be easy to get an abortion in their area

However, U.S. adults are divided over whether getting an abortion should be easier or harder:

  • 31% say it should be easier for someone to get an abortion in their area, while 25% say it should be harder. Four-in-ten say the ease of access should be about what it is now.
  • 48% of Democrats say that obtaining an abortion should be easier than it is now, while just 15% of Republicans say this. Instead, 40% of Republicans say it should be harder (just 11% of Democrats say this).

As was the case last year, views about abortion access vary widely between those who live in states where abortion is legal and those who live in states where it is not allowed.

For instance, 20% of adults in states where abortion is legal say it would be difficult to get an abortion where they live, but this share rises to 71% among adults in states where abortion is prohibited.

Medication abortion

Americans say medication abortion should be legal rather than illegal by a margin of more than two-to-one (54% vs. 20%). A quarter say they are not sure.

Chart shows Most Democrats say medication abortion should be legal; Republicans are divided

Like opinions on the legality of abortion overall, partisans differ greatly in their views of medication abortion:

  • Republicans are closely split but are slightly more likely to say it should be legal (37%) than illegal (32%). Another 30% aren’t sure.
  • Democrats (73%) overwhelmingly say medication abortion should be legal. Just 8% say it should be illegal, while 19% are not sure.

Across most other demographic groups, Americans are generally more supportive than not of medication abortion.

Chart shows Younger Americans are more likely than older adults to say abortion should be legal in all or most cases

Across demographic groups, support for abortion access has changed little since this time last year.

Today, roughly six-in-ten (63%) say abortion should be legal in all (25%) or most (38%) cases. And 36% say it should be illegal in all (8%) or most (28%) cases.

While differences are only modest by gender, other groups vary more widely in their views.

Race and ethnicity

Support for legal abortion is higher among Black (73%) and Asian (76%) adults compared with White (60%) and Hispanic (59%) adults.

Compared with older Americans, adults under 30 are particularly likely to say abortion should be legal: 76% say this, versus about six-in-ten among other age groups.

Those with higher levels of formal education express greater support for legal abortion than those with lower levels of educational attainment.

About two-thirds of Americans with a bachelor’s degree or more education (68%) say abortion should be legal in all or most cases, compared with six-in-ten among those without a degree.

White evangelical Protestants are about three times as likely to say abortion should be illegal (73%) as they are to say it should be legal (25%).

By contrast, majorities of White nonevangelical Protestants (64%), Black Protestants (71%) and Catholics (59%) say abortion should be legal. And religiously unaffiliated Americans are especially likely to say abortion should be legal (86% say this).

Partisanship and ideology

Democrats (85%) are about twice as likely as Republicans (41%) to say abortion should be legal in all or most cases.

But while more conservative Republicans say abortion should be illegal (76%) than legal (27%), the reverse is true for moderate and liberal Republicans (67% say legal, 31% say illegal).

By comparison, a clear majority of conservative and moderate Democrats (76%) say abortion should be legal, with liberal Democrats (96%) overwhelmingly saying this.

Views of abortion access by state

About six-in-ten Americans (58%) say it would be easy for someone to get an abortion in the area where they live, while 39% say it would be difficult.

Chart shows Americans vary widely in their views over how easy it would be to get an abortion based on where they live

This marks a slight shift since last year, when 54% said obtaining an abortion would be easy. But Americans are still less likely than before the Dobbs decision to say obtaining an abortion would be easy.

Still, Americans’ views vary widely depending on whether they live in a state that has banned or restricted abortion.

In states that prohibit abortion, Americans are about three times as likely to say it would be difficult to obtain an abortion where they live as they are to say it would be easy (71% vs. 25%). The share saying it would be difficult has risen 19 points since 2019.

In states where abortion is restricted or subject to legal challenges, 51% say it would be difficult to get an abortion where they live. This is similar to the share who said so last year (55%), but higher than the share who said this before the Dobbs decision (38%).

By comparison, just 20% of adults in states where abortion is legal say it would be difficult to get one. This is little changed over the past five years.

Americans’ attitudes about whether it should be easier or harder to get an abortion in the area where they live also varies by geography.

Chart shows Americans living in states with abortion bans or restrictions are more likely to say it should be easier than it currently is to obtain an abortion

Overall, a decreasing share of Americans say it should be harder to obtain an abortion: 33% said this in 2019, compared with 25% today.

This is particularly true of those in states where abortion is now prohibited or restricted.

In both types of states, the shares of Americans saying it should be easier to obtain an abortion have risen 12 points since before Roe was overturned, as the shares saying it should be harder have gradually declined.

By comparison, changes in views among those living in states where abortion is legal have been more modest.

While Americans overall are more supportive than not of medication abortion (54% say it should be legal, 20% say illegal), there are modest differences in support across groups:

Chart shows Across most groups, more say medication abortion should be legal than illegal in their states

  • Younger Americans are somewhat more likely to say medication abortion should be legal than older Americans. While 59% of adults ages 18 to 49 say it should be legal, 48% of those 50 and older say the same.
  • Asian adults (66%) are particularly likely to say medication abortion should be legal compared with White (55%), Black (51%) and Hispanic (47%) adults.
  • White evangelical Protestants oppose medication abortion by about two-to-one (45% vs. 23%), with White nonevangelicals, Black Protestants, Catholics and religiously unaffiliated adults all being more likely than not to say medication abortion should be legal.
  • Republicans are closely divided over medication abortion: 37% say it should be legal while 32% say it should be illegal. But similar to views on abortion access overall, conservative Republicans are more opposed (43% illegal, 27% legal), while moderate and liberals are more supportive (55% legal, 14% illegal).

Just over half of Americans (54%) say “the decision about whether to have an abortion should belong solely to the pregnant woman” describes their views extremely or very well, compared with 19% who say somewhat well and 26% who say not too or not at all well.

Chart shows Wide partisan divides over whether pregnant women should be the sole deciders of abortion decisions and whether an embryo is a person with rights

Democrats (76%) overwhelmingly say this statement describes their views extremely or very well, with just 8% saying it does not describe their views well.

Republicans are more divided: 44% say it does not describe their views well while 33% say it describes them extremely or very well. Another 22% say it describes them somewhat well.

Fewer Americans (35%) say the statement “human life begins at conception, so an embryo is a person with rights” describes their views extremely or very well. Another 19% say it describes their views somewhat well while 45% say it describes them not too or not at all well.

(The survey asks separately whether “a fetus is a person with rights.” The results are roughly similar: 37% say that statement describes their views extremely or very well.)

Republicans are about three times as likely as Democrats to say “an embryo is a person with rights” describes their views extremely or very well (53% vs. 18%). In turn, Democrats (66%) are far more likely than Republicans (25%) to say it describes their views not too or not at all well.

Some Americans are cross-pressured about abortion

Chart shows Nearly a third of U.S. adults say embryos are people with rights and pregnant women should be the ones to make abortion decisions

When results on the two statements are combined, 41% of Americans say the statement about a pregnant woman’s right to choose describes their views at least somewhat well , but not the statement about an embryo being a person with rights. About two-in-ten (21%) say the reverse.

But for nearly a third of U.S. adults (32%), both statements describe their views at least somewhat well.

Just 4% of Americans say neither statement describes their views well.

Sign up for our weekly newsletter

Fresh data delivery Saturday mornings

Sign up for The Briefing

Weekly updates on the world of news & information

  • Partisanship & Issues

Public Opinion on Abortion

Americans overwhelmingly say access to ivf is a good thing, what the data says about abortion in the u.s., support for legal abortion is widespread in many countries, especially in europe, nearly a year after roe’s demise, americans’ views of abortion access increasingly vary by where they live, most popular, report materials.

1615 L St. NW, Suite 800 Washington, DC 20036 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 |  Media Inquiries

Research Topics

  • Age & Generations
  • Coronavirus (COVID-19)
  • Economy & Work
  • Family & Relationships
  • Gender & LGBTQ
  • Immigration & Migration
  • International Affairs
  • Internet & Technology
  • Methodological Research
  • News Habits & Media
  • Non-U.S. Governments
  • Other Topics
  • Politics & Policy
  • Race & Ethnicity
  • Email Newsletters

ABOUT PEW RESEARCH CENTER  Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of  The Pew Charitable Trusts .

Copyright 2024 Pew Research Center

We've detected unusual activity from your computer network

To continue, please click the box below to let us know you're not a robot.

Why did this happen?

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review our Terms of Service and Cookie Policy .

For inquiries related to this message please contact our support team and provide the reference ID below.

IMAGES

  1. 10 Data Analysis Report Examples

    report on database research

  2. Types of Research Report

    report on database research

  3. Database Summary Report

    report on database research

  4. FREE Research Report Template

    report on database research

  5. The Beckman Report on Database Research

    report on database research

  6. FREE 17+ Sample Research Reports in PDF

    report on database research

VIDEO

  1. Database research paper presentation 1

  2. Theory and implementation of database -- Research paper presentation (Dasari Vaishnavi)

  3. Membuat Laporan Crystal Report Database MySQL

  4. Theory And Implementation Of Database Research Paper Presentation ( Saket Shreyas Guda )

  5. "Can I access this report database?” #bugbounty #bugbountytips #bugbountyhunter

  6. Reporting Systems and Database Development

COMMENTS

  1. PDF The Seattle Report on Database Research

    facing our research community. This report summarizes the discussion and conclusions of the 9th such meeting, held during October 9-10, 2018 in Seattle. Introduction From the inception of the field, academic database research has strongly influenced the state of the database industry and vice versa.

  2. The Seattle Report on Database Research

    View in the ACM Digital Library. From the inception of the field, academic database research has strongly influenced the database industry and vice versa. The database community, both research and industry, has grown substantially over the years. The relational database market alone has revenue upwards of $50B.

  3. The Beckman report on database research

    We strive to create an environment conducive to many different types of research across many different time scales and levels of risk. Learn more about our Philosophy Learn more. Philosophy People. Our researchers drive advancements in computer science through both fundamental and applied research. ... The Beckman report on database research ...

  4. The Seattle report on database research

    The Seattle Report on Database Research. FROM THE INCEPTION of the field, academic database research has strongly inluenced the database industry and vice versa. The database community, both research and industry, has grown substantially over the years. The relational database market alone has revenue upwards of $50B.

  5. The Seattle report on database research

    The Seattle report on database research. SIGMOD Rec. 48, 4 (2019) 44--53 (2019) Google Scholar; Abiteboul, S. et al. The Lowell database research self-assessment. Commun. ACM 48, 5 (May 2005), 111--118. Google Scholar Digital Library; Agrawal, R. et al. The Claremont report on database research.

  6. PDF The Beckman Report on Database Research

    do so, the report recommends signi cantly more attention to ve research areas: scalable big/fast data infrastructures; coping with diversity in the data management landscape; end-to-end processing and understanding of data; cloud services; and managing the diverse roles of people in the data life cycle. 1 Introduction

  7. The Beckman Report on Database Research

    Report. Abstract: Every few years a group of database researchers meets to discuss the state of database research, its impact on practice, and important new directions. This report summarizes the discussion and conclusions of the eighth such meeting, held October 14-15, 2013 in Irvine, California. It observes that Big Data has now become a ...

  8. The Beckman Report on Database Research

    Data Management Meet the groups behind our innovation Our teams advance the state of the art through research, systems engineering, and collaboration across Google.

  9. The Beckman report on database research

    The Beckman Report on Database Research. Every few years a group of database researchers meets to discuss the state of database research, its impact on practice, and important new directions. This report summarizes the discussion and conclusions of the eighth such meeting, held October 14- 15, ...

  10. The Claremont report on database research

    The Claremont report on database research; The Claremont report on database research. Rakesh Agrawal Anastasia Ailamaki Philip A. Bernstein ... We're proud to work with academic and research institutions to push the boundaries of AI and computer science. Learn more about our student and faculty programs, as well as our global outreach initiatives.

  11. The Seattle Report on Database Research

    This report summarizes the discussion and conclusions of the 9th self-assessment meeting of database researchers, held during October 9-10, 2018 in Seattle. Approximately every five years, a group of database researchers meet to do a self-assessment of our community, including reflections on our impact on the industry as well as challenges facing our research community.

  12. The Seattle report on database research

    Abstract. The authors of the Seattle report on database research met in Seattle in the fall of 2018 to identify promising research directions for the field. This report summarizes findings from the Seattle meeting and discussions, including panels at ACM SIGMOD 20206 and VLDB 2020. The central part of the report covers research themes and ...

  13. PDF The Seattle Report on Database Research

    research has strongly influenced the state of the database industry and vice versa. The database community, both research and industry, has grown substantiallyovertheyears. Therelationaldatabase market alone has revenue upwards of $50B. On the academic front, database researchers continue to be recognized with significant awards. With Michael

  14. PDF The Claremont Report on Database Research

    structured and unstructured data, cloud data services, and mobile and virtual worlds. We also report on discussions of the community's growth, including suggestions for changes in community processes to move the research agenda forward, and to enhance impact on a broader audience. 1. A Turning Point in Database Research

  15. The Claremont Report on Database Research

    Report. Abstract: In late May, 2008, a group of database researchers, architects, users and pundits met at the Claremont Resort in Berkeley, California to discuss the state of the research field and its impacts on practice. This was the seventh meeting of this sort in twenty years, and was distinguished by a broad consensus that we are at a ...

  16. (PDF) The Claremont Report on Database Research

    A group of database researchers, architects, users, and pundits met in May 2008 at the Claremont Resort in Berkeley, CA, to discuss the state of database research and its effects on practice. This ...

  17. The Seattle Report on Database Research

    Abstract. Approximately every five years, a group of database researchers meet to do a self-assessment of our community, including reflections on our impact on the industry as well as challenges facing our research community. This report summarizes the discussion and conclusions of the 9th such meeting, held during October 9-10, 2018 in Seattle.

  18. JSTOR Home

    Harness the power of visual materials—explore more than 3 million images now on JSTOR. Enhance your scholarly research with underground newspapers, magazines, and journals. Explore collections in the arts, sciences, and literature from the world's leading museums, archives, and scholars. JSTOR is a digital library of academic journals ...

  19. The Seattle Report on Database Research

    Abstract. Approximately every five years, a group of database researchers meet to do a self-assessment of our community, including reflections on our impact on the industry as well as challenges facing our research community. This report summarizes the discussion and conclusions of the 9th such meeting, held during October 9-10, 2018 in Seattle.

  20. (PDF) Database System: Concepts and Design

    In short, " A database is an organized collecti on of related information stored with. minimum redundancy, in a manner that makes them accessible f or multiple application". Definition : 1 ...

  21. A dataset for measuring the impact of research data and their ...

    Science funders, publishers, and data archives make decisions about how to responsibly allocate resources to maximize the reuse potential of research data. This paper introduces a dataset ...

  22. Home

    Research Portfolio Online Reporting Tools. (RePORT) In addition to carrying out its scientific mission, the NIH exemplifies and promotes the highest level of public accountability. To that end, the Research Portfolio Online Reporting Tools provides access to reports, data, and analyses of NIH research activities, including information on NIH ...

  23. Links and Data

    Links and Data. A vast fundamental collection of databases comprise the synergistic knowledge base for NIH research. Links to an extensive scope of data, statistics, strategic plans, policy studies, and program evaluations are provided. Access to additional report sources on biomedical and behavioral research programs, clinical trials, as well ...

  24. Research Report

    Sharing knowledge: A research report allows researchers to share their findings and knowledge with others in their field. This helps to advance the field and improve the understanding of a particular topic. Identifying trends: A research report can identify trends and patterns in data, which can help guide future research and inform decision ...

  25. Most Americans Support Legal Abortion 2 Years ...

    The new Pew Research Center survey, conducted April 8-14, 2024, among 8,709 adults, surfaces ongoing - and often partisan - divides over abortion attitudes: Democrats and Democratic-leaning independents (85%) overwhelmingly say abortion should be legal in all or most cases, with near unanimous support among liberal Democrats.

  26. Reports on China's Bad Lending Data Disappear on Social Media

    A series of research reports from Chinese brokerages on the country's recent bad credit data disappeared from social media over the weekend, highlighting the increasing difficulty of getting ...

  27. Global Cell Therapy and Gene Therapy Market Report

    Research and Development Trends ... ResearchAndMarkets.com is the world's leading source for international market research reports and market data. We provide you with the latest data on ...