Data science: a game changer for science and innovation

  • Regular Paper
  • Open access
  • Published: 19 April 2021
  • Volume 11 , pages 263–278, ( 2021 )

Cite this article

You have full access to this open access article

  • Valerio Grossi 1 ,
  • Fosca Giannotti 1 ,
  • Dino Pedreschi 2 ,
  • Paolo Manghi 3 ,
  • Pasquale Pagano 3 &
  • Massimiliano Assante 3  

13k Accesses

17 Citations

57 Altmetric

Explore all metrics

This paper shows data science’s potential for disruptive innovation in science, industry, policy, and people’s lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e-infrastructure as useful tools for supporting ethical data science and training new generations of data scientists. Finally, this work outlines SoBigData Research Infrastructure as an easy-to-access platform for executing complex data science processes. The services proposed by SoBigData are aimed at using data science to understand the complexity of our contemporary, globally interconnected society.

Similar content being viewed by others

research paper and data

What is a social pattern? Rethinking a central social science term

Hernan Mondani & Richard Swedberg

research paper and data

The mechanisms of AI hype and its planetary and social costs

Alva Markelius, Connor Wright, … Yu-Ting Kuo

research paper and data

Digital transformation: a review, synthesis and opportunities for future research

Swen Nadkarni & Reinhard Prügl

Avoid common mistakes on your manuscript.

1 Introduction: from data to knowledge

Data science is an interdisciplinary and pervasive paradigm where different theories and models are combined to transform data into knowledge (and value). Experiments and analyses over massive datasets are functional not only to the validation of existing theories and models but also to the data-driven discovery of patterns emerging from data, which can help scientists in the design of better theories and models, yielding a deeper understanding of the complexity of the social, economic, biological, technological, cultural, and natural phenomenon. The products of data science are the result of re-interpreting available data for analysis goals that differ from the original reasons motivating data collection. All these aspects are producing a change in the scientific method, in research and in the way our society makes decisions [ 2 ].

Data science emerges to concurring facts: (i) the advent of big data that provides the critical mass of actual examples to learn from, (ii) the advances in data analysis and learning techniques that can produce predictive models and behavioral patterns from big data, and (iii) the advances in high-performance computing infrastructures that make it possible to ingest and manage big data and perform complex analysis [ 16 ].

Paper organization Section 2 discusses how data science impacts our science and society at large in the coming years. Section 3 outlines the main issues related to the ethical problems in studying human behaviors that data science introduces. In Sect.  4 , we show how concepts such as open science and e-infrastructure are effective tools for supporting, disseminating ethical uses of the data, and training new generations of data scientists. We will illustrate the importance of an open data science with examples provided later in the paper. Finally, we show some use cases of data science through thematic environments that bind the datasets with social mining methods.

2 Data science for society, science, industry and business

figure 1

Data science as an ecosystem: on the left, the figure shows the main components enabling data science (data, analytical methods, and infrastructures). On the right, we can find the impact of data science into society, science, and business. All the activities related to data science should be done under rigid ethical principles

The quality of business decision making, government administration, and scientific research can potentially be improved by analyzing data. Data science offers important insights into many complicated issues, in many instances, with remarkable accuracy and timeliness.

figure 2

The data science pipeline starts with raw data and transforms them into data used for analytics. The next step is to transform these data into knowledge through analytical methods and then provide results and evaluation measures

As shown in Fig.  1 , data science is an ecosystem where the following scientific, technological, and socioeconomic factors interact:

Data Availability of data and access to data sources;

Analytics & computing infrastructures Availability of high performance analytical processing and open-source analytics;

Skills Availability of highly and rightly skilled data scientists and engineers;

Ethical & legal aspects Availability of regulatory environments for data ownership and usage, data protection and privacy, security, liability, cybercrime, and intellectual property rights;

Applications Business and market ready applications;

Social aspects Focus on major societal global challenges.

Data science envisioned as the intersection between data mining, big data analytics, artificial intelligence, statistical modeling, and complex systems is capable of monitoring data quality and analytical processes results transparently. If we want data science to face the global challenges and become a determinant factor of sustainable development, it is necessary to push towards an open global ecosystem for science, industrial, and societal innovation [ 48 ]. We need to build an ecosystem of socioeconomic activities, where each new idea, product, and service create opportunities for further purposes, and products. An open data strategy, innovation, interoperability, and suitable intellectual property rights can catalyze such an ecosystem and boost economic growth and sustainable development. This strategy also requires a “networked thinking” and a participatory, inclusive approach.

Data are relevant in almost all the scientific disciplines, and a data-dominated science could lead to the solution of problems currently considered hard or impossible to tackle. It is impossible to cover all the scientific sectors where a data-driven revolution is ongoing; here, we shall only provide just a few examples.

The Sloan Digital Sky Survey Footnote 1 has become a central resource for astronomers over the world. Astronomy is being transformed from the one where taking pictures of the sky was a large part of an astronomer’s job, to the one where the images are already in a database, and the astronomer’s task is to find interesting objects and phenomenon in the database. In biological sciences, data are stored in public repositories. There is an entire discipline of bioinformatics that is devoted to the analysis of such data. Footnote 2 Data-centric approaches based on personal behaviors can also support medical applications analyzing data at both human behavior levels and lower molecular ones. For example, integrating genome data of medical reactions with the habits of the users, enabling a computational drug science for high-precision personalized medicine. In humans, as in other organisms, most cellular components exert their functions through interactions with other cellular components. The totality of these interactions (representing the human “interactome”) is a network with hundreds of thousand nodes and a much larger number of links. A disease is rarely a consequence of an abnormality in a single gene. Instead, the disease phenotype is a reflection of various pathological processes that interact in a complex network. Network-based approaches can have multiple biological and clinical applications, especially in revealing the mechanisms behind complex diseases [ 6 ].

Now, we illustrate the typical data science pipeline [ 50 ]. People, machines, systems, factories, organizations, communities, and societies produce data. Data are collected in every aspect of our life, when: we submit a tax declaration; a customer orders an item online; a social media user posts a comment; a X-ray machine is used to take a picture; a traveler sends a review on a restaurant; a sensor in a supply chain sends an alert; or a scientist conducts an experiment. This huge and heterogeneous quantity of data needs to be extracted, loaded, understood, transformed, and in many cases, anonymized before they may be used for analysis. Analysis results include routines, automated decisions, predictions, and recommendations, and outcomes that need to be interpreted to produce actions and feedback. Furthermore, this scenario must also consider ethical problems in managing social data. Figure 2 depicts the data science pipeline. Footnote 3 Ethical aspects are important in the application of data science in several sectors, and they are addressed in Sect.  3 .

2.1 Impact on society

Data science is an opportunity for improving our society and boosting social progress. It can support policymaking; it offers novel ways to produce high-quality and high-precision statistical information and empower citizens with self-awareness tools. Furthermore, it can help to promote ethical uses of big data.

Modern cities are perfect environments densely traversed by large data flows. Using traffic monitoring systems, environmental sensors, GPS individual traces, and social information, we can organize cities as a collective sharing of resources that need to be optimized, continuously monitored, and promptly adjusted when needed. It is easy to understand the potentiality of data science by introducing terms such as urban planning , public transportation , reduction of energy consumption , ecological sustainability, safety , and management of mass events. These terms represent only the front line of topics that can benefit from the awareness that big data might provide to the city stakeholders [ 22 , 27 , 29 ]. Several methods allowing human mobility analysis and prediction are available in the literature: MyWay [ 47 ] exploits individual systematic behaviors to predict future human movements by combining individual and collective learned models. Carpooling [ 22 ] is based on mobility data from travelers in a given territory and constructs a network of potential carpooling users, by exploiting topological properties, highlighting sub-populations with higher chances to create a carpooling community and the propensity of users to be either drivers or passengers in a shared car. Event attendance prediction [ 13 ] analyzes users’ call habits and classifies people into behavioral categories, dividing them among residents, commuters, and visitors and allows to observe the variety of behaviors of city users and the attendance in big events in cities.

Electric mobility is expected to gain importance for the world. The impact of a complete switch to electric mobility is still under investigation, and what appears to be critical is the intensity of flows due to charge (and fast recharge) systems that may challenge the stability of the power network. To avoid instabilities regarding the charging infrastructure, an accurate prediction of power flows associated with mobility is needed. The use of personal mobility data can estimate the mobility flow and simulate the impact of different charging behavioral patterns to predict power flows and optimize the position of the charging infrastructures [ 25 , 49 ]. Lorini et al. [ 26 ] is an example of an urban flood prediction that integrates data provided by CEM system Footnote 4 and Twitter data. Twitter data are processed using massive multilingual approaches for classification. The model is a supervised model which requires a careful data collection and validation of ground truth about confirmed floods from multiple sources.

Another example of data science for society can be found in the development of applications with functions aimed directly at the individual. In this context, concepts such as personal data stores and personal data analytics are aimed at implementing a new deal on personal data, providing a user-centric view where data are collected, integrated and analyzed at the individual level, and providing the user with better awareness of own behavioral, health, and consumer profiles. Within this user-centric perspective, there is room for an even broader market of business applications, such as high-precision real-time targeted marketing, e.g., self-organizing decision making to preserve desired global properties, and sustainability of the transportation or the healthcare system. Such contexts emphasize two essential aspects of data science: the need for creativeness to exploit and combine the several data sources in novel ways and the need to give awareness and control of the personal data to the users that generate them, to sustain a transparent, trust-based, crowd-sourced data ecosystem [ 19 ].

The impact of online social networks in our society has changed the mechanisms behind information spreading and news production. The transformation of media ecosystems and news consumption are having consequences in several fields. A relevant example is the impact of misinformation on society, as for the Brexit referendum when the massive diffusion of fake news has been considered one of the most relevant factors of the outcome of this political event. Examples of achievements are provided by the results regarding the influence of external news media on polarization in online social networks. These achievements indicate that users are highly polarized towards news sources, i.e., they cite (and tend to cite) sources that they identify as ideologically similar to them. Other results regard echo chambers and the role of social media users: there is a strong correlation between the orientation of the content produced and consumed. In other words, an opinion “echoes” back to the user when others are sharing it in the “chamber” (i.e., the social network around the user) [ 36 ]. Other results worth mentioning regard efforts devoted to uncovering spam and bot activities in stock microblogs on Twitter: taking inspiration from biological DNA, the idea is to model the online users’ behavior through strings of characters representing sequences of online users’ actions. As a result of the following papers, [ 11 , 12 ] report that 71% of suspicious users were classified as bots; furthermore, 37% of them also got suspended by Twitter few months after our investigation. Several approaches can be found in the literature. However, they generally display some limitations. Some of them work only on some of the features of the diffusion of misinformation (bot detections, segregation of users due to their opinions or other social analysis), or there is a lack of comprehensive frameworks for interpreting results. While the former case is somehow due to the innovation of the research field and it is explainable, the latter showcases a more fundamental need, as, without strict statistical validation, it is hard to state which are the crucial elements that permit a well-grounded description of a system. For avoiding fake news diffusion, we can state that building a comprehensive fake news dataset providing all information about publishers, shared contents, and the engagements of users over space and time, together with their profile stories, can help the development of innovative and effective learning models. Both unsupervised and supervised methods will work together to identify misleading information. Multidisciplinary teams made up of journalists, linguists, and behavioral scientists and similar will be needed to identify what amounts to information warfare campaigns. Cyberwarfare and information warfare will be two of the biggest threats the world will face in the 21st Century.

Social sensing methods collect data produced by digital citizens, by either opportunistic or participatory crowd-sensing, depending on users’ awareness of their involvement. These approaches present a variety of technological and ethical challenges. An example is represented by Twitter Monitor [ 10 ], that is crowd-sensing tool designed to access Twitter streams through the Twitter Streaming API. It allows launching parallel listening for collecting different sets of data. Twitter Monitor represents a tool for creating services for listening campaigns regarding relevant events such as political elections, natural and human-made disasters, popular national events, etc. [ 11 ]. This campaign can be carried out, specifying keywords, accounts, and geographical areas of interest.

Nowcasting Footnote 5 financial and economic indicators focus on the potential of data science as a proxy for well-being and socioeconomic applications. The development of innovative research methods has demonstrated that poverty indicators can be approximated by social and behavioral mobility metrics extracted from mobile phone data and GPS data [ 34 ]; and the Gross Domestic Product can be accurately nowcasted by using retail supermarket market data [ 18 ]. Furthermore, nowcasting of demographic aspects of territory based on Twitter data [ 1 ] can support official statistics, through the estimation of location, occupation, and semantics. Networks are a convenient way to represent the complex interaction among the elements of a large system. In economics, networks are gaining increasing attention because the underlying topology of a networked system affects the aggregate output, the propagation of shocks, or financial distress; or the topology allows us to learn something about a node by looking at the properties of its neighbors. Among the most investigated financial and economic networks, we cite a work that analyzes the interbank systems, the payment networks between firms, the banks-firms bipartite networks, and the trading network between investors [ 37 ]. Another interesting phenomenon is the advent of blockchain technology that has led to the innovation of bitcoin crypto-currency [ 31 ].

Data science is an excellent opportunity for policy, data journalism, and marketing. The online media arena is now available as a real-time experimenting society for understanding social mechanisms, like harassment, discrimination, hate, and fake news. In our vision, the use of data science approaches is necessary for better governance. These new approaches integrate and change the Official Statistics representing a cheaper and more timely manner of computing them. The impact of data science-driven applications can be particularly significant when the applications help to build new infrastructures or new services for the population.

The availability of massive data portraying soccer performance has facilitated recent advances in soccer analytics. Rossi et al. [ 42 ] proposed an innovative machine learning approach to the forecasting of non-contact injuries for professional soccer players. In [ 3 ], we can find the definition of quantitative measures of pressing in defensive phases in soccer. Pappalardo et al. [ 33 ] outlined the automatic and data-driven evaluation of performance in soccer, a ranking system for soccer teams. Sports data science is attracting much interest and is now leading to the release of a large and public dataset of sports events.

Finally, data science has unveiled a shift from population statistics to interlinked entities statistics, connected by mutual interactions. This change of perspective reveals universal patterns underlying complex social, economic, technological, and biological systems. It is helpful to understand the dynamics of how opinions, epidemics, or innovations spread in our society, as well as the mechanisms behind complex systemic diseases, such as cancer and metabolic disorders revealing hidden relationships between them. Considering diffusive models and dynamic networks, NDlib [ 40 ] is a Python package for the description, simulation, and observation of diffusion processes in complex networks. It collects diffusive models from epidemics and opinion dynamics and allows a scientist to compare simulation over synthetic systems. For community discovery, two tools are available for studying the structure of a community and understand its habits: Demon [ 9 ] extracts ego networks (i.e., the set of nodes connected to an ego node) and identifies the real communities by adopting a democratic, bottom-up merging approach of such structures. Tiles [ 41 ] is dedicated to dynamic network data and extracts overlapping communities and tracks their evolution in time following an online iterative procedure.

2.2 Impact on industry and business

Data science can create an ecosystem of novel data-driven business opportunities. As a general trend across all sectors, massive quantities of data will be made accessible to everybody, allowing entrepreneurs to recognize and to rank shortcomings in business processes, to spot potential threads and win-win situations. Ideally, every citizen could establish from these patterns new business ideas. Co-creation enables data scientists to design innovative products and services. The value of joining different datasets is much larger than the sum of the value of the separated datasets by sharing data of various nature and provenance.

The gains from data science are expected across all sectors, from industry and production to services and retail. In this context, we cite several macro-areas where data science applications are especially promising. In energy and environment , the digitization of the energy systems (from production to distribution) enables the acquisition of real-time, high-resolution data. Coupled with other data sources, such as weather data, usage patterns, and market data (accompanied by advanced analytics), efficiency levels can be increased immensely. The positive impact to the environment is also enhanced by geospatial data that help to understand how our planet and its climate are changing and to confront major issues such as global warming, preservation of the species, the role and effects of human activities.

The manufacturing and production sector with the growing investments into Industry 4.0 and smart factories with sensor-equipped machinery that are both intelligent and networked (see internet of things . Cyber-physical systems ) will be one of the major producers of data in the world. The application of data science into this sector will bring efficiency gains and predictive maintenance. Entirely new business models are expected since the mass production of individualized products becomes possible where consumers may have direct access to influence and control.

As already stated in Sect.  2.1 , data science will contribute to increasing efficiency in public administrations processes and healthcare. In the physical and the cyber-domain, security will be enhanced. From financial fraud to public security, data science will contribute to establishing a framework that enables a safe and secure digital economy. Big data exploitation will open up opportunities for innovative, self-organizing ways of managing logistical business processes. Deliveries could be based on predictive monitoring, using data from stores, semantic product memories, internet forums, and weather forecasts, leading to both economic and environmental savings. Let us also consider the impact of personalized services for creating real experiences for tourists. The analysis of real-time and context-aware data (with the help of historical and cultural heritage data) will provide customized information to each tourist, and it will contribute to the better and more efficient management of the whole tourism value chain.

3 Data science ethics

Data science creates great opportunities but also new risks. The use of advanced tools for data analysis could expose sensitive knowledge of individual persons and could invade their privacy. Data science approaches require access to digital records of personal activities that contain potentially sensitive information. Personal information can be used to discriminate people based on their presumed characteristics. Data-driven algorithms yield classification and prediction models of behavioral traits of individuals, such as credit score, insurance risk, health status, personal preferences, and religious, ethnic, or political orientation, based on personal data disseminated in the digital environment by users (with or often without their awareness). The achievements of data science are the result of re-interpreting available data for analysis goals that differ from the original reasons motivating data collection. For example, mobile phone call records are initially collected by telecom operators for billing and operational aims, but they can be used for accurate and timely demography and human mobility analysis at a country or regional scale. This re-purposing of data clearly shows the importance of legal compliance and data ethics technologies and safeguards to protect privacy and anonymity; to secure data; to engage users; to avoid discrimination and misuse; to account for transparency; and to the purpose of seizing the opportunities of data science while controlling the associated risks.

Several aspects should be considered to avoid to harm individual privacy. Ethical elements should include the: (i) monitoring of the compliance of experiments, research protocols, and applications with ethical and juridical standards; (ii) developing of big data analytics and social mining tools with value-sensitive design and privacy-by-design methodologies; (iii) boosting of excellence and international competitiveness of Europe’s big data research in safe and fair use of big data for research. It is essential to highlight that data scientists using personal and social data also through infrastructures have the responsibility to get acquainted with the fundamental ethical aspects relating to becoming a “data controller.” This aspect has to be considered to define courses for informing and training data scientists about the responsibilities, the possibilities, and the boundaries they have in data manipulation.

Recalling Fig.  2 , it is crucial to inject into the data science pipeline the ethical values of fairness : how to avoid unfair and discriminatory decisions; accuracy : how to provide reliable information; confidentiality : how to protect the privacy of the involved people and transparency : how to make models and decisions comprehensible to all stakeholders. This value-sensitive design has to be aimed at boosting widespread social acceptance of data science, without inhibiting its power. Finally, it is essential to consider also the impact of the General Data Protection Regulation (GDPR) on (i) companies’ duties and how these European companies should comply with the limits in data manipulation the Regulation requires; and on (ii) researchers’ duties and to highlight articles and recitals which specifically mention and explain how research is intended in GDPR’s legal system.

figure 3

The relationship between big and open data and how they relate to the broad concept of open government

We complete this section with another important aspect related to open data, i.e., accessible public data that people, companies, and organizations can use to launch new ventures, analyze patterns and trends, make data-driven decisions, and solve complex problems. All the definitions of open data include two features: (i) the data must be publicly available for anyone to use, and (ii) data must be licensed in a way that allows for its reuse. All over the world, initiatives are launched to make data open by government agencies and public organizations; listing them is impossible, but an UN initiative has to be mentioned. Global Pulse Footnote 6 meant to implement the vision for a future in which big data is harnessed safely and responsibly as a public good.

Figure 3 shows the relationships between open data and big data. Currently, the problem is not only that government agencies (and some business companies) are collecting personal data about us, but also that we do not know what data are being collected and we do not have access to the information about ourselves. As reported by the World Economic forum in 2013, it is crucial to understand the value of personal data to let the users make informed decisions. A new branch of philosophy and ethics is emerging to handle personal data related issues. On the one hand, in all cases where the data might be used for the social good (i.e., medical research, improvement of public transports, contrasting epidemics), and understanding the personal data value means to correctly evaluate the balance between public benefits and personal loss of protection. On the other hand, when data are aimed to be used for commercial purposes, the value mentioned above might instead translate into simple pricing of personal information that the user might sell to a company for its business. In this context, discrimination discovery consists of searching for a-priori unknown contexts of suspect discrimination against protected-by-law social groups, by analyzing datasets of historical decision records. Machine learning and data mining approaches may be affected by discrimination rules, and these rules may be deeply hidden within obscure artificial intelligence models. Thus, discrimination discovery consists of understanding whether a predictive model makes direct or indirect discrimination. DCube [ 43 ] is a tool for data-driven discrimination discovery, a library of methods on fairness analysis.

It is important to evaluate how a mining model or algorithm takes its decision. The growing field of methods for explainable machine learning provides and continuously expands a set of comprehensive tool-kits [ 21 ]. For example, X-Lib is a library containing state-of-the-art explanation methods organized within a hierarchical structure and wrapped in a similar fashion way such that they can be easily accessed and used from different users. The library provides support for explaining classification on tabular data and images and for explaining the logic of complex decision systems. X-Lib collects, among the others, the following collection of explanation methods: LIME [ 38 ], Anchor [ 39 ], DeepExplain that includes Saliency maps [ 44 ], Gradient * Input, Integrated Gradients, and DeepLIFT [ 46 ]. Saliency method is a library containing code for SmoothGrad [ 45 ], as well as implementations of several other saliency techniques: Vanilla Gradients, Guided Backpropogation, and Grad-CAM. Another improvement in this context is the use of robotics and AI in data preparation, curation, and in detecting bias in data, information and knowledge as well as in the misuse and abuse of these assets when it comes to legal, privacy, and ethical issues and when it comes to transparency and trust. We cannot rely on human beings to do these tasks. We need to exploit the power of robotics and AI to help provide the protections required. Data and information lawyers will play a key role in legal and privacy issues, ethical use of these assets, and the problem of bias in both algorithms and the data, information, and knowledge used to develop analytics solutions. Finally, we can state that data science can help to fill the gap between legislators and technology.

4 Big data ecosystem: the role of research infrastructures

Research infrastructures (RIs) play a crucial role in the advent and development of data science. A social mining experiment exploits the main components of data science depicted in Fig.  1 (i.e., data, infrastructures, analytical methods) to enable multidisciplinary scientists and innovators to extract knowledge and to make the experiment reusable by the scientific community, innovators providing an impact on science and society.

Resources such as data and methods help domain and data scientists to transform research or an innovation question into a responsible data-driven analytical process. This process is executed onto the platform, thus supporting experiments that yield scientific output, policy recommendations, or innovative proofs-of-concept. Furthermore, an operational ethical board’s stewardship is a critical factor in the success of a RI.

An infrastructure typically offers easy-to-use means to define complex analytical processes and workflows , thus bridging the gap between domain experts and analytical technology. In many instances, domain experts may become a reference for their scientific communities, thus facilitating new users engagement within the RI activities. As a collateral feedback effect, experiments will generate new relevant data, methods, and workflows that can be integrated into the platform by data scientists, contributing to the resource expansion of the RI. An experiment designed in a node of the RI and executed on the platform returns its results to the entire RI community.

Well defined thematic environments amplify new experiments achievements towards the vertical scientific communities (and potential stakeholders) by activating appropriate dissemination channels.

4.1 The SoBigData Research Infrastructure

The SoBigData Research Infrastructure Footnote 7 is an ecosystem of human and digital resources, comprising data scientists, analytics, and processes. As shown in Fig.  4 , SoBigData is designed to enable multidisciplinary scientists and innovators to realize social mining experiments and to make them reusable by the scientific communities. All the components have been introduced for implementing data science from raw data management to knowledge extraction, with particular attention to legal and ethical aspects as reported in Fig.  1 . SoBigData supports data science serving a cross-disciplinary community of data scientists studying all the elements of societal complexity from a data- and model-driven perspective.

Currently, SoBigData includes scientific, industrial, and other stakeholders. In particular, our stakeholders are data analysts and researchers (35.6%), followed by companies (33.3%) and policy and lawmakers (20%). The following sections provide a short but comprehensive overview of the services provided by SoBigData RI with special attention on supporting ethical and open data science [ 15 , 16 ].

4.1.1 Resources, facilities, and access opportunities

Over the past decade, Europe has developed world-leading expertise in building and operating e-infrastructures. They are large-scale, federated and distributed online research environments through which researchers can share access to scientific resources (including data, instruments, computing, and communications), regardless of their location. They are meant to support unprecedented scales of international collaboration in science, both within and across disciplines, investing in economy-of-scale and common behavior, policies, best practices, and standards. They shape up a common environment where scientists can create , validate , assess , compare , and share their digital results of science, such as research data and research methods, by using a common “digital laboratory” consisting of agreed-on services and tools.

figure 4

The SoBigData Research Infrastructure: an ecosystem of human and digital resources, comprising data scientists, analytical methods, and processes. SoBigData enables multidisciplinary scientists and innovators to carry out experiments and to make them reusable by the community

However, the implementation of workflows, possibly following Open Science principles of reproducibility and transparency, is hindered by a multitude of real-world problems. One of the most prominent is that e-infrastructures available to research communities today are far from being well-designed and consistent digital laboratories, neatly designed to share and reuse resources according to common policies, data models, standards, language platforms, and APIs. They are instead “patchworks of systems,” assembling online tools, services, and data sources and evolving to match the requirements of the scientific process, to include new solutions. The degree of heterogeneity excludes the adoption of uniform workflow management systems, standard service-oriented approaches, routine monitoring and accounting methods. The realization of scientific workflows is typically realized by writing ad hoc code, manipulating data on desktops, alternating the execution of online web services, sharing software libraries implementing research methods in different languages, desktop tools, web-accessible execution engines (e.g., Taverna, Knime, Galaxy).

The SoBigData e-infrastructure is based on D4Science services, which provides researchers and practitioners with a working environment where open science practices are transparently promoted, and data science practices can be implemented by minimizing the technological integration cost highlighted above.

D4Science is a deployed instance of the gCube Footnote 8 technology [ 4 ], a software conceived to facilitate the integration of web services, code, and applications as resources of different types in a common framework, which in turn enables the construction of Virtual Research Environments (VREs) [ 7 ] as combinations of such resources (Fig.  5 ). As there is no common framework that can be trusted enough, sustained enough, to convince resource providers that converging to it would be a worthwhile effort, D4Science implements a “system of systems.” In such a framework, resources are integrated with minimal cost, to gain in scalability, performance, accounting, provenance tracking, seamless integration with other resources, visibility to all scientists. The principle is that the cost of “participation” to the framework is on the infrastructure rather than on resource providers. The infrastructure provides the necessary bridges to include and combine resources that would otherwise be incompatible.

figure 5

D4Science: resources from external systems, virtual research environments, and communities

More specifically, via D4Science, SoBigData scientists can integrate and share resources such as datasets, research methods, web services via APIs, and web applications via Portlets. Resources can then be integrated, combined, and accessed via VREs, intended as web-based working environments tailored to support the needs of their designated communities, each working on a research question. Research methods are integrated as executable code, implementing WPS APIs in different programming languages (e.g., Java, Python, R, Knime, Galaxy), which can be executed via the Data Miner analytics platform in parallel, transparently to the users, over powerful and extensible clusters, and via simple VRE user interfaces. Scientists using Data Miner in the context of a VRE can select and execute the available methods and share the results with other scientists, who can repeat or reproduce the experiment with a simple click.

D4Science VREs are equipped with core services supporting data analysis and collaboration among its users: ( i ) a shared workspace to store and organize any version of a research artifact; ( ii ) a social networking area to have discussions on any topic (including working version and released artifacts) and be informed on happenings; ( iii ) a Data Miner analytics platform to execute processing tasks (research methods) either natively provided by VRE users or borrowed from other VREs to be applied to VRE users’ cases and datasets; and iv ) a catalogue-based publishing platform to make the existence of a certain artifact public and disseminated. Scientists operating within VREs use such facilities continuously and transparently track the record of their research activities (actions, authorship, provenance), as well as products and links between them (lineage) resulting from every phase of the research life cycle, thus facilitating publishing of science according to Open Science principles of transparency and reproducibility [ 5 ].

Today, SoBigData integrates the resources in Table  1 . By means of such resources, SoBigData scientists have created VREs to deliver the so-called SoBigData exploratories : Explainable Machine Learning , Sports Data Science , Migration Studies , Societal Debates , Well-being & Economy , and City of Citizens . Each exploratory includes the resources required to perform Data science workflows in a controlled and shared environment. Resources range from data to methods, described more in detail in the following, together with their exploitation within the exploratories.

All the resources and instruments integrate into SoBigData RI are structured in such a way as to operate within the confines of the current data protection law with the focus on General Data Protection Regulation (GDPR) and ethical analysis of the fundamental values involved in social mining and AI. Each item into the catalogue has specific fields for managing ethical issues (e.g., if a dataset contains personal info) and fields for describing and managing intellectual properties.

4.1.2 Data resources: social mining and big data ecosystem

SoBigData RI defines policies supporting users in the collection, description, preservation, and sharing of their data sets. It implements data science making such data available for collaborative research by adopting various strategies, ranging from sharing the open data sets with the scientific community at large, to share the data with disclosure restriction allowing data access within secure environments.

Several big data sets are available through SoBigData RI including network graphs from mobile phone call data; networks crawled from many online social networks, including Facebook and Flickr, transaction micro-data from diverse retailers, query logs both from search engines and e-commerce, society-wide mobile phone call data records, GPS tracks from personal navigation devices, survey data about customer satisfaction or market research, extensive web archives, billions of tweets, and data from location-aware social networks.

4.1.3 Data science through SoBigData exploratories

Exploratories are thematic environments built on top of the SoBigData RI. An exploratory binds datasets with social mining methods providing the research context for supporting specific data science applications by: (i) providing the scientific context for performing the application. This context can be considered a container for binding specific methods, applications, services, and datasets; (ii) stimulating communities on the effectiveness of the analytical process related to the analysis, promoting scientific dissemination, result sharing, and reproducibility. The use of exploratories promotes the effectiveness of the data science trough research infrastructure services. The following sections report a short description of the six SoBigData exploratories. Figure 6 shows the main thematic areas covered by each exploratory. Due to its nature, Explainable Machine Learning exploratory can be applied to each sector where a black-box machine learning approach is used. The list of exploratories (and the data and methods inside them) are updated continuously and continue to grow over time. Footnote 9

figure 6

SoBigData covers six thematic areas listed horizontally. Each exploratory covers more than one thematic area

City of citizens. This exploratory aims to collect data science applications and methods related to geo-referenced data. The latter describes the movements of citizens in a city, a territory, or an entire region. There are several studies and different methods that employ a wide variety of data sources to build models about the mobility of people and city characteristics in the scientific literature [ 30 , 32 ]. Like ecosystems, cities are open systems that live and develop utilizing flows of energy, matter, and information. What distinguishes a city from a colony is the human component (i.e., the process of transformation by cultural and technological evolution). Through this combination, cities are evolutionary systems that develop and co-evolve continuously with their inhabitants [ 24 ]. Cities are kaleidoscopes of information generated by a myriad of digital devices weaved into the urban fabric. The inclusion of tracking technologies in personal devices enabled the analysis of large sets of mobility data like GPS traces and call detail records.

Data science applied to human mobility is one of the critical topics investigated in SoBigData thanks to the decennial experience of partners in European projects. The study of human mobility led to the integration into the SoBigData of unique Global Positioning System (GPS) and call detail record (CDR) datasets of people and vehicle movements, and geo-referenced social network data as well as several mobility services: O/D (origin-destination) matrix computation, Urban Mobility Atlas Footnote 10 (a visual interface to city mobility patterns), GeoTopics Footnote 11 (for exploring patterns of urban activity from Foursquare), and predictive models: MyWay Footnote 12 (trajectory prediction), TripBuilder Footnote 13 (tourists to build personalized tours of a city). In human mobility, research questions come from geographers, urbanists, complexity scientists, data scientists, policymakers, and Big Data providers, as well as innovators aiming to provide applications for any service for the smart city ecosystem. The idea is to investigate the impact of political events on the well-being of citizens. This exploratory supports the development of “happiness” and “peace” indicators through text mining/opinion mining pipeline on repositories of online news. These indicators reveal that the level of crime of a territory can be well approximated by analyzing the news related to that territory. Generally, we study the impact of the economy on well-being and vice versa, e.g., also considering the propagation of shocks of financial distress in an economic or financial system crucially depends on the topology of the network interconnecting the different elements.

Well-being and economy. This exploratory tests the hypothesis that well-being is correlated to the business performance of companies. The idea is to combine statistical methods and traditional economic data (typically at low-frequency) with high-frequency data from non-traditional sources, such as, i.e., web, supermarkets, for now-casting economic, socioeconomic and well-being indicators. These indicators allow us to study and measure real-life costs by studying price variation and socioeconomic status inference. Furthermore, this activity supports studies on the correlation between people’s well-being and their social and mobility data. In this context, some basic hypothesis can be summarized as: (i) there are curves of age- and gender-based segregation distribution in boards of companies, which are characteristic to mean credit risk of companies in a region; (ii) low mean credit risk of companies in a region has a positive correlation to well-being; (iii) systemic risk correlates highly with well-being indices at a national level. The final aim is to provide a set of guidelines to national governments, methods, and indices for decision making on regulations affecting companies to improve well-being in the country, also considering effective policies to reduce operational risks such as credit risk, and external threats of companies [ 17 ].

Big Data, analyzed through the lenses of data science, provides means to understand our complex socioeconomic and financial systems. On the one hand, this offers new opportunities to measure the patterns of well-being and poverty at a local and global scale, empowering governments and policymakers with the unprecedented opportunity to nowcast relevant economic quantities and compare different countries, regions, and cities. On the other hand, this allows us to investigate the network underlying the complex systems of economy and finance, and it affects the aggregate output, the propagation of shocks or financial distress and systemic risk.

Societal debates. This exploratory employs data science approaches to answer research questions such as who is participating in public debates? What is the “big picture” response from citizens to a policy, election, referendum, or other political events? This kind of analysis allows scientists, policymakers, and citizens to understand the online discussion surrounding polarized debates [ 14 ]. The personal perception of online discussions on social media is often biased by the so-called filter bubble, in which automatic curation of content and relationships between users negatively affects the diversity of opinions available to them. Making a complete analysis of online polarized debates enables the citizens to be better informed and prepared for political outcomes. By analyzing content and conversations on social media and newspaper articles, data scientists study public debates and also assess public sentiment around debated topics, opinion diffusion dynamics, echo chambers formation and polarized discussions, fake news analysis, and propaganda bots. Misinformation is often the result of a distorted perception of concepts that, although unrelated, suddenly appear together in the same narrative. Understanding the details of this process at an early stage may help to prevent the birth and the diffusion of fake news. The misinformation fight includes the development of dynamical models of misinformation diffusion (possibly in contrast to the spread of mainstream news) as well as models of how attention cycles are accelerated and amplified by the infrastructures of online media.

Another important topic covered by this exploratory concerns the analysis of how social bots activity affects fake news diffusion. Determining whether a human or a bot controls a user account is a complex task. To the best of our knowledge, the only openly accessible solution to detect social bots is Botometer, an API that allows us to interact with an underlying machine learning system. Although Botometer has been proven to be entirely accurate in detecting social bots, it has limitations due to the Twitter API features: hence, an algorithm overcoming the barriers of current recipes is needed.

The resources related to Societal Debates exploratory, especially in the domain of media ecology and the fight against misinformation online, provide easy-to-use services to public bodies, media outlets, and social/political scientists. Furthermore, SoBigData supports new simulation models and experimental processes to validate in vivo the algorithms for fighting misinformation, curbing the pathological acceleration and amplification of online attention cycles, breaking the bubbles, and explore alternative media and information ecosystems.

Migration studies. Data science is also useful to understand the migration phenomenon. Knowledge about the number of immigrants living in a particular region is crucial to devise policies that maximize the benefits for both locals and immigrants. These numbers can vary rapidly in space and time, especially in periods of crisis such as wars or natural disasters.

This exploratory provides a set of data and tools for trying to answer some questions about migration flows. Through this exploratory, a data scientist studies economic models of migration and can observe how migrants choose their destination countries. A scientist can discover what is the meaning of “opportunities” that a country provides to migrants, and whether there are correlations between the number of incoming migrants and opportunities in the host countries [ 8 ]. Furthermore, this exploratory tries to understand how public perception of migration is changing using an opinion mining analysis. For example, social network analysis enables us to analyze the migrant’s social network and discover the structure of the social network for people who decided to start a new life in a different country [ 28 ].

Finally, we can also evaluate current integration indices based on official statistics and survey data, which can be complemented by Big Data sources. This exploratory aims to build combined integration indexes that take into account multiple data sources to evaluate integration on various levels. Such integration includes mobile phone data to understand patterns of communication between immigrants and natives; social network data to assess sentiment towards immigrants and immigration; professional network data (such as LinkedIn) to understand labor market integration, and local data to understand to what extent moving across borders is associated with a change in the cultural norms of the migrants. These indexes are fundamental to evaluate the overall social and economic effects of immigration. The new integration indexes can be applied with various space and time resolutions (small area methods) to obtain a complete image of integration, and complement official index.

Sports data science. The proliferation of new sensing technologies that provide high-fidelity data streams extracted from every game, is changing the way scientists, fans and practitioners conceive sports performance. The combination of these (big) data with the tools of data science provides the possibility to unveil complex models underlying sports performance and enables to perform many challenging tasks: from automatic tactical analysis to data-driven performance ranking; game outcome prediction, and injury forecasting. The idea is to foster research on sports data science in several directions. The application of explainable AI and deep learning techniques can be hugely beneficial to sports data science. For example, by using adversarial learning, we can modify the training plans of players that are associated with high injury risk and develop training plans that maximize the fitness of players (minimizing their injury risk). The use of gaming, simulation, and modeling is another set of tools that can be used by coaching staff to test tactics that can be employed against a competitor. Furthermore, by using deep learning on time series, we can forecast the evolution of the performance of players and search for young talents.

This exploratory examines the factors influencing sports success and how to build simulation tools for boosting both individual and collective performance. Furthermore, this exploratory describes performances employing data, statistics, and models, allowing coaches, fans, and practitioners to understand (and boost) sports performance [ 42 ].

Explainable machine learning. Artificial Intelligence, increasingly based on Big Data analytics, is a disruptive technology of our times. This exploratory provides a forum for studying effects of AI on the future society. In this context, SoBigData studies the future of labor and the workforce, also through data- and model-driven analysis, simulations, and the development of methods that construct human understandable explanations of AI black-box models [ 20 ].

Black box systems for automated decision making map a user’s features into a class that predicts the behavioral traits of individuals, such as credit risk, health status, without exposing the reasons why. Most of the time, the internal reasoning of these algorithms is obscure even to their developers. For this reason, the last decade has witnessed the rise of a black box society. This exploratory is developing a set of techniques and tools which allow data analysts to understand why an algorithm produce a decision. These approaches are designed not for discovering a lack of transparency but also for discovering possible biases inherited by the algorithms from human prejudices and artefacts hidden in the training data (which may lead to unfair or wrong decisions) [ 35 ].

5 Conclusions: individual and collective intelligence

The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s [ 23 ]. Since 2012, every day 2.5 exabytes (2.5 \(\times \) 10 \(^18\) bytes) of data were created; as of 2014, every day 2.3 zettabytes (2.3 \(\times \) 10 \(^21\) bytes) of data were generated by Super-power high-tech Corporation worldwide. Soon zettabytes of useful public and private data will be widely and openly available. In the next years, smart applications such as smart grids, smart logistics, smart factories, and smart cities will be widely deployed across the continent and beyond. Ubiquitous broadband access, mobile technology, social media, services, and internet of think on billions of devices will have contributed to the explosion of generated data to a total global estimate of 40 zettabytes.

In this work, we have introduced data science as a new challenge and opportunity for the next years. In this context, we have tried to summarize in a concise way several aspects related to data science applications and their impacts on society, considering both the new services available and the new job perspectives. We have also introduced issues in managing data representing human behavior and showed how difficult it is to preserve personal information and privacy. With the introduction of SoBigData RI and exploratories, we have provided virtual environments where it is possible to understand the potentiality of data science in different research contexts.

Concluding, we can state that social dilemmas occur when there is a conflict between the individual and public interest. Such problems also appear in the ecosystem of distributed AI systems (based on data science tools) and humans, with additional difficulties due: on the one hand, to the relative rigidity of the trained AI systems and the necessity of achieving social benefit, and, on the other hand, to the necessity of keeping individuals interested. What are the principles and solutions for individual versus social optimization using AI, and how can an optimum balance be achieved? The answer is still open, but these complex systems have to work on fulfilling collective goals, and requirements, with the challenge that human needs change over time and move from one context to another. Every AI system should operate within an ethical and social framework in understandable, verifiable, and justifiable way. Such systems must, in any case, work within the bounds of the rule of law, incorporating protection of fundamental rights into the AI infrastructure. In other words, the challenge is to develop mechanisms that will result in the system converging to an equilibrium that complies with European values and social objectives (e.g., social inclusion) but without unnecessary losses of efficiency.

Interestingly, data science can play a vital role in enhancing desirable behaviors in the system, e.g., by supporting coordination and cooperation that is, more often than not, crucial to achieving any meaningful improvements. Our ultimate goal is to build the blueprint of a sociotechnical system in which AI not only cooperates with humans but, if necessary, helps them to learn how to collaborate, as well as other desirable behaviors. In this context, it is also essential to understand how to achieve robustness of the human and AI ecosystems in respect of various types of malicious behaviors, such as abuse of power and exploitation of AI technical weaknesses.

We conclude by paraphrasing Stephen Hawking in his Brief Answers to the Big Questions: the availability of data on its own will not take humanity to the future, but its intelligent and creative use will.

http://www.sdss3.org/collaboration/ .

e.g., https://www.nature.com/sdata/policies/repositories .

Responsible Data Science program: https://redasci.org/ .

https://emergency.copernicus.eu/ .

Nowcasting in economics is the prediction of the present, the very near future, and the very recent past state of an economic indicator.

https://www.unglobalpulse.org/ .

http://sobigdata.eu .

https://www.gcube-system.org/ .

https://sobigdata.d4science.org/catalogue-sobigdata .

http://www.sobigdata.eu/content/urban-mobility-atlas .

http://data.d4science.org/ctlg/ResourceCatalogue/geotopics_-_a_method_and_system_to_explore_urban_activity .

http://data.d4science.org/ctlg/ResourceCatalogue/myway_-_trajectory_prediction .

http://data.d4science.org/ctlg/ResourceCatalogue/tripbuilder .

Abitbol, J.L., Fleury, E., Karsai, M.: Optimal proxy selection for socioeconomic status inference on twitter. Complexity 2019 , 60596731–605967315 (2019). https://doi.org/10.1155/2019/6059673

Article   Google Scholar  

Amato, G., Candela, L., Castelli, D., Esuli, A., Falchi, F., Gennaro, C., Giannotti, F., Monreale, A., Nanni, M., Pagano, P., Pappalardo, L., Pedreschi, D., Pratesi, F., Rabitti, F., Rinzivillo, S., Rossetti, G., Ruggieri, S., Sebastiani, F., Tesconi, M.: How data mining and machine learning evolved from relational data base to data science. In: Flesca, S., Greco, S., Masciari, E., Saccà, D. (eds.) A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years, Studies in Big Data, vol. 31, pp. 287–306. Springer, Berlin (2018). https://doi.org/10.1007/978-3-319-61893-7_17

Chapter   Google Scholar  

Andrienko, G.L., Andrienko, N.V., Budziak, G., Dykes, J., Fuchs, G., von Landesberger, T., Weber, H.: Visual analysis of pressure in football. Data Min. Knowl. Discov. 31 (6), 1793–1839 (2017). https://doi.org/10.1007/s10618-017-0513-2

Article   MathSciNet   Google Scholar  

Assante, M., Candela, L., Castelli, D., Cirillo, R., Coro, G., Frosini, L., Lelii, L., Mangiacrapa, F., Marioli, V., Pagano, P., Panichi, G., Perciante, C., Sinibaldi, F.: The gcube system: delivering virtual research environments as-a-service. Future Gener. Comput. Syst. 95 , 445–453 (2019). https://doi.org/10.1016/j.future.2018.10.035

Assante, M., Candela, L., Castelli, D., Cirillo, R., Coro, G., Frosini, L., Lelii, L., Mangiacrapa, F., Pagano, P., Panichi, G., Sinibaldi, F.: Enacting open science by d4science. Future Gener. Comput. Syst. (2019). https://doi.org/10.1016/j.future.2019.05.063

Barabasi, A.L., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach to human disease. Nature reviews. Genetics 12 , 56–68 (2011). https://doi.org/10.1038/nrg2918

Candela, L., Castelli, D., Pagano, P.: Virtual research environments: an overview and a research agenda. Data Sci. J. 12 , GRDI75–GRDI81 (2013). https://doi.org/10.2481/dsj.GRDI-013

Coletto, M., Esuli, A., Lucchese, C., Muntean, C.I., Nardini, F.M., Perego, R., Renso, C.: Sentiment-enhanced multidimensional analysis of online social networks: perception of the mediterranean refugees crisis. In: Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM’16, pp. 1270–1277. IEEE Press, Piscataway, NJ, USA (2016). http://dl.acm.org/citation.cfm?id=3192424.3192657

Coscia, M., Rossetti, G., Giannotti, F., Pedreschi, D.: Uncovering hierarchical and overlapping communities with a local-first approach. TKDD 9 (1), 6:1–6:27 (2014). https://doi.org/10.1145/2629511

Cresci, S., Minutoli, S., Nizzoli, L., Tardelli, S., Tesconi, M.: Enriching digital libraries with crowdsensed data. In: P. Manghi, L. Candela, G. Silvello (eds.) Digital Libraries: Supporting Open Science—15th Italian Research Conference on Digital Libraries, IRCDL 2019, Pisa, Italy, 31 Jan–1 Feb 2019, Proceedings, Communications in Computer and Information Science, vol. 988, pp. 144–158. Springer (2019). https://doi.org/10.1007/978-3-030-11226-4_12

Cresci, S., Petrocchi, M., Spognardi, A., Tognazzi, S.: Better safe than sorry: an adversarial approach to improve social bot detection. In: P. Boldi, B.F. Welles, K. Kinder-Kurlanda, C. Wilson, I. Peters, W.M. Jr. (eds.) Proceedings of the 11th ACM Conference on Web Science, WebSci 2019, Boston, MA, USA, June 30–July 03, 2019, pp. 47–56. ACM (2019). https://doi.org/10.1145/3292522.3326030

Cresci, S., Pietro, R.D., Petrocchi, M., Spognardi, A., Tesconi, M.: Social fingerprinting: detection of spambot groups through dna-inspired behavioral modeling. IEEE Trans. Dependable Sec. Comput. 15 (4), 561–576 (2018). https://doi.org/10.1109/TDSC.2017.2681672

Furletti, B., Trasarti, R., Cintia, P., Gabrielli, L.: Discovering and understanding city events with big data: the case of rome. Information 8 (3), 74 (2017). https://doi.org/10.3390/info8030074

Garimella, K., De Francisci Morales, G., Gionis, A., Mathioudakis, M.: Reducing controversy by connecting opposing views. In: Proceedings of the 10th ACM International Conference on Web Search and Data Mining, WSDM’17, pp. 81–90. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3018661.3018703

Giannotti, F., Trasarti, R., Bontcheva, K., Grossi, V.: Sobigdata: social mining & big data ecosystem. In: P. Champin, F.L. Gandon, M. Lalmas, P.G. Ipeirotis (eds.) Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 23–27, 2018, pp. 437–438. ACM (2018). https://doi.org/10.1145/3184558.3186205

Grossi, V., Rapisarda, B., Giannotti, F., Pedreschi, D.: Data science at sobigdata: the european research infrastructure for social mining and big data analytics. I. J. Data Sci. Anal. 6 (3), 205–216 (2018). https://doi.org/10.1007/s41060-018-0126-x

Grossi, V., Romei, A., Ruggieri, S.: A case study in sequential pattern mining for it-operational risk. In: W. Daelemans, B. Goethals, K. Morik (eds.) Machine Learning and Knowledge Discovery in Databases, European Conference, ECML/PKDD 2008, Antwerp, Belgium, 15–19 Sept 2008, Proceedings, Part I, Lecture Notes in Computer Science, vol. 5211, pp. 424–439. Springer (2008). https://doi.org/10.1007/978-3-540-87479-9_46

Guidotti, R., Coscia, M., Pedreschi, D., Pennacchioli, D.: Going beyond GDP to nowcast well-being using retail market data. In: A. Wierzbicki, U. Brandes, F. Schweitzer, D. Pedreschi (eds.) Advances in Network Science—12th International Conference and School, NetSci-X 2016, Wroclaw, Poland, 11–13 Jan 2016, Proceedings, Lecture Notes in Computer Science, vol. 9564, pp. 29–42. Springer (2016). https://doi.org/10.1007/978-3-319-28361-6_3

Guidotti, R., Monreale, A., Nanni, M., Giannotti, F., Pedreschi, D.: Clustering individual transactional data for masses of users. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 Aug 2017, pp. 195–204. ACM (2017). https://doi.org/10.1145/3097983.3098034

Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A survey of methods for explaining black box models. ACM Comput. Surv. 51 (5), 93:1–93:42 (2019). https://doi.org/10.1145/3236009

Guidotti, R., Monreale, A., Turini, F., Pedreschi, D., Giannotti, F.: A survey of methods for explaining black box models. CoRR abs/1802.01933 (2018). arxiv: 1802.01933

Guidotti, R., Nanni, M., Rinzivillo, S., Pedreschi, D., Giannotti, F.: Never drive alone: boosting carpooling with network analysis. Inf. Syst. 64 , 237–257 (2017). https://doi.org/10.1016/j.is.2016.03.006

Hilbert, M., Lopez, P.: The world’s technological capacity to store, communicate, and compute information. Science 332 (6025), 60–65 (2011)

Kennedy, C.A., Stewart, I., Facchini, A., Cersosimo, I., Mele, R., Chen, B., Uda, M., Kansal, A., Chiu, A., Kim, K.g., Dubeux, C., Lebre La Rovere, E., Cunha, B., Pincetl, S., Keirstead, J., Barles, S., Pusaka, S., Gunawan, J., Adegbile, M., Nazariha, M., Hoque, S., Marcotullio, P.J., González Otharán, F., Genena, T., Ibrahim, N., Farooqui, R., Cervantes, G., Sahin, A.D., : Energy and material flows of megacities. Proc. Nat. Acad. Sci. 112 (19), 5985–5990 (2015). https://doi.org/10.1073/pnas.1504315112

Korjani, S., Damiano, A., Mureddu, M., Facchini, A., Caldarelli, G.: Optimal positioning of storage systems in microgrids based on complex networks centrality measures. Sci. Rep. (2018). https://doi.org/10.1038/s41598-018-35128-6

Lorini, V., Castillo, C., Dottori, F., Kalas, M., Nappo, D., Salamon, P.: Integrating social media into a pan-european flood awareness system: a multilingual approach. In: Z. Franco, J.J. González, J.H. Canós (eds.) Proceedings of the 16th International Conference on Information Systems for Crisis Response and Management, València, Spain, 19–22 May 2019. ISCRAM Association (2019). http://idl.iscram.org/files/valeriolorini/2019/1854-_ValerioLorini_etal2019.pdf

Lulli, A., Gabrielli, L., Dazzi, P., Dell’Amico, M., Michiardi, P., Nanni, M., Ricci, L.: Scalable and flexible clustering solutions for mobile phone-based population indicators. Int. J. Data Sci. Anal. 4 (4), 285–299 (2017). https://doi.org/10.1007/s41060-017-0065-y

Moise, I., Gaere, E., Merz, R., Koch, S., Pournaras, E.: Tracking language mobility in the twitter landscape. In: C. Domeniconi, F. Gullo, F. Bonchi, J. Domingo-Ferrer, R.A. Baeza-Yates, Z. Zhou, X. Wu (eds.) IEEE International Conference on Data Mining Workshops, ICDM Workshops 2016, 12–15 Dec 2016, Barcelona, Spain., pp. 663–670. IEEE Computer Society (2016). https://doi.org/10.1109/ICDMW.2016.0099

Nanni, M.: Advancements in mobility data analysis. In: F. Leuzzi, S. Ferilli (eds.) Traffic Mining Applied to Police Activities—Proceedings of the 1st Italian Conference for the Traffic Police (TRAP-2017), Rome, Italy, 25–26 Oct 2017, Advances in Intelligent Systems and Computing, vol. 728, pp. 11–16. Springer (2017). https://doi.org/10.1007/978-3-319-75608-0_2

Nanni, M., Trasarti, R., Monreale, A., Grossi, V., Pedreschi, D.: Driving profiles computation and monitoring for car insurance crm. ACM Trans. Intell. Syst. Technol. 8 (1), 14:1–14:26 (2016). https://doi.org/10.1145/2912148

Pappalardo, G., di Matteo, T., Caldarelli, G., Aste, T.: Blockchain inefficiency in the bitcoin peers network. EPJ Data Sci. 7 (1), 30 (2018). https://doi.org/10.1140/epjds/s13688-018-0159-3

Pappalardo, L., Barlacchi, G., Pellungrini, R., Simini, F.: Human mobility from theory to practice: Data, models and applications. In: S. Amer-Yahia, M. Mahdian, A. Goel, G. Houben, K. Lerman, J.J. McAuley, R.A. Baeza-Yates, L. Zia (eds.) Companion of The 2019 World Wide Web Conference, WWW 2019, San Francisco, CA, USA, 13–17 May 2019., pp. 1311–1312. ACM (2019). https://doi.org/10.1145/3308560.3320099

Pappalardo, L., Cintia, P., Ferragina, P., Massucco, E., Pedreschi, D., Giannotti, F.: Playerank: data-driven performance evaluation and player ranking in soccer via a machine learning approach. ACM TIST 10 (5), 59:1–59:27 (2019). https://doi.org/10.1145/3343172

Pappalardo, L., Vanhoof, M., Gabrielli, L., Smoreda, Z., Pedreschi, D., Giannotti, F.: An analytical framework to nowcast well-being using mobile phone data. CoRR abs/1606.06279 (2016). arxiv: 1606.06279

Pasquale, F.: The Black Box Society: The Secret Algorithms That Control Money and Information. Harvard University Press, Cambridge (2015)

Book   Google Scholar  

Piškorec, M., Antulov-Fantulin, N., Miholić, I., Šmuc, T., Šikić, M.: Modeling peer and external influence in online social networks: Case of 2013 referendum in croatia. In: Cherifi, C., Cherifi, H., Karsai, M., Musolesi, M. (eds.) Complex Networks & Their Applications VI. Springer, Cham (2018)

Google Scholar  

Ranco, G., Aleksovski, D., Caldarelli, G., Mozetic, I.: Investigating the relations between twitter sentiment and stock prices. CoRR abs/1506.02431 (2015). arxiv: 1506.02431

Ribeiro, M.T., Singh, S., Guestrin, C.: “why should I trust you?”: Explaining the predictions of any classifier. In: B. Krishnapuram, M. Shah, A.J. Smola, C.C. Aggarwal, D. Shen, R. Rastogi (eds.) Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 Aug 2016, pp. 1135–1144. ACM (2016). https://doi.org/10.1145/2939672.2939778

Ribeiro, M.T., Singh, S., Guestrin, C.: Anchors: High-precision model-agnostic explanations. In: S.A. McIlraith, K.Q. Weinberger (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, 2–7 Feb 2018, pp. 1527–1535. AAAI Press (2018). https://www.aaai.org/ocs/index.php/AAAI/AAAI18/-paper/view/16982

Rossetti, G., Milli, L., Rinzivillo, S., Sîrbu, A., Pedreschi, D., Giannotti, F.: Ndlib: a python library to model and analyze diffusion processes over complex networks. Int. J. Data Sci. Anal. 5 (1), 61–79 (2018). https://doi.org/10.1007/s41060-017-0086-6

Rossetti, G., Pappalardo, L., Pedreschi, D., Giannotti, F.: Tiles: an online algorithm for community discovery in dynamic social networks. Mach. Learn. 106 (8), 1213–1241 (2017). https://doi.org/10.1007/s10994-016-5582-8

Rossi, A., Pappalardo, L., Cintia, P., Fernández, J., Iaia, M.F., Medina, D.: Who is going to get hurt? predicting injuries in professional soccer. In: J. Davis, M. Kaytoue, A. Zimmermann (eds.) Proceedings of the 4th Workshop on Machine Learning and Data Mining for Sports Analytics co-located with 2017 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2017), Skopje, Macedonia, 18 Sept 2017., CEUR Workshop Proceedings, vol. 1971, pp. 21–30. CEUR-WS.org (2017). http://ceur-ws.org/Vol-1971/paper-04.pdf

Ruggieri, S., Pedreschi, D., Turini, F.: DCUBE: discrimination discovery in databases. In: A.K. Elmagarmid, D. Agrawal (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, 6–10 June 2010, pp. 1127–1130. ACM (2010). https://doi.org/10.1145/1807167.1807298

Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR abs/1312.6034 (2013). http://dblp.uni-trier.de/db/journals/corr/corr1312.html#SimonyanVZ13

Smilkov, D., Thorat, N., Kim, B., Viégas, F.B., Wattenberg, M.: Smoothgrad: removing noise by adding noise. CoRR abs/1706.03825 (2017). arxiv: 1706.03825

Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: D. Precup, Y.W. Teh (eds.) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 70, pp. 3319–3328. PMLR, International Convention Centre, Sydney, Australia (2017). http://proceedings.mlr.press/v70/sundararajan17a.html

Trasarti, R., Guidotti, R., Monreale, A., Giannotti, F.: Myway: location prediction via mobility profiling. Inf. Syst. 64 , 350–367 (2017). https://doi.org/10.1016/j.is.2015.11.002

Traub, J., Quiané-Ruiz, J., Kaoudi, Z., Markl, V.: Agora: Towards an open ecosystem for democratizing data science & artificial intelligence. CoRR abs/1909.03026 (2019). arxiv: 1909.03026

Vazifeh, M.M., Zhang, H., Santi, P., Ratti, C.: Optimizing the deployment of electric vehicle charging stations using pervasive mobility data. Transp Res A Policy Practice 121 (C), 75–91 (2019). https://doi.org/10.1016/j.tra.2019.01.002

Vermeulen, A.F.: Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets, 1st edn. Apress, New York (2018)

Download references

Acknowledgements

This work is supported by the European Community’s H2020 Program under the scheme ‘INFRAIA-1-2014-2015: Research Infrastructures’, grant agreement #654024 ‘SoBigData: Social Mining and Big Data Ecosystem’ and the scheme ‘INFRAIA-01-2018-2019: Research and Innovation action’, grant agreement #871042 ’SoBigData \(_{++}\) : European Integrated Infrastructure for Social Mining and Big Data Analytics’

Open access funding provided by Università di Pisa within the CRUI-CARE Agreement.

Author information

Authors and affiliations.

CNR - Istituto Scienza e Tecnologia dell’Informazione A. Faedo, KDDLab, Pisa, Italy

Valerio Grossi & Fosca Giannotti

Department of Computer Science, University of Pisa, Pisa, Italy

Dino Pedreschi

CNR - Istituto Scienza e Tecnologia dell’Informazione A. Faedo, NeMIS, Pisa, Italy

Paolo Manghi, Pasquale Pagano & Massimiliano Assante

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Dino Pedreschi .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Grossi, V., Giannotti, F., Pedreschi, D. et al. Data science: a game changer for science and innovation. Int J Data Sci Anal 11 , 263–278 (2021). https://doi.org/10.1007/s41060-020-00240-2

Download citation

Received : 13 July 2019

Accepted : 15 December 2020

Published : 19 April 2021

Issue Date : May 2021

DOI : https://doi.org/10.1007/s41060-020-00240-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Responsible data science
  • Research infrastructure
  • Social mining
  • Find a journal
  • Publish with us
  • Track your research
  • skip to Main Navigation
  • skip to Main Content
  • skip to Footer
  • Accessibility feedback
  • Data & Visualization and Research Support
  • Data Management

Defining Research Data

One definition of research data is: "the recorded factual material commonly accepted in the scientific community as necessary to validate research findings." ( OMB Circular 110 ).

Research data covers a broad range of types of information (see examples below), and digital data can be structured and stored in a variety of file formats.

Note that properly managing data (and records) does not necessarily equate to sharing or publishing that data.

Examples of Research Data

Some examples of research data:

  • Documents (text, Word), spreadsheets
  • Laboratory notebooks, field notebooks, diaries
  • Questionnaires, transcripts, codebooks
  • Audiotapes, videotapes
  • Photographs, films
  • Protein or genetic sequences
  • Test responses
  • Slides, artifacts, specimens, samples
  • Collection of digital objects acquired and generated during the process of research
  • Database contents (video, audio, text, images)
  • Models, algorithms, scripts
  • Contents of an application (input, output, logfiles for analysis software, simulation software, schemas)
  • Methodologies and workflows
  • Standard operating procedures and protocols

Exclusions from Sharing

In addition to the other records to manage (below), some kinds of data may not be sharable due to the nature of the records themselves, or to ethical and privacy concerns. As defined by the OMB , this refers to:

  • preliminary analyses,
  • drafts of scientific papers,
  • plans for future research,
  • peer reviews, or
  • communications with colleagues

Research data also do not include:

  • Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law; and
  • Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study.

Some types of data, particularly software, may require special license to share.  In those cases, contact the Office of Technology Transfer to review considerations for software generated in your research.

Other Records to Manage

Although they might not be addressed in an NSF data management plan, the following research records may also be important to manage during and beyond the life of a project.

  • Correspondence (electronic mail and paper-based correspondence)
  • Project files
  • Grant applications
  • Ethics applications
  • Technical reports
  • Research reports
  • Signed consent forms

Adapted from Defining Research Data by the University of Oregon Libraries.

Opens in your default email client

URL copied to clipboard.

QR code for this page

research paper and data

Detail of a painting depicting the landscape of New Mexico with mountains in the distance

Explore millions of high-quality primary sources and images from around the world, including artworks, maps, photographs, and more.

Explore migration issues through a variety of media types

  • Part of The Streets are Talking: Public Forms of Creative Expression from Around the World
  • Part of The Journal of Economic Perspectives, Vol. 34, No. 1 (Winter 2020)
  • Part of Cato Institute (Aug. 3, 2021)
  • Part of University of California Press
  • Part of Open: Smithsonian National Museum of African American History & Culture
  • Part of Indiana Journal of Global Legal Studies, Vol. 19, No. 1 (Winter 2012)
  • Part of R Street Institute (Nov. 1, 2020)
  • Part of Leuven University Press
  • Part of UN Secretary-General Papers: Ban Ki-moon (2007-2016)
  • Part of Perspectives on Terrorism, Vol. 12, No. 4 (August 2018)
  • Part of Leveraging Lives: Serbia and Illegal Tunisian Migration to Europe, Carnegie Endowment for International Peace (Mar. 1, 2023)
  • Part of UCL Press

Harness the power of visual materials—explore more than 3 million images now on JSTOR.

Enhance your scholarly research with underground newspapers, magazines, and journals.

Explore collections in the arts, sciences, and literature from the world’s leading museums, archives, and scholars.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • Data Collection Methods | Step-by-Step Guide & Examples

Data Collection Methods | Step-by-Step Guide & Examples

Published on 4 May 2022 by Pritha Bhandari .

Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental, or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem .

While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:

  • The  aim of the research
  • The type of data that you will collect
  • The methods and procedures you will use to collect, store, and process the data

To collect high-quality data that is relevant to your purposes, follow these four steps.

Table of contents

Step 1: define the aim of your research, step 2: choose your data collection method, step 3: plan your data collection procedures, step 4: collect the data, frequently asked questions about data collection.

Before you start the process of data collection, you need to identify exactly what you want to achieve. You can start by writing a problem statement : what is the practical or scientific issue that you want to address, and why does it matter?

Next, formulate one or more research questions that precisely define what you want to find out. Depending on your research questions, you might need to collect quantitative or qualitative data :

  • Quantitative data is expressed in numbers and graphs and is analysed through statistical methods .
  • Qualitative data is expressed in words and analysed through interpretations and categorisations.

If your aim is to test a hypothesis , measure something precisely, or gain large-scale statistical insights, collect quantitative data. If your aim is to explore ideas, understand experiences, or gain detailed insights into a specific context, collect qualitative data.

If you have several aims, you can use a mixed methods approach that collects both types of data.

  • Your first aim is to assess whether there are significant differences in perceptions of managers across different departments and office locations.
  • Your second aim is to gather meaningful feedback from employees to explore new ideas for how managers can improve.

Prevent plagiarism, run a free check.

Based on the data you want to collect, decide which method is best suited for your research.

  • Experimental research is primarily a quantitative method.
  • Interviews , focus groups , and ethnographies are qualitative methods.
  • Surveys , observations, archival research, and secondary data collection can be quantitative or qualitative methods.

Carefully consider what method you will use to gather data that helps you directly answer your research questions.

When you know which method(s) you are using, you need to plan exactly how you will implement them. What procedures will you follow to make accurate observations or measurements of the variables you are interested in?

For instance, if you’re conducting surveys or interviews, decide what form the questions will take; if you’re conducting an experiment, make decisions about your experimental design .

Operationalisation

Sometimes your variables can be measured directly: for example, you can collect data on the average age of employees simply by asking for dates of birth. However, often you’ll be interested in collecting data on more abstract concepts or variables that can’t be directly observed.

Operationalisation means turning abstract conceptual ideas into measurable observations. When planning how you will collect data, you need to translate the conceptual definition of what you want to study into the operational definition of what you will actually measure.

  • You ask managers to rate their own leadership skills on 5-point scales assessing the ability to delegate, decisiveness, and dependability.
  • You ask their direct employees to provide anonymous feedback on the managers regarding the same topics.

You may need to develop a sampling plan to obtain data systematically. This involves defining a population , the group you want to draw conclusions about, and a sample, the group you will actually collect data from.

Your sampling method will determine how you recruit participants or obtain measurements for your study. To decide on a sampling method you will need to consider factors like the required sample size, accessibility of the sample, and time frame of the data collection.

Standardising procedures

If multiple researchers are involved, write a detailed manual to standardise data collection procedures in your study.

This means laying out specific step-by-step instructions so that everyone in your research team collects data in a consistent way – for example, by conducting experiments under the same conditions and using objective criteria to record and categorise observations.

This helps ensure the reliability of your data, and you can also use it to replicate the study in the future.

Creating a data management plan

Before beginning data collection, you should also decide how you will organise and store your data.

  • If you are collecting data from people, you will likely need to anonymise and safeguard the data to prevent leaks of sensitive information (e.g. names or identity numbers).
  • If you are collecting data via interviews or pencil-and-paper formats, you will need to perform transcriptions or data entry in systematic ways to minimise distortion.
  • You can prevent loss of data by having an organisation system that is routinely backed up.

Finally, you can implement your chosen methods to measure or observe the variables you are interested in.

The closed-ended questions ask participants to rate their manager’s leadership skills on scales from 1 to 5. The data produced is numerical and can be statistically analysed for averages and patterns.

To ensure that high-quality data is recorded in a systematic way, here are some best practices:

  • Record all relevant information as and when you obtain data. For example, note down whether or how lab equipment is recalibrated during an experimental study.
  • Double-check manual data entry for errors.
  • If you collect quantitative data, you can assess the reliability and validity to get an indication of your data quality.

Data collection is the systematic process by which observations or measurements are gathered in research. It is used in many different contexts by academics, governments, businesses, and other organisations.

When conducting research, collecting original data has significant advantages:

  • You can tailor data collection to your specific research aims (e.g., understanding the needs of your consumers or user testing your website).
  • You can control and standardise the process for high reliability and validity (e.g., choosing appropriate measurements and sampling methods ).

However, there are also some drawbacks: data collection can be time-consuming, labour-intensive, and expensive. In some cases, it’s more efficient to use secondary data that has already been collected by someone else, but the data might be less reliable.

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to test a hypothesis by systematically collecting and analysing data, while qualitative methods allow you to explore ideas and experiences in depth.

Reliability and validity are both about how well a method measures something:

  • Reliability refers to the  consistency of a measure (whether the results can be reproduced under the same conditions).
  • Validity   refers to the  accuracy of a measure (whether the results really do represent what they are supposed to measure).

If you are doing experimental research , you also have to consider the internal and external validity of your experiment.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

Operationalisation means turning abstract conceptual ideas into measurable observations.

For example, the concept of social anxiety isn’t directly observable, but it can be operationally defined in terms of self-rating scores, behavioural avoidance of crowded places, or physical anxiety symptoms in social situations.

Before collecting data , it’s important to consider how you will operationalise the variables that you want to measure.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Bhandari, P. (2022, May 04). Data Collection Methods | Step-by-Step Guide & Examples. Scribbr. Retrieved 15 April 2024, from https://www.scribbr.co.uk/research-methods/data-collection-guide/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, qualitative vs quantitative research | examples & methods, triangulation in research | guide, types, examples, what is a conceptual framework | tips & examples.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Biomed Res Int

Logo of bmri

Privacy Protection and Secondary Use of Health Data: Strategies and Methods

Dingyi xiang.

1 Internet Rule of Law Institute, East China University of Political Science and Law, Shanghai, China

2 Humanities and Law School, Northeast Forest University, Harbin, Heilongjiang, China

3 Beidahuang Information Company, Harbin, Heilongjiang, China

Health big data has already been the most important big data for its serious privacy disclosure concerns and huge potential value of secondary use. Measurements must be taken to balance and compromise both the two serious challenges. One holistic solution or strategy is regarded as the preferred direction, by which the risk of reidentification from records should be kept as low as possible and data be shared with the principle of minimum necessary. In this article, we present a comprehensive review about privacy protection of health data from four aspects: health data, related regulations, three strategies for data sharing, and three types of methods with progressive levels. Finally, we summarize this review and identify future research directions.

1. Introduction

The rapid development and application of multiple health information technologies enabled medical organizations to store, share, and analyze a large amount of personal medical/health and biomedical data, of which the majority are electronic health records (EHR) and genomic data. Meanwhile, the emerging technologies, such as smart phones and wearable devices, also enabled third-party firms to provide many kinds of complementary mHealth services and collect huge tons of consumer health data. Health big data has already been the most important big data for its serious privacy disclosure concerns and huge potential value of secondary use.

Health big data stimulated the development of personalized medicine or precision medicine. Empowered by health informatics and analytic techniques, secondary use of health data can support clinical decision making; extract knowledge about diseases, genetics, and medicine; improve patients' healthcare experiences; reduce healthcare costs; and support public health policies [ 1 – 3 ]. On the other side of the coin, health data contains much personal privacy and confidential information. For the guidance of protecting health-related privacy, the Health Insurance Portability and Accountability Act (HIPAA) of the US specifies 18 categories of protected health information (PHI) [ 4 ]. The heavy concerns about privacy disclosure much hinder secondary use of health big data. Much efforts tried to balance between privacy management and health data secondary use from both the legislation side [ 5 ] and the technology side [ 6 , 7 ]. But for much more circumstances, a perfect balance is difficult to achieve; instead, a certain tradeoff or compromise must always be made. Recently, COVID-19 may perfectly illustrate the conundrum between protecting health information and ensuring its availability to meet the challenges posed by a significant global pandemic. In this ongoing battle, China and South Korea have mandated public use of contact tracing technologies, with few privacy controls; other countries are also adopting contact tracing technologies [ 7 ].

The direct and also important strategy to balance both issues is reusing health data under the premise of protecting privacy. The most primary idea is to share deidentified health data by removing 18 specified PHI. Based on deidentified health data, machine learning and data mining can be used for knowledge extraction or learning health system building for the purpose of analyzing and improving care, whereby treatment is tailored to the clinical or genetic features of the patient [ 8 ]. However, transforming data or anonymizing individuals may minimize the utility of the transferred data and lead to inaccurate knowledge [ 9 ]. This tradeoff between privacy and utility, also accuracy, is the center issue of sensitive data secondary usage [ 10 ]. Deidentification refers to a collection of techniques devised for removing or transforming identifiable information into nonidentifiable information and also introducing random noise into the dataset. By deidentification, privacy protection will be leveraged, but the outcome of analysis may be not exact, rather an approximation. To reconcile this conflict, the privacy loss parameter, also called privacy budget, was proposed to tune the tradeoff between privacy and accuracy: by changing the value of this parameter, more or less privacy resulting in less or more accuracy, respectively [ 11 ]. Furthermore, deidentified data may become reidentifiable through data triangulation from other datasets, which means that the privacy harms of big health data arise not merely in the collection of data but in their eventual use [ 12 ]. Just deidentification is far from needed. Instead, a holistic solution is the right direction, by which the risk of reidentification from records should be kept as low as possible and data be shared with the principle of minimum necessary [ 13 ]. For the minimum necessary, user-controlled access [ 6 , 14 ] and secure network architecture [ 15 ] can be a practical implementation. For effective reusing health data while reducing the risk of reidentification, attempts in three aspects can be applicable references, that is, risk-mitigation methods, privacy-preserving data mining, and distributed data mining without sharing out data.

The remainder of this paper is organized as follows. Section 2 describes the scope of health data and its corresponding category. Section 3 summarizes regulations about privacy protection of health data in several countries. Section 4 concisely reviews two strategies for privacy protection and secondary use of health data. Section 5 reviews three aspects of tasks and methods for privacy preservation and data mining the primary tasks of data mining. Section 6 concludes this study.

2. Health Data and Its Category

Generally speaking, any data associated with users' health conditions can be viewed as health data. The most important health data is clinical data, especially electronic medical records (EMR), produced by different level hospitals. With the development of health information technology and the popularization of wearable health device, vast amounts of health-relevant data, such as monitored physiological data and diet or exercise data, are collected from individuals and entities elsewhere, both passively and actively. According to the review article by Deven McGraw and Kenneth D. Mandl, health-relevant data can be classified into four categories [ 7 ]. In this research, we focus on the first two categories of data, which are directly related to users' health and privacy.

Category 1. Health data generated by healthcare system. This type of data is clinical data and is recorded by clinical professionals or medical equipment when a patient gets healthcare service in a hospital or clinic. Clinical data includes EMR, prescriptions, laboratory data, pathology images, radiography, and payor claims data. Patients' historical condition and current condition are recorded for treatment requirement. For making better health service for patients, it is important to track patients' lifelong clinical data and make clinical data sharing among different healthcare providers. Personal health record (PHR) was proposed to integrate patients' cross-institutions and lifelong clinical data [ 16 ]. This type of health data is generated and collected routinely in the process of healthcare, with the explicit aim that those data be used for the purpose of analyzing and improving care. For the purpose of clinical treatment, and also because of consumers' firm trust on healthcare experts and institutions, clinical data contains a high degree of health-related privacy. Therefore, the majority of health privacy laws mainly cover the privacy protection of clinical data [ 7 ]. Under the constraints of health privacy laws, tons of clinical data have been restricted only for internal use in medical institutions. Meanwhile, the clinical data is also extremely valuable for secondary usage since the data is created by professional experts and is direct description of consumers' health conditions. The tradeoff between utility and privacy of this type of health data has been one of the most important issues in the age of medical big data.

Category 2. Health data generated by consumer health and wellness industry. This type of health data is an important complementation to clinical data. With the widespread application of new-generation information technology, such as IoT, mHealth, smart phone, and wearable device, consumers' health attitude has greatly changed from passive treatment to active health. Consumers' health data can be generated through wearable fitness tracking devices, medical wearables such as insulin pumps and pacemakers, medical or health monitoring apps, and online health service. These health data can include breath, heart rate, blood pressure, blood glucose, walking, weight, diet preference, position, and online health consultation. These products or services and health data play important role in consumers' daily heath management, especially for chronic disease patients. This area has gained more and more focus from industry and academia. Consumer health informatics is the representative direction [ 17 ]. This type of nontraditional health-relevant data, often equally revealing of health status, is in widespread commercial use and, in the hands of commercial companies, yet often less accessible by providers, patients, and public health for improving individual and population health [ 18 ]. These big health data are scattered across institutions and intentionally isolated to protect patient privacy. For this type of health data, integration and linking at individual level are an extra challenge except for the utility-privacy tradeoff.

Table 1 summarizes the two categories of health data and their comparative features.

Summarization of clinical data and consumer health data.

3. Regulations about Privacy Protection of Health Data

Personal information and health-relevant data are necessary to record in order to provide regular health service. Meanwhile, personal information and health-relevant data are closely associated with user privacy and confidential information. Therefore, several important privacy protection-related regulations or acts are published to guide health data protection and reuse. Modern data protection law is built on “fair information practice principles” (FIPPS) [ 19 ].

The most referenced regulation is Health Insurance Portability and Accountability Act (HIPAA) [ 4 ]. HIPAA was created primarily to modernize the flow of healthcare information, stipulate how personally identifiable information maintained by the healthcare and healthcare insurance industries should be protected from fraud and theft, and address limitations on healthcare insurance coverage. The HIPAA Safe Harbor (SH) rule specifies 18 categories of explicitly or potentially identifying attributes, called protected health information (PHI), that must be removed before the health data is released to a third party. HIPAA also covers electronic PHI, ePHI. This includes medical scans and electronic health records. A full list of PHI elements is provided in Table 2 . PHI elements in Table 2 only cover identity information and do not include any sensitive attribute. That is, HIPAA does not provide guidelines on how to protect sensitive attribute data; instead, the basic idea of the HIPAA SH rule is to protect privacy by preventing identity disclosure. However, other sensitive attributes may still uniquely combine into a quasi-identifier (QI), which can allow data recipients to reidentify individuals to whom the data refer. Therefore, a strict implementation of the SH rule, however, may be inadequate for protecting privacy or preserving data quality. Recognizing this limitation, HIPAA also provides alternative guidelines that enable a statistical assessment of privacy disclosure risk to determine if the data are appropriate for release [ 20 ].

Protected health information defined by HIPAA.

The Health Information Technology for Economic and Clinical Health (HITECH) Act [ 21 ] was enacted as part of the American Recovery and Reinvestment Act of 2009 to promote the adoption and meaningful use of health information technology. Subtitle D of the HITECH Act addresses the privacy and security concerns associated with the electronic transmission of health information, in part, through several provisions that strengthen the civil and criminal enforcement of the HIPAA rules. It is complimentary with HIPAA and strengthens HIPAA's privacy regulations. HITECH has also widened the scope of HIPAA through the Omnibus Rule. This extends the privacy and security reach of HIPAA/HITECH to business associates. According to HIPAA and HITECH Act, much of data beyond category 1 in Table 1 is outside of the scope of comprehensive health privacy laws in the U.S.

The Consumer Data Right (CDR) [ 22 ] is coregulated by the Office of the Australian Information Commissioner (OIAC) and Australian Competition and Consumer Commission (ACCC). “My Health Record System” is run to track citizen medical conditions, test results, and so on. The OIAC sets out controls on how health information in a My Health Record can be collected, used, and disclosed, which corresponds to PHR integration. The Personal Information Protection and Electronic Documents Act (PIPEDA) [ 23 ] of Canada applies to all personal health data. PIPEDA is stringent and although has many commonalities with HIPAA; it goes beyond HIPAA requirements in several areas. One such area is in the protection of data generated by mobile health apps which is not strictly covered by HIPAA. PIPEDA runs to protected consumer health data. Under PIPEDA, organizations can seek implied or explicit consent, which is based on the sensitivity of the personal information collected and the reasonable data processing consent expectations of the data subject. The General Data Protection Regulation (GDPR) is a wide-ranging data protection regulation in EU, which covering health data as well as all other personal data, even they contain sensitive attributes. GDPR also has data consent and breach notification expectations and contains several key provisions, including notification, right to access, right to be forgotten, and portability. Under GDPR, organizations are required to gain explicit consent from data subjects, and individuals have the right to restriction of processing and not to be subject to automated decision-making.

China has no specific regulations for health data privacy protection. Several restriction rules to prohibit privacy disclosure scatter in China Civil Code (CCC), Medical Practitioners Act of the PRC (MPAPRC), and Regulations on Medical Records Management in Medical Institutions (RMRMMMI), which make privacy disclosure restrictions to individuals, medical practitioners, and medical institutions, respectively. CCC specifies 9 categories of personal information to be protected, including name, birthday, ID number, biometric information, living address, phone number, email address, health condition information, and position tracking information. RMRMMMI only approves reuse of health data just for medical care, teaching, and academic research. Recently, the Personal Information Protection Law of the PRC (PIPILRC) [ 24 ] is released and will come into force on November 1, 2021. This is the first complete and comprehensive regulation on personal information protection. In this regulation, the definition of sensitive personal information and automatic decision making both involve health data, so, this regulation is applicable to privacy protection of health data. According to this regulation, secondary use of deidentified or anonymized health data for automatic decision making is permitted, and data processing consent from consumers is also required. This regulation, so far as can be foreseen, will greatly stimulate the exploitation and exploration of health big data.

According to the comparison of these data privacy relevant regulations, shown in Table 3 , PIPEDA and GDPR and the newly released PIPILRC can cover both clinical data and consumer health data, and others pay the majority of attention to clinical data. Health data need to be reused for multiple important purposes. In fact, health data processing and reusing are never absolutely prohibited in the regulations mentioned above, as long as privacy protection is achieved as the important prerequisite. In this respect, HIPAA sets Safe Harbor rules to make sure PHI be removed before the health data is released to a third party. Furthermore, PIPEDA and GDPR require consumers' consent for data processing. Regulations from China also encourage health data to be reused in certain restricted areas. As the newcomer, PIPILRC presents a more complete and comprehensive guidance to protect and process health data.

Regulations and corresponding data category.

4. Strategies and Framework

The exploitation of health data can provide tremendous benefits for clinical research, but methods to protect patient privacy while using these data have many challenges. Some of these challenges arise from a misunderstanding that the problem should be solved by a foolproof solution. There exists a paradox: well deidentified and scrubbed data may lose much meaningful information results in low quality, maintaining much PHI may have high risk of privacy breach. Therefore, a holistic solution, or to say a unified strategy, is needed. Three strategies are summarized in this section. The first is for clinical data and provides a practical user access rating system, and the second is majority for genomic data and designs a network architecture to address both security access and potential risk of privacy disclosure and reidentification. From a more practical starting point, the third tries to share a model without exposing any data. The tree strategies present solutions from different perspectives, therefore can be complementary to each other.

4.1. Strategies for Clinical Data

As for clinical data, Murphy et al. proposed an effective strategy to build a clinical data sharing platform while protecting patient privacy [ 6 ]. The proposed approach to resolving the balance between privacy management and data secondary use is to match the level of data deidentification with the trustworthiness of the data recipients, in which the more identified the data, the more “trustworthy” the recipients are required to be, and vice versa. The level of trust for a data recipient becomes a critical factor in determining what data may be seen by that person. This type of hierarchical access rating is similar to the film rating, which can accommodate the requirement and appetites of different types of audiences. Murphy et al.'s strategy sets up five patient privacy levels with three aspects of requirements: availability of the data, trust in the researcher and the research, and the security of the technical platforms. Corresponding to the privacy levels are five user role levels.

The lowest level of user is “obfuscated data user.” For this user, data are obfuscated as it is served to a client machine with possibly low technical security. Obfuscation methods try to add a random number to the aggregated counts instead of providing accurate result [ 25 , 26 ]. The second level of user is “aggregated data user,” to whom exact numbers from aggregate query results are permissible. The third is “LDS data user,” who is granted to access HIPAA-defined LDS (limited dataset) and structured patient data in which PHI must be removed. The fourth is “Notes-enabled LDS data user,” who is additionally allowed to view PHI scrubbed text notes (such as discharge summary). The final level of user is “PHI-viewable data user,” who has access to all patient data.

These access level categories are summarized in Table 4 .

Health data access level categories.

With the guidance of health data access level categories, Murphy et al. implemented five cases in clinical research. In a realistic project, multiple use role or different access privileges must be needed to reconcile different data access requirements. Murphy et al. also provided three exemplar projects and their possible privacy level user distributions. This proposed strategy gave a complete reference for data sensitive project and also implemented a holistic approach to patient privacy solutions in Informatics for Integrating Biology and the Bedside (i2b2) research framework [ 27 ]. The i2b2 framework is the most widespread open-source framework for exploring clinical research data-warehouses and was jointly developed by the Harvard Medical School and Massachusetts Institute of Technology to enable clinical researchers to use existing deidentified clinical data and only IRB-approved genomic data for research aims. Yet, i2b2 does not provide any specific protection mechanism for genomic data.

4.2. Strategies for Genomic Data

As for genomic data, two potential privacy threats are loss of patients' health data confidentiality due to illegitimate data access and patients' reidentification and resulting sensitive attribute disclosure from legitimate data access. On the basis of the i2b2 framework, Raisaro et al. [ 15 ] proposed to apply homomorphic encryption [ 28 ] to the first threat and differential privacy [ 29 ] to the second threat. Furthermore, Raisaro et al. designed a system model, consisting of two physically separated networks, from the perspective of architecture. The network architecture is shown in Figure 1 . This network architecture is aimed at isolating data that is used for clinical/medical care and that is used for research activities by a few trusted and authorized individuals.

An external file that holds a picture, illustration, etc.
Object name is BMRI2021-6967166.001.jpg

Network architecture of privacy protection for health data including genomic data.

The clinical network is used for hospital's clinical daily activities, containing clinical and genomic data of patients. This network is very controlled and protected by a firewall that blocks all incoming network traffic. Authorized users are permitted to log in.

The research network hosts i2b2 service used by researchers in their research activities. The i2b2 service is composed of an i2b2 server and a proxy server, in which a homomorphic encryption method and a differential privacy method are implemented and deployed. The i2b2 server can receive deidentified clinical data and encrypted genomic data from the clinical network and perform security data query and computation. The proxy server is devoted to support the decryption phase and the storage of partial decryption keys for homomorphic encryption. Through the research network, researchers can get authorized data via query execution module by the sequential five steps: query generation, query processing, result perturbation, result partial decryption, and result decryption at the final user-client side.

This network architecture and its privacy-preserving solution have been successfully deployed and tested in Lausanne University Hospital and used for exploring genomic cohorts in a real operational scenario. This application is also a practicable demonstration for similar scenario. It is not a unique instance but has its counterpart. Azencott reviewed how breaches in patient privacy can occur, and recent developments in computational data protection also proposed a similar secure framework for genomic data sharing around three aspects, which includes algorithmic solutions to deidentification, database security, and user trustworthy access [ 3 ].

4.3. Strategies for Sharing Not Data but Models

Since the new paradigm of the machine learning method, namely, federated learning (FL), was first introduced in 2016 [ 30 ], has achieved a rapid development, and become a hot research topic in the field of artificial intelligence, its core idea is to train machine learning models on separate datasets that are distributed across different devices or parties, which can preserve the local data privacy to a certain extent. This development mainly benefits from the following three facts [ 31 ]: (1) the wide successful applications of machine learning technologies, (2) the explosive growth of big data, and (3) the legal regulations for data privacy protection worldwide.

The idea of federated learning is to only share the model parameters instead of the original data. By this way, many of these initiatives are based on federated models in which the actual data never leave the institution of origin, allowing researchers to share models without necessarily sharing patient data. Federated learning has inspired another important strategy to develop smart healthcare based on sensitive and private medical records which exist in isolated medical centers and hospitals. As shown in Figure 2 , federated learning offers a framework to jointly train a global model using datasets stored in separate clients.

An external file that holds a picture, illustration, etc.
Object name is BMRI2021-6967166.002.jpg

Architecture for a federated learning system.

Model building of this kind has been used in real-world applications where user privacy is crucial, e.g., for hospital data or text predictions on mobile devices, and it has been stated that model updates are considered to contain less information than the original data, and through the aggregation of updates from multiple data points, original data is considered impossible to recover. Federated learning emphasizes the data privacy protection of the data owner during the model training process. Effective measures to protect data privacy can better cope with the increasingly stringent data privacy and data security regulatory environment in the future [ 32 ].

5. Tasks and Methods

Under the strategies of health data protection, specific tasks and methods about privacy and data processing can be employed and deployed. The tasks and methods can be viewed at three progressive levels. Methods in the first level are aimed at mitigating the risk of privacy disclosure, from four aspects. Methods in the second level target on data mining or knowledge extraction from deidentified or anonymized health data. No need to share health data, methods in the third level try to build a learning model or extract knowledge in a distributed manner, then share the model or knowledge.

5.1. Risk-Mitigation Methods

There are two widely recognized types of privacy disclosure [ 33 ]: identity disclosure (or reidentification) and attribute disclosure. The former occurs when illegitimate data users try to match a record in a dataset to an individual, and the latter occurs when illegitimate data users try to predict the sensitive value(s) of an individual record. According to Malin et al. [ 34 ], methods of mitigating the risk of two types of privacy disclosure can be divided into four classes: suppression, generalization, randomization, and synthetization. This perspective of method categories expects to well summarize the recent research on risk-mitigation methods.

5.1.1. Suppression Methods

Suppression methods are aimed at scrubbing (remove or mask) 18 PHI defined in HIPAA, which is the most important deidentification method. Before PHI scrubbing, the major task is to identify the PHI from health data. For structural data, PHI identification can be done easily according to data schema. For narrative data or free text, such as discharge summary or progress note, natural language processing (NLP) is the preferred technology for PHI identification. Specifically, named entity recognition (NER) is the mainstream technology used in clinical data for deidentification and medical knowledge extraction. The 18 PHI are regarded as predefined entity types, and machine learning is employed to annotate type tags for each word in a sentence, then those tags are merged, and finally, the position and type of PHI can be identified. Conditional random fields (CRFs) are the classic sequential tagging model for NER and are often applied for deidentification [ 35 ]. Meystre et al. made a systematic review of deidentification methods [ 36 ], and Uzuner et al. [ 37 ] and Deleger et al. [ 38 ] both conducted some evaluations on a certain human-annotated dataset. The identified PHI values are then simply removed from or replaced with a constant value in the released text documents, which may be inadequate for protecting privacy or preserving data quality. Li and Qin proposed a new systematic approach to integrate methods developed in both data privacy and health informatics fields. The key novel elements of the proposed approach include a recursive partitioning method to cluster medical text records and a value enumeration method to anonymize potentially identifying information in the text data, which essentially masks the original values, to improve privacy protection and data utility [ 20 ].

For genomic data, homomorphic encryption [ 28 ] is applied to encrypting genomic data, and then, encrypted data can be shared for secondary use. Raisaro et al. employed homomorphic encryption to build a data warehouse for genomic data [ 15 ]. Kamm et al. [ 39 ] also proposed a framework for generating aggregated statistics on genomic data by using secure multiparty computation based on homomorphic secret sharing. Several other works [ 28 , 40 , 41 ] proposed using homomorphic encryption to protect genomic information in order to allow researchers to perform some statistics directly on the encrypted data and decrypt only the final result.

5.1.2. Generalization Methods

These methods transform data into more abstract representations. The much easier implementation is abbreviation. For instance, the age of a patient may be generalized from 1-year to 5-year age groups. Based on this type of generation, sensitive attributes can be generalized subgroup and be anonymized to some extent, which is the back idea of k -anonymity and its variations. k -anonymity seeks to prevent reidentification by stripping enough information from the released data that any individual record becomes indistinguishable from at least ( k − 1) other records [ 42 ]. The idea of k -anonymity is based on modifying the values of the QI attributes to make it difficult for an attacker to unravel the identity of persons in a particular dataset while the released data remain as useful as possible. This modification is a sort of generalization, by which stored values can be replaced with semantically consistent but less precise alternatives [ 43 ]. For example, let us consider a dataset in which age is a quasi-identifier. While the three records {age = 30, gender = male}, {age = 35, gender = male}, and {age = 31, gender = female} are all distinct, releasing them as {age = 3∗, gender = male}, {age = 3∗, gender = male}, and {age = 3∗, gender = female} ensures they all belong to the same age category and the anonymity is 3-anonymity. Based on k -anonymity, l -diversity [ 44 , 45 ] were proposed to address further disclosure issues of sensitive attributes.

5.1.3. Randomization Methods

Randomization can be used for attribute-level data. In this case, original sensitive values are replaced with similar but different values, with a certain probability. For example, a patient's name may be masked by a randomly selected made-up name. This basic approach may result in worse data quality. Li and Qin proposed to obtain value via a clustering method [ 20 ].

Randomization can further be used for aggregation operation. Obfuscation is a sort of such randomization. Numerous repetitions of a query by a single user must be detected and interrupted because they will converge on the true patient count making proper user identification absolutely necessary for the methods to function properly [ 6 ]. Aiming to deidentify aggregated data, obfuscation methods include the addition of a random number to the patient counts that has a distribution defined by a Gaussian function.18. Obfuscation is applied to aggregate patient counts that are reported as a result of ad hoc queries on the client machine [ 26 ]. Another protection model for preventing reidentification is differential privacy [ 10 , 46 ]. In this model, reidentification is prevented by the addition of noise to the data. The model is based on the fact that auxiliary information will always make it easier to identify an individual in a dataset, even if anonymized. Instead, differential privacy seeks to guarantee that the information that is released when querying a dataset is nearly the same whether a specific person is included or not [ 46 ]. Unlike other methods, differential privacy provides formal statistical privacy guarantees.

5.1.4. Synthetization Methods

Synthetization is compelling for two main reasons: preserving confidentiality and valid inferences for various estimates [ 47 ]. In this case, the original data are never shared. Instead, general aggregate statistics about the data are computed, and new synthetic records are generated from the statistics to create fake, but realistic-like, data. Exploiting clinical data for building an intelligent system is one of the scenarios. Developing clinical natural language processing systems often requires access to many clinical documents, which are not widely available to the public due to privacy and security concerns. To address this challenge, Li et al. proposed to develop methods to generate synthetic clinical notes and evaluate their utility in real clinical natural language processing tasks. Thanks to the development of deep learning, recent advances in text generation have made it possible to generate synthetic clinical notes that could be useful for training NER models for information extraction from natural clinical notes, thus lowering the privacy concern and increasing data availability [ 48 ].

5.2. Privacy-Preserving Data Mining

Data mining is also synonymously called knowledge discovery from data (KDD), which highlights the goal of the mining process. To obtain useful knowledge from data, the mining process can be divided into four iterative steps: data preprocessing, data transformation, data mining, and pattern evaluation and presentation. Based on the stage division in the process of KDD, Xu et al. developed a user-role-based methodology and identified four different types of users in a typical data mining scenario: data provider, data collector, data miner, and decision maker. By differentiating the four different user roles, privacy-preserving data mining (PPDM) can be explored in a principled way, by which all users care about the security of sensitive information but each user role views the security issue from its own perspective [ 49 ]. In this research, PPDM is explored from the view of a data miner role, that is, from the data mining stage of KDD.

Privacy-preserving data mining is aimed at mining or extracting information, via a certain machine learning-based model, from privacy-preserving data in which the values of individual records have been perturbed or masked [ 50 ]. The key challenge is that the privacy-preserving data look very different from the original records and the distribution of data values is also very different from the original distribution. Researches for this issue have started very early. Agrawal and Srikant proposed a reconstruction procedure to estimate the distribution of original data values and then built a decision-tree classifier [ 50 ]. Recent studies on PPDM include privacy-preserving association rule mining, privacy-preserving classification, and privacy-preserving cluster.

Association rule mining is aimed at finding interesting associations and correlation relationships among large sets of data items. For PPDM, some of the rules may be considered to be sensitive. For hiding these rules, the original data need to be modified to generate a sanitized dataset from which sensitive rules cannot be mined, while those nonsensitive ones can still be discovered [ 51 ]. Classification is a task of data analysis that learns models to automatically classify data into defined categories. Privacy-preserving classification evolves decision tree, Bayesian model, support vector machine, and neural classification. The strategies of adapting the classification method to a privacy-preserving scenario can simply be described as two aspects. The first is learning the classification model based on data transformation, since the transformed data is difficult to be recovered [ 52 , 53 ]. The second is learning the classification model based on secure multiparty computation (SMC) [ 54 ], where multiparties collaborate to develop a classification model from vertically partitioned or horizontally partitioned data, but no one wants to disclose its data to others [ 55 , 56 ]. Cluster analysis is the process of grouping a set of records into multiple groups or clusters so that objects within a cluster have high similarity but are very dissimilar to objects in other clusters. This process runs in an unsupervised manner. Similar to classification, current researches on privacy-preserving clustering can be roughly categorized into two types, based on data transformation [ 57 , 58 ] and based on secure multiparty computation [ 59 , 60 ].

5.3. Federated Privacy-Preserving Data Mining

For the distributed or isolated data, distributed data mining is the research topic. Distributed data mining can be further categorized into data mining over horizontally partitioned data and data mining over vertically partitioned data. Research on distributed data mining attracts much attention. To overcome the difficulty of data integration and promote efficient information exchange without sharing sensitive raw data, Que et al. developed a Distributed Privacy-Preserving Support Vector Machine (DPP-SVM). The DPP-SVM enables privacy-preserving collaborative learning, in which a trusted server integrates “privacy-insensitive” intermediary results [ 61 ]. In medical domain, much raw data can hardly leave the institution of origin. Instead of bringing data to a central repository for computation, Wu et al. proposed a new algorithm, Grid Binary LOgistic REgression (GLORE), to fit a LR model in a distributed fashion using information from locally hosted databases containing different observations that share the same attributes [ 62 ].

It is worth to note that learning (classification or clustering) on secure multiparty computation is an important distributed learning strategy, by which privacy disclosure concern can be much reduced since data need not to be shared out. This research topic probably inspired federated machine learning [ 30 , 32 ]. Today's AI still faces two major challenges. One is that data exists in the form of isolated islands. The other is the strengthening of data privacy and security. The two challenge is much severer in the healthcare domain. Federated machine learning is aimed at building a learning model from decentralized data [ 30 ]. Federated learning can be classified into horizontally federated learning, vertically federated learning, and federated transfer learning based on how data is distributed among various parties in the feature and sample ID space [ 32 ]. Horizontal federated learning, or sample-based federated learning, is introduced in the scenarios that datasets share the same feature space but different in samples. At the end of the learning, the universal model and the entire model parameters are exposed to all participants. Vertical federated learning or feature-based federated learning is applicable to the cases that two datasets share the same sample ID space but differ in feature space. At the end of learning, each party only holds the model parameters associated with its own features; therefore, at inference time, the two parties also need to collaborate to generate output. Federated transfer learning (FTL) applies to the scenarios that the two datasets differ not only in samples but also in feature space. FTL is an important extension to the existing federated learning systems and is more similar to vertical federated learning. The challenge of protecting data privacy while maintaining the data utility through machine learning still remains. For a comprehensive introduction of federated privacy-preserving data mining, please refer to the survey based on the proposed 5 W-scenario-based taxonomy [ 31 ].

5.4. Summary: Privacy vs. Accuracy

Privacy protection is the indispensable prerequisite of secondary usage of health data. As discussed above, risk-mitigation methods are aimed at anonymizing private or sensitive information so as to reduce the risk of reidentification. Methods about privacy-preserving data mining target to process the privacy-scrubbed data and extract knowledge and even build AI systems. If absolute privacy safe is pursued, the scrubbed data is definitely useless, since the data quality is severely corrupted. With the poor-quality data, accuracy and effectiveness of data utilization are extremely affected. Therefore, in a practical scenario, a certain tradeoff or compromise between privacy and accuracy must always be made. The tradeoff can be tuned to provide more or less privacy resulting in less or more accuracy, respectively, according to the requirements of privacy level and utility level. Federated privacy-preserving data mining sheds light on the new direction to compromise, even to balance, the privacy and accuracy. No need to share data out, federated privacy-preserving data mining first processes the original health data within institutions, and the conduct federated mining or learning. This type of method is expected to reconcile privacy and accuracy with more elegant style and more acceptable way.

6. Conclusions

Clinical data, genomic data, and consumer health data are the majority of health big data. Protection and reuse always gain much focused research topics. In this review article, the type and scope of health data are firstly discussed, followed by the related regulations for privacy protection. Then, strategies for user-controlled access and secure network architecture are presented. Sharing trained model without original data leaving out is a new important strategy and gains more and more focus. According to different data reuse scenarios, tasks and methods at three different levels are summarized. The strategies and methods can be combined to form a holistic solution.

With the rapid develop health information technology and artificial intelligence, the capability of privacy protection will impede the urgent demand of reusing health data. Some potential research directions may include (1) applying modern machine learning to deidentification and anonymization for multimodal health data while ensuring its data quality; (2) learning model construction and knowledge extraction based on anonymized data to leverage secondary use of health data; (3) federated learning on isolated heath data can both protect privacy perfectly and improve the efficiency of data transferring and processing, being deserved more attention; (4) research on alleviating reidentification risk, such as linkage or inference, from a trained model.

Acknowledgments

This study was funded by the China Postdoctoral Science Foundation Grant (2020M671059) and the Fundamental Research Funds for the Central Universities (2572020BN02).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Subscribe to the PwC Newsletter

Join the community, 9730 dataset results.

research paper and data

The CIFAR-10 dataset (Canadian Institute for Advanced Research, 10 classes) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The images are labelled with one of 10 mutually exclusive classes: airplane, automobile (but not truck or pickup truck), bird, cat, deer, dog, frog, horse, ship, and truck (but not pickup truck). There are 6000 images per class with 5000 training and 1000 testing images per class.

14,047 PAPERS • 100 BENCHMARKS

research paper and data

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.

13,388 PAPERS • 40 BENCHMARKS

research paper and data

The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.

10,098 PAPERS • 92 BENCHMARKS

research paper and data

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. There are 600 images per class. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). There are 500 training images and 100 testing images per class.

7,615 PAPERS • 52 BENCHMARKS

research paper and data

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

6,970 PAPERS • 52 BENCHMARKS

research paper and data

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset consists of around 5000 fine annotated images and 20000 coarse annotated ones. Data was captured in 50 cities during several months, daytimes, and good weather conditions. It was originally recorded as video so the frames were manually selected to have the following features: large number of dynamic objects, varying scene layout, and varying background.

3,315 PAPERS • 54 BENCHMARKS

research paper and data

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome

3,207 PAPERS • 141 BENCHMARKS

research paper and data

Street View House Numbers (SVHN) is a digit classification benchmark dataset that contains 600,000 32×32 RGB images of printed digits (from 0 to 9) cropped from pictures of house number plates. The cropped images are centered in the digit of interest, but nearby digits and other distractors are kept in the image. SVHN has three sets: training, testing sets and an extra set with 530,000 images that are less difficult and can be used for helping with the training process.

3,073 PAPERS • 12 BENCHMARKS

research paper and data

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels indicating facial attributes like hair color, gender and age.

3,072 PAPERS • 21 BENCHMARKS

research paper and data

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images. Fashion-MNIST shares the same image size, data format and the structure of training and testing splits with the original MNIST.

2,773 PAPERS • 18 BENCHMARKS

research paper and data

General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks, including single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI, RTE and WNLI.

2,695 PAPERS • 45 BENCHMARKS

Neural Radiance Fields (NeRF) is a method for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. The dataset contains three parts with the first 2 being synthetic renderings of objects called Diffuse Synthetic 360◦ and Realistic Synthetic 360◦ while the third is real images of complex scenes. Diffuse Synthetic 360◦ consists of four Lambertian objects with simple geometry. Each object is rendered at 512x512 pixels from viewpoints sampled on the upper hemisphere. Realistic Synthetic 360◦ consists of eight objects of complicated geometry and realistic non-Lambertian materials. Six of them are rendered from viewpoints sampled on the upper hemisphere and the two left are from viewpoints sampled on a full sphere with all of them at 800x800 pixels. The real images of complex scenes consist of 8 forward-facing scenes captured with a cellphone at a size of 1008x756 pixels.

2,590 PAPERS • 1 BENCHMARK

research paper and data

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.

2,012 PAPERS • 13 BENCHMARKS

research paper and data

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of 200 subcategories belonging to birds, 5,994 for training and 5,794 for testing. Each image has detailed annotations: 1 subcategory label, 15 part locations, 312 binary attributes and 1 bounding box. The textual information comes from Reed et al.. They expand the CUB-200-2011 dataset by collecting fine-grained natural language descriptions. Ten single-sentence descriptions are collected for each image. The natural language descriptions are collected through the Amazon Mechanical Turk (AMT) platform, and are required at least 10 words, without any information of subcategories and actions.

1,953 PAPERS • 44 BENCHMARKS

research paper and data

The LibriSpeech corpus is a collection of approximately 1,000 hours of audiobooks that are a part of the LibriVox project. Most of the audiobooks come from the Project Gutenberg. The training data is split into 3 partitions of 100hr, 360hr, and 500hr sets while the dev and test data are split into the ’clean’ and ’other’ categories, respectively, depending upon how well or challenging Automatic Speech Recognition systems would perform against. Each of the dev and test sets is around 5hr in audio length. This corpus also provides the n-gram language models and the corresponding texts excerpted from the Project Gutenberg books, which contain 803M tokens and 977K unique words.

1,951 PAPERS • 12 BENCHMARKS

research paper and data

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.

1,913 PAPERS • 16 BENCHMARKS

research paper and data

ShapeNet is a large scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University and the Toyota Technological Institute at Chicago, USA. The repository contains over 300M models with 220,000 classified into 3,135 classes arranged using WordNet hypernym-hyponym relationships. ShapeNet Parts subset contains 31,693 meshes categorised into 16 common object classes (i.e. table, chair, plane etc.). Each shapes ground truth contains 2-5 parts (with a total of 50 part classes).

1,681 PAPERS • 13 BENCHMARKS

research paper and data

The Multi-Genre Natural Language Inference (MultiNLI) dataset has 433K sentence pairs. Its size and mode of collection are modeled closely like SNLI. MultiNLI offers ten distinct genres (Face-to-face, Telephone, 9/11, Travel, Letters, Oxford University Press, Slate, Verbatim, Goverment and Fiction) of written and spoken English data. There are matched dev/test sets which are derived from the same sources as those in the training set, and mismatched sets which do not closely resemble any seen at training time.

1,665 PAPERS • 8 BENCHMARKS

research paper and data

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These 101 categories can be classified into 5 types (Body motion, Human-human interactions, Human-object interactions, Playing musical instruments and Sports). The total length of these video clips is over 27 hours. All the videos are collected from YouTube and have a fixed frame rate of 25 FPS with the resolution of 320 × 240.

1,607 PAPERS • 22 BENCHMARKS

research paper and data

The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset contains additional unlabeled data.

1,571 PAPERS • 11 BENCHMARKS

research paper and data

Visual Question Answering (VQA) is a dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. The first version of the dataset was released in October 2015. VQA v2.0 was released in April 2017.

1,544 PAPERS • NO BENCHMARKS YET

research paper and data

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in Boston and Singapore. Each scene is 20 seconds long and annotated at 2Hz. This results in a total of 28130 samples for training, 6019 samples for validation and 6008 samples for testing. The dataset has the full autonomous vehicle data suite: 32-beam LiDAR, 6 cameras and radars with complete 360° coverage. The 3D object detection challenge evaluates the performance on 10 classes: cars, trucks, buses, trailers, construction vehicles, pedestrians, motorcycles, bicycles, traffic cones and barriers.

1,533 PAPERS • 20 BENCHMARKS

1,499 PAPERS • 5 BENCHMARKS

research paper and data

MuJoCo (multi-joint dynamics with contact) is a physics engine used to implement environments to benchmark Reinforcement Learning methods.

1,378 PAPERS • 2 BENCHMARKS

mini-Imagenet is proposed by Matching Networks for One Shot Learning . In NeurIPS, 2016. This dataset consists of 50000 training images and 10000 testing images, evenly distributed across 100 classes.

1,249 PAPERS • 19 BENCHMARKS

research paper and data

The ModelNet40 dataset contains synthetic object point clouds. As the most widely used benchmark for point cloud analysis, ModelNet40 is popular because of its various categories, clean shapes, well-constructed dataset, etc. The original ModelNet40 consists of 12,311 CAD-generated meshes in 40 categories (such as airplane, car, plant, lamp), of which 9,843 are used for training while the rest 2,468 are reserved for testing. The corresponding point cloud data points are uniformly sampled from the mesh surfaces, and then further preprocessed by moving to the origin and scaling into a unit sphere.

1,232 PAPERS • 16 BENCHMARKS

research paper and data

ScanNet is an instance-level indoor RGB-D dataset that includes both 2D and 3D data. It is a collection of labeled voxels rather than points or objects. Up to now, ScanNet v2, the newest version of ScanNet, has collected 1513 annotated scans with an approximate 90% surface coverage. In the semantic segmentation task, this dataset is marked in 20 classes of annotated 3D voxelized objects.

1,229 PAPERS • 19 BENCHMARKS

research paper and data

The SNLI dataset (Stanford Natural Language Inference) consists of 570k sentence-pairs manually labeled as entailment, contradiction, and neutral. Premises are image captions from Flickr30k, while hypotheses were generated by crowd-sourced annotators who were shown a premise and asked to generate entailing, contradicting, and neutral sentences. Annotators were instructed to judge the relation between sentences given that they describe the same event. Each pair is labeled as “entailment”, “neutral”, “contradiction” or “-”, where “-” indicates that an agreement could not be reached.

1,218 PAPERS • 3 BENCHMARKS

research paper and data

Flickr-Faces-HQ (FFHQ) consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity and image background. It also has good coverage of accessories such as eyeglasses, sunglasses, hats, etc. The images were crawled from Flickr, thus inheriting all the biases of that website, and automatically aligned and cropped using dlib. Only images under permissive licenses were collected. Various automatic filters were used to prune the set, and finally Amazon Mechanical Turk was used to remove the occasional statues, paintings, or photos of photos.

1,215 PAPERS • 16 BENCHMARKS

OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It includes environment such as Algorithmic, Atari, Box2D, Classic Control, MuJoCo, Robotics, and Toy Text.

1,204 PAPERS • 3 BENCHMARKS

research paper and data

The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 video clips covering 600 human action classes with at least 600 video clips for each action class. Each video clip lasts around 10 seconds and is labeled with a single action class. The videos are collected from YouTube.

1,176 PAPERS • 28 BENCHMARKS

research paper and data

Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 million QA pairs, 17 questions per image on average. Compared to the Visual Question Answering dataset, Visual Genome represents a more balanced distribution over 6 question types: What, Where, When, Who, Why and How. The Visual Genome dataset also presents 108K images with densely annotated objects, attributes and relationships.

1,137 PAPERS • 19 BENCHMARKS

research paper and data

The MovieLens datasets, first released in 1998, describe people’s expressed preferences for movies. These preferences take the form of tuples, each the result of a person expressing a preference (a 0-5 star rating) for a movie at a particular time. These preferences were entered by way of the MovieLens web site1 — a recommender system that asks its users to give movie ratings in order to receive personalized movie recommendations.

1,090 PAPERS • 16 BENCHMARKS

The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.

1,059 PAPERS • 26 BENCHMARKS

CARLA (CAR Learning to Act) is an open simulator for urban driving, developed as an open-source layer over Unreal Engine 4. Technically, it operates similarly to, as an open source layer over Unreal Engine 4 that provides sensors in the form of RGB cameras (with customizable positions), ground truth depth maps, ground truth semantic segmentation maps with 12 semantic classes designed for driving (road, lane marking, traffic sign, sidewalk and so on), bounding boxes for dynamic objects in the environment, and measurements of the agent itself (vehicle location and orientation).

1,053 PAPERS • 3 BENCHMARKS

research paper and data

The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 (SQuAD). SQuAD v1.1 consists of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). The dataset was converted into sentence pair classification by forming a pair between each question and each sentence in the corresponding context, and filtering out pairs with low lexical overlap between the question and the context sentence. The task is to determine whether the context sentence contains the answer to the question. This modified version of the original task removes the requirement that the model select the exact answer, but also removes the simplifying assumptions that the answer is always present in the input and that lexical overlap is a reliable cue. The QNLI dataset is part of GLUE benchmark.

1,047 PAPERS • 5 BENCHMARKS

research paper and data

Oxford 102 Flower is an image classification dataset consisting of 102 flower categories. The flowers chosen to be flower commonly occurring in the United Kingdom. Each class consists of between 40 and 258 images.

1,037 PAPERS • 14 BENCHMARKS

research paper and data

The Places dataset is proposed for scene recognition and contains more than 2.5 million images covering more than 205 scene categories with more than 5,000 images per category.

1,023 PAPERS • 4 BENCHMARKS

research paper and data

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. There are totally 150 semantic categories, which include stuffs like sky, road, grass, and discrete objects like person, car, bed.

991 PAPERS • 25 BENCHMARKS

research paper and data

The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. Each example is comprised of a google.com query and a corresponding Wikipedia page. Each Wikipedia page has a passage (or long answer) annotated on the page that answers the question and one or more short spans from the annotated passage containing the actual answer. The long and the short answer annotations can however be empty. If they are both empty, then there is no answer on the page at all. If the long answer annotation is non-empty, but the short answer annotation is empty, then the annotated passage answers the question but no explicit short answer could be found. Finally 1% of the documents have a passage annotated with a short answer that is “yes” or “no”, instead of a list of short spans.

989 PAPERS • 8 BENCHMARKS

research paper and data

The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall Street Journal (WSJ), is one of the most known and used corpus for the evaluation of models for sequence labelling. The task consists of annotating each word with its Part-of-Speech tag. In the most common split of this corpus, sections from 0 to 18 are used for training (38 219 sentences, 912 344 tokens), sections from 19 to 21 are used for validation (5 527 sentences, 131 768 tokens), and sections from 22 to 24 are used for testing (5 462 sentences, 129 654 tokens). The corpus is also commonly used for character-level and word-level Language Modelling.

975 PAPERS • 10 BENCHMARKS

research paper and data

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or self-taught learning. Besides 100,000 unlabeled images, it contains 13,000 labeled images from 10 object classes (such as birds, cats, trucks), among which 5,000 images are partitioned for training while the remaining 8,000 images for testing. All the images are color images with 96×96 pixels in size.

956 PAPERS • 17 BENCHMARKS

research paper and data

Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images and 50 test images.

936 PAPERS • 8 BENCHMARKS

research paper and data

Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. The four domains are: Art – artistic images in the form of sketches, paintings, ornamentation, etc.; Clipart – collection of clipart images; Product – images of objects without a background and Real-World – images of objects captured with a regular camera. It contains 15,500 images, with an average of around 70 images per class and a maximum of 99 images in a class.

932 PAPERS • 11 BENCHMARKS

research paper and data

The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.

887 PAPERS • 8 BENCHMARKS

research paper and data

The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect. It features:

838 PAPERS • 20 BENCHMARKS

research paper and data

The MS MARCO (Microsoft MAchine Reading Comprehension) is a collection of datasets focused on deep learning in search. The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Over time the collection was extended with a 1,000,000 question dataset, a natural language generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.

818 PAPERS • 7 BENCHMARKS

research paper and data

Market-1501 is a large-scale public benchmark dataset for person re-identification. It contains 1501 identities which are captured by six different cameras, and 32,668 pedestrian image bounding-boxes obtained using the Deformable Part Models pedestrian detector. Each person has 3.6 images on average at each viewpoint. The dataset is split into two parts: 750 identities are utilized for training and the remaining 751 identities are used for testing. In the official testing protocol 3,368 query images are selected as probe set to find the correct match across 19,732 reference gallery images.

811 PAPERS • 9 BENCHMARKS

  • Interlibrary Loan and Scan & Deliver
  • Course Reserves
  • Purchase Request
  • Collection Development & Maintenance
  • Current Negotiations
  • Ask a Librarian
  • Instructor Support
  • Library How-To
  • Research Guides
  • Research Support
  • Study Rooms
  • Research Rooms
  • Partner Spaces
  • Loanable Equipment
  • Print, Scan, Copy
  • 3D Printers
  • Poster Printing
  • OSULP Leadership
  • Strategic Plan

Research Data Services

  • Campus Services & Policies
  • Archiving & Preservation
  • Citing Datasets
  • Data Papers & Journals

Data Papers & Data Journals

  • Data Repositories
  • ScholarsArchive@OSU data repository
  • Data Storage & Backup
  • Data Types & File Formats
  • Defining Data
  • File Organization
  • IP & Licensing Data
  • Laboratory Notebooks
  • Research Lifecycle
  • Researcher Identifiers
  • Sharing Your Data
  • Metadata/Documentation
  • Tools & Resources

SEND US AN EMAIL

  • L.K. Borland Data Management Support Coordinator Schedule an appointment with me!
  • Diana Castillo College of Business/Social Sciences Data Librarian Assistant Professor 541-737-9494
  • Clara Llebot Lorente Data Management Specialist Assistant Professor 541-737-1192 On sabbatical through June 2024

The rise of the "data paper"

Datasets are increasingly being recognized as scholarly products in their own right, and as such, are now being submitted for standalone publication. In many cases, the greatest value of a dataset lies in sharing it, not necessarily in providing interpretation or analysis. For example, this paper presents a global database of the abundance, biomass, and nitrogen fixation rates of marine diazotrophs. This benchmark dataset, which will continue to evolve over time, is a valuable standalone research product that has intrinsic value. Under traditional publication models, this dataset would not be considered "publishable" because it doesn't present novel research or interpretation of results. Data papers facilitate the sharing of data in a standardized framework that provides value, impact, and recognition for authors. Data papers also provide much more thorough context and description than datasets that are simply deposited to a repository (which may have very minimal metadata requirements).

What is a data paper?

Data papers thoroughly describe datasets, and do not usually include any interpretation or discussion (an exception may be discussion of different methods to collect the data, e.g.). Some data papers are published in a distinct “Data Papers” section of a well-established journal (see this article in Ecology, for example). It is becoming more common, however, to see journals that exclusively focus on the publication of datasets. The purpose of a data journal is to provide quick access to high-quality datasets that are of broad interest to the scientific community. They are intended to facilitate reuse of the dataset, which increases its original value and impact, and speeds the pace of research by avoiding unintentional duplication of effort.

Are data papers peer-reviewed?

Data papers typically go through a peer review process in the same manner as articles, but being new to scientific practice, the quality and scope of the process is variable across publishers. A good example of a peer reviewed data journal is Earth System Science Data ( ESSD ). Their review guidelines are well described and aren't all that different from manuscript review guidelines that we are all already familiar with.

You might wonder, W hat is the difference between a 'data paper' and a 'regular article + dataset published in a public repository' ? The answer to that isn’t always clear. Some data papers necessitate just as much preparation as, and are of equal quality to, ‘typical’ journal articles. Some data papers are brief, and only present enough metadata and descriptive content to make the dataset understandable and reusable. In most cases however, the datasets or databases presented in data papers include much more description than datasets deposited to a repository, even if those datasets were deposited to support a manuscript. Common practices and standards are evolving in the realm of data papers and data journals, but for now, they are the Wild West of data sharing.

Where do the data from data papers live?

Data preservation is a corollary of data papers, not their main purpose. Most data journals do not archive data in-house. Instead, they generally require that authors submit the dataset to a repository. These repositories archive the data, provide persistent access, and assign the dataset a unique identifier (DOI). Repositories do not always require that the dataset(s) be linked with a publication (data paper or ‘typical’ paper; Dryad does require one), but if you’re going to the trouble of submitting a dataset to a repository, consider exploring the option of publishing a data paper to support it.

How can I find data journals?

The article by Walters (2020) has a list of data journals in their appendix, and differentiates between "pure" data journals and journals that publish data reports but are devoted mainly to other types of contributions. They also update previous lists of data journals ( Candela et al, 2015 ).

Walters, William H.. 2020. “Data Journals: Incentivizing Data Access and Documentation Within the Scholarly Communication System”.  Insights  33 (1): 18. DOI:  http://doi.org/10.1629/uksg.510

Candela, L., Castelli, D., Manghi, P., & Tani, A. (2015). Data journals: A survey. Journal of the Association for Information Science and Technology , 66 (9), 1747–1762. https://doi.org/10.1002/asi.23358  

This blog post by Katherine Akers , from 2014, also has a long list of existing data journals.

  • << Previous: Citing Datasets
  • Next: Data Repositories >>
  • Last Updated: Aug 30, 2023 9:25 AM
  • URL: https://guides.library.oregonstate.edu/research-data-services

research paper and data

Contact Info

121 The Valley Library Corvallis OR 97331–4501

Phone: 541-737-3331

Services for Persons with Disabilities

In the Valley Library

  • Oregon State University Press
  • Special Collections and Archives Research Center
  • Undergrad Research & Writing Studio
  • Graduate Student Commons
  • Tutoring Services
  • Northwest Art Collection

Digital Projects

  • Oregon Explorer
  • Oregon Digital
  • ScholarsArchive@OSU
  • Digital Publishing Initiatives
  • Atlas of the Pacific Northwest
  • Marilyn Potts Guin Library  
  • Cascades Campus Library
  • McDowell Library of Vet Medicine

FDLP Emblem

  • Privacy Policy

Buy Me a Coffee

Research Method

Home » Data Collection – Methods Types and Examples

Data Collection – Methods Types and Examples

Table of Contents

Data collection

Data Collection

Definition:

Data collection is the process of gathering and collecting information from various sources to analyze and make informed decisions based on the data collected. This can involve various methods, such as surveys, interviews, experiments, and observation.

In order for data collection to be effective, it is important to have a clear understanding of what data is needed and what the purpose of the data collection is. This can involve identifying the population or sample being studied, determining the variables to be measured, and selecting appropriate methods for collecting and recording data.

Types of Data Collection

Types of Data Collection are as follows:

Primary Data Collection

Primary data collection is the process of gathering original and firsthand information directly from the source or target population. This type of data collection involves collecting data that has not been previously gathered, recorded, or published. Primary data can be collected through various methods such as surveys, interviews, observations, experiments, and focus groups. The data collected is usually specific to the research question or objective and can provide valuable insights that cannot be obtained from secondary data sources. Primary data collection is often used in market research, social research, and scientific research.

Secondary Data Collection

Secondary data collection is the process of gathering information from existing sources that have already been collected and analyzed by someone else, rather than conducting new research to collect primary data. Secondary data can be collected from various sources, such as published reports, books, journals, newspapers, websites, government publications, and other documents.

Qualitative Data Collection

Qualitative data collection is used to gather non-numerical data such as opinions, experiences, perceptions, and feelings, through techniques such as interviews, focus groups, observations, and document analysis. It seeks to understand the deeper meaning and context of a phenomenon or situation and is often used in social sciences, psychology, and humanities. Qualitative data collection methods allow for a more in-depth and holistic exploration of research questions and can provide rich and nuanced insights into human behavior and experiences.

Quantitative Data Collection

Quantitative data collection is a used to gather numerical data that can be analyzed using statistical methods. This data is typically collected through surveys, experiments, and other structured data collection methods. Quantitative data collection seeks to quantify and measure variables, such as behaviors, attitudes, and opinions, in a systematic and objective way. This data is often used to test hypotheses, identify patterns, and establish correlations between variables. Quantitative data collection methods allow for precise measurement and generalization of findings to a larger population. It is commonly used in fields such as economics, psychology, and natural sciences.

Data Collection Methods

Data Collection Methods are as follows:

Surveys involve asking questions to a sample of individuals or organizations to collect data. Surveys can be conducted in person, over the phone, or online.

Interviews involve a one-on-one conversation between the interviewer and the respondent. Interviews can be structured or unstructured and can be conducted in person or over the phone.

Focus Groups

Focus groups are group discussions that are moderated by a facilitator. Focus groups are used to collect qualitative data on a specific topic.

Observation

Observation involves watching and recording the behavior of people, objects, or events in their natural setting. Observation can be done overtly or covertly, depending on the research question.

Experiments

Experiments involve manipulating one or more variables and observing the effect on another variable. Experiments are commonly used in scientific research.

Case Studies

Case studies involve in-depth analysis of a single individual, organization, or event. Case studies are used to gain detailed information about a specific phenomenon.

Secondary Data Analysis

Secondary data analysis involves using existing data that was collected for another purpose. Secondary data can come from various sources, such as government agencies, academic institutions, or private companies.

How to Collect Data

The following are some steps to consider when collecting data:

  • Define the objective : Before you start collecting data, you need to define the objective of the study. This will help you determine what data you need to collect and how to collect it.
  • Identify the data sources : Identify the sources of data that will help you achieve your objective. These sources can be primary sources, such as surveys, interviews, and observations, or secondary sources, such as books, articles, and databases.
  • Determine the data collection method : Once you have identified the data sources, you need to determine the data collection method. This could be through online surveys, phone interviews, or face-to-face meetings.
  • Develop a data collection plan : Develop a plan that outlines the steps you will take to collect the data. This plan should include the timeline, the tools and equipment needed, and the personnel involved.
  • Test the data collection process: Before you start collecting data, test the data collection process to ensure that it is effective and efficient.
  • Collect the data: Collect the data according to the plan you developed in step 4. Make sure you record the data accurately and consistently.
  • Analyze the data: Once you have collected the data, analyze it to draw conclusions and make recommendations.
  • Report the findings: Report the findings of your data analysis to the relevant stakeholders. This could be in the form of a report, a presentation, or a publication.
  • Monitor and evaluate the data collection process: After the data collection process is complete, monitor and evaluate the process to identify areas for improvement in future data collection efforts.
  • Ensure data quality: Ensure that the collected data is of high quality and free from errors. This can be achieved by validating the data for accuracy, completeness, and consistency.
  • Maintain data security: Ensure that the collected data is secure and protected from unauthorized access or disclosure. This can be achieved by implementing data security protocols and using secure storage and transmission methods.
  • Follow ethical considerations: Follow ethical considerations when collecting data, such as obtaining informed consent from participants, protecting their privacy and confidentiality, and ensuring that the research does not cause harm to participants.
  • Use appropriate data analysis methods : Use appropriate data analysis methods based on the type of data collected and the research objectives. This could include statistical analysis, qualitative analysis, or a combination of both.
  • Record and store data properly: Record and store the collected data properly, in a structured and organized format. This will make it easier to retrieve and use the data in future research or analysis.
  • Collaborate with other stakeholders : Collaborate with other stakeholders, such as colleagues, experts, or community members, to ensure that the data collected is relevant and useful for the intended purpose.

Applications of Data Collection

Data collection methods are widely used in different fields, including social sciences, healthcare, business, education, and more. Here are some examples of how data collection methods are used in different fields:

  • Social sciences : Social scientists often use surveys, questionnaires, and interviews to collect data from individuals or groups. They may also use observation to collect data on social behaviors and interactions. This data is often used to study topics such as human behavior, attitudes, and beliefs.
  • Healthcare : Data collection methods are used in healthcare to monitor patient health and track treatment outcomes. Electronic health records and medical charts are commonly used to collect data on patients’ medical history, diagnoses, and treatments. Researchers may also use clinical trials and surveys to collect data on the effectiveness of different treatments.
  • Business : Businesses use data collection methods to gather information on consumer behavior, market trends, and competitor activity. They may collect data through customer surveys, sales reports, and market research studies. This data is used to inform business decisions, develop marketing strategies, and improve products and services.
  • Education : In education, data collection methods are used to assess student performance and measure the effectiveness of teaching methods. Standardized tests, quizzes, and exams are commonly used to collect data on student learning outcomes. Teachers may also use classroom observation and student feedback to gather data on teaching effectiveness.
  • Agriculture : Farmers use data collection methods to monitor crop growth and health. Sensors and remote sensing technology can be used to collect data on soil moisture, temperature, and nutrient levels. This data is used to optimize crop yields and minimize waste.
  • Environmental sciences : Environmental scientists use data collection methods to monitor air and water quality, track climate patterns, and measure the impact of human activity on the environment. They may use sensors, satellite imagery, and laboratory analysis to collect data on environmental factors.
  • Transportation : Transportation companies use data collection methods to track vehicle performance, optimize routes, and improve safety. GPS systems, on-board sensors, and other tracking technologies are used to collect data on vehicle speed, fuel consumption, and driver behavior.

Examples of Data Collection

Examples of Data Collection are as follows:

  • Traffic Monitoring: Cities collect real-time data on traffic patterns and congestion through sensors on roads and cameras at intersections. This information can be used to optimize traffic flow and improve safety.
  • Social Media Monitoring : Companies can collect real-time data on social media platforms such as Twitter and Facebook to monitor their brand reputation, track customer sentiment, and respond to customer inquiries and complaints in real-time.
  • Weather Monitoring: Weather agencies collect real-time data on temperature, humidity, air pressure, and precipitation through weather stations and satellites. This information is used to provide accurate weather forecasts and warnings.
  • Stock Market Monitoring : Financial institutions collect real-time data on stock prices, trading volumes, and other market indicators to make informed investment decisions and respond to market fluctuations in real-time.
  • Health Monitoring : Medical devices such as wearable fitness trackers and smartwatches can collect real-time data on a person’s heart rate, blood pressure, and other vital signs. This information can be used to monitor health conditions and detect early warning signs of health issues.

Purpose of Data Collection

The purpose of data collection can vary depending on the context and goals of the study, but generally, it serves to:

  • Provide information: Data collection provides information about a particular phenomenon or behavior that can be used to better understand it.
  • Measure progress : Data collection can be used to measure the effectiveness of interventions or programs designed to address a particular issue or problem.
  • Support decision-making : Data collection provides decision-makers with evidence-based information that can be used to inform policies, strategies, and actions.
  • Identify trends : Data collection can help identify trends and patterns over time that may indicate changes in behaviors or outcomes.
  • Monitor and evaluate : Data collection can be used to monitor and evaluate the implementation and impact of policies, programs, and initiatives.

When to use Data Collection

Data collection is used when there is a need to gather information or data on a specific topic or phenomenon. It is typically used in research, evaluation, and monitoring and is important for making informed decisions and improving outcomes.

Data collection is particularly useful in the following scenarios:

  • Research : When conducting research, data collection is used to gather information on variables of interest to answer research questions and test hypotheses.
  • Evaluation : Data collection is used in program evaluation to assess the effectiveness of programs or interventions, and to identify areas for improvement.
  • Monitoring : Data collection is used in monitoring to track progress towards achieving goals or targets, and to identify any areas that require attention.
  • Decision-making: Data collection is used to provide decision-makers with information that can be used to inform policies, strategies, and actions.
  • Quality improvement : Data collection is used in quality improvement efforts to identify areas where improvements can be made and to measure progress towards achieving goals.

Characteristics of Data Collection

Data collection can be characterized by several important characteristics that help to ensure the quality and accuracy of the data gathered. These characteristics include:

  • Validity : Validity refers to the accuracy and relevance of the data collected in relation to the research question or objective.
  • Reliability : Reliability refers to the consistency and stability of the data collection process, ensuring that the results obtained are consistent over time and across different contexts.
  • Objectivity : Objectivity refers to the impartiality of the data collection process, ensuring that the data collected is not influenced by the biases or personal opinions of the data collector.
  • Precision : Precision refers to the degree of accuracy and detail in the data collected, ensuring that the data is specific and accurate enough to answer the research question or objective.
  • Timeliness : Timeliness refers to the efficiency and speed with which the data is collected, ensuring that the data is collected in a timely manner to meet the needs of the research or evaluation.
  • Ethical considerations : Ethical considerations refer to the ethical principles that must be followed when collecting data, such as ensuring confidentiality and obtaining informed consent from participants.

Advantages of Data Collection

There are several advantages of data collection that make it an important process in research, evaluation, and monitoring. These advantages include:

  • Better decision-making : Data collection provides decision-makers with evidence-based information that can be used to inform policies, strategies, and actions, leading to better decision-making.
  • Improved understanding: Data collection helps to improve our understanding of a particular phenomenon or behavior by providing empirical evidence that can be analyzed and interpreted.
  • Evaluation of interventions: Data collection is essential in evaluating the effectiveness of interventions or programs designed to address a particular issue or problem.
  • Identifying trends and patterns: Data collection can help identify trends and patterns over time that may indicate changes in behaviors or outcomes.
  • Increased accountability: Data collection increases accountability by providing evidence that can be used to monitor and evaluate the implementation and impact of policies, programs, and initiatives.
  • Validation of theories: Data collection can be used to test hypotheses and validate theories, leading to a better understanding of the phenomenon being studied.
  • Improved quality: Data collection is used in quality improvement efforts to identify areas where improvements can be made and to measure progress towards achieving goals.

Limitations of Data Collection

While data collection has several advantages, it also has some limitations that must be considered. These limitations include:

  • Bias : Data collection can be influenced by the biases and personal opinions of the data collector, which can lead to inaccurate or misleading results.
  • Sampling bias : Data collection may not be representative of the entire population, resulting in sampling bias and inaccurate results.
  • Cost : Data collection can be expensive and time-consuming, particularly for large-scale studies.
  • Limited scope: Data collection is limited to the variables being measured, which may not capture the entire picture or context of the phenomenon being studied.
  • Ethical considerations : Data collection must follow ethical principles to protect the rights and confidentiality of the participants, which can limit the type of data that can be collected.
  • Data quality issues: Data collection may result in data quality issues such as missing or incomplete data, measurement errors, and inconsistencies.
  • Limited generalizability : Data collection may not be generalizable to other contexts or populations, limiting the generalizability of the findings.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Delimitations

Delimitations in Research – Types, Examples and...

Research Process

Research Process – Steps, Examples and Tips

Research Design

Research Design – Types, Methods and Examples

Institutional Review Board (IRB)

Institutional Review Board – Application Sample...

Evaluating Research

Evaluating Research – Process, Examples and...

Research Questions

Research Questions – Types, Examples and Writing...

data analysis Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Introduce a Survival Model with Spatial Skew Gaussian Random Effects and its Application in Covid-19 Data Analysis

Futuristic prediction of missing value imputation methods using extended ann.

Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study. The nourishment area is not an exemption to the difficulty of data missing. Most frequently, this difficulty is determined by manipulative means or medians from the existing datasets which need improvements. The paper proposed hybrid schemes of MICE and ANN known as extended ANN to search and analyze the missing values and perform imputations in the given dataset. The proposed mechanism is efficiently able to analyze the blank entries and fill them with proper examining their neighboring records in order to improve the accuracy of the dataset. In order to validate the proposed scheme, the extended ANN is further compared against various recent algorithms or mechanisms to analyze the efficiency as well as the accuracy of the results.

Applications of multivariate data analysis in shelf life studies of edible vegetal oils – A review of the few past years

Hypothesis formalization: empirical findings, software limitations, and design implications.

Data analysis requires translating higher level questions and hypotheses into computable statistical models. We present a mixed-methods study aimed at identifying the steps, considerations, and challenges involved in operationalizing hypotheses into statistical models, a process we refer to as hypothesis formalization . In a formative content analysis of 50 research papers, we find that researchers highlight decomposing a hypothesis into sub-hypotheses, selecting proxy variables, and formulating statistical models based on data collection design as key steps. In a lab study, we find that analysts fixated on implementation and shaped their analyses to fit familiar approaches, even if sub-optimal. In an analysis of software tools, we find that tools provide inconsistent, low-level abstractions that may limit the statistical models analysts use to formalize hypotheses. Based on these observations, we characterize hypothesis formalization as a dual-search process balancing conceptual and statistical considerations constrained by data and computation and discuss implications for future tools.

The Complexity and Expressive Power of Limit Datalog

Motivated by applications in declarative data analysis, in this article, we study Datalog Z —an extension of Datalog with stratified negation and arithmetic functions over integers. This language is known to be undecidable, so we present the fragment of limit Datalog Z programs, which is powerful enough to naturally capture many important data analysis tasks. In limit Datalog Z , all intensional predicates with a numeric argument are limit predicates that keep maximal or minimal bounds on numeric values. We show that reasoning in limit Datalog Z is decidable if a linearity condition restricting the use of multiplication is satisfied. In particular, limit-linear Datalog Z is complete for Δ 2 EXP and captures Δ 2 P over ordered datasets in the sense of descriptive complexity. We also provide a comprehensive study of several fragments of limit-linear Datalog Z . We show that semi-positive limit-linear programs (i.e., programs where negation is allowed only in front of extensional atoms) capture coNP over ordered datasets; furthermore, reasoning becomes coNEXP-complete in combined and coNP-complete in data complexity, where the lower bounds hold already for negation-free programs. In order to satisfy the requirements of data-intensive applications, we also propose an additional stability requirement, which causes the complexity of reasoning to drop to EXP in combined and to P in data complexity, thus obtaining the same bounds as for usual Datalog. Finally, we compare our formalisms with the languages underpinning existing Datalog-based approaches for data analysis and show that core fragments of these languages can be encoded as limit programs; this allows us to transfer decidability and complexity upper bounds from limit programs to other formalisms. Therefore, our article provides a unified logical framework for declarative data analysis which can be used as a basis for understanding the impact on expressive power and computational complexity of the key constructs available in existing languages.

An empirical study on Cross-Border E-commerce Talent Cultivation-—Based on Skill Gap Theory and big data analysis

To solve the dilemma between the increasing demand for cross-border e-commerce talents and incompatible students’ skill level, Industry-University-Research cooperation, as an essential pillar for inter-disciplinary talent cultivation model adopted by colleges and universities, brings out the synergy from relevant parties and builds the bridge between the knowledge and practice. Nevertheless, industry-university-research cooperation developed lately in the cross-border e-commerce field with several problems such as unstable collaboration relationships and vague training plans.

The Effects of Cross-border e-Commerce Platforms on Transnational Digital Entrepreneurship

This research examines the important concept of transnational digital entrepreneurship (TDE). The paper integrates the host and home country entrepreneurial ecosystems with the digital ecosystem to the framework of the transnational digital entrepreneurial ecosystem. The authors argue that cross-border e-commerce platforms provide critical foundations in the digital entrepreneurial ecosystem. Entrepreneurs who count on this ecosystem are defined as transnational digital entrepreneurs. Interview data were dissected for the purpose of case studies to make understanding from twelve Chinese immigrant entrepreneurs living in Australia and New Zealand. The results of the data analysis reveal that cross-border entrepreneurs are in actual fact relying on the significant framework of the transnational digital ecosystem. Cross-border e-commerce platforms not only play a bridging role between home and host country ecosystems but provide entrepreneurial capitals as digital ecosystem promised.

Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis With Limited Computational Resources

The effects of cross-border e-commerce platforms on transnational digital entrepreneurship, a trajectory evaluator by sub-tracks for detecting vot-based anomalous trajectory.

With the popularization of visual object tracking (VOT), more and more trajectory data are obtained and have begun to gain widespread attention in the fields of mobile robots, intelligent video surveillance, and the like. How to clean the anomalous trajectories hidden in the massive data has become one of the research hotspots. Anomalous trajectories should be detected and cleaned before the trajectory data can be effectively used. In this article, a Trajectory Evaluator by Sub-tracks (TES) for detecting VOT-based anomalous trajectory is proposed. Feature of Anomalousness is defined and described as the Eigenvector of classifier to filter Track Lets anomalous trajectory and IDentity Switch anomalous trajectory, which includes Feature of Anomalous Pose and Feature of Anomalous Sub-tracks (FAS). In the comparative experiments, TES achieves better results on different scenes than state-of-the-art methods. Moreover, FAS makes better performance than point flow, least square method fitting and Chebyshev Polynomial Fitting. It is verified that TES is more accurate and effective and is conducive to the sub-tracks trajectory data analysis.

Export Citation Format

Share document.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • 15 April 2024

Revealed: the ten research papers that policy documents cite most

  • Dalmeet Singh Chawla 0

Dalmeet Singh Chawla is a freelance science journalist based in London.

You can also search for this author in PubMed   Google Scholar

G7 leaders gather for a photo at the Itsukushima Shrine during the G7 Summit in Hiroshima, Japan in 2023

Policymakers often work behind closed doors — but the documents they produce offer clues about the research that influences them. Credit: Stefan Rousseau/Getty

When David Autor co-wrote a paper on how computerization affects job skill demands more than 20 years ago, a journal took 18 months to consider it — only to reject it after review. He went on to submit it to The Quarterly Journal of Economics , which eventually published the work 1 in November 2003.

Autor’s paper is now the third most cited in policy documents worldwide, according to an analysis of data provided exclusively to Nature . It has accumulated around 1,100 citations in policy documents, show figures from the London-based firm Overton (see ‘The most-cited papers in policy’), which maintains a database of more than 12 million policy documents, think-tank papers, white papers and guidelines.

“I thought it was destined to be quite an obscure paper,” recalls Autor, a public-policy scholar and economist at the Massachusetts Institute of Technology in Cambridge. “I’m excited that a lot of people are citing it.”

The most-cited papers in policy

Economics papers dominate the top ten papers that policy documents reference most.

Data from Sage Policy Profiles as of 15 April 2024

The top ten most cited papers in policy documents are dominated by economics research; the number one most referenced study has around 1,300 citations. When economics studies are excluded, a 1997 Nature paper 2 about Earth’s ecosystem services and natural capital is second on the list, with more than 900 policy citations. The paper has also garnered more than 32,000 references from other studies, according to Google Scholar. Other highly cited non-economics studies include works on planetary boundaries, sustainable foods and the future of employment (see ‘Most-cited papers — excluding economics research’).

These lists provide insight into the types of research that politicians pay attention to, but policy citations don’t necessarily imply impact or influence, and Overton’s database has a bias towards documents published in English.

Interdisciplinary impact

Overton usually charges a licence fee to access its citation data. But last year, the firm worked with the London-based publisher Sage to release a free web-based tool that allows any researcher to find out how many times policy documents have cited their papers or mention their names. Overton and Sage said they created the tool, called Sage Policy Profiles, to help researchers to demonstrate the impact or influence their work might be having on policy. This can be useful for researchers during promotion or tenure interviews and in grant applications.

Autor thinks his study stands out because his paper was different from what other economists were writing at the time. It suggested that ‘middle-skill’ work, typically done in offices or factories by people who haven’t attended university, was going to be largely automated, leaving workers with either highly skilled jobs or manual work. “It has stood the test of time,” he says, “and it got people to focus on what I think is the right problem.” That topic is just as relevant today, Autor says, especially with the rise of artificial intelligence.

Most-cited papers — excluding economics research

When economics studies are excluded, the research papers that policy documents most commonly reference cover topics including climate change and nutrition.

Walter Willett, an epidemiologist and food scientist at the Harvard T.H. Chan School of Public Health in Boston, Massachusetts, thinks that interdisciplinary teams are most likely to gain a lot of policy citations. He co-authored a paper on the list of most cited non-economics studies: a 2019 work 3 that was part of a Lancet commission to investigate how to feed the global population a healthy and environmentally sustainable diet by 2050 and has accumulated more than 600 policy citations.

“I think it had an impact because it was clearly a multidisciplinary effort,” says Willett. The work was co-authored by 37 scientists from 17 countries. The team included researchers from disciplines including food science, health metrics, climate change, ecology and evolution and bioethics. “None of us could have done this on our own. It really did require working with people outside our fields.”

Sverker Sörlin, an environmental historian at the KTH Royal Institute of Technology in Stockholm, agrees that papers with a diverse set of authors often attract more policy citations. “It’s the combined effect that is often the key to getting more influence,” he says.

research paper and data

Has your research influenced policy? Use this free tool to check

Sörlin co-authored two papers in the list of top ten non-economics papers. One of those is a 2015 Science paper 4 on planetary boundaries — a concept defining the environmental limits in which humanity can develop and thrive — which has attracted more than 750 policy citations. Sörlin thinks one reason it has been popular is that it’s a sequel to a 2009 Nature paper 5 he co-authored on the same topic, which has been cited by policy documents 575 times.

Although policy citations don’t necessarily imply influence, Willett has seen evidence that his paper is prompting changes in policy. He points to Denmark as an example, noting that the nation is reformatting its dietary guidelines in line with the study’s recommendations. “I certainly can’t say that this document is the only thing that’s changing their guidelines,” he says. But “this gave it the support and credibility that allowed them to go forward”.

Broad brush

Peter Gluckman, who was the chief science adviser to the prime minister of New Zealand between 2009 and 2018, is not surprised by the lists. He expects policymakers to refer to broad-brush papers rather than those reporting on incremental advances in a field.

Gluckman, a paediatrician and biomedical scientist at the University of Auckland in New Zealand, notes that it’s important to consider the context in which papers are being cited, because studies reporting controversial findings sometimes attract many citations. He also warns that the list is probably not comprehensive: many policy papers are not easily accessible to tools such as Overton, which uses text mining to compile data, and so will not be included in the database.

research paper and data

The top 100 papers

“The thing that worries me most is the age of the papers that are involved,” Gluckman says. “Does that tell us something about just the way the analysis is done or that relatively few papers get heavily used in policymaking?”

Gluckman says it’s strange that some recent work on climate change, food security, social cohesion and similar areas hasn’t made it to the non-economics list. “Maybe it’s just because they’re not being referred to,” he says, or perhaps that work is cited, in turn, in the broad-scope papers that are most heavily referenced in policy documents.

As for Sage Policy Profiles, Gluckman says it’s always useful to get an idea of which studies are attracting attention from policymakers, but he notes that studies often take years to influence policy. “Yet the average academic is trying to make a claim here and now that their current work is having an impact,” he adds. “So there’s a disconnect there.”

Willett thinks policy citations are probably more important than scholarly citations in other papers. “In the end, we don’t want this to just sit on an academic shelf.”

doi: https://doi.org/10.1038/d41586-024-00660-1

Autor, D. H., Levy, F. & Murnane, R. J. Q. J. Econ. 118 , 1279–1333 (2003).

Article   Google Scholar  

Costanza, R. et al. Nature 387 , 253–260 (1997).

Willett, W. et al. Lancet 393 , 447–492 (2019).

Article   PubMed   Google Scholar  

Steffen, W. et al. Science 347 , 1259855 (2015).

Rockström, J. et al. Nature 461 , 472–475 (2009).

Download references

Reprints and permissions

Related Articles

research paper and data

We must protect the global plastics treaty from corporate interference

World View 17 APR 24

UN plastics treaty: don’t let lobbyists drown out researchers

UN plastics treaty: don’t let lobbyists drown out researchers

Editorial 17 APR 24

Smoking bans are coming: what does the evidence say?

Smoking bans are coming: what does the evidence say?

News 17 APR 24

CERN’s impact goes way beyond tiny particles

CERN’s impact goes way beyond tiny particles

Spotlight 17 APR 24

The economic commitment of climate change

The economic commitment of climate change

Article 17 APR 24

Last-mile delivery increases vaccine uptake in Sierra Leone

Last-mile delivery increases vaccine uptake in Sierra Leone

Article 13 MAR 24

Qiushi Chair Professor

Distinguished scholars with notable achievements and extensive international influence.

Hangzhou, Zhejiang, China

Zhejiang University

research paper and data

ZJU 100 Young Professor

Promising young scholars who can independently establish and develop a research direction.

Head of the Thrust of Robotics and Autonomous Systems

Reporting to the Dean of Systems Hub, the Head of ROAS is an executive assuming overall responsibility for the academic, student, human resources...

Guangzhou, Guangdong, China

The Hong Kong University of Science and Technology (Guangzhou)

research paper and data

Head of Biology, Bio-island

Head of Biology to lead the discovery biology group.

BeiGene Ltd.

research paper and data

Research Postdoctoral Fellow - MD (Cardiac Surgery)

Houston, Texas (US)

Baylor College of Medicine (BCM)

research paper and data

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Help | Advanced Search

Computer Science > Computation and Language

Title: researchagent: iterative research idea generation over scientific literature with large language models.

Abstract: Scientific Research, vital for improving human life, is hindered by its inherent complexity, slow pace, and the need for specialized experts. To enhance its productivity, we propose a ResearchAgent, a large language model-powered research idea writing agent, which automatically generates problems, methods, and experiment designs while iteratively refining them based on scientific literature. Specifically, starting with a core paper as the primary focus to generate ideas, our ResearchAgent is augmented not only with relevant publications through connecting information over an academic graph but also entities retrieved from an entity-centric knowledge store based on their underlying concepts, mined and shared across numerous papers. In addition, mirroring the human approach to iteratively improving ideas with peer discussions, we leverage multiple ReviewingAgents that provide reviews and feedback iteratively. Further, they are instantiated with human preference-aligned large language models whose criteria for evaluation are derived from actual human judgments. We experimentally validate our ResearchAgent on scientific publications across multiple disciplines, showcasing its effectiveness in generating novel, clear, and valid research ideas based on human and model-based evaluation results.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Numbers, Facts and Trends Shaping Your World

Read our research on:

Full Topic List

Regions & Countries

  • Publications
  • Our Methods
  • Short Reads
  • Tools & Resources

Read Our Research On:

What the data says about abortion in the U.S.

Pew Research Center has conducted many surveys about abortion over the years, providing a lens into Americans’ views on whether the procedure should be legal, among a host of other questions.

In a  Center survey  conducted nearly a year after the Supreme Court’s June 2022 decision that  ended the constitutional right to abortion , 62% of U.S. adults said the practice should be legal in all or most cases, while 36% said it should be illegal in all or most cases. Another survey conducted a few months before the decision showed that relatively few Americans take an absolutist view on the issue .

Find answers to common questions about abortion in America, based on data from the Centers for Disease Control and Prevention (CDC) and the Guttmacher Institute, which have tracked these patterns for several decades:

How many abortions are there in the U.S. each year?

How has the number of abortions in the u.s. changed over time, what is the abortion rate among women in the u.s. how has it changed over time, what are the most common types of abortion, how many abortion providers are there in the u.s., and how has that number changed, what percentage of abortions are for women who live in a different state from the abortion provider, what are the demographics of women who have had abortions, when during pregnancy do most abortions occur, how often are there medical complications from abortion.

This compilation of data on abortion in the United States draws mainly from two sources: the Centers for Disease Control and Prevention (CDC) and the Guttmacher Institute, both of which have regularly compiled national abortion data for approximately half a century, and which collect their data in different ways.

The CDC data that is highlighted in this post comes from the agency’s “abortion surveillance” reports, which have been published annually since 1974 (and which have included data from 1969). Its figures from 1973 through 1996 include data from all 50 states, the District of Columbia and New York City – 52 “reporting areas” in all. Since 1997, the CDC’s totals have lacked data from some states (most notably California) for the years that those states did not report data to the agency. The four reporting areas that did not submit data to the CDC in 2021 – California, Maryland, New Hampshire and New Jersey – accounted for approximately 25% of all legal induced abortions in the U.S. in 2020, according to Guttmacher’s data. Most states, though,  do  have data in the reports, and the figures for the vast majority of them came from each state’s central health agency, while for some states, the figures came from hospitals and other medical facilities.

Discussion of CDC abortion data involving women’s state of residence, marital status, race, ethnicity, age, abortion history and the number of previous live births excludes the low share of abortions where that information was not supplied. Read the methodology for the CDC’s latest abortion surveillance report , which includes data from 2021, for more details. Previous reports can be found at  stacks.cdc.gov  by entering “abortion surveillance” into the search box.

For the numbers of deaths caused by induced abortions in 1963 and 1965, this analysis looks at reports by the then-U.S. Department of Health, Education and Welfare, a precursor to the Department of Health and Human Services. In computing those figures, we excluded abortions listed in the report under the categories “spontaneous or unspecified” or as “other.” (“Spontaneous abortion” is another way of referring to miscarriages.)

Guttmacher data in this post comes from national surveys of abortion providers that Guttmacher has conducted 19 times since 1973. Guttmacher compiles its figures after contacting every known provider of abortions – clinics, hospitals and physicians’ offices – in the country. It uses questionnaires and health department data, and it provides estimates for abortion providers that don’t respond to its inquiries. (In 2020, the last year for which it has released data on the number of abortions in the U.S., it used estimates for 12% of abortions.) For most of the 2000s, Guttmacher has conducted these national surveys every three years, each time getting abortion data for the prior two years. For each interim year, Guttmacher has calculated estimates based on trends from its own figures and from other data.

The latest full summary of Guttmacher data came in the institute’s report titled “Abortion Incidence and Service Availability in the United States, 2020.” It includes figures for 2020 and 2019 and estimates for 2018. The report includes a methods section.

In addition, this post uses data from StatPearls, an online health care resource, on complications from abortion.

An exact answer is hard to come by. The CDC and the Guttmacher Institute have each tried to measure this for around half a century, but they use different methods and publish different figures.

The last year for which the CDC reported a yearly national total for abortions is 2021. It found there were 625,978 abortions in the District of Columbia and the 46 states with available data that year, up from 597,355 in those states and D.C. in 2020. The corresponding figure for 2019 was 607,720.

The last year for which Guttmacher reported a yearly national total was 2020. It said there were 930,160 abortions that year in all 50 states and the District of Columbia, compared with 916,460 in 2019.

  • How the CDC gets its data: It compiles figures that are voluntarily reported by states’ central health agencies, including separate figures for New York City and the District of Columbia. Its latest totals do not include figures from California, Maryland, New Hampshire or New Jersey, which did not report data to the CDC. ( Read the methodology from the latest CDC report .)
  • How Guttmacher gets its data: It compiles its figures after contacting every known abortion provider – clinics, hospitals and physicians’ offices – in the country. It uses questionnaires and health department data, then provides estimates for abortion providers that don’t respond. Guttmacher’s figures are higher than the CDC’s in part because they include data (and in some instances, estimates) from all 50 states. ( Read the institute’s latest full report and methodology .)

While the Guttmacher Institute supports abortion rights, its empirical data on abortions in the U.S. has been widely cited by  groups  and  publications  across the political spectrum, including by a  number of those  that  disagree with its positions .

These estimates from Guttmacher and the CDC are results of multiyear efforts to collect data on abortion across the U.S. Last year, Guttmacher also began publishing less precise estimates every few months , based on a much smaller sample of providers.

The figures reported by these organizations include only legal induced abortions conducted by clinics, hospitals or physicians’ offices, or those that make use of abortion pills dispensed from certified facilities such as clinics or physicians’ offices. They do not account for the use of abortion pills that were obtained  outside of clinical settings .

(Back to top)

A line chart showing the changing number of legal abortions in the U.S. since the 1970s.

The annual number of U.S. abortions rose for years after Roe v. Wade legalized the procedure in 1973, reaching its highest levels around the late 1980s and early 1990s, according to both the CDC and Guttmacher. Since then, abortions have generally decreased at what a CDC analysis called  “a slow yet steady pace.”

Guttmacher says the number of abortions occurring in the U.S. in 2020 was 40% lower than it was in 1991. According to the CDC, the number was 36% lower in 2021 than in 1991, looking just at the District of Columbia and the 46 states that reported both of those years.

(The corresponding line graph shows the long-term trend in the number of legal abortions reported by both organizations. To allow for consistent comparisons over time, the CDC figures in the chart have been adjusted to ensure that the same states are counted from one year to the next. Using that approach, the CDC figure for 2021 is 622,108 legal abortions.)

There have been occasional breaks in this long-term pattern of decline – during the middle of the first decade of the 2000s, and then again in the late 2010s. The CDC reported modest 1% and 2% increases in abortions in 2018 and 2019, and then, after a 2% decrease in 2020, a 5% increase in 2021. Guttmacher reported an 8% increase over the three-year period from 2017 to 2020.

As noted above, these figures do not include abortions that use pills obtained outside of clinical settings.

Guttmacher says that in 2020 there were 14.4 abortions in the U.S. per 1,000 women ages 15 to 44. Its data shows that the rate of abortions among women has generally been declining in the U.S. since 1981, when it reported there were 29.3 abortions per 1,000 women in that age range.

The CDC says that in 2021, there were 11.6 abortions in the U.S. per 1,000 women ages 15 to 44. (That figure excludes data from California, the District of Columbia, Maryland, New Hampshire and New Jersey.) Like Guttmacher’s data, the CDC’s figures also suggest a general decline in the abortion rate over time. In 1980, when the CDC reported on all 50 states and D.C., it said there were 25 abortions per 1,000 women ages 15 to 44.

That said, both Guttmacher and the CDC say there were slight increases in the rate of abortions during the late 2010s and early 2020s. Guttmacher says the abortion rate per 1,000 women ages 15 to 44 rose from 13.5 in 2017 to 14.4 in 2020. The CDC says it rose from 11.2 per 1,000 in 2017 to 11.4 in 2019, before falling back to 11.1 in 2020 and then rising again to 11.6 in 2021. (The CDC’s figures for those years exclude data from California, D.C., Maryland, New Hampshire and New Jersey.)

The CDC broadly divides abortions into two categories: surgical abortions and medication abortions, which involve pills. Since the Food and Drug Administration first approved abortion pills in 2000, their use has increased over time as a share of abortions nationally, according to both the CDC and Guttmacher.

The majority of abortions in the U.S. now involve pills, according to both the CDC and Guttmacher. The CDC says 56% of U.S. abortions in 2021 involved pills, up from 53% in 2020 and 44% in 2019. Its figures for 2021 include the District of Columbia and 44 states that provided this data; its figures for 2020 include D.C. and 44 states (though not all of the same states as in 2021), and its figures for 2019 include D.C. and 45 states.

Guttmacher, which measures this every three years, says 53% of U.S. abortions involved pills in 2020, up from 39% in 2017.

Two pills commonly used together for medication abortions are mifepristone, which, taken first, blocks hormones that support a pregnancy, and misoprostol, which then causes the uterus to empty. According to the FDA, medication abortions are safe  until 10 weeks into pregnancy.

Surgical abortions conducted  during the first trimester  of pregnancy typically use a suction process, while the relatively few surgical abortions that occur  during the second trimester  of a pregnancy typically use a process called dilation and evacuation, according to the UCLA School of Medicine.

In 2020, there were 1,603 facilities in the U.S. that provided abortions,  according to Guttmacher . This included 807 clinics, 530 hospitals and 266 physicians’ offices.

A horizontal stacked bar chart showing the total number of abortion providers down since 1982.

While clinics make up half of the facilities that provide abortions, they are the sites where the vast majority (96%) of abortions are administered, either through procedures or the distribution of pills, according to Guttmacher’s 2020 data. (This includes 54% of abortions that are administered at specialized abortion clinics and 43% at nonspecialized clinics.) Hospitals made up 33% of the facilities that provided abortions in 2020 but accounted for only 3% of abortions that year, while just 1% of abortions were conducted by physicians’ offices.

Looking just at clinics – that is, the total number of specialized abortion clinics and nonspecialized clinics in the U.S. – Guttmacher found the total virtually unchanged between 2017 (808 clinics) and 2020 (807 clinics). However, there were regional differences. In the Midwest, the number of clinics that provide abortions increased by 11% during those years, and in the West by 6%. The number of clinics  decreased  during those years by 9% in the Northeast and 3% in the South.

The total number of abortion providers has declined dramatically since the 1980s. In 1982, according to Guttmacher, there were 2,908 facilities providing abortions in the U.S., including 789 clinics, 1,405 hospitals and 714 physicians’ offices.

The CDC does not track the number of abortion providers.

In the District of Columbia and the 46 states that provided abortion and residency information to the CDC in 2021, 10.9% of all abortions were performed on women known to live outside the state where the abortion occurred – slightly higher than the percentage in 2020 (9.7%). That year, D.C. and 46 states (though not the same ones as in 2021) reported abortion and residency data. (The total number of abortions used in these calculations included figures for women with both known and unknown residential status.)

The share of reported abortions performed on women outside their state of residence was much higher before the 1973 Roe decision that stopped states from banning abortion. In 1972, 41% of all abortions in D.C. and the 20 states that provided this information to the CDC that year were performed on women outside their state of residence. In 1973, the corresponding figure was 21% in the District of Columbia and the 41 states that provided this information, and in 1974 it was 11% in D.C. and the 43 states that provided data.

In the District of Columbia and the 46 states that reported age data to  the CDC in 2021, the majority of women who had abortions (57%) were in their 20s, while about three-in-ten (31%) were in their 30s. Teens ages 13 to 19 accounted for 8% of those who had abortions, while women ages 40 to 44 accounted for about 4%.

The vast majority of women who had abortions in 2021 were unmarried (87%), while married women accounted for 13%, according to  the CDC , which had data on this from 37 states.

A pie chart showing that, in 2021, majority of abortions were for women who had never had one before.

In the District of Columbia, New York City (but not the rest of New York) and the 31 states that reported racial and ethnic data on abortion to  the CDC , 42% of all women who had abortions in 2021 were non-Hispanic Black, while 30% were non-Hispanic White, 22% were Hispanic and 6% were of other races.

Looking at abortion rates among those ages 15 to 44, there were 28.6 abortions per 1,000 non-Hispanic Black women in 2021; 12.3 abortions per 1,000 Hispanic women; 6.4 abortions per 1,000 non-Hispanic White women; and 9.2 abortions per 1,000 women of other races, the  CDC reported  from those same 31 states, D.C. and New York City.

For 57% of U.S. women who had induced abortions in 2021, it was the first time they had ever had one,  according to the CDC.  For nearly a quarter (24%), it was their second abortion. For 11% of women who had an abortion that year, it was their third, and for 8% it was their fourth or more. These CDC figures include data from 41 states and New York City, but not the rest of New York.

A bar chart showing that most U.S. abortions in 2021 were for women who had previously given birth.

Nearly four-in-ten women who had abortions in 2021 (39%) had no previous live births at the time they had an abortion,  according to the CDC . Almost a quarter (24%) of women who had abortions in 2021 had one previous live birth, 20% had two previous live births, 10% had three, and 7% had four or more previous live births. These CDC figures include data from 41 states and New York City, but not the rest of New York.

The vast majority of abortions occur during the first trimester of a pregnancy. In 2021, 93% of abortions occurred during the first trimester – that is, at or before 13 weeks of gestation,  according to the CDC . An additional 6% occurred between 14 and 20 weeks of pregnancy, and about 1% were performed at 21 weeks or more of gestation. These CDC figures include data from 40 states and New York City, but not the rest of New York.

About 2% of all abortions in the U.S. involve some type of complication for the woman , according to an article in StatPearls, an online health care resource. “Most complications are considered minor such as pain, bleeding, infection and post-anesthesia complications,” according to the article.

The CDC calculates  case-fatality rates for women from induced abortions – that is, how many women die from abortion-related complications, for every 100,000 legal abortions that occur in the U.S .  The rate was lowest during the most recent period examined by the agency (2013 to 2020), when there were 0.45 deaths to women per 100,000 legal induced abortions. The case-fatality rate reported by the CDC was highest during the first period examined by the agency (1973 to 1977), when it was 2.09 deaths to women per 100,000 legal induced abortions. During the five-year periods in between, the figure ranged from 0.52 (from 1993 to 1997) to 0.78 (from 1978 to 1982).

The CDC calculates death rates by five-year and seven-year periods because of year-to-year fluctuation in the numbers and due to the relatively low number of women who die from legal induced abortions.

In 2020, the last year for which the CDC has information , six women in the U.S. died due to complications from induced abortions. Four women died in this way in 2019, two in 2018, and three in 2017. (These deaths all followed legal abortions.) Since 1990, the annual number of deaths among women due to legal induced abortion has ranged from two to 12.

The annual number of reported deaths from induced abortions (legal and illegal) tended to be higher in the 1980s, when it ranged from nine to 16, and from 1972 to 1979, when it ranged from 13 to 63. One driver of the decline was the drop in deaths from illegal abortions. There were 39 deaths from illegal abortions in 1972, the last full year before Roe v. Wade. The total fell to 19 in 1973 and to single digits or zero every year after that. (The number of deaths from legal abortions has also declined since then, though with some slight variation over time.)

The number of deaths from induced abortions was considerably higher in the 1960s than afterward. For instance, there were 119 deaths from induced abortions in  1963  and 99 in  1965 , according to reports by the then-U.S. Department of Health, Education and Welfare, a precursor to the Department of Health and Human Services. The CDC is a division of Health and Human Services.

Note: This is an update of a post originally published May 27, 2022, and first updated June 24, 2022.

Support for legal abortion is widespread in many countries, especially in Europe

Nearly a year after roe’s demise, americans’ views of abortion access increasingly vary by where they live, by more than two-to-one, americans say medication abortion should be legal in their state, most latinos say democrats care about them and work hard for their vote, far fewer say so of gop, positive views of supreme court decline sharply following abortion ruling, most popular.

1615 L St. NW, Suite 800 Washington, DC 20036 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 |  Media Inquiries

Research Topics

  • Age & Generations
  • Coronavirus (COVID-19)
  • Economy & Work
  • Family & Relationships
  • Gender & LGBTQ
  • Immigration & Migration
  • International Affairs
  • Internet & Technology
  • Methodological Research
  • News Habits & Media
  • Non-U.S. Governments
  • Other Topics
  • Politics & Policy
  • Race & Ethnicity
  • Email Newsletters

ABOUT PEW RESEARCH CENTER  Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of  The Pew Charitable Trusts .

Copyright 2024 Pew Research Center

Terms & Conditions

Privacy Policy

Cookie Settings

Reprints, Permissions & Use Policy

share this!

April 18, 2024

This article has been reviewed according to Science X's editorial process and policies . Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

peer-reviewed publication

trusted source

A third of China's urban population at risk of city sinking, new satellite data shows

by University of East Anglia

china

Land subsidence is overlooked as a hazard in cities, according to scientists from the University of East Anglia (UEA) and Virginia Tech. Writing in the journal Science, Prof Robert Nicholls of the Tyndall Center for Climate Change Research at UEA and Prof Manoochehr Shirzaei of Virginia Tech and United Nations University for Water, Environment and Health, Ontario, highlight the importance of a new research paper analyzing satellite data that accurately and consistently maps land movement across China.

While they say in their comment article that consistently measuring subsidence is a great achievement, they argue it is only the start of finding solutions. Predicting future subsidence requires models that consider all drivers, including human activities and climate change , and how they might change with time.

The research paper , published in the same issue, considers 82 cities with a collective population of nearly 700 million people. The results show that 45% of the urban areas that were analyzed are sinking, with 16% falling at a rate of 10mm a year or more.

Nationally, roughly 270 million urban residents are estimated to be affected, with nearly 70 million experiencing rapid subsidence of 10mm a year or more. Hotspots include Beijing and Tianjin.

Coastal cities such as Tianjin are especially affected as sinking land reinforces climate change and sea-level rise. The sinking of sea defenses is one reason why Hurricane Katrina's flooding brought such devastation and death-toll to New Orleans in 2005.

Shanghai—China's biggest city—has subsided up to 3m over the past century and continues to subside today. When subsidence is combined with sea-level rise, the urban area in China below sea level could triple in size by 2120, affecting 55 to 128 million residents. This could be catastrophic without a strong societal response.

"Subsidence jeopardizes the structural integrity of buildings and critical infrastructure and exacerbates the impacts of climate change in terms of flooding, particularly in coastal cities where it reinforces sea-level rise," said Prof Nicholls, who was not involved in the study, but whose research focuses on sea-level rise, coastal erosion and flooding, and how communities can adapt to these changes.

The subsidence is mainly caused by human action in the cities. Groundwater withdrawal, which lowers the water table is considered the most important driver of subsidence, combined with geology and weight of buildings.

In Osaka and Tokyo, groundwater withdrawal was stopped in the 1970s and city subsidence has ceased or greatly reduced, showing this is an effective mitigation strategy. Traffic vibration and tunneling are potentially also a local contributing factor—Beijing has sinking of 45mm a year near subways and highways. Natural upward or downward land movement also occurs but is generally much smaller than human-induced changes.

While human-induced subsidence was known in China before this study, Profs Nicholls and Shirzaei say these new results reinforce the need for a national response. This problem happens in susceptible cities outside China and is a widespread problem across the world.

They call for the research community to move from measurement to understanding implications and supporting responses. The new satellite measurements are delivering new detailed subsidence data but the methods to use this information to work with city planners to address these problems need much more development. Affected coastal cities in China and more widely need particular attention.

"Many cities and areas worldwide are developing strategies for managing the risks of climate change and sea-level rise ," said Prof Nicholls. "We need to learn from this experience to also address the threat of subsidence which is more common than currently recognized."

Zurui Ao et al, A national-scale assessment of land subsidence in China's major cities, Science (2024). DOI: 10.1126/science.adl4366 , www.science.org/doi/10.1126/science.adl4366?

Journal information: Science

Provided by University of East Anglia

Explore further

Feedback to editors

research paper and data

Comprehensive model unravels quantum-mechanical effects behind photoluminescence in thin gold films

19 minutes ago

research paper and data

Cosmic rays streamed through Earth's atmosphere 41,000 years ago: New findings on the Laschamps excursion

24 minutes ago

research paper and data

Study suggests Io's volcanoes have been active for 4.5 billion years

29 minutes ago

research paper and data

Ghost particle on the scales: Research offers more precise determination of neutrino mass

3 hours ago

research paper and data

Light show in living cells: New method allows simultaneous fluorescent labeling of many proteins

research paper and data

Warming of Antarctic deep-sea waters contribute to sea level rise in North Atlantic, study finds

research paper and data

Unraveling water mysteries beyond Earth: Ground-penetrating radar will seek bodies of water on Jupiter

4 hours ago

research paper and data

Baby white sharks prefer being closer to shore, scientists find

8 hours ago

research paper and data

Key protein regulates immune response to viruses in mammal cells

12 hours ago

research paper and data

Unraveling the mysteries of consecutive atmospheric river events

15 hours ago

Relevant PhysicsForums posts

Unlocking the secrets of prof. verschure's rosetta stones.

6 hours ago

Iceland warming up again - quakes swarming

19 hours ago

Tidal friction and global warming

Apr 18, 2024

Large eruption at Ruang volcano, Indonesia

Apr 17, 2024

M 4.8 - Whitehouse Station, New Jersey, US

Apr 6, 2024

Major Earthquakes - 7.4 (7.2) Mag and 6.4 Mag near Hualien, Taiwan

Apr 5, 2024

More from Earth Sciences

Related Stories

research paper and data

From New York to Jakarta, land in many coastal cities is sinking faster than sea levels are rising

Jan 25, 2024

research paper and data

Sinking US cities more exposed to rising seas: Study

Mar 9, 2024

research paper and data

Sea level rise up to four times global average for coastal communities

Mar 8, 2021

research paper and data

Study: From NYC to DC and beyond, cities on the East Coast are sinking

Jan 2, 2024

research paper and data

Asian coastal cities sinking fast: study

Sep 25, 2022

research paper and data

Study suggests sinking land increases risk for thousands of coastal residents by 2050

Mar 6, 2024

Recommended for you

research paper and data

Scientists reveal hydroclimatic changes on multiple timescales in Central Asia over the past 7,800 years

16 hours ago

research paper and data

Toxic fireproof chemicals can be absorbed through touch, 3D-printed skin model shows

17 hours ago

research paper and data

Drawing a line back to the origin of life: Graphitization could provide simplicity scientists are looking for

20 hours ago

research paper and data

Dense network of seismometers reveals how the underground ruptures

Let us know if there is a problem with our content.

Use this form if you have come across a typo, inaccuracy or would like to send an edit request for the content on this page. For general inquiries, please use our contact form . For general feedback, use the public comments section below (please adhere to guidelines ).

Please select the most appropriate category to facilitate processing of your request

Thank you for taking time to provide your feedback to the editors.

Your feedback is important to us. However, we do not guarantee individual replies due to the high volume of messages.

E-mail the story

Your email address is used only to let the recipient know who sent the email. Neither your address nor the recipient's address will be used for any other purpose. The information you enter will appear in your e-mail message and is not retained by Phys.org in any form.

Newsletter sign up

Get weekly and/or daily updates delivered to your inbox. You can unsubscribe at any time and we'll never share your details to third parties.

More information Privacy policy

Donate and enjoy an ad-free experience

We keep our content available to everyone. Consider supporting Science X's mission by getting a premium account.

E-mail newsletter

IMAGES

  1. Tables in Research Paper

    research paper and data

  2. Parts of a Research Paper

    research paper and data

  3. Sample Research Paper

    research paper and data

  4. (PDF) Statistical Analysis of Data in Research Methodology

    research paper and data

  5. 😀 Research paper format. The Basics of a Research Paper Format. 2019-02-10

    research paper and data

  6. Tips For How To Write A Scientific Research Paper

    research paper and data

VIDEO

  1. F# Tutorial: Using the Array.collect function

  2. Challenges and Opportunities for Educational Data Mining ! Research Paper review

  3. F# Tutorial: Introducing the actor model

  4. Overloaded methods in F#

  5. Claudio Paganini: No Events on Closed Causal Curves

  6. F# Tutorial: Understanding the need for computation expressions

COMMENTS

  1. Data Collection

    Data Collection | Definition, Methods & Examples. Published on June 5, 2020 by Pritha Bhandari.Revised on June 21, 2023. Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem.

  2. Google Scholar

    Google Scholar provides a simple way to broadly search for scholarly literature. Search across a wide variety of disciplines and sources: articles, theses, books, abstracts and court opinions.

  3. Harvard Data Science Review

    As an open access platform of the Harvard Data Science Initiative, Harvard Data Science Review (HDSR) features foundational thinking, research milestones, educational innovations, and major applications, with a primary emphasis on reproducibility, replicability, and readability.We aim to publish content that helps define and shape data science as a scientifically rigorous and globally ...

  4. Learning to Do Qualitative Data Analysis: A Starting Point

    For many researchers unfamiliar with qualitative research, determining how to conduct qualitative analyses is often quite challenging. Part of this challenge is due to the seemingly limitless approaches that a qualitative researcher might leverage, as well as simply learning to think like a qualitative researcher when analyzing data. From framework analysis (Ritchie & Spencer, 1994) to content ...

  5. A Practical Guide to Writing Quantitative and Qualitative Research

    A research question is what a study aims to answer after data analysis and interpretation. The answer is written in length in the discussion section of the paper. Thus, the research question gives a preview of the different parts and variables of the study meant to address the problem posed in the research question.1 An excellent research ...

  6. Data Science and Analytics: An Overview from Data-Driven Smart

    This research contributes to the creation of a research vector on the role of data science in central banking. In , the authors provide an overview and tutorial on the data-driven design of intelligent wireless networks. The authors in provide a thorough understanding of computational optimal transport with application to data science.

  7. Data science: a game changer for science and innovation

    This paper shows data science's potential for disruptive innovation in science, industry, policy, and people's lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e ...

  8. Defining Research Data

    Defining Research Data. One definition of research data is: "the recorded factual material commonly accepted in the scientific community as necessary to validate research findings." ( OMB Circular 110 ). Research data covers a broad range of types of information (see examples below), and digital data can be structured and stored in a variety of ...

  9. Research Data

    Analysis Methods. Some common research data analysis methods include: Descriptive statistics: Descriptive statistics involve summarizing and describing the main features of a dataset, such as the mean, median, and standard deviation. Descriptive statistics are often used to provide an initial overview of the data.

  10. Research Paper

    A research paper is a piece of academic writing that provides analysis, interpretation, and argument based on in-depth independent research. About us; Disclaimer; ... Actual research papers may have different structures, contents, and formats depending on the field of study, research question, data collection and analysis methods, and other ...

  11. The Ultimate Guide to Writing a Research Paper

    What is a research paper? A research paper is a type of academic writing that provides an in-depth analysis, evaluation, or interpretation of a single topic, based on empirical evidence. Research papers are similar to analytical essays, except that research papers emphasize the use of statistical data and preexisting research, along with a strict code for citations.

  12. (PDF) Data Analytics and Techniques: A Review

    This study provides an in-depth examination of space launch data over the long-time frame from 1957 to 2023. By combining data from a Kaggle dataset with web-scraped data for 2023, the research ...

  13. JSTOR Home

    Harness the power of visual materials—explore more than 3 million images now on JSTOR. Enhance your scholarly research with underground newspapers, magazines, and journals. Explore collections in the arts, sciences, and literature from the world's leading museums, archives, and scholars. JSTOR is a digital library of academic journals ...

  14. Big Data Research

    About the journal. The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in dealing with big data, as ...

  15. Data Collection Methods

    Table of contents. Step 1: Define the aim of your research. Step 2: Choose your data collection method. Step 3: Plan your data collection procedures. Step 4: Collect the data. Frequently asked questions about data collection.

  16. Privacy Protection and Secondary Use of Health Data: Strategies and

    In this research, we focus on the first two categories of data, which are directly related to users' health and privacy. Category 1. Health data generated by healthcare system. This type of data is clinical data and is recorded by clinical professionals or medical equipment when a patient gets healthcare service in a hospital or clinic.

  17. Machine Learning Datasets

    9722 datasets • 125271 papers with code. Browse State-of-the-Art ... Datasets 9,722 machine learning datasets Subscribe to the PwC Newsletter ×. Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... and void). The dataset consists of around 5000 fine annotated images and 20000 ...

  18. LibGuides: Research Data Services: Data Papers & Journals

    Data preservation is a corollary of data papers, not their main purpose. Most data journals do not archive data in-house. Instead, they generally require that authors submit the dataset to a repository. These repositories archive the data, provide persistent access, and assign the dataset a unique identifier (DOI).

  19. Writing a Research Paper Introduction

    Table of contents. Step 1: Introduce your topic. Step 2: Describe the background. Step 3: Establish your research problem. Step 4: Specify your objective (s) Step 5: Map out your paper. Research paper introduction examples. Frequently asked questions about the research paper introduction.

  20. Data Collection

    Data collection is the process of gathering and collecting information from various sources to analyze and make informed decisions based on the data collected. This can involve various methods, such as surveys, interviews, experiments, and observation. In order for data collection to be effective, it is important to have a clear understanding ...

  21. data analysis Latest Research Papers

    Data Missing . The Given. Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study.

  22. Revealed: the ten research papers that policy documents cite most

    The top ten most cited papers in policy documents are dominated by economics research. When economics studies are excluded, a 1997 Nature paper 2 about Earth's ecosystem services and natural ...

  23. [2404.07738] ResearchAgent: Iterative Research Idea Generation over

    Scientific Research, vital for improving human life, is hindered by its inherent complexity, slow pace, and the need for specialized experts. To enhance its productivity, we propose a ResearchAgent, a large language model-powered research idea writing agent, which automatically generates problems, methods, and experiment designs while iteratively refining them based on scientific literature ...

  24. What the data says about abortion in the U.S.

    The CDC data that is highlighted in this post comes from the agency's "abortion surveillance" reports, which have been published annually since 1974 (and which have included data from 1969). Its figures from 1973 through 1996 include data from all 50 states, the District of Columbia and New York City - 52 "reporting areas" in all.

  25. A third of China's urban population at risk of city sinking, new

    The research paper, published in the same issue, considers 82 cities with a collective population of nearly 700 million people. The results show that 45% of the urban areas that were analyzed are ...

  26. Dana-Farber retracts Science paper as part of data integrity review

    A n ongoing investigation into data integrity at Dana-Farber Cancer Institute has resulted in a string of retractions, the latest of which is a 2006 Science paper co-authored by institute ...