Advertisement

Advertisement

Data science: a game changer for science and innovation

  • Regular Paper
  • Open access
  • Published: 19 April 2021
  • Volume 11 , pages 263–278, ( 2021 )

Cite this article

You have full access to this open access article

research paper and data

  • Valerio Grossi 1 ,
  • Fosca Giannotti 1 ,
  • Dino Pedreschi 2 ,
  • Paolo Manghi 3 ,
  • Pasquale Pagano 3 &
  • Massimiliano Assante 3  

14k Accesses

19 Citations

57 Altmetric

Explore all metrics

This paper shows data science’s potential for disruptive innovation in science, industry, policy, and people’s lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e-infrastructure as useful tools for supporting ethical data science and training new generations of data scientists. Finally, this work outlines SoBigData Research Infrastructure as an easy-to-access platform for executing complex data science processes. The services proposed by SoBigData are aimed at using data science to understand the complexity of our contemporary, globally interconnected society.

Similar content being viewed by others

research paper and data

What is Qualitative in Qualitative Research

research paper and data

Digital transformation: a review, synthesis and opportunities for future research

research paper and data

Research Methodology: An Introduction

Avoid common mistakes on your manuscript.

1 Introduction: from data to knowledge

Data science is an interdisciplinary and pervasive paradigm where different theories and models are combined to transform data into knowledge (and value). Experiments and analyses over massive datasets are functional not only to the validation of existing theories and models but also to the data-driven discovery of patterns emerging from data, which can help scientists in the design of better theories and models, yielding a deeper understanding of the complexity of the social, economic, biological, technological, cultural, and natural phenomenon. The products of data science are the result of re-interpreting available data for analysis goals that differ from the original reasons motivating data collection. All these aspects are producing a change in the scientific method, in research and in the way our society makes decisions [ 2 ].

Data science emerges to concurring facts: (i) the advent of big data that provides the critical mass of actual examples to learn from, (ii) the advances in data analysis and learning techniques that can produce predictive models and behavioral patterns from big data, and (iii) the advances in high-performance computing infrastructures that make it possible to ingest and manage big data and perform complex analysis [ 16 ].

Paper organization Section 2 discusses how data science impacts our science and society at large in the coming years. Section 3 outlines the main issues related to the ethical problems in studying human behaviors that data science introduces. In Sect.  4 , we show how concepts such as open science and e-infrastructure are effective tools for supporting, disseminating ethical uses of the data, and training new generations of data scientists. We will illustrate the importance of an open data science with examples provided later in the paper. Finally, we show some use cases of data science through thematic environments that bind the datasets with social mining methods.

2 Data science for society, science, industry and business

figure 1

Data science as an ecosystem: on the left, the figure shows the main components enabling data science (data, analytical methods, and infrastructures). On the right, we can find the impact of data science into society, science, and business. All the activities related to data science should be done under rigid ethical principles

The quality of business decision making, government administration, and scientific research can potentially be improved by analyzing data. Data science offers important insights into many complicated issues, in many instances, with remarkable accuracy and timeliness.

figure 2

The data science pipeline starts with raw data and transforms them into data used for analytics. The next step is to transform these data into knowledge through analytical methods and then provide results and evaluation measures

As shown in Fig.  1 , data science is an ecosystem where the following scientific, technological, and socioeconomic factors interact:

Data Availability of data and access to data sources;

Analytics & computing infrastructures Availability of high performance analytical processing and open-source analytics;

Skills Availability of highly and rightly skilled data scientists and engineers;

Ethical & legal aspects Availability of regulatory environments for data ownership and usage, data protection and privacy, security, liability, cybercrime, and intellectual property rights;

Applications Business and market ready applications;

Social aspects Focus on major societal global challenges.

Data science envisioned as the intersection between data mining, big data analytics, artificial intelligence, statistical modeling, and complex systems is capable of monitoring data quality and analytical processes results transparently. If we want data science to face the global challenges and become a determinant factor of sustainable development, it is necessary to push towards an open global ecosystem for science, industrial, and societal innovation [ 48 ]. We need to build an ecosystem of socioeconomic activities, where each new idea, product, and service create opportunities for further purposes, and products. An open data strategy, innovation, interoperability, and suitable intellectual property rights can catalyze such an ecosystem and boost economic growth and sustainable development. This strategy also requires a “networked thinking” and a participatory, inclusive approach.

Data are relevant in almost all the scientific disciplines, and a data-dominated science could lead to the solution of problems currently considered hard or impossible to tackle. It is impossible to cover all the scientific sectors where a data-driven revolution is ongoing; here, we shall only provide just a few examples.

The Sloan Digital Sky Survey Footnote 1 has become a central resource for astronomers over the world. Astronomy is being transformed from the one where taking pictures of the sky was a large part of an astronomer’s job, to the one where the images are already in a database, and the astronomer’s task is to find interesting objects and phenomenon in the database. In biological sciences, data are stored in public repositories. There is an entire discipline of bioinformatics that is devoted to the analysis of such data. Footnote 2 Data-centric approaches based on personal behaviors can also support medical applications analyzing data at both human behavior levels and lower molecular ones. For example, integrating genome data of medical reactions with the habits of the users, enabling a computational drug science for high-precision personalized medicine. In humans, as in other organisms, most cellular components exert their functions through interactions with other cellular components. The totality of these interactions (representing the human “interactome”) is a network with hundreds of thousand nodes and a much larger number of links. A disease is rarely a consequence of an abnormality in a single gene. Instead, the disease phenotype is a reflection of various pathological processes that interact in a complex network. Network-based approaches can have multiple biological and clinical applications, especially in revealing the mechanisms behind complex diseases [ 6 ].

Now, we illustrate the typical data science pipeline [ 50 ]. People, machines, systems, factories, organizations, communities, and societies produce data. Data are collected in every aspect of our life, when: we submit a tax declaration; a customer orders an item online; a social media user posts a comment; a X-ray machine is used to take a picture; a traveler sends a review on a restaurant; a sensor in a supply chain sends an alert; or a scientist conducts an experiment. This huge and heterogeneous quantity of data needs to be extracted, loaded, understood, transformed, and in many cases, anonymized before they may be used for analysis. Analysis results include routines, automated decisions, predictions, and recommendations, and outcomes that need to be interpreted to produce actions and feedback. Furthermore, this scenario must also consider ethical problems in managing social data. Figure 2 depicts the data science pipeline. Footnote 3 Ethical aspects are important in the application of data science in several sectors, and they are addressed in Sect.  3 .

2.1 Impact on society

Data science is an opportunity for improving our society and boosting social progress. It can support policymaking; it offers novel ways to produce high-quality and high-precision statistical information and empower citizens with self-awareness tools. Furthermore, it can help to promote ethical uses of big data.

Modern cities are perfect environments densely traversed by large data flows. Using traffic monitoring systems, environmental sensors, GPS individual traces, and social information, we can organize cities as a collective sharing of resources that need to be optimized, continuously monitored, and promptly adjusted when needed. It is easy to understand the potentiality of data science by introducing terms such as urban planning , public transportation , reduction of energy consumption , ecological sustainability, safety , and management of mass events. These terms represent only the front line of topics that can benefit from the awareness that big data might provide to the city stakeholders [ 22 , 27 , 29 ]. Several methods allowing human mobility analysis and prediction are available in the literature: MyWay [ 47 ] exploits individual systematic behaviors to predict future human movements by combining individual and collective learned models. Carpooling [ 22 ] is based on mobility data from travelers in a given territory and constructs a network of potential carpooling users, by exploiting topological properties, highlighting sub-populations with higher chances to create a carpooling community and the propensity of users to be either drivers or passengers in a shared car. Event attendance prediction [ 13 ] analyzes users’ call habits and classifies people into behavioral categories, dividing them among residents, commuters, and visitors and allows to observe the variety of behaviors of city users and the attendance in big events in cities.

Electric mobility is expected to gain importance for the world. The impact of a complete switch to electric mobility is still under investigation, and what appears to be critical is the intensity of flows due to charge (and fast recharge) systems that may challenge the stability of the power network. To avoid instabilities regarding the charging infrastructure, an accurate prediction of power flows associated with mobility is needed. The use of personal mobility data can estimate the mobility flow and simulate the impact of different charging behavioral patterns to predict power flows and optimize the position of the charging infrastructures [ 25 , 49 ]. Lorini et al. [ 26 ] is an example of an urban flood prediction that integrates data provided by CEM system Footnote 4 and Twitter data. Twitter data are processed using massive multilingual approaches for classification. The model is a supervised model which requires a careful data collection and validation of ground truth about confirmed floods from multiple sources.

Another example of data science for society can be found in the development of applications with functions aimed directly at the individual. In this context, concepts such as personal data stores and personal data analytics are aimed at implementing a new deal on personal data, providing a user-centric view where data are collected, integrated and analyzed at the individual level, and providing the user with better awareness of own behavioral, health, and consumer profiles. Within this user-centric perspective, there is room for an even broader market of business applications, such as high-precision real-time targeted marketing, e.g., self-organizing decision making to preserve desired global properties, and sustainability of the transportation or the healthcare system. Such contexts emphasize two essential aspects of data science: the need for creativeness to exploit and combine the several data sources in novel ways and the need to give awareness and control of the personal data to the users that generate them, to sustain a transparent, trust-based, crowd-sourced data ecosystem [ 19 ].

The impact of online social networks in our society has changed the mechanisms behind information spreading and news production. The transformation of media ecosystems and news consumption are having consequences in several fields. A relevant example is the impact of misinformation on society, as for the Brexit referendum when the massive diffusion of fake news has been considered one of the most relevant factors of the outcome of this political event. Examples of achievements are provided by the results regarding the influence of external news media on polarization in online social networks. These achievements indicate that users are highly polarized towards news sources, i.e., they cite (and tend to cite) sources that they identify as ideologically similar to them. Other results regard echo chambers and the role of social media users: there is a strong correlation between the orientation of the content produced and consumed. In other words, an opinion “echoes” back to the user when others are sharing it in the “chamber” (i.e., the social network around the user) [ 36 ]. Other results worth mentioning regard efforts devoted to uncovering spam and bot activities in stock microblogs on Twitter: taking inspiration from biological DNA, the idea is to model the online users’ behavior through strings of characters representing sequences of online users’ actions. As a result of the following papers, [ 11 , 12 ] report that 71% of suspicious users were classified as bots; furthermore, 37% of them also got suspended by Twitter few months after our investigation. Several approaches can be found in the literature. However, they generally display some limitations. Some of them work only on some of the features of the diffusion of misinformation (bot detections, segregation of users due to their opinions or other social analysis), or there is a lack of comprehensive frameworks for interpreting results. While the former case is somehow due to the innovation of the research field and it is explainable, the latter showcases a more fundamental need, as, without strict statistical validation, it is hard to state which are the crucial elements that permit a well-grounded description of a system. For avoiding fake news diffusion, we can state that building a comprehensive fake news dataset providing all information about publishers, shared contents, and the engagements of users over space and time, together with their profile stories, can help the development of innovative and effective learning models. Both unsupervised and supervised methods will work together to identify misleading information. Multidisciplinary teams made up of journalists, linguists, and behavioral scientists and similar will be needed to identify what amounts to information warfare campaigns. Cyberwarfare and information warfare will be two of the biggest threats the world will face in the 21st Century.

Social sensing methods collect data produced by digital citizens, by either opportunistic or participatory crowd-sensing, depending on users’ awareness of their involvement. These approaches present a variety of technological and ethical challenges. An example is represented by Twitter Monitor [ 10 ], that is crowd-sensing tool designed to access Twitter streams through the Twitter Streaming API. It allows launching parallel listening for collecting different sets of data. Twitter Monitor represents a tool for creating services for listening campaigns regarding relevant events such as political elections, natural and human-made disasters, popular national events, etc. [ 11 ]. This campaign can be carried out, specifying keywords, accounts, and geographical areas of interest.

Nowcasting Footnote 5 financial and economic indicators focus on the potential of data science as a proxy for well-being and socioeconomic applications. The development of innovative research methods has demonstrated that poverty indicators can be approximated by social and behavioral mobility metrics extracted from mobile phone data and GPS data [ 34 ]; and the Gross Domestic Product can be accurately nowcasted by using retail supermarket market data [ 18 ]. Furthermore, nowcasting of demographic aspects of territory based on Twitter data [ 1 ] can support official statistics, through the estimation of location, occupation, and semantics. Networks are a convenient way to represent the complex interaction among the elements of a large system. In economics, networks are gaining increasing attention because the underlying topology of a networked system affects the aggregate output, the propagation of shocks, or financial distress; or the topology allows us to learn something about a node by looking at the properties of its neighbors. Among the most investigated financial and economic networks, we cite a work that analyzes the interbank systems, the payment networks between firms, the banks-firms bipartite networks, and the trading network between investors [ 37 ]. Another interesting phenomenon is the advent of blockchain technology that has led to the innovation of bitcoin crypto-currency [ 31 ].

Data science is an excellent opportunity for policy, data journalism, and marketing. The online media arena is now available as a real-time experimenting society for understanding social mechanisms, like harassment, discrimination, hate, and fake news. In our vision, the use of data science approaches is necessary for better governance. These new approaches integrate and change the Official Statistics representing a cheaper and more timely manner of computing them. The impact of data science-driven applications can be particularly significant when the applications help to build new infrastructures or new services for the population.

The availability of massive data portraying soccer performance has facilitated recent advances in soccer analytics. Rossi et al. [ 42 ] proposed an innovative machine learning approach to the forecasting of non-contact injuries for professional soccer players. In [ 3 ], we can find the definition of quantitative measures of pressing in defensive phases in soccer. Pappalardo et al. [ 33 ] outlined the automatic and data-driven evaluation of performance in soccer, a ranking system for soccer teams. Sports data science is attracting much interest and is now leading to the release of a large and public dataset of sports events.

Finally, data science has unveiled a shift from population statistics to interlinked entities statistics, connected by mutual interactions. This change of perspective reveals universal patterns underlying complex social, economic, technological, and biological systems. It is helpful to understand the dynamics of how opinions, epidemics, or innovations spread in our society, as well as the mechanisms behind complex systemic diseases, such as cancer and metabolic disorders revealing hidden relationships between them. Considering diffusive models and dynamic networks, NDlib [ 40 ] is a Python package for the description, simulation, and observation of diffusion processes in complex networks. It collects diffusive models from epidemics and opinion dynamics and allows a scientist to compare simulation over synthetic systems. For community discovery, two tools are available for studying the structure of a community and understand its habits: Demon [ 9 ] extracts ego networks (i.e., the set of nodes connected to an ego node) and identifies the real communities by adopting a democratic, bottom-up merging approach of such structures. Tiles [ 41 ] is dedicated to dynamic network data and extracts overlapping communities and tracks their evolution in time following an online iterative procedure.

2.2 Impact on industry and business

Data science can create an ecosystem of novel data-driven business opportunities. As a general trend across all sectors, massive quantities of data will be made accessible to everybody, allowing entrepreneurs to recognize and to rank shortcomings in business processes, to spot potential threads and win-win situations. Ideally, every citizen could establish from these patterns new business ideas. Co-creation enables data scientists to design innovative products and services. The value of joining different datasets is much larger than the sum of the value of the separated datasets by sharing data of various nature and provenance.

The gains from data science are expected across all sectors, from industry and production to services and retail. In this context, we cite several macro-areas where data science applications are especially promising. In energy and environment , the digitization of the energy systems (from production to distribution) enables the acquisition of real-time, high-resolution data. Coupled with other data sources, such as weather data, usage patterns, and market data (accompanied by advanced analytics), efficiency levels can be increased immensely. The positive impact to the environment is also enhanced by geospatial data that help to understand how our planet and its climate are changing and to confront major issues such as global warming, preservation of the species, the role and effects of human activities.

The manufacturing and production sector with the growing investments into Industry 4.0 and smart factories with sensor-equipped machinery that are both intelligent and networked (see internet of things . Cyber-physical systems ) will be one of the major producers of data in the world. The application of data science into this sector will bring efficiency gains and predictive maintenance. Entirely new business models are expected since the mass production of individualized products becomes possible where consumers may have direct access to influence and control.

As already stated in Sect.  2.1 , data science will contribute to increasing efficiency in public administrations processes and healthcare. In the physical and the cyber-domain, security will be enhanced. From financial fraud to public security, data science will contribute to establishing a framework that enables a safe and secure digital economy. Big data exploitation will open up opportunities for innovative, self-organizing ways of managing logistical business processes. Deliveries could be based on predictive monitoring, using data from stores, semantic product memories, internet forums, and weather forecasts, leading to both economic and environmental savings. Let us also consider the impact of personalized services for creating real experiences for tourists. The analysis of real-time and context-aware data (with the help of historical and cultural heritage data) will provide customized information to each tourist, and it will contribute to the better and more efficient management of the whole tourism value chain.

3 Data science ethics

Data science creates great opportunities but also new risks. The use of advanced tools for data analysis could expose sensitive knowledge of individual persons and could invade their privacy. Data science approaches require access to digital records of personal activities that contain potentially sensitive information. Personal information can be used to discriminate people based on their presumed characteristics. Data-driven algorithms yield classification and prediction models of behavioral traits of individuals, such as credit score, insurance risk, health status, personal preferences, and religious, ethnic, or political orientation, based on personal data disseminated in the digital environment by users (with or often without their awareness). The achievements of data science are the result of re-interpreting available data for analysis goals that differ from the original reasons motivating data collection. For example, mobile phone call records are initially collected by telecom operators for billing and operational aims, but they can be used for accurate and timely demography and human mobility analysis at a country or regional scale. This re-purposing of data clearly shows the importance of legal compliance and data ethics technologies and safeguards to protect privacy and anonymity; to secure data; to engage users; to avoid discrimination and misuse; to account for transparency; and to the purpose of seizing the opportunities of data science while controlling the associated risks.

Several aspects should be considered to avoid to harm individual privacy. Ethical elements should include the: (i) monitoring of the compliance of experiments, research protocols, and applications with ethical and juridical standards; (ii) developing of big data analytics and social mining tools with value-sensitive design and privacy-by-design methodologies; (iii) boosting of excellence and international competitiveness of Europe’s big data research in safe and fair use of big data for research. It is essential to highlight that data scientists using personal and social data also through infrastructures have the responsibility to get acquainted with the fundamental ethical aspects relating to becoming a “data controller.” This aspect has to be considered to define courses for informing and training data scientists about the responsibilities, the possibilities, and the boundaries they have in data manipulation.

Recalling Fig.  2 , it is crucial to inject into the data science pipeline the ethical values of fairness : how to avoid unfair and discriminatory decisions; accuracy : how to provide reliable information; confidentiality : how to protect the privacy of the involved people and transparency : how to make models and decisions comprehensible to all stakeholders. This value-sensitive design has to be aimed at boosting widespread social acceptance of data science, without inhibiting its power. Finally, it is essential to consider also the impact of the General Data Protection Regulation (GDPR) on (i) companies’ duties and how these European companies should comply with the limits in data manipulation the Regulation requires; and on (ii) researchers’ duties and to highlight articles and recitals which specifically mention and explain how research is intended in GDPR’s legal system.

figure 3

The relationship between big and open data and how they relate to the broad concept of open government

We complete this section with another important aspect related to open data, i.e., accessible public data that people, companies, and organizations can use to launch new ventures, analyze patterns and trends, make data-driven decisions, and solve complex problems. All the definitions of open data include two features: (i) the data must be publicly available for anyone to use, and (ii) data must be licensed in a way that allows for its reuse. All over the world, initiatives are launched to make data open by government agencies and public organizations; listing them is impossible, but an UN initiative has to be mentioned. Global Pulse Footnote 6 meant to implement the vision for a future in which big data is harnessed safely and responsibly as a public good.

Figure 3 shows the relationships between open data and big data. Currently, the problem is not only that government agencies (and some business companies) are collecting personal data about us, but also that we do not know what data are being collected and we do not have access to the information about ourselves. As reported by the World Economic forum in 2013, it is crucial to understand the value of personal data to let the users make informed decisions. A new branch of philosophy and ethics is emerging to handle personal data related issues. On the one hand, in all cases where the data might be used for the social good (i.e., medical research, improvement of public transports, contrasting epidemics), and understanding the personal data value means to correctly evaluate the balance between public benefits and personal loss of protection. On the other hand, when data are aimed to be used for commercial purposes, the value mentioned above might instead translate into simple pricing of personal information that the user might sell to a company for its business. In this context, discrimination discovery consists of searching for a-priori unknown contexts of suspect discrimination against protected-by-law social groups, by analyzing datasets of historical decision records. Machine learning and data mining approaches may be affected by discrimination rules, and these rules may be deeply hidden within obscure artificial intelligence models. Thus, discrimination discovery consists of understanding whether a predictive model makes direct or indirect discrimination. DCube [ 43 ] is a tool for data-driven discrimination discovery, a library of methods on fairness analysis.

It is important to evaluate how a mining model or algorithm takes its decision. The growing field of methods for explainable machine learning provides and continuously expands a set of comprehensive tool-kits [ 21 ]. For example, X-Lib is a library containing state-of-the-art explanation methods organized within a hierarchical structure and wrapped in a similar fashion way such that they can be easily accessed and used from different users. The library provides support for explaining classification on tabular data and images and for explaining the logic of complex decision systems. X-Lib collects, among the others, the following collection of explanation methods: LIME [ 38 ], Anchor [ 39 ], DeepExplain that includes Saliency maps [ 44 ], Gradient * Input, Integrated Gradients, and DeepLIFT [ 46 ]. Saliency method is a library containing code for SmoothGrad [ 45 ], as well as implementations of several other saliency techniques: Vanilla Gradients, Guided Backpropogation, and Grad-CAM. Another improvement in this context is the use of robotics and AI in data preparation, curation, and in detecting bias in data, information and knowledge as well as in the misuse and abuse of these assets when it comes to legal, privacy, and ethical issues and when it comes to transparency and trust. We cannot rely on human beings to do these tasks. We need to exploit the power of robotics and AI to help provide the protections required. Data and information lawyers will play a key role in legal and privacy issues, ethical use of these assets, and the problem of bias in both algorithms and the data, information, and knowledge used to develop analytics solutions. Finally, we can state that data science can help to fill the gap between legislators and technology.

4 Big data ecosystem: the role of research infrastructures

Research infrastructures (RIs) play a crucial role in the advent and development of data science. A social mining experiment exploits the main components of data science depicted in Fig.  1 (i.e., data, infrastructures, analytical methods) to enable multidisciplinary scientists and innovators to extract knowledge and to make the experiment reusable by the scientific community, innovators providing an impact on science and society.

Resources such as data and methods help domain and data scientists to transform research or an innovation question into a responsible data-driven analytical process. This process is executed onto the platform, thus supporting experiments that yield scientific output, policy recommendations, or innovative proofs-of-concept. Furthermore, an operational ethical board’s stewardship is a critical factor in the success of a RI.

An infrastructure typically offers easy-to-use means to define complex analytical processes and workflows , thus bridging the gap between domain experts and analytical technology. In many instances, domain experts may become a reference for their scientific communities, thus facilitating new users engagement within the RI activities. As a collateral feedback effect, experiments will generate new relevant data, methods, and workflows that can be integrated into the platform by data scientists, contributing to the resource expansion of the RI. An experiment designed in a node of the RI and executed on the platform returns its results to the entire RI community.

Well defined thematic environments amplify new experiments achievements towards the vertical scientific communities (and potential stakeholders) by activating appropriate dissemination channels.

4.1 The SoBigData Research Infrastructure

The SoBigData Research Infrastructure Footnote 7 is an ecosystem of human and digital resources, comprising data scientists, analytics, and processes. As shown in Fig.  4 , SoBigData is designed to enable multidisciplinary scientists and innovators to realize social mining experiments and to make them reusable by the scientific communities. All the components have been introduced for implementing data science from raw data management to knowledge extraction, with particular attention to legal and ethical aspects as reported in Fig.  1 . SoBigData supports data science serving a cross-disciplinary community of data scientists studying all the elements of societal complexity from a data- and model-driven perspective.

Currently, SoBigData includes scientific, industrial, and other stakeholders. In particular, our stakeholders are data analysts and researchers (35.6%), followed by companies (33.3%) and policy and lawmakers (20%). The following sections provide a short but comprehensive overview of the services provided by SoBigData RI with special attention on supporting ethical and open data science [ 15 , 16 ].

4.1.1 Resources, facilities, and access opportunities

Over the past decade, Europe has developed world-leading expertise in building and operating e-infrastructures. They are large-scale, federated and distributed online research environments through which researchers can share access to scientific resources (including data, instruments, computing, and communications), regardless of their location. They are meant to support unprecedented scales of international collaboration in science, both within and across disciplines, investing in economy-of-scale and common behavior, policies, best practices, and standards. They shape up a common environment where scientists can create , validate , assess , compare , and share their digital results of science, such as research data and research methods, by using a common “digital laboratory” consisting of agreed-on services and tools.

figure 4

The SoBigData Research Infrastructure: an ecosystem of human and digital resources, comprising data scientists, analytical methods, and processes. SoBigData enables multidisciplinary scientists and innovators to carry out experiments and to make them reusable by the community

However, the implementation of workflows, possibly following Open Science principles of reproducibility and transparency, is hindered by a multitude of real-world problems. One of the most prominent is that e-infrastructures available to research communities today are far from being well-designed and consistent digital laboratories, neatly designed to share and reuse resources according to common policies, data models, standards, language platforms, and APIs. They are instead “patchworks of systems,” assembling online tools, services, and data sources and evolving to match the requirements of the scientific process, to include new solutions. The degree of heterogeneity excludes the adoption of uniform workflow management systems, standard service-oriented approaches, routine monitoring and accounting methods. The realization of scientific workflows is typically realized by writing ad hoc code, manipulating data on desktops, alternating the execution of online web services, sharing software libraries implementing research methods in different languages, desktop tools, web-accessible execution engines (e.g., Taverna, Knime, Galaxy).

The SoBigData e-infrastructure is based on D4Science services, which provides researchers and practitioners with a working environment where open science practices are transparently promoted, and data science practices can be implemented by minimizing the technological integration cost highlighted above.

D4Science is a deployed instance of the gCube Footnote 8 technology [ 4 ], a software conceived to facilitate the integration of web services, code, and applications as resources of different types in a common framework, which in turn enables the construction of Virtual Research Environments (VREs) [ 7 ] as combinations of such resources (Fig.  5 ). As there is no common framework that can be trusted enough, sustained enough, to convince resource providers that converging to it would be a worthwhile effort, D4Science implements a “system of systems.” In such a framework, resources are integrated with minimal cost, to gain in scalability, performance, accounting, provenance tracking, seamless integration with other resources, visibility to all scientists. The principle is that the cost of “participation” to the framework is on the infrastructure rather than on resource providers. The infrastructure provides the necessary bridges to include and combine resources that would otherwise be incompatible.

figure 5

D4Science: resources from external systems, virtual research environments, and communities

More specifically, via D4Science, SoBigData scientists can integrate and share resources such as datasets, research methods, web services via APIs, and web applications via Portlets. Resources can then be integrated, combined, and accessed via VREs, intended as web-based working environments tailored to support the needs of their designated communities, each working on a research question. Research methods are integrated as executable code, implementing WPS APIs in different programming languages (e.g., Java, Python, R, Knime, Galaxy), which can be executed via the Data Miner analytics platform in parallel, transparently to the users, over powerful and extensible clusters, and via simple VRE user interfaces. Scientists using Data Miner in the context of a VRE can select and execute the available methods and share the results with other scientists, who can repeat or reproduce the experiment with a simple click.

D4Science VREs are equipped with core services supporting data analysis and collaboration among its users: ( i ) a shared workspace to store and organize any version of a research artifact; ( ii ) a social networking area to have discussions on any topic (including working version and released artifacts) and be informed on happenings; ( iii ) a Data Miner analytics platform to execute processing tasks (research methods) either natively provided by VRE users or borrowed from other VREs to be applied to VRE users’ cases and datasets; and iv ) a catalogue-based publishing platform to make the existence of a certain artifact public and disseminated. Scientists operating within VREs use such facilities continuously and transparently track the record of their research activities (actions, authorship, provenance), as well as products and links between them (lineage) resulting from every phase of the research life cycle, thus facilitating publishing of science according to Open Science principles of transparency and reproducibility [ 5 ].

Today, SoBigData integrates the resources in Table  1 . By means of such resources, SoBigData scientists have created VREs to deliver the so-called SoBigData exploratories : Explainable Machine Learning , Sports Data Science , Migration Studies , Societal Debates , Well-being & Economy , and City of Citizens . Each exploratory includes the resources required to perform Data science workflows in a controlled and shared environment. Resources range from data to methods, described more in detail in the following, together with their exploitation within the exploratories.

All the resources and instruments integrate into SoBigData RI are structured in such a way as to operate within the confines of the current data protection law with the focus on General Data Protection Regulation (GDPR) and ethical analysis of the fundamental values involved in social mining and AI. Each item into the catalogue has specific fields for managing ethical issues (e.g., if a dataset contains personal info) and fields for describing and managing intellectual properties.

4.1.2 Data resources: social mining and big data ecosystem

SoBigData RI defines policies supporting users in the collection, description, preservation, and sharing of their data sets. It implements data science making such data available for collaborative research by adopting various strategies, ranging from sharing the open data sets with the scientific community at large, to share the data with disclosure restriction allowing data access within secure environments.

Several big data sets are available through SoBigData RI including network graphs from mobile phone call data; networks crawled from many online social networks, including Facebook and Flickr, transaction micro-data from diverse retailers, query logs both from search engines and e-commerce, society-wide mobile phone call data records, GPS tracks from personal navigation devices, survey data about customer satisfaction or market research, extensive web archives, billions of tweets, and data from location-aware social networks.

4.1.3 Data science through SoBigData exploratories

Exploratories are thematic environments built on top of the SoBigData RI. An exploratory binds datasets with social mining methods providing the research context for supporting specific data science applications by: (i) providing the scientific context for performing the application. This context can be considered a container for binding specific methods, applications, services, and datasets; (ii) stimulating communities on the effectiveness of the analytical process related to the analysis, promoting scientific dissemination, result sharing, and reproducibility. The use of exploratories promotes the effectiveness of the data science trough research infrastructure services. The following sections report a short description of the six SoBigData exploratories. Figure 6 shows the main thematic areas covered by each exploratory. Due to its nature, Explainable Machine Learning exploratory can be applied to each sector where a black-box machine learning approach is used. The list of exploratories (and the data and methods inside them) are updated continuously and continue to grow over time. Footnote 9

figure 6

SoBigData covers six thematic areas listed horizontally. Each exploratory covers more than one thematic area

City of citizens. This exploratory aims to collect data science applications and methods related to geo-referenced data. The latter describes the movements of citizens in a city, a territory, or an entire region. There are several studies and different methods that employ a wide variety of data sources to build models about the mobility of people and city characteristics in the scientific literature [ 30 , 32 ]. Like ecosystems, cities are open systems that live and develop utilizing flows of energy, matter, and information. What distinguishes a city from a colony is the human component (i.e., the process of transformation by cultural and technological evolution). Through this combination, cities are evolutionary systems that develop and co-evolve continuously with their inhabitants [ 24 ]. Cities are kaleidoscopes of information generated by a myriad of digital devices weaved into the urban fabric. The inclusion of tracking technologies in personal devices enabled the analysis of large sets of mobility data like GPS traces and call detail records.

Data science applied to human mobility is one of the critical topics investigated in SoBigData thanks to the decennial experience of partners in European projects. The study of human mobility led to the integration into the SoBigData of unique Global Positioning System (GPS) and call detail record (CDR) datasets of people and vehicle movements, and geo-referenced social network data as well as several mobility services: O/D (origin-destination) matrix computation, Urban Mobility Atlas Footnote 10 (a visual interface to city mobility patterns), GeoTopics Footnote 11 (for exploring patterns of urban activity from Foursquare), and predictive models: MyWay Footnote 12 (trajectory prediction), TripBuilder Footnote 13 (tourists to build personalized tours of a city). In human mobility, research questions come from geographers, urbanists, complexity scientists, data scientists, policymakers, and Big Data providers, as well as innovators aiming to provide applications for any service for the smart city ecosystem. The idea is to investigate the impact of political events on the well-being of citizens. This exploratory supports the development of “happiness” and “peace” indicators through text mining/opinion mining pipeline on repositories of online news. These indicators reveal that the level of crime of a territory can be well approximated by analyzing the news related to that territory. Generally, we study the impact of the economy on well-being and vice versa, e.g., also considering the propagation of shocks of financial distress in an economic or financial system crucially depends on the topology of the network interconnecting the different elements.

Well-being and economy. This exploratory tests the hypothesis that well-being is correlated to the business performance of companies. The idea is to combine statistical methods and traditional economic data (typically at low-frequency) with high-frequency data from non-traditional sources, such as, i.e., web, supermarkets, for now-casting economic, socioeconomic and well-being indicators. These indicators allow us to study and measure real-life costs by studying price variation and socioeconomic status inference. Furthermore, this activity supports studies on the correlation between people’s well-being and their social and mobility data. In this context, some basic hypothesis can be summarized as: (i) there are curves of age- and gender-based segregation distribution in boards of companies, which are characteristic to mean credit risk of companies in a region; (ii) low mean credit risk of companies in a region has a positive correlation to well-being; (iii) systemic risk correlates highly with well-being indices at a national level. The final aim is to provide a set of guidelines to national governments, methods, and indices for decision making on regulations affecting companies to improve well-being in the country, also considering effective policies to reduce operational risks such as credit risk, and external threats of companies [ 17 ].

Big Data, analyzed through the lenses of data science, provides means to understand our complex socioeconomic and financial systems. On the one hand, this offers new opportunities to measure the patterns of well-being and poverty at a local and global scale, empowering governments and policymakers with the unprecedented opportunity to nowcast relevant economic quantities and compare different countries, regions, and cities. On the other hand, this allows us to investigate the network underlying the complex systems of economy and finance, and it affects the aggregate output, the propagation of shocks or financial distress and systemic risk.

Societal debates. This exploratory employs data science approaches to answer research questions such as who is participating in public debates? What is the “big picture” response from citizens to a policy, election, referendum, or other political events? This kind of analysis allows scientists, policymakers, and citizens to understand the online discussion surrounding polarized debates [ 14 ]. The personal perception of online discussions on social media is often biased by the so-called filter bubble, in which automatic curation of content and relationships between users negatively affects the diversity of opinions available to them. Making a complete analysis of online polarized debates enables the citizens to be better informed and prepared for political outcomes. By analyzing content and conversations on social media and newspaper articles, data scientists study public debates and also assess public sentiment around debated topics, opinion diffusion dynamics, echo chambers formation and polarized discussions, fake news analysis, and propaganda bots. Misinformation is often the result of a distorted perception of concepts that, although unrelated, suddenly appear together in the same narrative. Understanding the details of this process at an early stage may help to prevent the birth and the diffusion of fake news. The misinformation fight includes the development of dynamical models of misinformation diffusion (possibly in contrast to the spread of mainstream news) as well as models of how attention cycles are accelerated and amplified by the infrastructures of online media.

Another important topic covered by this exploratory concerns the analysis of how social bots activity affects fake news diffusion. Determining whether a human or a bot controls a user account is a complex task. To the best of our knowledge, the only openly accessible solution to detect social bots is Botometer, an API that allows us to interact with an underlying machine learning system. Although Botometer has been proven to be entirely accurate in detecting social bots, it has limitations due to the Twitter API features: hence, an algorithm overcoming the barriers of current recipes is needed.

The resources related to Societal Debates exploratory, especially in the domain of media ecology and the fight against misinformation online, provide easy-to-use services to public bodies, media outlets, and social/political scientists. Furthermore, SoBigData supports new simulation models and experimental processes to validate in vivo the algorithms for fighting misinformation, curbing the pathological acceleration and amplification of online attention cycles, breaking the bubbles, and explore alternative media and information ecosystems.

Migration studies. Data science is also useful to understand the migration phenomenon. Knowledge about the number of immigrants living in a particular region is crucial to devise policies that maximize the benefits for both locals and immigrants. These numbers can vary rapidly in space and time, especially in periods of crisis such as wars or natural disasters.

This exploratory provides a set of data and tools for trying to answer some questions about migration flows. Through this exploratory, a data scientist studies economic models of migration and can observe how migrants choose their destination countries. A scientist can discover what is the meaning of “opportunities” that a country provides to migrants, and whether there are correlations between the number of incoming migrants and opportunities in the host countries [ 8 ]. Furthermore, this exploratory tries to understand how public perception of migration is changing using an opinion mining analysis. For example, social network analysis enables us to analyze the migrant’s social network and discover the structure of the social network for people who decided to start a new life in a different country [ 28 ].

Finally, we can also evaluate current integration indices based on official statistics and survey data, which can be complemented by Big Data sources. This exploratory aims to build combined integration indexes that take into account multiple data sources to evaluate integration on various levels. Such integration includes mobile phone data to understand patterns of communication between immigrants and natives; social network data to assess sentiment towards immigrants and immigration; professional network data (such as LinkedIn) to understand labor market integration, and local data to understand to what extent moving across borders is associated with a change in the cultural norms of the migrants. These indexes are fundamental to evaluate the overall social and economic effects of immigration. The new integration indexes can be applied with various space and time resolutions (small area methods) to obtain a complete image of integration, and complement official index.

Sports data science. The proliferation of new sensing technologies that provide high-fidelity data streams extracted from every game, is changing the way scientists, fans and practitioners conceive sports performance. The combination of these (big) data with the tools of data science provides the possibility to unveil complex models underlying sports performance and enables to perform many challenging tasks: from automatic tactical analysis to data-driven performance ranking; game outcome prediction, and injury forecasting. The idea is to foster research on sports data science in several directions. The application of explainable AI and deep learning techniques can be hugely beneficial to sports data science. For example, by using adversarial learning, we can modify the training plans of players that are associated with high injury risk and develop training plans that maximize the fitness of players (minimizing their injury risk). The use of gaming, simulation, and modeling is another set of tools that can be used by coaching staff to test tactics that can be employed against a competitor. Furthermore, by using deep learning on time series, we can forecast the evolution of the performance of players and search for young talents.

This exploratory examines the factors influencing sports success and how to build simulation tools for boosting both individual and collective performance. Furthermore, this exploratory describes performances employing data, statistics, and models, allowing coaches, fans, and practitioners to understand (and boost) sports performance [ 42 ].

Explainable machine learning. Artificial Intelligence, increasingly based on Big Data analytics, is a disruptive technology of our times. This exploratory provides a forum for studying effects of AI on the future society. In this context, SoBigData studies the future of labor and the workforce, also through data- and model-driven analysis, simulations, and the development of methods that construct human understandable explanations of AI black-box models [ 20 ].

Black box systems for automated decision making map a user’s features into a class that predicts the behavioral traits of individuals, such as credit risk, health status, without exposing the reasons why. Most of the time, the internal reasoning of these algorithms is obscure even to their developers. For this reason, the last decade has witnessed the rise of a black box society. This exploratory is developing a set of techniques and tools which allow data analysts to understand why an algorithm produce a decision. These approaches are designed not for discovering a lack of transparency but also for discovering possible biases inherited by the algorithms from human prejudices and artefacts hidden in the training data (which may lead to unfair or wrong decisions) [ 35 ].

5 Conclusions: individual and collective intelligence

The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s [ 23 ]. Since 2012, every day 2.5 exabytes (2.5 \(\times \) 10 \(^18\) bytes) of data were created; as of 2014, every day 2.3 zettabytes (2.3 \(\times \) 10 \(^21\) bytes) of data were generated by Super-power high-tech Corporation worldwide. Soon zettabytes of useful public and private data will be widely and openly available. In the next years, smart applications such as smart grids, smart logistics, smart factories, and smart cities will be widely deployed across the continent and beyond. Ubiquitous broadband access, mobile technology, social media, services, and internet of think on billions of devices will have contributed to the explosion of generated data to a total global estimate of 40 zettabytes.

In this work, we have introduced data science as a new challenge and opportunity for the next years. In this context, we have tried to summarize in a concise way several aspects related to data science applications and their impacts on society, considering both the new services available and the new job perspectives. We have also introduced issues in managing data representing human behavior and showed how difficult it is to preserve personal information and privacy. With the introduction of SoBigData RI and exploratories, we have provided virtual environments where it is possible to understand the potentiality of data science in different research contexts.

Concluding, we can state that social dilemmas occur when there is a conflict between the individual and public interest. Such problems also appear in the ecosystem of distributed AI systems (based on data science tools) and humans, with additional difficulties due: on the one hand, to the relative rigidity of the trained AI systems and the necessity of achieving social benefit, and, on the other hand, to the necessity of keeping individuals interested. What are the principles and solutions for individual versus social optimization using AI, and how can an optimum balance be achieved? The answer is still open, but these complex systems have to work on fulfilling collective goals, and requirements, with the challenge that human needs change over time and move from one context to another. Every AI system should operate within an ethical and social framework in understandable, verifiable, and justifiable way. Such systems must, in any case, work within the bounds of the rule of law, incorporating protection of fundamental rights into the AI infrastructure. In other words, the challenge is to develop mechanisms that will result in the system converging to an equilibrium that complies with European values and social objectives (e.g., social inclusion) but without unnecessary losses of efficiency.

Interestingly, data science can play a vital role in enhancing desirable behaviors in the system, e.g., by supporting coordination and cooperation that is, more often than not, crucial to achieving any meaningful improvements. Our ultimate goal is to build the blueprint of a sociotechnical system in which AI not only cooperates with humans but, if necessary, helps them to learn how to collaborate, as well as other desirable behaviors. In this context, it is also essential to understand how to achieve robustness of the human and AI ecosystems in respect of various types of malicious behaviors, such as abuse of power and exploitation of AI technical weaknesses.

We conclude by paraphrasing Stephen Hawking in his Brief Answers to the Big Questions: the availability of data on its own will not take humanity to the future, but its intelligent and creative use will.

http://www.sdss3.org/collaboration/ .

e.g., https://www.nature.com/sdata/policies/repositories .

Responsible Data Science program: https://redasci.org/ .

https://emergency.copernicus.eu/ .

Nowcasting in economics is the prediction of the present, the very near future, and the very recent past state of an economic indicator.

https://www.unglobalpulse.org/ .

http://sobigdata.eu .

https://www.gcube-system.org/ .

https://sobigdata.d4science.org/catalogue-sobigdata .

http://www.sobigdata.eu/content/urban-mobility-atlas .

http://data.d4science.org/ctlg/ResourceCatalogue/geotopics_-_a_method_and_system_to_explore_urban_activity .

http://data.d4science.org/ctlg/ResourceCatalogue/myway_-_trajectory_prediction .

http://data.d4science.org/ctlg/ResourceCatalogue/tripbuilder .

Abitbol, J.L., Fleury, E., Karsai, M.: Optimal proxy selection for socioeconomic status inference on twitter. Complexity 2019 , 60596731–605967315 (2019). https://doi.org/10.1155/2019/6059673

Article   Google Scholar  

Amato, G., Candela, L., Castelli, D., Esuli, A., Falchi, F., Gennaro, C., Giannotti, F., Monreale, A., Nanni, M., Pagano, P., Pappalardo, L., Pedreschi, D., Pratesi, F., Rabitti, F., Rinzivillo, S., Rossetti, G., Ruggieri, S., Sebastiani, F., Tesconi, M.: How data mining and machine learning evolved from relational data base to data science. In: Flesca, S., Greco, S., Masciari, E., Saccà, D. (eds.) A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years, Studies in Big Data, vol. 31, pp. 287–306. Springer, Berlin (2018). https://doi.org/10.1007/978-3-319-61893-7_17

Chapter   Google Scholar  

Andrienko, G.L., Andrienko, N.V., Budziak, G., Dykes, J., Fuchs, G., von Landesberger, T., Weber, H.: Visual analysis of pressure in football. Data Min. Knowl. Discov. 31 (6), 1793–1839 (2017). https://doi.org/10.1007/s10618-017-0513-2

Article   MathSciNet   Google Scholar  

Assante, M., Candela, L., Castelli, D., Cirillo, R., Coro, G., Frosini, L., Lelii, L., Mangiacrapa, F., Marioli, V., Pagano, P., Panichi, G., Perciante, C., Sinibaldi, F.: The gcube system: delivering virtual research environments as-a-service. Future Gener. Comput. Syst. 95 , 445–453 (2019). https://doi.org/10.1016/j.future.2018.10.035

Assante, M., Candela, L., Castelli, D., Cirillo, R., Coro, G., Frosini, L., Lelii, L., Mangiacrapa, F., Pagano, P., Panichi, G., Sinibaldi, F.: Enacting open science by d4science. Future Gener. Comput. Syst. (2019). https://doi.org/10.1016/j.future.2019.05.063

Barabasi, A.L., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach to human disease. Nature reviews. Genetics 12 , 56–68 (2011). https://doi.org/10.1038/nrg2918

Candela, L., Castelli, D., Pagano, P.: Virtual research environments: an overview and a research agenda. Data Sci. J. 12 , GRDI75–GRDI81 (2013). https://doi.org/10.2481/dsj.GRDI-013

Coletto, M., Esuli, A., Lucchese, C., Muntean, C.I., Nardini, F.M., Perego, R., Renso, C.: Sentiment-enhanced multidimensional analysis of online social networks: perception of the mediterranean refugees crisis. In: Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM’16, pp. 1270–1277. IEEE Press, Piscataway, NJ, USA (2016). http://dl.acm.org/citation.cfm?id=3192424.3192657

Coscia, M., Rossetti, G., Giannotti, F., Pedreschi, D.: Uncovering hierarchical and overlapping communities with a local-first approach. TKDD 9 (1), 6:1–6:27 (2014). https://doi.org/10.1145/2629511

Cresci, S., Minutoli, S., Nizzoli, L., Tardelli, S., Tesconi, M.: Enriching digital libraries with crowdsensed data. In: P. Manghi, L. Candela, G. Silvello (eds.) Digital Libraries: Supporting Open Science—15th Italian Research Conference on Digital Libraries, IRCDL 2019, Pisa, Italy, 31 Jan–1 Feb 2019, Proceedings, Communications in Computer and Information Science, vol. 988, pp. 144–158. Springer (2019). https://doi.org/10.1007/978-3-030-11226-4_12

Cresci, S., Petrocchi, M., Spognardi, A., Tognazzi, S.: Better safe than sorry: an adversarial approach to improve social bot detection. In: P. Boldi, B.F. Welles, K. Kinder-Kurlanda, C. Wilson, I. Peters, W.M. Jr. (eds.) Proceedings of the 11th ACM Conference on Web Science, WebSci 2019, Boston, MA, USA, June 30–July 03, 2019, pp. 47–56. ACM (2019). https://doi.org/10.1145/3292522.3326030

Cresci, S., Pietro, R.D., Petrocchi, M., Spognardi, A., Tesconi, M.: Social fingerprinting: detection of spambot groups through dna-inspired behavioral modeling. IEEE Trans. Dependable Sec. Comput. 15 (4), 561–576 (2018). https://doi.org/10.1109/TDSC.2017.2681672

Furletti, B., Trasarti, R., Cintia, P., Gabrielli, L.: Discovering and understanding city events with big data: the case of rome. Information 8 (3), 74 (2017). https://doi.org/10.3390/info8030074

Garimella, K., De Francisci Morales, G., Gionis, A., Mathioudakis, M.: Reducing controversy by connecting opposing views. In: Proceedings of the 10th ACM International Conference on Web Search and Data Mining, WSDM’17, pp. 81–90. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3018661.3018703

Giannotti, F., Trasarti, R., Bontcheva, K., Grossi, V.: Sobigdata: social mining & big data ecosystem. In: P. Champin, F.L. Gandon, M. Lalmas, P.G. Ipeirotis (eds.) Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 23–27, 2018, pp. 437–438. ACM (2018). https://doi.org/10.1145/3184558.3186205

Grossi, V., Rapisarda, B., Giannotti, F., Pedreschi, D.: Data science at sobigdata: the european research infrastructure for social mining and big data analytics. I. J. Data Sci. Anal. 6 (3), 205–216 (2018). https://doi.org/10.1007/s41060-018-0126-x

Grossi, V., Romei, A., Ruggieri, S.: A case study in sequential pattern mining for it-operational risk. In: W. Daelemans, B. Goethals, K. Morik (eds.) Machine Learning and Knowledge Discovery in Databases, European Conference, ECML/PKDD 2008, Antwerp, Belgium, 15–19 Sept 2008, Proceedings, Part I, Lecture Notes in Computer Science, vol. 5211, pp. 424–439. Springer (2008). https://doi.org/10.1007/978-3-540-87479-9_46

Guidotti, R., Coscia, M., Pedreschi, D., Pennacchioli, D.: Going beyond GDP to nowcast well-being using retail market data. In: A. Wierzbicki, U. Brandes, F. Schweitzer, D. Pedreschi (eds.) Advances in Network Science—12th International Conference and School, NetSci-X 2016, Wroclaw, Poland, 11–13 Jan 2016, Proceedings, Lecture Notes in Computer Science, vol. 9564, pp. 29–42. Springer (2016). https://doi.org/10.1007/978-3-319-28361-6_3

Guidotti, R., Monreale, A., Nanni, M., Giannotti, F., Pedreschi, D.: Clustering individual transactional data for masses of users. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 Aug 2017, pp. 195–204. ACM (2017). https://doi.org/10.1145/3097983.3098034

Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A survey of methods for explaining black box models. ACM Comput. Surv. 51 (5), 93:1–93:42 (2019). https://doi.org/10.1145/3236009

Guidotti, R., Monreale, A., Turini, F., Pedreschi, D., Giannotti, F.: A survey of methods for explaining black box models. CoRR abs/1802.01933 (2018). arxiv: 1802.01933

Guidotti, R., Nanni, M., Rinzivillo, S., Pedreschi, D., Giannotti, F.: Never drive alone: boosting carpooling with network analysis. Inf. Syst. 64 , 237–257 (2017). https://doi.org/10.1016/j.is.2016.03.006

Hilbert, M., Lopez, P.: The world’s technological capacity to store, communicate, and compute information. Science 332 (6025), 60–65 (2011)

Kennedy, C.A., Stewart, I., Facchini, A., Cersosimo, I., Mele, R., Chen, B., Uda, M., Kansal, A., Chiu, A., Kim, K.g., Dubeux, C., Lebre La Rovere, E., Cunha, B., Pincetl, S., Keirstead, J., Barles, S., Pusaka, S., Gunawan, J., Adegbile, M., Nazariha, M., Hoque, S., Marcotullio, P.J., González Otharán, F., Genena, T., Ibrahim, N., Farooqui, R., Cervantes, G., Sahin, A.D., : Energy and material flows of megacities. Proc. Nat. Acad. Sci. 112 (19), 5985–5990 (2015). https://doi.org/10.1073/pnas.1504315112

Korjani, S., Damiano, A., Mureddu, M., Facchini, A., Caldarelli, G.: Optimal positioning of storage systems in microgrids based on complex networks centrality measures. Sci. Rep. (2018). https://doi.org/10.1038/s41598-018-35128-6

Lorini, V., Castillo, C., Dottori, F., Kalas, M., Nappo, D., Salamon, P.: Integrating social media into a pan-european flood awareness system: a multilingual approach. In: Z. Franco, J.J. González, J.H. Canós (eds.) Proceedings of the 16th International Conference on Information Systems for Crisis Response and Management, València, Spain, 19–22 May 2019. ISCRAM Association (2019). http://idl.iscram.org/files/valeriolorini/2019/1854-_ValerioLorini_etal2019.pdf

Lulli, A., Gabrielli, L., Dazzi, P., Dell’Amico, M., Michiardi, P., Nanni, M., Ricci, L.: Scalable and flexible clustering solutions for mobile phone-based population indicators. Int. J. Data Sci. Anal. 4 (4), 285–299 (2017). https://doi.org/10.1007/s41060-017-0065-y

Moise, I., Gaere, E., Merz, R., Koch, S., Pournaras, E.: Tracking language mobility in the twitter landscape. In: C. Domeniconi, F. Gullo, F. Bonchi, J. Domingo-Ferrer, R.A. Baeza-Yates, Z. Zhou, X. Wu (eds.) IEEE International Conference on Data Mining Workshops, ICDM Workshops 2016, 12–15 Dec 2016, Barcelona, Spain., pp. 663–670. IEEE Computer Society (2016). https://doi.org/10.1109/ICDMW.2016.0099

Nanni, M.: Advancements in mobility data analysis. In: F. Leuzzi, S. Ferilli (eds.) Traffic Mining Applied to Police Activities—Proceedings of the 1st Italian Conference for the Traffic Police (TRAP-2017), Rome, Italy, 25–26 Oct 2017, Advances in Intelligent Systems and Computing, vol. 728, pp. 11–16. Springer (2017). https://doi.org/10.1007/978-3-319-75608-0_2

Nanni, M., Trasarti, R., Monreale, A., Grossi, V., Pedreschi, D.: Driving profiles computation and monitoring for car insurance crm. ACM Trans. Intell. Syst. Technol. 8 (1), 14:1–14:26 (2016). https://doi.org/10.1145/2912148

Pappalardo, G., di Matteo, T., Caldarelli, G., Aste, T.: Blockchain inefficiency in the bitcoin peers network. EPJ Data Sci. 7 (1), 30 (2018). https://doi.org/10.1140/epjds/s13688-018-0159-3

Pappalardo, L., Barlacchi, G., Pellungrini, R., Simini, F.: Human mobility from theory to practice: Data, models and applications. In: S. Amer-Yahia, M. Mahdian, A. Goel, G. Houben, K. Lerman, J.J. McAuley, R.A. Baeza-Yates, L. Zia (eds.) Companion of The 2019 World Wide Web Conference, WWW 2019, San Francisco, CA, USA, 13–17 May 2019., pp. 1311–1312. ACM (2019). https://doi.org/10.1145/3308560.3320099

Pappalardo, L., Cintia, P., Ferragina, P., Massucco, E., Pedreschi, D., Giannotti, F.: Playerank: data-driven performance evaluation and player ranking in soccer via a machine learning approach. ACM TIST 10 (5), 59:1–59:27 (2019). https://doi.org/10.1145/3343172

Pappalardo, L., Vanhoof, M., Gabrielli, L., Smoreda, Z., Pedreschi, D., Giannotti, F.: An analytical framework to nowcast well-being using mobile phone data. CoRR abs/1606.06279 (2016). arxiv: 1606.06279

Pasquale, F.: The Black Box Society: The Secret Algorithms That Control Money and Information. Harvard University Press, Cambridge (2015)

Book   Google Scholar  

Piškorec, M., Antulov-Fantulin, N., Miholić, I., Šmuc, T., Šikić, M.: Modeling peer and external influence in online social networks: Case of 2013 referendum in croatia. In: Cherifi, C., Cherifi, H., Karsai, M., Musolesi, M. (eds.) Complex Networks & Their Applications VI. Springer, Cham (2018)

Google Scholar  

Ranco, G., Aleksovski, D., Caldarelli, G., Mozetic, I.: Investigating the relations between twitter sentiment and stock prices. CoRR abs/1506.02431 (2015). arxiv: 1506.02431

Ribeiro, M.T., Singh, S., Guestrin, C.: “why should I trust you?”: Explaining the predictions of any classifier. In: B. Krishnapuram, M. Shah, A.J. Smola, C.C. Aggarwal, D. Shen, R. Rastogi (eds.) Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 Aug 2016, pp. 1135–1144. ACM (2016). https://doi.org/10.1145/2939672.2939778

Ribeiro, M.T., Singh, S., Guestrin, C.: Anchors: High-precision model-agnostic explanations. In: S.A. McIlraith, K.Q. Weinberger (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, 2–7 Feb 2018, pp. 1527–1535. AAAI Press (2018). https://www.aaai.org/ocs/index.php/AAAI/AAAI18/-paper/view/16982

Rossetti, G., Milli, L., Rinzivillo, S., Sîrbu, A., Pedreschi, D., Giannotti, F.: Ndlib: a python library to model and analyze diffusion processes over complex networks. Int. J. Data Sci. Anal. 5 (1), 61–79 (2018). https://doi.org/10.1007/s41060-017-0086-6

Rossetti, G., Pappalardo, L., Pedreschi, D., Giannotti, F.: Tiles: an online algorithm for community discovery in dynamic social networks. Mach. Learn. 106 (8), 1213–1241 (2017). https://doi.org/10.1007/s10994-016-5582-8

Rossi, A., Pappalardo, L., Cintia, P., Fernández, J., Iaia, M.F., Medina, D.: Who is going to get hurt? predicting injuries in professional soccer. In: J. Davis, M. Kaytoue, A. Zimmermann (eds.) Proceedings of the 4th Workshop on Machine Learning and Data Mining for Sports Analytics co-located with 2017 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2017), Skopje, Macedonia, 18 Sept 2017., CEUR Workshop Proceedings, vol. 1971, pp. 21–30. CEUR-WS.org (2017). http://ceur-ws.org/Vol-1971/paper-04.pdf

Ruggieri, S., Pedreschi, D., Turini, F.: DCUBE: discrimination discovery in databases. In: A.K. Elmagarmid, D. Agrawal (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, 6–10 June 2010, pp. 1127–1130. ACM (2010). https://doi.org/10.1145/1807167.1807298

Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR abs/1312.6034 (2013). http://dblp.uni-trier.de/db/journals/corr/corr1312.html#SimonyanVZ13

Smilkov, D., Thorat, N., Kim, B., Viégas, F.B., Wattenberg, M.: Smoothgrad: removing noise by adding noise. CoRR abs/1706.03825 (2017). arxiv: 1706.03825

Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: D. Precup, Y.W. Teh (eds.) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 70, pp. 3319–3328. PMLR, International Convention Centre, Sydney, Australia (2017). http://proceedings.mlr.press/v70/sundararajan17a.html

Trasarti, R., Guidotti, R., Monreale, A., Giannotti, F.: Myway: location prediction via mobility profiling. Inf. Syst. 64 , 350–367 (2017). https://doi.org/10.1016/j.is.2015.11.002

Traub, J., Quiané-Ruiz, J., Kaoudi, Z., Markl, V.: Agora: Towards an open ecosystem for democratizing data science & artificial intelligence. CoRR abs/1909.03026 (2019). arxiv: 1909.03026

Vazifeh, M.M., Zhang, H., Santi, P., Ratti, C.: Optimizing the deployment of electric vehicle charging stations using pervasive mobility data. Transp Res A Policy Practice 121 (C), 75–91 (2019). https://doi.org/10.1016/j.tra.2019.01.002

Vermeulen, A.F.: Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets, 1st edn. Apress, New York (2018)

Download references

Acknowledgements

This work is supported by the European Community’s H2020 Program under the scheme ‘INFRAIA-1-2014-2015: Research Infrastructures’, grant agreement #654024 ‘SoBigData: Social Mining and Big Data Ecosystem’ and the scheme ‘INFRAIA-01-2018-2019: Research and Innovation action’, grant agreement #871042 ’SoBigData \(_{++}\) : European Integrated Infrastructure for Social Mining and Big Data Analytics’

Open access funding provided by Università di Pisa within the CRUI-CARE Agreement.

Author information

Authors and affiliations.

CNR - Istituto Scienza e Tecnologia dell’Informazione A. Faedo, KDDLab, Pisa, Italy

Valerio Grossi & Fosca Giannotti

Department of Computer Science, University of Pisa, Pisa, Italy

Dino Pedreschi

CNR - Istituto Scienza e Tecnologia dell’Informazione A. Faedo, NeMIS, Pisa, Italy

Paolo Manghi, Pasquale Pagano & Massimiliano Assante

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Dino Pedreschi .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Grossi, V., Giannotti, F., Pedreschi, D. et al. Data science: a game changer for science and innovation. Int J Data Sci Anal 11 , 263–278 (2021). https://doi.org/10.1007/s41060-020-00240-2

Download citation

Received : 13 July 2019

Accepted : 15 December 2020

Published : 19 April 2021

Issue Date : May 2021

DOI : https://doi.org/10.1007/s41060-020-00240-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Responsible data science
  • Research infrastructure
  • Social mining
  • Find a journal
  • Publish with us
  • Track your research

Grad Coach

How To Write A Research Paper

Step-By-Step Tutorial With Examples + FREE Template

By: Derek Jansen (MBA) | Expert Reviewer: Dr Eunice Rautenbach | March 2024

For many students, crafting a strong research paper from scratch can feel like a daunting task – and rightly so! In this post, we’ll unpack what a research paper is, what it needs to do , and how to write one – in three easy steps. 🙂 

Overview: Writing A Research Paper

What (exactly) is a research paper.

  • How to write a research paper
  • Stage 1 : Topic & literature search
  • Stage 2 : Structure & outline
  • Stage 3 : Iterative writing
  • Key takeaways

Let’s start by asking the most important question, “ What is a research paper? ”.

Simply put, a research paper is a scholarly written work where the writer (that’s you!) answers a specific question (this is called a research question ) through evidence-based arguments . Evidence-based is the keyword here. In other words, a research paper is different from an essay or other writing assignments that draw from the writer’s personal opinions or experiences. With a research paper, it’s all about building your arguments based on evidence (we’ll talk more about that evidence a little later).

Now, it’s worth noting that there are many different types of research papers , including analytical papers (the type I just described), argumentative papers, and interpretative papers. Here, we’ll focus on analytical papers , as these are some of the most common – but if you’re keen to learn about other types of research papers, be sure to check out the rest of the blog .

With that basic foundation laid, let’s get down to business and look at how to write a research paper .

Research Paper Template

Overview: The 3-Stage Process

While there are, of course, many potential approaches you can take to write a research paper, there are typically three stages to the writing process. So, in this tutorial, we’ll present a straightforward three-step process that we use when working with students at Grad Coach.

These three steps are:

  • Finding a research topic and reviewing the existing literature
  • Developing a provisional structure and outline for your paper, and
  • Writing up your initial draft and then refining it iteratively

Let’s dig into each of these.

Need a helping hand?

research paper and data

Step 1: Find a topic and review the literature

As we mentioned earlier, in a research paper, you, as the researcher, will try to answer a question . More specifically, that’s called a research question , and it sets the direction of your entire paper. What’s important to understand though is that you’ll need to answer that research question with the help of high-quality sources – for example, journal articles, government reports, case studies, and so on. We’ll circle back to this in a minute.

The first stage of the research process is deciding on what your research question will be and then reviewing the existing literature (in other words, past studies and papers) to see what they say about that specific research question. In some cases, your professor may provide you with a predetermined research question (or set of questions). However, in many cases, you’ll need to find your own research question within a certain topic area.

Finding a strong research question hinges on identifying a meaningful research gap – in other words, an area that’s lacking in existing research. There’s a lot to unpack here, so if you wanna learn more, check out the plain-language explainer video below.

Once you’ve figured out which question (or questions) you’ll attempt to answer in your research paper, you’ll need to do a deep dive into the existing literature – this is called a “ literature search ”. Again, there are many ways to go about this, but your most likely starting point will be Google Scholar .

If you’re new to Google Scholar, think of it as Google for the academic world. You can start by simply entering a few different keywords that are relevant to your research question and it will then present a host of articles for you to review. What you want to pay close attention to here is the number of citations for each paper – the more citations a paper has, the more credible it is (generally speaking – there are some exceptions, of course).

how to use google scholar

Ideally, what you’re looking for are well-cited papers that are highly relevant to your topic. That said, keep in mind that citations are a cumulative metric , so older papers will often have more citations than newer papers – just because they’ve been around for longer. So, don’t fixate on this metric in isolation – relevance and recency are also very important.

Beyond Google Scholar, you’ll also definitely want to check out academic databases and aggregators such as Science Direct, PubMed, JStor and so on. These will often overlap with the results that you find in Google Scholar, but they can also reveal some hidden gems – so, be sure to check them out.

Once you’ve worked your way through all the literature, you’ll want to catalogue all this information in some sort of spreadsheet so that you can easily recall who said what, when and within what context. If you’d like, we’ve got a free literature spreadsheet that helps you do exactly that.

Don’t fixate on an article’s citation count in isolation - relevance (to your research question) and recency are also very important.

Step 2: Develop a structure and outline

With your research question pinned down and your literature digested and catalogued, it’s time to move on to planning your actual research paper .

It might sound obvious, but it’s really important to have some sort of rough outline in place before you start writing your paper. So often, we see students eagerly rushing into the writing phase, only to land up with a disjointed research paper that rambles on in multiple

Now, the secret here is to not get caught up in the fine details . Realistically, all you need at this stage is a bullet-point list that describes (in broad strokes) what you’ll discuss and in what order. It’s also useful to remember that you’re not glued to this outline – in all likelihood, you’ll chop and change some sections once you start writing, and that’s perfectly okay. What’s important is that you have some sort of roadmap in place from the start.

You need to have a rough outline in place before you start writing your paper - or you’ll end up with a disjointed research paper that rambles on.

At this stage you might be wondering, “ But how should I structure my research paper? ”. Well, there’s no one-size-fits-all solution here, but in general, a research paper will consist of a few relatively standardised components:

  • Introduction
  • Literature review
  • Methodology

Let’s take a look at each of these.

First up is the introduction section . As the name suggests, the purpose of the introduction is to set the scene for your research paper. There are usually (at least) four ingredients that go into this section – these are the background to the topic, the research problem and resultant research question , and the justification or rationale. If you’re interested, the video below unpacks the introduction section in more detail. 

The next section of your research paper will typically be your literature review . Remember all that literature you worked through earlier? Well, this is where you’ll present your interpretation of all that content . You’ll do this by writing about recent trends, developments, and arguments within the literature – but more specifically, those that are relevant to your research question . The literature review can oftentimes seem a little daunting, even to seasoned researchers, so be sure to check out our extensive collection of literature review content here .

With the introduction and lit review out of the way, the next section of your paper is the research methodology . In a nutshell, the methodology section should describe to your reader what you did (beyond just reviewing the existing literature) to answer your research question. For example, what data did you collect, how did you collect that data, how did you analyse that data and so on? For each choice, you’ll also need to justify why you chose to do it that way, and what the strengths and weaknesses of your approach were.

Now, it’s worth mentioning that for some research papers, this aspect of the project may be a lot simpler . For example, you may only need to draw on secondary sources (in other words, existing data sets). In some cases, you may just be asked to draw your conclusions from the literature search itself (in other words, there may be no data analysis at all). But, if you are required to collect and analyse data, you’ll need to pay a lot of attention to the methodology section. The video below provides an example of what the methodology section might look like.

By this stage of your paper, you will have explained what your research question is, what the existing literature has to say about that question, and how you analysed additional data to try to answer your question. So, the natural next step is to present your analysis of that data . This section is usually called the “results” or “analysis” section and this is where you’ll showcase your findings.

Depending on your school’s requirements, you may need to present and interpret the data in one section – or you might split the presentation and the interpretation into two sections. In the latter case, your “results” section will just describe the data, and the “discussion” is where you’ll interpret that data and explicitly link your analysis back to your research question. If you’re not sure which approach to take, check in with your professor or take a look at past papers to see what the norms are for your programme.

Alright – once you’ve presented and discussed your results, it’s time to wrap it up . This usually takes the form of the “ conclusion ” section. In the conclusion, you’ll need to highlight the key takeaways from your study and close the loop by explicitly answering your research question. Again, the exact requirements here will vary depending on your programme (and you may not even need a conclusion section at all) – so be sure to check with your professor if you’re unsure.

Step 3: Write and refine

Finally, it’s time to get writing. All too often though, students hit a brick wall right about here… So, how do you avoid this happening to you?

Well, there’s a lot to be said when it comes to writing a research paper (or any sort of academic piece), but we’ll share three practical tips to help you get started.

First and foremost , it’s essential to approach your writing as an iterative process. In other words, you need to start with a really messy first draft and then polish it over multiple rounds of editing. Don’t waste your time trying to write a perfect research paper in one go. Instead, take the pressure off yourself by adopting an iterative approach.

Secondly , it’s important to always lean towards critical writing , rather than descriptive writing. What does this mean? Well, at the simplest level, descriptive writing focuses on the “ what ”, while critical writing digs into the “ so what ” – in other words, the implications . If you’re not familiar with these two types of writing, don’t worry! You can find a plain-language explanation here.

Last but not least, you’ll need to get your referencing right. Specifically, you’ll need to provide credible, correctly formatted citations for the statements you make. We see students making referencing mistakes all the time and it costs them dearly. The good news is that you can easily avoid this by using a simple reference manager . If you don’t have one, check out our video about Mendeley, an easy (and free) reference management tool that you can start using today.

Recap: Key Takeaways

We’ve covered a lot of ground here. To recap, the three steps to writing a high-quality research paper are:

  • To choose a research question and review the literature
  • To plan your paper structure and draft an outline
  • To take an iterative approach to writing, focusing on critical writing and strong referencing

Remember, this is just a b ig-picture overview of the research paper development process and there’s a lot more nuance to unpack. So, be sure to grab a copy of our free research paper template to learn more about how to write a research paper.

You Might Also Like:

Referencing in Word

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

When you choose to publish with PLOS, your research makes an impact. Make your work accessible to all, without restrictions, and accelerate scientific discovery with options like preprints and published peer review that make your work more Open.

  • PLOS Biology
  • PLOS Climate
  • PLOS Complex Systems
  • PLOS Computational Biology
  • PLOS Digital Health
  • PLOS Genetics
  • PLOS Global Public Health
  • PLOS Medicine
  • PLOS Mental Health
  • PLOS Neglected Tropical Diseases
  • PLOS Pathogens
  • PLOS Sustainability and Transformation
  • PLOS Collections

Open Data is a strategy for incorporating research data into the permanent scientific record by releasing it under an Open Access license.  Whether data is deposited in a purpose-built repository or published as Supporting Information alongside a research article, Open Data practices ensure that data remains accessible and discoverable. For verification, replication, reuse, and enhanced understanding of research.

Benefits of Open Data

Readers rely on raw scientific data to enhance their understanding of published research, for purposes of verification, replication and reanalysis, and to inform future investigations.

Ensure reproducibility Proactively sharing data ensures that your work remains reproducible over the long term.

Inspire trust Sharing data demonstrates rigor and signals to the community that the work has integrity.

Receive  credit Making data public opens opportunities to get academic credit for collecting and curating data during the research process.

Make a contribution Access to data accelerates progress. According to the 2019 State of Open Data report, more than 70% of researchers use open datasets to inform their future research.

Preserve the scientific record Posting datasets in a repository or uploading them as Supporting Information prevents data loss.

Why do researchers choose to make their data public?

Watch the short video that explores the top benefits of data sharing, what types of research data you should share, and how you can get it ready to help ensure more impact for your research.

PLOS Open Data policy

Publishing in a PLOS journal carries with it a commitment to make the data underlying the conclusions in your research article publicly available upon publication.

Our data policy underscores the rigor of the research we publish, and gives readers a fuller understanding of each study.

Read more about Open Data

Data sharing has long been a hallmark of high-quality reproducible research. Now, Open Data is becoming...

For PLOS, increasing data-sharing rates—and especially increasing the amount of data shared in a repository—is a high priority.

Ensure that you’re publication-ready and ensure future reproducibility through good data management How you store your data matters. Even after…

Data repositories

All methods of data sharing data facilitate reproduction, improve trust in science, ensure appropriate credit, and prevent data loss. When you choose to deposit your data in a repository, those benefits are magnified and extended.

Data posted in a repository is…

…more discoverable.

Detailed metadata and bidirectional linking to and from related articles help to make data in public repositories easily findable.

…more reusable

Machine-readable data formatting allows research in a repository to be incorporated into future systematic reviews or meta analyses more easily.

…easier to cite

Repositories assign data its own unique DOI, distinct from that of related research articles, so datasets can accumulate citations in their own right, illustrating the importance and lasting relevance of the data itself.

…more likely to earn citations

A 2020 study of more than 500,000 published research articles found articles that link to data in a public repository were likely to have a 25% higher citation rate on average than articles where data is available on request or as Supporting Information.

Open Data is more discoverable and accessible than ever

Deposit your data in a repository and earn an accessible data icon.

research paper and data

You already know depositing research data in a repository yields benefits like improved reproducibility, discoverability, and more attention and citations for your research.

PLOS helps to magnify these benefits even further with our Accessible Data icon. When you link to select, popular data repositories, your article earns an eye-catching graphic with a link to the associated dataset, so it’s more visible to readers.

Participating data repositories include: 

  • Open Science Framework (OSF)
  • Gene Expression Omnibus
  • NCBI Bioproject
  • NCBI Sequence Read Archive
  • Demographic and Health Surveys

We aim to add more repositories to the list in future. Read more

The PLOS Open Science Toolbox

The future is open

The PLOS Open Science Toolbox is your source for sci-comm tips and best-practice. Learn practical strategies and hands-on tips to improve reproducibility, increase trust, and maximize the impact of your research through Open Science.

Sign up to have new issues delivered to your inbox every week.

Learn more about the benefits of Open Science.   Open Science

  • Privacy Policy

Research Method

Home » Research Data – Types Methods and Examples

Research Data – Types Methods and Examples

Table of Contents

Research Data

Research Data

Research data refers to any information or evidence gathered through systematic investigation or experimentation to support or refute a hypothesis or answer a research question.

It includes both primary and secondary data, and can be in various formats such as numerical, textual, audiovisual, or visual. Research data plays a critical role in scientific inquiry and is often subject to rigorous analysis, interpretation, and dissemination to advance knowledge and inform decision-making.

Types of Research Data

There are generally four types of research data:

Quantitative Data

This type of data involves the collection and analysis of numerical data. It is often gathered through surveys, experiments, or other types of structured data collection methods. Quantitative data can be analyzed using statistical techniques to identify patterns or relationships in the data.

Qualitative Data

This type of data is non-numerical and often involves the collection and analysis of words, images, or sounds. It is often gathered through methods such as interviews, focus groups, or observation. Qualitative data can be analyzed using techniques such as content analysis, thematic analysis, or discourse analysis.

Primary Data

This type of data is collected by the researcher directly from the source. It can include data gathered through surveys, experiments, interviews, or observation. Primary data is often used to answer specific research questions or to test hypotheses.

Secondary Data

This type of data is collected by someone other than the researcher. It can include data from sources such as government reports, academic journals, or industry publications. Secondary data is often used to supplement or support primary data or to provide context for a research project.

Research Data Formates

There are several formats in which research data can be collected and stored. Some common formats include:

  • Text : This format includes any type of written data, such as interview transcripts, survey responses, or open-ended questionnaire answers.
  • Numeric : This format includes any data that can be expressed as numerical values, such as measurements or counts.
  • Audio : This format includes any recorded data in an audio form, such as interviews or focus group discussions.
  • Video : This format includes any recorded data in a video form, such as observations of behavior or experimental procedures.
  • Images : This format includes any visual data, such as photographs, drawings, or scans of documents.
  • Mixed media: This format includes any combination of the above formats, such as a survey response that includes both text and numeric data, or an observation study that includes both video and audio recordings.
  • Sensor Data: This format includes data collected from various sensors or devices, such as GPS, accelerometers, or heart rate monitors.
  • Social Media Data: This format includes data collected from social media platforms, such as tweets, posts, or comments.
  • Geographic Information System (GIS) Data: This format includes data with a spatial component, such as maps or satellite imagery.
  • Machine-Readable Data : This format includes data that can be read and processed by machines, such as data in XML or JSON format.
  • Metadata: This format includes data that describes other data, such as information about the source, format, or content of a dataset.

Data Collection Methods

Some common research data collection methods include:

  • Surveys : Surveys involve asking participants to answer a series of questions about a particular topic. Surveys can be conducted online, over the phone, or in person.
  • Interviews : Interviews involve asking participants a series of open-ended questions in order to gather detailed information about their experiences or perspectives. Interviews can be conducted in person, over the phone, or via video conferencing.
  • Focus groups: Focus groups involve bringing together a small group of participants to discuss a particular topic or issue in depth. The group is typically led by a moderator who asks questions and encourages discussion among the participants.
  • Observations : Observations involve watching and recording behaviors or events as they naturally occur. Observations can be conducted in person or through the use of video or audio recordings.
  • Experiments : Experiments involve manipulating one or more variables in order to measure the effect on an outcome of interest. Experiments can be conducted in a laboratory or in the field.
  • Case studies: Case studies involve conducting an in-depth analysis of a particular individual, group, or organization. Case studies typically involve gathering data from multiple sources, including interviews, observations, and document analysis.
  • Secondary data analysis: Secondary data analysis involves analyzing existing data that was collected for another purpose. Examples of secondary data sources include government records, academic research studies, and market research reports.

Analysis Methods

Some common research data analysis methods include:

  • Descriptive statistics: Descriptive statistics involve summarizing and describing the main features of a dataset, such as the mean, median, and standard deviation. Descriptive statistics are often used to provide an initial overview of the data.
  • Inferential statistics: Inferential statistics involve using statistical techniques to draw conclusions about a population based on a sample of data. Inferential statistics are often used to test hypotheses and determine the statistical significance of relationships between variables.
  • Content analysis : Content analysis involves analyzing the content of text, audio, or video data to identify patterns, themes, or other meaningful features. Content analysis is often used in qualitative research to analyze open-ended survey responses, interviews, or other types of text data.
  • Discourse analysis: Discourse analysis involves analyzing the language used in text, audio, or video data to understand how meaning is constructed and communicated. Discourse analysis is often used in qualitative research to analyze interviews, focus group discussions, or other types of text data.
  • Grounded theory : Grounded theory involves developing a theory or model based on an analysis of qualitative data. Grounded theory is often used in exploratory research to generate new insights and hypotheses.
  • Network analysis: Network analysis involves analyzing the relationships between entities, such as individuals or organizations, in a network. Network analysis is often used in social network analysis to understand the structure and dynamics of social networks.
  • Structural equation modeling: Structural equation modeling involves using statistical techniques to test complex models that include multiple variables and relationships. Structural equation modeling is often used in social science research to test theories about the relationships between variables.

Purpose of Research Data

Research data serves several important purposes, including:

  • Supporting scientific discoveries : Research data provides the basis for scientific discoveries and innovations. Researchers use data to test hypotheses, develop new theories, and advance scientific knowledge in their field.
  • Validating research findings: Research data provides the evidence necessary to validate research findings. By analyzing and interpreting data, researchers can determine the statistical significance of relationships between variables and draw conclusions about the research question.
  • Informing policy decisions: Research data can be used to inform policy decisions by providing evidence about the effectiveness of different policies or interventions. Policymakers can use data to make informed decisions about how to allocate resources and address social or economic challenges.
  • Promoting transparency and accountability: Research data promotes transparency and accountability by allowing other researchers to verify and replicate research findings. Data sharing also promotes transparency by allowing others to examine the methods used to collect and analyze data.
  • Supporting education and training: Research data can be used to support education and training by providing examples of research methods, data analysis techniques, and research findings. Students and researchers can use data to learn new research skills and to develop their own research projects.

Applications of Research Data

Research data has numerous applications across various fields, including social sciences, natural sciences, engineering, and health sciences. The applications of research data can be broadly classified into the following categories:

  • Academic research: Research data is widely used in academic research to test hypotheses, develop new theories, and advance scientific knowledge. Researchers use data to explore complex relationships between variables, identify patterns, and make predictions.
  • Business and industry: Research data is used in business and industry to make informed decisions about product development, marketing, and customer engagement. Data analysis techniques such as market research, customer analytics, and financial analysis are widely used to gain insights and inform strategic decision-making.
  • Healthcare: Research data is used in healthcare to improve patient outcomes, develop new treatments, and identify health risks. Researchers use data to analyze health trends, track disease outbreaks, and develop evidence-based treatment protocols.
  • Education : Research data is used in education to improve teaching and learning outcomes. Data analysis techniques such as assessments, surveys, and evaluations are used to measure student progress, evaluate program effectiveness, and inform policy decisions.
  • Government and public policy: Research data is used in government and public policy to inform decision-making and policy development. Data analysis techniques such as demographic analysis, cost-benefit analysis, and impact evaluation are widely used to evaluate policy effectiveness, identify social or economic challenges, and develop evidence-based policy solutions.
  • Environmental management: Research data is used in environmental management to monitor environmental conditions, track changes, and identify emerging threats. Data analysis techniques such as spatial analysis, remote sensing, and modeling are used to map environmental features, monitor ecosystem health, and inform policy decisions.

Advantages of Research Data

Research data has numerous advantages, including:

  • Empirical evidence: Research data provides empirical evidence that can be used to support or refute theories, test hypotheses, and inform decision-making. This evidence-based approach helps to ensure that decisions are based on objective, measurable data rather than subjective opinions or assumptions.
  • Accuracy and reliability : Research data is typically collected using rigorous scientific methods and protocols, which helps to ensure its accuracy and reliability. Data can be validated and verified using statistical methods, which further enhances its credibility.
  • Replicability: Research data can be replicated and validated by other researchers, which helps to promote transparency and accountability in research. By making data available for others to analyze and interpret, researchers can ensure that their findings are robust and reliable.
  • Insights and discoveries : Research data can provide insights into complex relationships between variables, identify patterns and trends, and reveal new discoveries. These insights can lead to the development of new theories, treatments, and interventions that can improve outcomes in various fields.
  • Informed decision-making: Research data can inform decision-making in a range of fields, including healthcare, business, education, and public policy. Data analysis techniques can be used to identify trends, evaluate the effectiveness of interventions, and inform policy decisions.
  • Efficiency and cost-effectiveness: Research data can help to improve efficiency and cost-effectiveness by identifying areas where resources can be directed most effectively. By using data to identify the most promising approaches or interventions, researchers can optimize the use of resources and improve outcomes.

Limitations of Research Data

Research data has several limitations that researchers should be aware of, including:

  • Bias and subjectivity: Research data can be influenced by biases and subjectivity, which can affect the accuracy and reliability of the data. Researchers must take steps to minimize bias and subjectivity in data collection and analysis.
  • Incomplete data : Research data can be incomplete or missing, which can affect the validity of the findings. Researchers must ensure that data is complete and representative to ensure that their findings are reliable.
  • Limited scope: Research data may be limited in scope, which can limit the generalizability of the findings. Researchers must carefully consider the scope of their research and ensure that their findings are applicable to the broader population.
  • Data quality: Research data can be affected by issues such as measurement error, data entry errors, and missing data, which can affect the quality of the data. Researchers must ensure that data is collected and analyzed using rigorous methods to minimize these issues.
  • Ethical concerns: Research data can raise ethical concerns, particularly when it involves human subjects. Researchers must ensure that their research complies with ethical standards and protects the rights and privacy of human subjects.
  • Data security: Research data must be protected to prevent unauthorized access or use. Researchers must ensure that data is stored and transmitted securely to protect the confidentiality and integrity of the data.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Primary Data

Primary Data – Types, Methods and Examples

Qualitative Data

Qualitative Data – Types, Methods and Examples

Quantitative Data

Quantitative Data – Types, Methods and Examples

Secondary Data

Secondary Data – Types, Methods and Examples

Research Information

Information in Research – Types and Examples

Detail of a painting depicting the landscape of New Mexico with mountains in the distance

Explore millions of high-quality primary sources and images from around the world, including artworks, maps, photographs, and more.

Explore migration issues through a variety of media types

  • Part of The Streets are Talking: Public Forms of Creative Expression from Around the World
  • Part of The Journal of Economic Perspectives, Vol. 34, No. 1 (Winter 2020)
  • Part of Cato Institute (Aug. 3, 2021)
  • Part of University of California Press
  • Part of Open: Smithsonian National Museum of African American History & Culture
  • Part of Indiana Journal of Global Legal Studies, Vol. 19, No. 1 (Winter 2012)
  • Part of R Street Institute (Nov. 1, 2020)
  • Part of Leuven University Press
  • Part of UN Secretary-General Papers: Ban Ki-moon (2007-2016)
  • Part of Perspectives on Terrorism, Vol. 12, No. 4 (August 2018)
  • Part of Leveraging Lives: Serbia and Illegal Tunisian Migration to Europe, Carnegie Endowment for International Peace (Mar. 1, 2023)
  • Part of UCL Press

Harness the power of visual materials—explore more than 3 million images now on JSTOR.

Enhance your scholarly research with underground newspapers, magazines, and journals.

Explore collections in the arts, sciences, and literature from the world’s leading museums, archives, and scholars.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

Iqbal h. sarker.

1 Swinburne University of Technology, Melbourne, VIC 3122 Australia

2 Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, Chittagong, 4349 Bangladesh

The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science, advanced analytics methods including machine learning modeling can provide actionable insights or deeper knowledge about data, which makes the computing process automatic and smart. In this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios. We also discuss and summarize ten potential real-world application domains including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making. Based on this, we finally highlight the challenges and potential research directions within the scope of our study. Overall, this paper aims to serve as a reference point on data science and advanced analytics to the researchers and decision-makers as well as application developers, particularly from the data-driven solution point of view for real-world problems.

Introduction

We are living in the age of “data science and advanced analytics”, where almost everything in our daily lives is digitally recorded as data [ 17 ]. Thus the current electronic world is a wealth of various kinds of data, such as business data, financial data, healthcare data, multimedia data, internet of things (IoT) data, cybersecurity data, social media data, etc [ 112 ]. The data can be structured, semi-structured, or unstructured, which increases day by day [ 105 ]. Data science is typically a “concept to unify statistics, data analysis, and their related methods” to understand and analyze the actual phenomena with data. According to Cao et al. [ 17 ] “data science is the science of data” or “data science is the study of data”, where a data product is a data deliverable, or data-enabled or guided, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, or system. The popularity of “Data science” is increasing day-by-day, which is shown in Fig. ​ Fig.1 1 according to Google Trends data over the last 5 years [ 36 ]. In addition to data science, we have also shown the popularity trends of the relevant areas such as “Data analytics”, “Data mining”, “Big data”, “Machine learning” in the figure. According to Fig. ​ Fig.1, 1 , the popularity indication values for these data-driven domains, particularly “Data science”, and “Machine learning” are increasing day-by-day. This statistical information and the applicability of the data-driven smart decision-making in various real-world application areas, motivate us to study briefly on “Data science” and machine-learning-based “Advanced analytics” in this paper.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig1_HTML.jpg

The worldwide popularity score of data science comparing with relevant  areas in a range of 0 (min) to 100 (max) over time where x -axis represents the timestamp information and y -axis represents the corresponding score

Usually, data science is the field of applying advanced analytics methods and scientific concepts to derive useful business information from data. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to analyze granular data, which we are interested in. In the field of data science, several types of analytics are popular, such as "Descriptive analytics" which answers the question of what happened; "Diagnostic analytics" which answers the question of why did it happen; "Predictive analytics" which predicts what will happen in the future; and "Prescriptive analytics" which prescribes what action should be taken, discussed briefly in “ Advanced analytics methods and smart computing ”. Such advanced analytics and decision-making based on machine learning techniques [ 105 ], a major part of artificial intelligence (AI) [ 102 ] can also play a significant role in the Fourth Industrial Revolution (Industry 4.0) due to its learning capability for smart computing as well as automation [ 121 ].

Although the area of “data science” is huge, we mainly focus on deriving useful insights through advanced analytics, where the results are used to make smart decisions in various real-world application areas. For this, various advanced analytics methods such as machine learning modeling, natural language processing, sentiment analysis, neural network, or deep learning analysis can provide deeper knowledge about data, and thus can be used to develop data-driven intelligent applications. More specifically, regression analysis, classification, clustering analysis, association rules, time-series analysis, sentiment analysis, behavioral patterns, anomaly detection, factor analysis, log analysis, and deep learning which is originated from the artificial neural network, are taken into account in our study. These machine learning-based advanced analytics methods are discussed briefly in “ Advanced analytics methods and smart computing ”. Thus, it’s important to understand the principles of various advanced analytics methods mentioned above and their applicability to apply in various real-world application areas. For instance, in our earlier paper Sarker et al. [ 114 ], we have discussed how data science and machine learning modeling can play a significant role in the domain of cybersecurity for making smart decisions and to provide data-driven intelligent security services. In this paper, we broadly take into account the data science application areas and real-world problems in ten potential domains including the area of business data science, health data science, IoT data science, behavioral data science, urban data science, and so on, discussed briefly in “ Real-world application domains ”.

Based on the importance of machine learning modeling to extract the useful insights from the data mentioned above and data-driven smart decision-making, in this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application. The key contribution of this study is thus understanding data science modeling, explaining different analytic methods for solution perspective and their applicability in various real-world data-driven applications areas mentioned earlier. Overall, the purpose of this paper is, therefore, to provide a basic guide or reference for those academia and industry people who want to study, research, and develop automated and intelligent applications or systems based on smart computing and decision making within the area of data science.

The main contributions of this paper are summarized as follows:

  • To define the scope of our study towards data-driven smart computing and decision-making in our real-world life. We also make a brief discussion on the concept of data science modeling from business problems to data product and automation, to understand its applicability and provide intelligent services in real-world scenarios.
  • To provide a comprehensive view on data science including advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application.
  • To discuss the applicability and significance of machine learning-based analytics methods in various real-world application areas. We also summarize ten potential real-world application areas, from business to personalized applications in our daily life, where advanced analytics with machine learning modeling can be used to achieve the expected outcome.
  • To highlight and summarize the challenges and potential research directions within the scope of our study.

The rest of the paper is organized as follows. The next section provides the background and related work and defines the scope of our study. The following section presents the concepts of data science modeling for building a data-driven application. After that, briefly discuss and explain different advanced analytics methods and smart computing. Various real-world application areas are discussed and summarized in the next section. We then highlight and summarize several research issues and potential future directions, and finally, the last section concludes this paper.

Background and Related Work

In this section, we first discuss various data terms and works related to data science and highlight the scope of our study.

Data Terms and Definitions

There is a range of key terms in the field, such as data analysis, data mining, data analytics, big data, data science, advanced analytics, machine learning, and deep learning, which are highly related and easily confusing. In the following, we define these terms and differentiate them with the term “Data Science” according to our goal.

The term “Data analysis” refers to the processing of data by conventional (e.g., classic statistical, empirical, or logical) theories, technologies, and tools for extracting useful information and for practical purposes [ 17 ]. The term “Data analytics”, on the other hand, refers to the theories, technologies, instruments, and processes that allow for an in-depth understanding and exploration of actionable data insight [ 17 ]. Statistical and mathematical analysis of the data is the major concern in this process. “Data mining” is another popular term over the last decade, which has a similar meaning with several other terms such as knowledge mining from data, knowledge extraction, knowledge discovery from data (KDD), data/pattern analysis, data archaeology, and data dredging. According to Han et al. [ 38 ], it should have been more appropriately named “knowledge mining from data”. Overall, data mining is defined as the process of discovering interesting patterns and knowledge from large amounts of data [ 38 ]. Data sources may include databases, data centers, the Internet or Web, other repositories of data, or data dynamically streamed through the system. “Big data” is another popular term nowadays, which may change the statistical and data analysis approaches as it has the unique features of “massive, high dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and erroneous” [ 74 ]. Big data can be generated by mobile devices, social networks, the Internet of Things, multimedia, and many other new applications [ 129 ]. Several unique features including volume, velocity, variety, veracity, value (5Vs), and complexity are used to understand and describe big data [ 69 ].

In terms of analytics, basic analytics provides a summary of data whereas the term “Advanced Analytics” takes a step forward in offering a deeper understanding of data and helps to analyze granular data. Advanced analytics is characterized or defined as autonomous or semi-autonomous data or content analysis using advanced techniques and methods to discover deeper insights, predict or generate recommendations, typically beyond traditional business intelligence or analytics. “Machine learning”, a branch of artificial intelligence (AI), is one of the major techniques used in advanced analytics which can automate analytical model building [ 112 ]. This is focused on the premise that systems can learn from data, recognize trends, and make decisions, with minimal human involvement [ 38 , 115 ]. “Deep Learning” is a subfield of machine learning that discusses algorithms inspired by the human brain’s structure and the function called artificial neural networks [ 38 , 139 ].

Unlike the above data-related terms, “Data science” is an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies. In [ 17 ], Cao et al. defined data science from the disciplinary perspective as “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. In “ Understanding data science modeling ”, we briefly discuss the data science modeling from a practical perspective starting from business problems to data products that can assist the data scientists to think and work in a particular real-world problem domain within the area of data science and analytics.

Related Work

In the area, several papers have been reviewed by the researchers based on data science and its significance. For example, the authors in [ 19 ] identify the evolving field of data science and its importance in the broader knowledge environment and some issues that differentiate data science and informatics issues from conventional approaches in information sciences. Donoho et al. [ 27 ] present 50 years of data science including recent commentary on data science in mass media, and on how/whether data science varies from statistics. The authors formally conceptualize the theory-guided data science (TGDS) model in [ 53 ] and present a taxonomy of research themes in TGDS. Cao et al. include a detailed survey and tutorial on the fundamental aspects of data science in [ 17 ], which considers the transition from data analysis to data science, the principles of data science, as well as the discipline and competence of data education.

Besides, the authors include a data science analysis in [ 20 ], which aims to provide a realistic overview of the use of statistical features and related data science methods in bioimage informatics. The authors in [ 61 ] study the key streams of data science algorithm use at central banks and show how their popularity has risen over time. This research contributes to the creation of a research vector on the role of data science in central banking. In [ 62 ], the authors provide an overview and tutorial on the data-driven design of intelligent wireless networks. The authors in [ 87 ] provide a thorough understanding of computational optimal transport with application to data science. In [ 97 ], the authors present data science as theoretical contributions in information systems via text analytics.

Unlike the above recent studies, in this paper, we concentrate on the knowledge of data science including advanced analytics methods, machine learning modeling, real-world application domains, and potential research directions within the scope of our study. The advanced analytics methods based on machine learning techniques discussed in this paper can be applied to enhance the capabilities of an application in terms of data-driven intelligent decision making and automation in the final data product or systems.

Understanding Data Science Modeling

In this section, we briefly discuss how data science can play a significant role in the real-world business process. For this, we first categorize various types of data and then discuss the major steps of data science modeling starting from business problems to data product and automation.

Types of Real-World Data

Typically, to build a data-driven real-world system in a particular domain, the availability of data is the key [ 17 , 112 , 114 ]. The data can be in different types such as (i) Structured—that has a well-defined data structure and follows a standard order, examples are names, dates, addresses, credit card numbers, stock information, geolocation, etc.; (ii) Unstructured—has no pre-defined format or organization, examples are sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, etc.; (iii) Semi-structured—has elements of both the structured and unstructured data containing certain organizational properties, examples are HTML, XML, JSON documents, NoSQL databases, etc.; and (iv) Metadata—that represents data about the data, examples are author, file type, file size, creation date and time, last modification date and time, etc. [ 38 , 105 ].

In the area of data science, researchers use various widely-used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 127 ], UNSW-NB15 [ 79 ], Bot-IoT [ 59 ], ISCX’12 [ 15 ], CIC-DDoS2019 [ 22 ], etc., smartphone datasets such as phone call logs [ 88 , 110 ], mobile application usages logs [ 124 , 149 ], SMS Log [ 28 ], mobile phone notification logs [ 77 ] etc., IoT data [ 56 , 11 , 64 ], health data such as heart disease [ 99 ], diabetes mellitus [ 86 , 147 ], COVID-19 [ 41 , 78 ], etc., agriculture and e-commerce data [ 128 , 150 ], and many more in various application domains. In “ Real-world application domains ”, we discuss ten potential real-world application domains of data science and analytics by taking into account data-driven smart computing and decision making, which can help the data scientists and application developers to explore more in various real-world issues.

Overall, the data used in data-driven applications can be any of the types mentioned above, and they can differ from one application to another in the real world. Data science modeling, which is briefly discussed below, can be used to analyze such data in a specific problem domain and derive insights or useful information from the data to build a data-driven model or data product.

Steps of Data Science Modeling

Data science is typically an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies, mentioned earlier in “ Background and related work ”. In this section, we briefly discuss how data science can play a significant role in the real-world business process. Figure ​ Figure2 2 shows an example of data science modeling starting from real-world data to data-driven product and automation. In the following, we briefly discuss each module of the data science process.

  • Understanding business problems: This involves getting a clear understanding of the problem that is needed to solve, how it impacts the relevant organization or individuals, the ultimate goals for addressing it, and the relevant project plan. Thus to understand and identify the business problems, the data scientists formulate relevant questions while working with the end-users and other stakeholders. For instance, how much/many, which category/group, is the behavior unrealistic/abnormal, which option should be taken, what action, etc. could be relevant questions depending on the nature of the problems. This helps to get a better idea of what business needs and what we should be extracted from data. Such business knowledge can enable organizations to enhance their decision-making process, is known as “Business Intelligence” [ 65 ]. Identifying the relevant data sources that can help to answer the formulated questions and what kinds of actions should be taken from the trends that the data shows, is another important task associated with this stage. Once the business problem has been clearly stated, the data scientist can define the analytic approach to solve the problem.
  • Understanding data: As we know that data science is largely driven by the availability of data [ 114 ]. Thus a sound understanding of the data is needed towards a data-driven model or system. The reason is that real-world data sets are often noisy, missing values, have inconsistencies, or other data issues, which are needed to handle effectively [ 101 ]. To gain actionable insights, the appropriate data or the quality of the data must be sourced and cleansed, which is fundamental to any data science engagement. For this, data assessment that evaluates what data is available and how it aligns to the business problem could be the first step in data understanding. Several aspects such as data type/format, the quantity of data whether it is sufficient or not to extract the useful knowledge, data relevance, authorized access to data, feature or attribute importance, combining multiple data sources, important metrics to report the data, etc. are needed to take into account to clearly understand the data for a particular business problem. Overall, the data understanding module involves figuring out what data would be best needed and the best ways to acquire it.
  • Data pre-processing and exploration: Exploratory data analysis is defined in data science as an approach to analyzing datasets to summarize their key characteristics, often with visual methods [ 135 ]. This examines a broad data collection to discover initial trends, attributes, points of interest, etc. in an unstructured manner to construct meaningful summaries of the data. Thus data exploration is typically used to figure out the gist of data and to develop a first step assessment of its quality, quantity, and characteristics. A statistical model can be used or not, but primarily it offers tools for creating hypotheses by generally visualizing and interpreting the data through graphical representation such as a chart, plot, histogram, etc [ 72 , 91 ]. Before the data is ready for modeling, it’s necessary to use data summarization and visualization to audit the quality of the data and provide the information needed to process it. To ensure the quality of the data, the data  pre-processing technique, which is typically the process of cleaning and transforming raw data [ 107 ] before processing and analysis is important. It also involves reformatting information, making data corrections, and merging data sets to enrich data. Thus, several aspects such as expected data, data cleaning, formatting or transforming data, dealing with missing values, handling data imbalance and bias issues, data distribution, search for outliers or anomalies in data and dealing with them, ensuring data quality, etc. could be the key considerations in this step.
  • Machine learning modeling and evaluation: Once the data is prepared for building the model, data scientists design a model, algorithm, or set of models, to address the business problem. Model building is dependent on what type of analytics, e.g., predictive analytics, is needed to solve the particular problem, which is discussed briefly in “ Advanced analytics methods and smart computing ”. To best fits the data according to the type of analytics, different types of data-driven or machine learning models that have been summarized in our earlier paper Sarker et al. [ 105 ], can be built to achieve the goal. Data scientists typically separate training and test subsets of the given dataset usually dividing in the ratio of 80:20 or data considering the most popular k -folds data splitting method [ 38 ]. This is to observe whether the model performs well or not on the data, to maximize the model performance. Various model validation and assessment metrics, such as error rate, accuracy, true positive, false positive, true negative, false negative, precision, recall, f-score, ROC (receiver operating characteristic curve) analysis, applicability analysis, etc. [ 38 , 115 ] are used to measure the model performance, which can guide the data scientists to choose or design the learning method or model. Besides, machine learning experts or data scientists can take into account several advanced analytics such as feature engineering, feature selection or extraction methods, algorithm tuning, ensemble methods, modifying existing algorithms, or designing new algorithms, etc. to improve the ultimate data-driven model to solve a particular business problem through smart decision making.
  • Data product and automation: A data product is typically the output of any data science activity [ 17 ]. A data product, in general terms, is a data deliverable, or data-enabled or guide, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, application, or system that process data and generate results. Businesses can use the results of such data analysis to obtain useful information like churn (a measure of how many customers stop using a product) prediction and customer segmentation, and use these results to make smarter business decisions and automation. Thus to make better decisions in various business problems, various machine learning pipelines and data products can be developed. To highlight this, we summarize several potential real-world data science application areas in “ Real-world application domains ”, where various data products can play a significant role in relevant business problems to make them smart and automate.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in business practices. The interesting part of the data science process indicates having a deeper understanding of the business problem to solve. Without that, it would be much harder to gather the right data and extract the most useful information from the data for making decisions to solve the problem. In terms of role, “Data Scientists” typically interpret and manage data to uncover the answers to major questions that help organizations to make objective decisions and solve complex problems. In a summary, a data scientist proactively gathers and analyzes information from multiple sources to better understand how the business performs, and  designs machine learning or data-driven tools/methods, or algorithms, focused on advanced analytics, which can make today’s computing process smarter and intelligent, discussed briefly in the following section.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig2_HTML.jpg

An example of data science modeling from real-world data to data-driven system and decision making

Advanced Analytics Methods and Smart Computing

As mentioned earlier in “ Background and related work ”, basic analytics provides a summary of data whereas advanced analytics takes a step forward in offering a deeper understanding of data and helps in granular data analysis. For instance, the predictive capabilities of advanced analytics can be used to forecast trends, events, and behaviors. Thus, “advanced analytics” can be defined as the autonomous or semi-autonomous analysis of data or content using advanced techniques and methods to discover deeper insights, make predictions, or produce recommendations, where machine learning-based analytical modeling is considered as the key technologies in the area. In the following section, we first summarize various types of analytics and outcome that are needed to solve the associated business problems, and then we briefly discuss machine learning-based analytical modeling.

Types of Analytics and Outcome

In the real-world business process, several key questions such as “What happened?”, “Why did it happen?”, “What will happen in the future?”, “What action should be taken?” are common and important. Based on these questions, in this paper, we categorize and highlight the analytics into four types such as descriptive, diagnostic, predictive, and prescriptive, which are discussed below.

  • Descriptive analytics: It is the interpretation of historical data to better understand the changes that have occurred in a business. Thus descriptive analytics answers the question, “what happened in the past?” by summarizing past data such as statistics on sales and operations or marketing strategies, use of social media, and engagement with Twitter, Linkedin or Facebook, etc. For instance, using descriptive analytics through analyzing trends, patterns, and anomalies, etc., customers’ historical shopping data can be used to predict the probability of a customer purchasing a product. Thus, descriptive analytics can play a significant role to provide an accurate picture of what has occurred in a business and how it relates to previous times utilizing a broad range of relevant business data. As a result, managers and decision-makers can pinpoint areas of strength and weakness in their business, and eventually can take more effective management strategies and business decisions.
  • Diagnostic analytics: It is a form of advanced analytics that examines data or content to answer the question, “why did it happen?” The goal of diagnostic analytics is to help to find the root cause of the problem. For example, the human resource management department of a business organization may use these diagnostic analytics to find the best applicant for a position, select them, and compare them to other similar positions to see how well they perform. In a healthcare example, it might help to figure out whether the patients’ symptoms such as high fever, dry cough, headache, fatigue, etc. are all caused by the same infectious agent. Overall, diagnostic analytics enables one to extract value from the data by posing the right questions and conducting in-depth investigations into the answers. It is characterized by techniques such as drill-down, data discovery, data mining, and correlations.
  • Predictive analytics: Predictive analytics is an important analytical technique used by many organizations for various purposes such as to assess business risks, anticipate potential market patterns, and decide when maintenance is needed, to enhance their business. It is a form of advanced analytics that examines data or content to answer the question, “what will happen in the future?” Thus, the primary goal of predictive analytics is to identify and typically answer this question with a high degree of probability. Data scientists can use historical data as a source to extract insights for building predictive models using various regression analyses and machine learning techniques, which can be used in various application domains for a better outcome. Companies, for example, can use predictive analytics to minimize costs by better anticipating future demand and changing output and inventory, banks and other financial institutions to reduce fraud and risks by predicting suspicious activity, medical specialists to make effective decisions through predicting patients who are at risk of diseases, retailers to increase sales and customer satisfaction through understanding and predicting customer preferences, manufacturers to optimize production capacity through predicting maintenance requirements, and many more. Thus predictive analytics can be considered as the core analytical method within the area of data science.
  • Prescriptive analytics: Prescriptive analytics focuses on recommending the best way forward with actionable information to maximize overall returns and profitability, which typically answer the question, “what action should be taken?” In business analytics, prescriptive analytics is considered the final step. For its models, prescriptive analytics collects data from several descriptive and predictive sources and applies it to the decision-making process. Thus, we can say that it is related to both descriptive analytics and predictive analytics, but it emphasizes actionable insights instead of data monitoring. In other words, it can be considered as the opposite of descriptive analytics, which examines decisions and outcomes after the fact. By integrating big data, machine learning, and business rules, prescriptive analytics helps organizations to make more informed decisions to produce results that drive the most successful business decisions.

In summary, to clarify what happened and why it happened, both descriptive analytics and diagnostic analytics look at the past. Historical data is used by predictive analytics and prescriptive analytics to forecast what will happen in the future and what steps should be taken to impact those effects. In Table ​ Table1, 1 , we have summarized these analytics methods with examples. Forward-thinking organizations in the real world can jointly use these analytical methods to make smart decisions that help drive changes in business processes and improvements. In the following, we discuss how machine learning techniques can play a big role in these analytical methods through their learning capabilities from the data.

Various types of analytical methods with examples

Machine Learning Based Analytical Modeling

In this section, we briefly discuss various advanced analytics methods based on machine learning modeling, which can make the computing process smart through intelligent decision-making in a business process. Figure ​ Figure3 3 shows a general structure of a machine learning-based predictive modeling considering both the training and testing phase. In the following, we discuss a wide range of methods such as regression and classification analysis, association rule analysis, time-series analysis, behavioral analysis, log analysis, and so on within the scope of our study.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig3_HTML.jpg

A general structure of a machine learning based predictive model considering both the training and testing phase

Regression Analysis

In data science, one of the most common statistical approaches used for predictive modeling and data mining tasks is regression techniques [ 38 ]. Regression analysis is a form of supervised machine learning that examines the relationship between a dependent variable (target) and independent variables (predictor) to predict continuous-valued output [ 105 , 117 ]. The following equations Eqs. 1 , 2 , and 3 [ 85 , 105 ] represent the simple, multiple or multivariate, and polynomial regressions respectively, where x represents independent variable and y is the predicted/target output mentioned above:

Regression analysis is typically conducted for one of two purposes: to predict the value of the dependent variable in the case of individuals for whom some knowledge relating to the explanatory variables is available, or to estimate the effect of some explanatory variable on the dependent variable, i.e., finding the relationship of causal influence between the variables. Linear regression cannot be used to fit non-linear data and may cause an underfitting problem. In that case, polynomial regression performs better, however, increases the model complexity. The regularization techniques such as Ridge, Lasso, Elastic-Net, etc. [ 85 , 105 ] can be used to optimize the linear regression model. Besides, support vector regression, decision tree regression, random forest regression techniques [ 85 , 105 ] can be used for building effective regression models depending on the problem type, e.g., non-linear tasks. Financial forecasting or prediction, cost estimation, trend analysis, marketing, time-series estimation, drug response modeling, etc. are some examples where the regression models can be used to solve real-world problems in the domain of data science and analytics.

Classification Analysis

Classification is one of the most widely used and best-known data science processes. This is a form of supervised machine learning approach that also refers to a predictive modeling problem in which a class label is predicted for a given example [ 38 ]. Spam identification, such as ‘spam’ and ‘not spam’ in email service providers, can be an example of a classification problem. There are several forms of classification analysis available in the area such as binary classification—which refers to the prediction of one of two classes; multi-class classification—which involves the prediction of one of more than two classes; multi-label classification—a generalization of multiclass classification in which the problem’s classes are organized hierarchically [ 105 ].

Several popular classification techniques, such as k-nearest neighbors [ 5 ], support vector machines [ 55 ], navies Bayes [ 49 ], adaptive boosting [ 32 ], extreme gradient boosting [ 85 ], logistic regression [ 66 ], decision trees ID3 [ 92 ], C4.5 [ 93 ], and random forests [ 13 ] exist to solve classification problems. The tree-based classification technique, e.g., random forest considering multiple decision trees, performs better than others to solve real-world problems in many cases as due to its capability of producing logic rules [ 103 , 115 ]. Figure ​ Figure4 4 shows an example of a random forest structure considering multiple decision trees. In addition, BehavDT recently proposed by Sarker et al. [ 109 ], and IntrudTree [ 106 ] can be used for building effective classification or prediction models in the relevant tasks within the domain of data science and analytics.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig4_HTML.jpg

An example of a random forest structure considering multiple decision trees

Cluster Analysis

Clustering is a form of unsupervised machine learning technique and is well-known in many data science application areas for statistical data analysis [ 38 ]. Usually, clustering techniques search for the structures inside a dataset and, if the classification is not previously identified, classify homogeneous groups of cases. This means that data points are identical to each other within a cluster, and different from data points in another cluster. Overall, the purpose of cluster analysis is to sort various data points into groups (or clusters) that are homogeneous internally and heterogeneous externally [ 105 ]. To gain insight into how data is distributed in a given dataset or as a preprocessing phase for other algorithms, clustering is often used. Data clustering, for example, assists with customer shopping behavior, sales campaigns, and retention of consumers for retail businesses, anomaly detection, etc.

Many clustering algorithms with the ability to group data have been proposed in machine learning and data science literature [ 98 , 138 , 141 ]. In our earlier paper Sarker et al. [ 105 ], we have summarized this based on several perspectives, such as partitioning methods, density-based methods, hierarchical-based methods, model-based methods, etc. In the literature, the popular K-means [ 75 ], K-Mediods [ 84 ], CLARA [ 54 ] etc. are known as partitioning methods; DBSCAN [ 30 ], OPTICS [ 8 ] etc. are known as density-based methods; single linkage [ 122 ], complete linkage [ 123 ], etc. are known as hierarchical methods. In addition, grid-based clustering methods, such as STING [ 134 ], CLIQUE [ 2 ], etc.; model-based clustering such as neural network learning [ 141 ], GMM [ 94 ], SOM [ 18 , 104 ], etc.; constrained-based methods such as COP K-means [ 131 ], CMWK-Means [ 25 ], etc. are used in the area. Recently, Sarker et al. [ 111 ] proposed a hierarchical clustering method, BOTS [ 111 ] based on bottom-up agglomerative technique for capturing user’s similar behavioral characteristics over time. The key benefit of agglomerative hierarchical clustering is that the tree-structure hierarchy created by agglomerative clustering is more informative than an unstructured set of flat clusters, which can assist in better decision-making in relevant application areas in data science.

Association Rule Analysis

Association rule learning is known as a rule-based machine learning system, an unsupervised learning method is typically used to establish a relationship among variables. This is a descriptive technique often used to analyze large datasets for discovering interesting relationships or patterns. The association learning technique’s main strength is its comprehensiveness, as it produces all associations that meet user-specified constraints including minimum support and confidence value [ 138 ].

Association rules allow a data scientist to identify trends, associations, and co-occurrences between data sets inside large data collections. In a supermarket, for example, associations infer knowledge about the buying behavior of consumers for different items, which helps to change the marketing and sales plan. In healthcare, to better diagnose patients, physicians may use association guidelines. Doctors can assess the conditional likelihood of a given illness by comparing symptom associations in the data from previous cases using association rules and machine learning-based data analysis. Similarly, association rules are useful for consumer behavior analysis and prediction, customer market analysis, bioinformatics, weblog mining, recommendation systems, etc.

Several types of association rules have been proposed in the area, such as frequent pattern based [ 4 , 47 , 73 ], logic-based [ 31 ], tree-based [ 39 ], fuzzy-rules [ 126 ], belief rule [ 148 ] etc. The rule learning techniques such as AIS [ 3 ], Apriori [ 4 ], Apriori-TID and Apriori-Hybrid [ 4 ], FP-Tree [ 39 ], Eclat [ 144 ], RARM [ 24 ] exist to solve the relevant business problems. Apriori [ 4 ] is the most commonly used algorithm for discovering association rules from a given dataset among the association rule learning techniques [ 145 ]. The recent association rule-learning technique ABC-RuleMiner proposed in our earlier paper by Sarker et al. [ 113 ] could give significant results in terms of generating non-redundant rules that can be used for smart decision making according to human preferences, within the area of data science applications.

Time-Series Analysis and Forecasting

A time series is typically a series of data points indexed in time order particularly, by date, or timestamp [ 111 ]. Depending on the frequency, the time-series can be different types such as annually, e.g., annual budget, quarterly, e.g., expenditure, monthly, e.g., air traffic, weekly, e.g., sales quantity, daily, e.g., weather, hourly, e.g., stock price, minute-wise, e.g., inbound calls in a call center, and even second-wise, e.g., web traffic, and so on in relevant domains.

A mathematical method dealing with such time-series data, or the procedure of fitting a time series to a proper model is termed time-series analysis. Many different time series forecasting algorithms and analysis methods can be applied to extract the relevant information. For instance, to do time-series forecasting for future patterns, the autoregressive (AR) model [ 130 ] learns the behavioral trends or patterns of past data. Moving average (MA) [ 40 ] is another simple and common form of smoothing used in time series analysis and forecasting that uses past forecasted errors in a regression-like model to elaborate an averaged trend across the data. The autoregressive moving average (ARMA) [ 12 , 120 ] combines these two approaches, where autoregressive extracts the momentum and pattern of the trend and moving average capture the noise effects. The most popular and frequently used time-series model is the autoregressive integrated moving average (ARIMA) model [ 12 , 120 ]. ARIMA model, a generalization of an ARMA model, is more flexible than other statistical models such as exponential smoothing or simple linear regression. In terms of data, the ARMA model can only be used for stationary time-series data, while the ARIMA model includes the case of non-stationarity as well. Similarly, seasonal autoregressive integrated moving average (SARIMA), autoregressive fractionally integrated moving average (ARFIMA), autoregressive moving average model with exogenous inputs model (ARMAX model) are also used in time-series models [ 120 ].

In addition to the stochastic methods for time-series modeling and forecasting, machine and deep learning-based approach can be used for effective time-series analysis and forecasting. For instance, in our earlier paper, Sarker et al. [ 111 ] present a bottom-up clustering-based time-series analysis to capture the mobile usage behavioral patterns of the users. Figure ​ Figure5 5 shows an example of producing aggregate time segments Seg_i from initial time slices TS_i based on similar behavioral characteristics that are used in our bottom-up clustering approach, where D represents the dominant behavior BH_i of the users, mentioned above [ 111 ]. The authors in [ 118 ], used a long short-term memory (LSTM) model, a kind of recurrent neural network (RNN) deep learning model, in forecasting time-series that outperform traditional approaches such as the ARIMA model. Time-series analysis is commonly used these days in various fields such as financial, manufacturing, business, social media, event data (e.g., clickstreams and system events), IoT and smartphone data, and generally in any applied science and engineering temporal measurement domain. Thus, it covers a wide range of application areas in data science.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig5_HTML.jpg

An example of producing aggregate time segments from initial time slices based on similar behavioral characteristics

Opinion Mining and Sentiment Analysis

Sentiment analysis or opinion mining is the computational study of the opinions, thoughts, emotions, assessments, and attitudes of people towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes [ 71 ]. There are three kinds of sentiments: positive, negative, and neutral, along with more extreme feelings such as angry, happy and sad, or interested or not interested, etc. More refined sentiments to evaluate the feelings of individuals in various situations can also be found according to the problem domain.

Although the task of opinion mining and sentiment analysis is very challenging from a technical point of view, it’s very useful in real-world practice. For instance, a business always aims to obtain an opinion from the public or customers about its products and services to refine the business policy as well as a better business decision. It can thus benefit a business to understand the social opinion of their brand, product, or service. Besides, potential customers want to know what consumers believe they have when they use a service or purchase a product. Document-level, sentence level, aspect level, and concept level, are the possible levels of opinion mining in the area [ 45 ].

Several popular techniques such as lexicon-based including dictionary-based and corpus-based methods, machine learning including supervised and unsupervised learning, deep learning, and hybrid methods are used in sentiment analysis-related tasks [ 70 ]. To systematically define, extract, measure, and analyze affective states and subjective knowledge, it incorporates the use of statistics, natural language processing (NLP), machine learning as well as deep learning methods. Sentiment analysis is widely used in many applications, such as reviews and survey data, web and social media, and healthcare content, ranging from marketing and customer support to clinical practice. Thus sentiment analysis has a big influence in many data science applications, where public sentiment is involved in various real-world issues.

Behavioral Data and Cohort Analysis

Behavioral analytics is a recent trend that typically reveals new insights into e-commerce sites, online gaming, mobile and smartphone applications, IoT user behavior, and many more [ 112 ]. The behavioral analysis aims to understand how and why the consumers or users behave, allowing accurate predictions of how they are likely to behave in the future. For instance, it allows advertisers to make the best offers with the right client segments at the right time. Behavioral analytics, including traffic data such as navigation paths, clicks, social media interactions, purchase decisions, and marketing responsiveness, use the large quantities of raw user event information gathered during sessions in which people use apps, games, or websites. In our earlier papers Sarker et al. [ 101 , 111 , 113 ] we have discussed how to extract users phone usage behavioral patterns utilizing real-life phone log data for various purposes.

In the real-world scenario, behavioral analytics is often used in e-commerce, social media, call centers, billing systems, IoT systems, political campaigns, and other applications, to find opportunities for optimization to achieve particular outcomes. Cohort analysis is a branch of behavioral analytics that involves studying groups of people over time to see how their behavior changes. For instance, it takes data from a given data set (e.g., an e-commerce website, web application, or online game) and separates it into related groups for analysis. Various machine learning techniques such as behavioral data clustering [ 111 ], behavioral decision tree classification [ 109 ], behavioral association rules [ 113 ], etc. can be used in the area according to the goal. Besides, the concept of RecencyMiner, proposed in our earlier paper Sarker et al. [ 108 ] that takes into account recent behavioral patterns could be effective while analyzing behavioral data as it may not be static in the real-world changes over time.

Anomaly Detection or Outlier Analysis

Anomaly detection, also known as Outlier analysis is a data mining step that detects data points, events, and/or findings that deviate from the regularities or normal behavior of a dataset. Anomalies are usually referred to as outliers, abnormalities, novelties, noise, inconsistency, irregularities, and exceptions [ 63 , 114 ]. Techniques of anomaly detection may discover new situations or cases as deviant based on historical data through analyzing the data patterns. For instance, identifying fraud or irregular transactions in finance is an example of anomaly detection.

It is often used in preprocessing tasks for the deletion of anomalous or inconsistency in the real-world data collected from various data sources including user logs, devices, networks, and servers. For anomaly detection, several machine learning techniques can be used, such as k-nearest neighbors, isolation forests, cluster analysis, etc [ 105 ]. The exclusion of anomalous data from the dataset also results in a statistically significant improvement in accuracy during supervised learning [ 101 ]. However, extracting appropriate features, identifying normal behaviors, managing imbalanced data distribution, addressing variations in abnormal behavior or irregularities, the sparse occurrence of abnormal events, environmental variations, etc. could be challenging in the process of anomaly detection. Detection of anomalies can be applicable in a variety of domains such as cybersecurity analytics, intrusion detections, fraud detection, fault detection, health analytics, identifying irregularities, detecting ecosystem disturbances, and many more. This anomaly detection can be considered a significant task for building effective systems with higher accuracy within the area of data science.

Factor Analysis

Factor analysis is a collection of techniques for describing the relationships or correlations between variables in terms of more fundamental entities known as factors [ 23 ]. It’s usually used to organize variables into a small number of clusters based on their common variance, where mathematical or statistical procedures are used. The goals of factor analysis are to determine the number of fundamental influences underlying a set of variables, calculate the degree to which each variable is associated with the factors, and learn more about the existence of the factors by examining which factors contribute to output on which variables. The broad purpose of factor analysis is to summarize data so that relationships and patterns can be easily interpreted and understood [ 143 ].

Exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) are the two most popular factor analysis techniques. EFA seeks to discover complex trends by analyzing the dataset and testing predictions, while CFA tries to validate hypotheses and uses path analysis diagrams to represent variables and factors [ 143 ]. Factor analysis is one of the algorithms for unsupervised machine learning that is used for minimizing dimensionality. The most common methods for factor analytics are principal components analysis (PCA), principal axis factoring (PAF), and maximum likelihood (ML) [ 48 ]. Methods of correlation analysis such as Pearson correlation, canonical correlation, etc. may also be useful in the field as they can quantify the statistical relationship between two continuous variables, or association. Factor analysis is commonly used in finance, marketing, advertising, product management, psychology, and operations research, and thus can be considered as another significant analytical method within the area of data science.

Log Analysis

Logs are commonly used in system management as logs are often the only data available that record detailed system runtime activities or behaviors in production [ 44 ]. Log analysis is thus can be considered as the method of analyzing, interpreting, and capable of understanding computer-generated records or messages, also known as logs. This can be device log, server log, system log, network log, event log, audit trail, audit record, etc. The process of creating such records is called data logging.

Logs are generated by a wide variety of programmable technologies, including networking devices, operating systems, software, and more. Phone call logs [ 88 , 110 ], SMS Logs [ 28 ], mobile apps usages logs [ 124 , 149 ], notification logs [ 77 ], game Logs [ 82 ], context logs [ 16 , 149 ], web logs [ 37 ], smartphone life logs [ 95 ], etc. are some examples of log data for smartphone devices. The main characteristics of these log data is that it contains users’ actual behavioral activities with their devices. Similar other log data can be search logs [ 50 , 133 ], application logs [ 26 ], server logs [ 33 ], network logs [ 57 ], event logs [ 83 ], network and security logs [ 142 ] etc.

Several techniques such as classification and tagging, correlation analysis, pattern recognition methods, anomaly detection methods, machine learning modeling, etc. [ 105 ] can be used for effective log analysis. Log analysis can assist in compliance with security policies and industry regulations, as well as provide a better user experience by encouraging the troubleshooting of technical problems and identifying areas where efficiency can be improved. For instance, web servers use log files to record data about website visitors. Windows event log analysis can help an investigator draw a timeline based on the logging information and the discovered artifacts. Overall, advanced analytics methods by taking into account machine learning modeling can play a significant role to extract insightful patterns from these log data, which can be used for building automated and smart applications, and thus can be considered as a key working area in data science.

Neural Networks and Deep Learning Analysis

Deep learning is a form of machine learning that uses artificial neural networks to create a computational architecture that learns from data by combining multiple processing layers, such as the input, hidden, and output layers [ 38 ]. The key benefit of deep learning over conventional machine learning methods is that it performs better in a variety of situations, particularly when learning from large datasets [ 114 , 140 ].

The most common deep learning algorithms are: multi-layer perceptron (MLP) [ 85 ], convolutional neural network (CNN or ConvNet) [ 67 ], long short term memory recurrent neural network (LSTM-RNN) [ 34 ]. Figure ​ Figure6 6 shows a structure of an artificial neural network modeling with multiple processing layers. The Backpropagation technique [ 38 ] is used to adjust the weight values internally while building the model. Convolutional neural networks (CNNs) [ 67 ] improve on the design of traditional artificial neural networks (ANNs), which include convolutional layers, pooling layers, and fully connected layers. It is commonly used in a variety of fields, including natural language processing, speech recognition, image processing, and other autocorrelated data since it takes advantage of the two-dimensional (2D) structure of the input data. AlexNet [ 60 ], Xception [ 21 ], Inception [ 125 ], Visual Geometry Group (VGG) [ 42 ], ResNet [ 43 ], etc., and other advanced deep learning models based on CNN are also used in the field.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig6_HTML.jpg

A structure of an artificial neural network modeling with multiple processing layers

In addition to CNN, recurrent neural network (RNN) architecture is another popular method used in deep learning. Long short-term memory (LSTM) is a popular type of recurrent neural network architecture used broadly in the area of deep learning. Unlike traditional feed-forward neural networks, LSTM has feedback connections. Thus, LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, sorting, and predicting data based on time-series data. Therefore, when the data is in a sequential format, such as time, sentence, etc., LSTM can be used, and it is widely used in the areas of time-series analysis, natural language processing, speech recognition, and so on.

In addition to the most popular deep learning methods mentioned above, several other deep learning approaches [ 104 ] exist in the field for various purposes. The self-organizing map (SOM) [ 58 ], for example, uses unsupervised learning to represent high-dimensional data as a 2D grid map, reducing dimensionality. Another learning technique that is commonly used for dimensionality reduction and feature extraction in unsupervised learning tasks is the autoencoder (AE) [ 10 ]. Restricted Boltzmann machines (RBM) can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling, according to [ 46 ]. A deep belief network (DBN) is usually made up of a backpropagation neural network and unsupervised networks like restricted Boltzmann machines (RBMs) or autoencoders (BPNN) [ 136 ]. A generative adversarial network (GAN) [ 35 ] is a deep learning network that can produce data with characteristics that are similar to the input data. Transfer learning is common worldwide presently because it can train deep neural networks with a small amount of data, which is usually the re-use of a pre-trained model on a new problem [ 137 ]. These deep learning methods can perform  well, particularly, when learning from large-scale datasets [ 105 , 140 ]. In our previous article Sarker et al. [ 104 ], we have summarized a brief discussion of various artificial neural networks (ANN) and deep learning (DL) models mentioned above, which can be used in a variety of data science and analytics tasks.

Real-World Application Domains

Almost every industry or organization is impacted by data, and thus “Data Science” including advanced analytics with machine learning modeling can be used in business, marketing, finance, IoT systems, cybersecurity, urban management, health care, government policies, and every possible industries, where data gets generated. In the following, we discuss ten most popular application areas based on data science and analytics.

  • Business or financial data science: In general, business data science can be considered as the study of business or e-commerce data to obtain insights about a business that can typically lead to smart decision-making as well as taking high-quality actions [ 90 ]. Data scientists can develop algorithms or data-driven models predicting customer behavior, identifying patterns and trends based on historical business data, which can help companies to reduce costs, improve service delivery, and generate recommendations for better decision-making. Eventually, business automation, intelligence, and efficiency can be achieved through the data science process discussed earlier, where various advanced analytics methods and machine learning modeling based on the collected data are the keys. Many online retailers, such as Amazon [ 76 ], can improve inventory management, avoid out-of-stock situations, and optimize logistics and warehousing using predictive modeling based on machine learning techniques [ 105 ]. In terms of finance, the historical data is related to financial institutions to make high-stakes business decisions, which is mostly used for risk management, fraud prevention, credit allocation, customer analytics, personalized services, algorithmic trading, etc. Overall, data science methodologies can play a key role in the future generation business or finance industry, particularly in terms of business automation, intelligence, and smart decision-making and systems.
  • Manufacturing or industrial data science: To compete in global production capability, quality, and cost, manufacturing industries have gone through many industrial revolutions [ 14 ]. The latest fourth industrial revolution, also known as Industry 4.0, is the emerging trend of automation and data exchange in manufacturing technology. Thus industrial data science, which is the study of industrial data to obtain insights that can typically lead to optimizing industrial applications, can play a vital role in such revolution. Manufacturing industries generate a large amount of data from various sources such as sensors, devices, networks, systems, and applications [ 6 , 68 ]. The main categories of industrial data include large-scale data devices, life-cycle production data, enterprise operation data, manufacturing value chain sources, and collaboration data from external sources [ 132 ]. The data needs to be processed, analyzed, and secured to help improve the system’s efficiency, safety, and scalability. Data science modeling thus can be used to maximize production, reduce costs and raise profits in manufacturing industries.
  • Medical or health data science: Healthcare is one of the most notable fields where data science is making major improvements. Health data science involves the extrapolation of actionable insights from sets of patient data, typically collected from electronic health records. To help organizations, improve the quality of treatment, lower the cost of care, and improve the patient experience, data can be obtained from several sources, e.g., the electronic health record, billing claims, cost estimates, and patient satisfaction surveys, etc., to analyze. In reality, healthcare analytics using machine learning modeling can minimize medical costs, predict infectious outbreaks, prevent preventable diseases, and generally improve the quality of life [ 81 , 119 ]. Across the global population, the average human lifespan is growing, presenting new challenges to today’s methods of delivery of care. Thus health data science modeling can play a role in analyzing current and historical data to predict trends, improve services, and even better monitor the spread of diseases. Eventually, it may lead to new approaches to improve patient care, clinical expertise, diagnosis, and management.
  • IoT data science: Internet of things (IoT) [ 9 ] is a revolutionary technical field that turns every electronic system into a smarter one and is therefore considered to be the big frontier that can enhance almost all activities in our lives. Machine learning has become a key technology for IoT applications because it uses expertise to identify patterns and generate models that help predict future behavior and events [ 112 ]. One of the IoT’s main fields of application is a smart city, which uses technology to improve city services and citizens’ living experiences. For example, using the relevant data, data science methods can be used for traffic prediction in smart cities, to estimate the total usage of energy of the citizens for a particular period. Deep learning-based models in data science can be built based on a large scale of IoT datasets [ 7 , 104 ]. Overall, data science and analytics approaches can aid modeling in a variety of IoT and smart city services, including smart governance, smart homes, education, connectivity, transportation, business, agriculture, health care, and industry, and many others.
  • Cybersecurity data science: Cybersecurity, or the practice of defending networks, systems, hardware, and data from digital attacks, is one of the most important fields of Industry 4.0 [ 114 , 121 ]. Data science techniques, particularly machine learning, have become a crucial cybersecurity technology that continually learns to identify trends by analyzing data, better detecting malware in encrypted traffic, finding insider threats, predicting where bad neighborhoods are online, keeping people safe while surfing, or protecting information in the cloud by uncovering suspicious user activity [ 114 ]. For instance, machine learning and deep learning-based security modeling can be used to effectively detect various types of cyberattacks or anomalies [ 103 , 106 ]. To generate security policy rules, association rule learning can play a significant role to build rule-based systems [ 102 ]. Deep learning-based security models can perform better when utilizing the large scale of security datasets [ 140 ]. Thus data science modeling can enable professionals in cybersecurity to be more proactive in preventing threats and reacting in real-time to active attacks, through extracting actionable insights from the security datasets.
  • Behavioral data science: Behavioral data is information produced as a result of activities, most commonly commercial behavior, performed on a variety of Internet-connected devices, such as a PC, tablet, or smartphones [ 112 ]. Websites, mobile applications, marketing automation systems, call centers, help desks, and billing systems, etc. are all common sources of behavioral data. Behavioral data is much more than just data, which is not static data [ 108 ]. Advanced analytics of these data including machine learning modeling can facilitate in several areas such as predicting future sales trends and product recommendations in e-commerce and retail; predicting usage trends, load, and user preferences in future releases in online gaming; determining how users use an application to predict future usage and preferences in application development; breaking users down into similar groups to gain a more focused understanding of their behavior in cohort analysis; detecting compromised credentials and insider threats by locating anomalous behavior, or making suggestions, etc. Overall, behavioral data science modeling typically enables to make the right offers to the right consumers at the right time on various common platforms such as e-commerce platforms, online games, web and mobile applications, and IoT. In social context, analyzing the behavioral data of human being using advanced analytics methods and the extracted insights from social data can be used for data-driven intelligent social services, which can be considered as social data science.
  • Mobile data science: Today’s smart mobile phones are considered as “next-generation, multi-functional cell phones that facilitate data processing, as well as enhanced wireless connectivity” [ 146 ]. In our earlier paper [ 112 ], we have shown that users’ interest in “Mobile Phones” is more and more than other platforms like “Desktop Computer”, “Laptop Computer” or “Tablet Computer” in recent years. People use smartphones for a variety of activities, including e-mailing, instant messaging, online shopping, Internet surfing, entertainment, social media such as Facebook, Linkedin, and Twitter, and various IoT services such as smart cities, health, and transportation services, and many others. Intelligent apps are based on the extracted insight from the relevant datasets depending on apps characteristics, such as action-oriented, adaptive in nature, suggestive and decision-oriented, data-driven, context-awareness, and cross-platform operation [ 112 ]. As a result, mobile data science, which involves gathering a large amount of mobile data from various sources and analyzing it using machine learning techniques to discover useful insights or data-driven trends, can play an important role in the development of intelligent smartphone applications.
  • Multimedia data science: Over the last few years, a big data revolution in multimedia management systems has resulted from the rapid and widespread use of multimedia data, such as image, audio, video, and text, as well as the ease of access and availability of multimedia sources. Currently, multimedia sharing websites, such as Yahoo Flickr, iCloud, and YouTube, and social networks such as Facebook, Instagram, and Twitter, are considered as valuable sources of multimedia big data [ 89 ]. People, particularly younger generations, spend a lot of time on the Internet and social networks to connect with others, exchange information, and create multimedia data, thanks to the advent of new technology and the advanced capabilities of smartphones and tablets. Multimedia analytics deals with the problem of effectively and efficiently manipulating, handling, mining, interpreting, and visualizing various forms of data to solve real-world problems. Text analysis, image or video processing, computer vision, audio or speech processing, and database management are among the solutions available for a range of applications including healthcare, education, entertainment, and mobile devices.
  • Smart cities or urban data science: Today, more than half of the world’s population live in urban areas or cities [ 80 ] and considered as drivers or hubs of economic growth, wealth creation, well-being, and social activity [ 96 , 116 ]. In addition to cities, “Urban area” can refer to the surrounding areas such as towns, conurbations, or suburbs. Thus, a large amount of data documenting daily events, perceptions, thoughts, and emotions of citizens or people are recorded, that are loosely categorized into personal data, e.g., household, education, employment, health, immigration, crime, etc., proprietary data, e.g., banking, retail, online platforms data, etc., government data, e.g., citywide crime statistics, or government institutions, etc., Open and public data, e.g., data.gov, ordnance survey, and organic and crowdsourced data, e.g., user-generated web data, social media, Wikipedia, etc. [ 29 ]. The field of urban data science typically focuses on providing more effective solutions from a data-driven perspective, through extracting knowledge and actionable insights from such urban data. Advanced analytics of these data using machine learning techniques [ 105 ] can facilitate the efficient management of urban areas including real-time management, e.g., traffic flow management, evidence-based planning decisions which pertain to the longer-term strategic role of forecasting for urban planning, e.g., crime prevention, public safety, and security, or framing the future, e.g., political decision-making [ 29 ]. Overall, it can contribute to government and public planning, as well as relevant sectors including retail, financial services, mobility, health, policing, and utilities within a data-rich urban environment through data-driven smart decision-making and policies, which lead to smart cities and improve the quality of human life.
  • Smart villages or rural data science: Rural areas or countryside are the opposite of urban areas, that include villages, hamlets, or agricultural areas. The field of rural data science typically focuses on making better decisions and providing more effective solutions that include protecting public safety, providing critical health services, agriculture, and fostering economic development from a data-driven perspective, through extracting knowledge and actionable insights from the collected rural data. Advanced analytics of rural data including machine learning [ 105 ] modeling can facilitate providing new opportunities for them to build insights and capacity to meet current needs and prepare for their futures. For instance, machine learning modeling [ 105 ] can help farmers to enhance their decisions to adopt sustainable agriculture utilizing the increasing amount of data captured by emerging technologies, e.g., the internet of things (IoT), mobile technologies and devices, etc. [ 1 , 51 , 52 ]. Thus, rural data science can play a very important role in the economic and social development of rural areas, through agriculture, business, self-employment, construction, banking, healthcare, governance, or other services, etc. that lead to smarter villages.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in almost every sector in our real-world life, where the relevant data is available to analyze. To gather the right data and extract useful knowledge or actionable insights from the data for making smart decisions is the key to data science modeling in any application domain. Based on our discussion on the above ten potential real-world application domains by taking into account data-driven smart computing and decision making, we can say that the prospects of data science and the role of data scientists are huge for the future world. The “Data Scientists” typically analyze information from multiple sources to better understand the data and business problems, and develop machine learning-based analytical modeling or algorithms, or data-driven tools, or solutions, focused on advanced analytics, which can make today’s computing process smarter, automated, and intelligent.

Challenges and Research Directions

Our study on data science and analytics, particularly data science modeling in “ Understanding data science modeling ”, advanced analytics methods and smart computing in “ Advanced analytics methods and smart computing ”, and real-world application areas in “ Real-world application domains ” open several research issues in the area of data-driven business solutions and eventual data products. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions to build data-driven products.

  • Understanding the real-world business problems and associated data including nature, e.g., what forms, type, size, labels, etc., is the first challenge in the data science modeling, discussed briefly in “ Understanding data science modeling ”. This is actually to identify, specify, represent and quantify the domain-specific business problems and data according to the requirements. For a data-driven effective business solution, there must be a well-defined workflow before beginning the actual data analysis work. Furthermore, gathering business data is difficult because data sources can be numerous and dynamic. As a result, gathering different forms of real-world data, such as structured, or unstructured, related to a specific business issue with legal access, which varies from application to application, is challenging. Moreover, data annotation, which is typically the process of categorization, tagging, or labeling of raw data, for the purpose of building data-driven models, is another challenging issue. Thus, the primary task is to conduct a more in-depth analysis of data collection and dynamic annotation methods. Therefore, understanding the business problem, as well as integrating and managing the raw data gathered for efficient data analysis, may be one of the most challenging aspects of working in the field of data science and analytics.
  • The next challenge is the extraction of the relevant and accurate information from the collected data mentioned above. The main focus of data scientists is typically to disclose, describe, represent, and capture data-driven intelligence for actionable insights from data. However, the real-world data may contain many ambiguous values, missing values, outliers, and meaningless data [ 101 ]. The advanced analytics methods including machine and deep learning modeling, discussed in “ Advanced analytics methods and smart computing ”, highly impact the quality, and availability of the data. Thus understanding real-world business scenario and associated data, to whether, how, and why they are insufficient, missing, or problematic, then extend or redevelop the existing methods, such as large-scale hypothesis testing, learning inconsistency, and uncertainty, etc. to address the complexities in data and business problems is important. Therefore, developing new techniques to effectively pre-process the diverse data collected from multiple sources, according to their nature and characteristics could be another challenging task.
  • Understanding and selecting the appropriate analytical methods to extract the useful insights for smart decision-making for a particular business problem is the main issue in the area of data science. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to granular data analysis. Thus, understanding the advanced analytics methods, especially machine and deep learning-based modeling is the key. The traditional learning techniques mentioned in “ Advanced analytics methods and smart computing ” may not be directly applicable for the expected outcome in many cases. For instance, in a rule-based system, the traditional association rule learning technique [ 4 ] may  produce redundant rules from the data that makes the decision-making process complex and ineffective [ 113 ]. Thus, a scientific understanding of the learning algorithms, mathematical properties, how the techniques are robust or fragile to input data, is needed to understand. Therefore, a deeper understanding of the strengths and drawbacks of the existing machine and deep learning methods [ 38 , 105 ] to solve a particular business problem is needed, consequently to improve or optimize the learning algorithms according to the data characteristics, or to propose the new algorithm/techniques with higher accuracy becomes a significant challenging issue for the future generation data scientists.
  • The traditional data-driven models or systems typically use a large amount of business data to generate data-driven decisions. In several application fields, however, the new trends are more likely to be interesting and useful for modeling and predicting the future than older ones. For example, smartphone user behavior modeling, IoT services, stock market forecasting, health or transport service, job market analysis, and other related areas where time-series and actual human interests or preferences are involved over time. Thus, rather than considering the traditional data analysis, the concept of RecencyMiner, i.e., recent pattern-based extracted insight or knowledge proposed in our earlier paper Sarker et al. [ 108 ] might be effective. Therefore, to propose the new techniques by taking into account the recent data patterns, and consequently to build a recency-based data-driven model for solving real-world problems, is another significant challenging issue in the area.
  • The most crucial task for a data-driven smart system is to create a framework that supports data science modeling discussed in “ Understanding data science modeling ”. As a result, advanced analytical methods based on machine learning or deep learning techniques can be considered in such a system to make the framework capable of resolving the issues. Besides, incorporating contextual information such as temporal context, spatial context, social context, environmental context, etc. [ 100 ] can be used for building an adaptive, context-aware, and dynamic model or framework, depending on the problem domain. As a result, a well-designed data-driven framework, as well as experimental evaluation, is a very important direction to effectively solve a business problem in a particular domain, as well as a big challenge for the data scientists.
  • In several important application areas such as autonomous cars, criminal justice, health care, recruitment, housing, management of the human resource, public safety, where decisions made by models, or AI agents, have a direct effect on human lives. As a result, there is growing concerned about whether these decisions can be trusted, to be right, reasonable, ethical, personalized, accurate, robust, and secure, particularly in the context of adversarial attacks [ 104 ]. If we can explain the result in a meaningful way, then the model can be better trusted by the end-user. For machine-learned models, new trust properties yield new trade-offs, such as privacy versus accuracy; robustness versus efficiency; fairness versus robustness. Therefore, incorporating trustworthy AI particularly, data-driven or machine learning modeling could be another challenging issue in the area.

In the above, we have summarized and discussed several challenges and the potential research opportunities and directions, within the scope of our study in the area of data science and advanced analytics. The data scientists in academia/industry and the researchers in the relevant area have the opportunity to contribute to each issue identified above and build effective data-driven models or systems, to make smart decisions in the corresponding business domains.

In this paper, we have presented a comprehensive view on data science including various types of advanced analytical methods that can be applied to enhance the intelligence and the capabilities of an application. We have also visualized the current popularity of data science and machine learning-based advanced analytical modeling and also differentiate these from the relevant terms used in the area, to make the position of this paper. A thorough study on the data science modeling with its various processing modules that are needed to extract the actionable insights from the data for a particular business problem and the eventual data product. Thus, according to our goal, we have briefly discussed how different data modules can play a significant role in a data-driven business solution through the data science process. For this, we have also summarized various types of advanced analytical methods and outcomes as well as machine learning modeling that are needed to solve the associated business problems. Thus, this study’s key contribution has been identified as the explanation of different advanced analytical methods and their applicability in various real-world data-driven applications areas including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making.

Finally, within the scope of our study, we have outlined and discussed the challenges we faced, as well as possible research opportunities and future directions. As a result, the challenges identified provide promising research opportunities in the field that can be explored with effective solutions to improve the data-driven model and systems. Overall, we conclude that our study of advanced analytical solutions based on data science and machine learning methods, leads in a positive direction and can be used as a reference guide for future research and applications in the field of data science and its real-world applications by both academia and industry professionals.

Declarations

The author declares no conflict of interest.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

  • skip to Main Navigation
  • skip to Main Content
  • skip to Footer
  • Accessibility feedback
  • Data & Visualization and Research Support
  • Data Management

Defining Research Data

One definition of research data is: "the recorded factual material commonly accepted in the scientific community as necessary to validate research findings." ( OMB Circular 110 ).

Research data covers a broad range of types of information (see examples below), and digital data can be structured and stored in a variety of file formats.

Note that properly managing data (and records) does not necessarily equate to sharing or publishing that data.

Examples of Research Data

Some examples of research data:

  • Documents (text, Word), spreadsheets
  • Laboratory notebooks, field notebooks, diaries
  • Questionnaires, transcripts, codebooks
  • Audiotapes, videotapes
  • Photographs, films
  • Protein or genetic sequences
  • Test responses
  • Slides, artifacts, specimens, samples
  • Collection of digital objects acquired and generated during the process of research
  • Database contents (video, audio, text, images)
  • Models, algorithms, scripts
  • Contents of an application (input, output, logfiles for analysis software, simulation software, schemas)
  • Methodologies and workflows
  • Standard operating procedures and protocols

Exclusions from Sharing

In addition to the other records to manage (below), some kinds of data may not be sharable due to the nature of the records themselves, or to ethical and privacy concerns. As defined by the OMB , this refers to:

  • preliminary analyses,
  • drafts of scientific papers,
  • plans for future research,
  • peer reviews, or
  • communications with colleagues

Research data also do not include:

  • Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law; and
  • Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study.

Some types of data, particularly software, may require special license to share.  In those cases, contact the Office of Technology Transfer to review considerations for software generated in your research.

Other Records to Manage

Although they might not be addressed in an NSF data management plan, the following research records may also be important to manage during and beyond the life of a project.

  • Correspondence (electronic mail and paper-based correspondence)
  • Project files
  • Grant applications
  • Ethics applications
  • Technical reports
  • Research reports
  • Signed consent forms

Adapted from Defining Research Data by the University of Oregon Libraries.

Opens in your default email client

URL copied to clipboard.

QR code for this page

research paper and data

data analysis Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Introduce a Survival Model with Spatial Skew Gaussian Random Effects and its Application in Covid-19 Data Analysis

Futuristic prediction of missing value imputation methods using extended ann.

Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study. The nourishment area is not an exemption to the difficulty of data missing. Most frequently, this difficulty is determined by manipulative means or medians from the existing datasets which need improvements. The paper proposed hybrid schemes of MICE and ANN known as extended ANN to search and analyze the missing values and perform imputations in the given dataset. The proposed mechanism is efficiently able to analyze the blank entries and fill them with proper examining their neighboring records in order to improve the accuracy of the dataset. In order to validate the proposed scheme, the extended ANN is further compared against various recent algorithms or mechanisms to analyze the efficiency as well as the accuracy of the results.

Applications of multivariate data analysis in shelf life studies of edible vegetal oils – A review of the few past years

Hypothesis formalization: empirical findings, software limitations, and design implications.

Data analysis requires translating higher level questions and hypotheses into computable statistical models. We present a mixed-methods study aimed at identifying the steps, considerations, and challenges involved in operationalizing hypotheses into statistical models, a process we refer to as hypothesis formalization . In a formative content analysis of 50 research papers, we find that researchers highlight decomposing a hypothesis into sub-hypotheses, selecting proxy variables, and formulating statistical models based on data collection design as key steps. In a lab study, we find that analysts fixated on implementation and shaped their analyses to fit familiar approaches, even if sub-optimal. In an analysis of software tools, we find that tools provide inconsistent, low-level abstractions that may limit the statistical models analysts use to formalize hypotheses. Based on these observations, we characterize hypothesis formalization as a dual-search process balancing conceptual and statistical considerations constrained by data and computation and discuss implications for future tools.

The Complexity and Expressive Power of Limit Datalog

Motivated by applications in declarative data analysis, in this article, we study Datalog Z —an extension of Datalog with stratified negation and arithmetic functions over integers. This language is known to be undecidable, so we present the fragment of limit Datalog Z programs, which is powerful enough to naturally capture many important data analysis tasks. In limit Datalog Z , all intensional predicates with a numeric argument are limit predicates that keep maximal or minimal bounds on numeric values. We show that reasoning in limit Datalog Z is decidable if a linearity condition restricting the use of multiplication is satisfied. In particular, limit-linear Datalog Z is complete for Δ 2 EXP and captures Δ 2 P over ordered datasets in the sense of descriptive complexity. We also provide a comprehensive study of several fragments of limit-linear Datalog Z . We show that semi-positive limit-linear programs (i.e., programs where negation is allowed only in front of extensional atoms) capture coNP over ordered datasets; furthermore, reasoning becomes coNEXP-complete in combined and coNP-complete in data complexity, where the lower bounds hold already for negation-free programs. In order to satisfy the requirements of data-intensive applications, we also propose an additional stability requirement, which causes the complexity of reasoning to drop to EXP in combined and to P in data complexity, thus obtaining the same bounds as for usual Datalog. Finally, we compare our formalisms with the languages underpinning existing Datalog-based approaches for data analysis and show that core fragments of these languages can be encoded as limit programs; this allows us to transfer decidability and complexity upper bounds from limit programs to other formalisms. Therefore, our article provides a unified logical framework for declarative data analysis which can be used as a basis for understanding the impact on expressive power and computational complexity of the key constructs available in existing languages.

An empirical study on Cross-Border E-commerce Talent Cultivation-—Based on Skill Gap Theory and big data analysis

To solve the dilemma between the increasing demand for cross-border e-commerce talents and incompatible students’ skill level, Industry-University-Research cooperation, as an essential pillar for inter-disciplinary talent cultivation model adopted by colleges and universities, brings out the synergy from relevant parties and builds the bridge between the knowledge and practice. Nevertheless, industry-university-research cooperation developed lately in the cross-border e-commerce field with several problems such as unstable collaboration relationships and vague training plans.

The Effects of Cross-border e-Commerce Platforms on Transnational Digital Entrepreneurship

This research examines the important concept of transnational digital entrepreneurship (TDE). The paper integrates the host and home country entrepreneurial ecosystems with the digital ecosystem to the framework of the transnational digital entrepreneurial ecosystem. The authors argue that cross-border e-commerce platforms provide critical foundations in the digital entrepreneurial ecosystem. Entrepreneurs who count on this ecosystem are defined as transnational digital entrepreneurs. Interview data were dissected for the purpose of case studies to make understanding from twelve Chinese immigrant entrepreneurs living in Australia and New Zealand. The results of the data analysis reveal that cross-border entrepreneurs are in actual fact relying on the significant framework of the transnational digital ecosystem. Cross-border e-commerce platforms not only play a bridging role between home and host country ecosystems but provide entrepreneurial capitals as digital ecosystem promised.

Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis With Limited Computational Resources

The effects of cross-border e-commerce platforms on transnational digital entrepreneurship, a trajectory evaluator by sub-tracks for detecting vot-based anomalous trajectory.

With the popularization of visual object tracking (VOT), more and more trajectory data are obtained and have begun to gain widespread attention in the fields of mobile robots, intelligent video surveillance, and the like. How to clean the anomalous trajectories hidden in the massive data has become one of the research hotspots. Anomalous trajectories should be detected and cleaned before the trajectory data can be effectively used. In this article, a Trajectory Evaluator by Sub-tracks (TES) for detecting VOT-based anomalous trajectory is proposed. Feature of Anomalousness is defined and described as the Eigenvector of classifier to filter Track Lets anomalous trajectory and IDentity Switch anomalous trajectory, which includes Feature of Anomalous Pose and Feature of Anomalous Sub-tracks (FAS). In the comparative experiments, TES achieves better results on different scenes than state-of-the-art methods. Moreover, FAS makes better performance than point flow, least square method fitting and Chebyshev Polynomial Fitting. It is verified that TES is more accurate and effective and is conducive to the sub-tracks trajectory data analysis.

Export Citation Format

Share document.

  • Interlibrary Loan and Scan & Deliver
  • Course Reserves
  • Purchase Request
  • Collection Development & Maintenance
  • Current Negotiations
  • Ask a Librarian
  • Instructor Support
  • Library How-To
  • Research Guides
  • Research Support
  • Study Rooms
  • Research Rooms
  • Partner Spaces
  • Loanable Equipment
  • Print, Scan, Copy
  • 3D Printers
  • Poster Printing
  • OSULP Leadership
  • Strategic Plan

Research Data Services

  • Campus Services & Policies
  • Archiving & Preservation
  • Citing Datasets
  • Data Papers & Journals

Data Papers & Data Journals

  • Data Repositories
  • ScholarsArchive@OSU data repository
  • Data Storage & Backup
  • Data Types & File Formats
  • Defining Data
  • File Organization
  • IP & Licensing Data
  • Laboratory Notebooks
  • Research Lifecycle
  • Researcher Identifiers
  • Sharing Your Data
  • Metadata/Documentation
  • Tools & Resources

SEND US AN EMAIL

  • L.K. Borland Data Management Support Coordinator Schedule an appointment with me!
  • Diana Castillo College of Business/Social Sciences Data Librarian Assistant Professor 541-737-9494
  • Clara Llebot Lorente Data Management Specialist Assistant Professor 541-737-1192 On sabbatical through June 2024

The rise of the "data paper"

Datasets are increasingly being recognized as scholarly products in their own right, and as such, are now being submitted for standalone publication. In many cases, the greatest value of a dataset lies in sharing it, not necessarily in providing interpretation or analysis. For example, this paper presents a global database of the abundance, biomass, and nitrogen fixation rates of marine diazotrophs. This benchmark dataset, which will continue to evolve over time, is a valuable standalone research product that has intrinsic value. Under traditional publication models, this dataset would not be considered "publishable" because it doesn't present novel research or interpretation of results. Data papers facilitate the sharing of data in a standardized framework that provides value, impact, and recognition for authors. Data papers also provide much more thorough context and description than datasets that are simply deposited to a repository (which may have very minimal metadata requirements).

What is a data paper?

Data papers thoroughly describe datasets, and do not usually include any interpretation or discussion (an exception may be discussion of different methods to collect the data, e.g.). Some data papers are published in a distinct “Data Papers” section of a well-established journal (see this article in Ecology, for example). It is becoming more common, however, to see journals that exclusively focus on the publication of datasets. The purpose of a data journal is to provide quick access to high-quality datasets that are of broad interest to the scientific community. They are intended to facilitate reuse of the dataset, which increases its original value and impact, and speeds the pace of research by avoiding unintentional duplication of effort.

Are data papers peer-reviewed?

Data papers typically go through a peer review process in the same manner as articles, but being new to scientific practice, the quality and scope of the process is variable across publishers. A good example of a peer reviewed data journal is Earth System Science Data ( ESSD ). Their review guidelines are well described and aren't all that different from manuscript review guidelines that we are all already familiar with.

You might wonder, W hat is the difference between a 'data paper' and a 'regular article + dataset published in a public repository' ? The answer to that isn’t always clear. Some data papers necessitate just as much preparation as, and are of equal quality to, ‘typical’ journal articles. Some data papers are brief, and only present enough metadata and descriptive content to make the dataset understandable and reusable. In most cases however, the datasets or databases presented in data papers include much more description than datasets deposited to a repository, even if those datasets were deposited to support a manuscript. Common practices and standards are evolving in the realm of data papers and data journals, but for now, they are the Wild West of data sharing.

Where do the data from data papers live?

Data preservation is a corollary of data papers, not their main purpose. Most data journals do not archive data in-house. Instead, they generally require that authors submit the dataset to a repository. These repositories archive the data, provide persistent access, and assign the dataset a unique identifier (DOI). Repositories do not always require that the dataset(s) be linked with a publication (data paper or ‘typical’ paper; Dryad does require one), but if you’re going to the trouble of submitting a dataset to a repository, consider exploring the option of publishing a data paper to support it.

How can I find data journals?

The article by Walters (2020) has a list of data journals in their appendix, and differentiates between "pure" data journals and journals that publish data reports but are devoted mainly to other types of contributions. They also update previous lists of data journals ( Candela et al, 2015 ).

Walters, William H.. 2020. “Data Journals: Incentivizing Data Access and Documentation Within the Scholarly Communication System”.  Insights  33 (1): 18. DOI:  http://doi.org/10.1629/uksg.510

Candela, L., Castelli, D., Manghi, P., & Tani, A. (2015). Data journals: A survey. Journal of the Association for Information Science and Technology , 66 (9), 1747–1762. https://doi.org/10.1002/asi.23358  

This blog post by Katherine Akers , from 2014, also has a long list of existing data journals.

  • << Previous: Citing Datasets
  • Next: Data Repositories >>
  • Last Updated: Aug 30, 2023 9:25 AM
  • URL: https://guides.library.oregonstate.edu/research-data-services

research paper and data

Contact Info

121 The Valley Library Corvallis OR 97331–4501

Phone: 541-737-3331

Services for Persons with Disabilities

In the Valley Library

  • Oregon State University Press
  • Special Collections and Archives Research Center
  • Undergrad Research & Writing Studio
  • Graduate Student Commons
  • Tutoring Services
  • Northwest Art Collection

Digital Projects

  • Oregon Explorer
  • Oregon Digital
  • ScholarsArchive@OSU
  • Digital Publishing Initiatives
  • Atlas of the Pacific Northwest
  • Marilyn Potts Guin Library  
  • Cascades Campus Library
  • McDowell Library of Vet Medicine

FDLP Emblem

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 17 October 2023

The impact of founder personalities on startup success

  • Paul X. McCarthy 1 , 2 ,
  • Xian Gong 3 ,
  • Fabian Braesemann 4 , 5 ,
  • Fabian Stephany 4 , 5 ,
  • Marian-Andrei Rizoiu 3 &
  • Margaret L. Kern 6  

Scientific Reports volume  13 , Article number:  17200 ( 2023 ) Cite this article

59k Accesses

2 Citations

305 Altmetric

Metrics details

  • Human behaviour
  • Information technology

An Author Correction to this article was published on 07 May 2024

This article has been updated

Startup companies solve many of today’s most challenging problems, such as the decarbonisation of the economy or the development of novel life-saving vaccines. Startups are a vital source of innovation, yet the most innovative are also the least likely to survive. The probability of success of startups has been shown to relate to several firm-level factors such as industry, location and the economy of the day. Still, attention has increasingly considered internal factors relating to the firm’s founding team, including their previous experiences and failures, their centrality in a global network of other founders and investors, as well as the team’s size. The effects of founders’ personalities on the success of new ventures are, however, mainly unknown. Here, we show that founder personality traits are a significant feature of a firm’s ultimate success. We draw upon detailed data about the success of a large-scale global sample of startups (n = 21,187). We find that the Big Five personality traits of startup founders across 30 dimensions significantly differ from that of the population at large. Key personality facets that distinguish successful entrepreneurs include a preference for variety, novelty and starting new things (openness to adventure), like being the centre of attention (lower levels of modesty) and being exuberant (higher activity levels). We do not find one ’Founder-type’ personality; instead, six different personality types appear. Our results also demonstrate the benefits of larger, personality-diverse teams in startups, which show an increased likelihood of success. The findings emphasise the role of the diversity of personality types as a novel dimension of team diversity that influences performance and success.

Similar content being viewed by others

research paper and data

Predicting success in the worldwide start-up network

research paper and data

The personality traits of self-made and inherited millionaires

research paper and data

The nexus of top executives’ attributes, firm strategies, and outcomes: Large firms versus SMEs

Introduction.

The success of startups is vital to economic growth and renewal, with a small number of young, high-growth firms creating a disproportionately large share of all new jobs 1 , 2 . Startups create jobs and drive economic growth, and they are also an essential vehicle for solving some of society’s most pressing challenges.

As a poignant example, six centuries ago, the German city of Mainz was abuzz as the birthplace of the world’s first moveable-type press created by Johannes Gutenberg. However, in the early part of this century, it faced several economic challenges, including rising unemployment and a significant and growing municipal debt. Then in 2008, two Turkish immigrants formed the company BioNTech in Mainz with another university research colleague. Together they pioneered new mRNA-based technologies. In 2020, BioNTech partnered with US pharmaceutical giant Pfizer to create one of only a handful of vaccines worldwide for Covid-19, saving an estimated six million lives 3 . The economic benefit to Europe and, in particular, the German city where the vaccine was developed has been significant, with windfall tax receipts to the government clearing Mainz’s €1.3bn debt and enabling tax rates to be reduced, attracting other businesses to the region as well as inspiring a whole new generation of startups 4 .

While stories such as the success of BioNTech are often retold and remembered, their success is the exception rather than the rule. The overwhelming majority of startups ultimately fail. One study of 775 startups in Canada that successfully attracted external investment found only 35% were still operating seven years later 5 .

But what determines the success of these ‘lucky few’? When assessing the success factors of startups, especially in the early-stage unproven phase, venture capitalists and other investors offer valuable insights. Three different schools of thought characterise their perspectives: first, supply-side or product investors : those who prioritise investing in firms they consider to have novel and superior products and services, investing in companies with intellectual property such as patents and trademarks. Secondly, demand-side or market-based investors : those who prioritise investing in areas of highest market interest, such as in hot areas of technology like quantum computing or recurrent or emerging large-scale social and economic challenges such as the decarbonisation of the economy. Thirdly, talent investors : those who prioritise the foundation team above the startup’s initial products or what industry or problem it is looking to address.

Investors who adopt the third perspective and prioritise talent often recognise that a good team can overcome many challenges in the lead-up to product-market fit. And while the initial products of a startup may or may not work a successful and well-functioning team has the potential to pivot to new markets and new products, even if the initial ones prove untenable. Not surprisingly, an industry ‘autopsy’ into 101 tech startup failures found 23% were due to not having the right team—the number three cause of failure ahead of running out of cash or not having a product that meets the market need 6 .

Accordingly, early entrepreneurship research was focused on the personality of founders, but the focus shifted away in the mid-1980s onwards towards more environmental factors such as venture capital financing 7 , 8 , 9 , networks 10 , location 11 and due to a range of issues and challenges identified with the early entrepreneurship personality research 12 , 13 . At the turn of the 21st century, some scholars began exploring ways to combine context and personality and reconcile entrepreneurs’ individual traits with features of their environment. In her influential work ’The Sociology of Entrepreneurship’, Patricia H. Thornton 14 discusses two perspectives on entrepreneurship: the supply-side perspective (personality theory) and the demand-side perspective (environmental approach). The supply-side perspective focuses on the individual traits of entrepreneurs. In contrast, the demand-side perspective focuses on the context in which entrepreneurship occurs, with factors such as finance, industry and geography each playing their part. In the past two decades, there has been a revival of interest and research that explores how entrepreneurs’ personality relates to the success of their ventures. This new and growing body of research includes several reviews and meta-studies, which show that personality traits play an important role in both career success and entrepreneurship 15 , 16 , 17 , 18 , 19 , that there is heterogeneity in definitions and samples used in research on entrepreneurship 16 , 18 , and that founder personality plays an important role in overall startup outcomes 17 , 19 .

Motivated by the pivotal role of the personality of founders on startup success outlined in these recent contributions, we investigate two main research questions:

Which personality features characterise founders?

Do their personalities, particularly the diversity of personality types in founder teams, play a role in startup success?

We aim to understand whether certain founder personalities and their combinations relate to startup success, defined as whether their company has been acquired, acquired another company or listed on a public stock exchange. For the quantitative analysis, we draw on a previously published methodology 20 , which matches people to their ‘ideal’ jobs based on social media-inferred personality traits.

We find that personality traits matter for startup success. In addition to firm-level factors of location, industry and company age, we show that founders’ specific Big Five personality traits, such as adventurousness and openness, are significantly more widespread among successful startups. As we find that companies with multi-founder teams are more likely to succeed, we cluster founders in six different and distinct personality groups to underline the relevance of the complementarity in personality traits among founder teams. Startups with diverse and specific combinations of founder types (e. g., an adventurous ‘Leader’, a conscientious ‘Accomplisher’, and an extroverted ‘Developer’) have significantly higher odds of success.

We organise the rest of this paper as follows. In the Section " Results ", we introduce the data used and the methods applied to relate founders’ psychological traits with their startups’ success. We introduce the natural language processing method to derive individual and team personality characteristics and the clustering technique to identify personality groups. Then, we present the result for multi-variate regression analysis that allows us to relate firm success with external and personality features. Subsequently, the Section " Discussion " mentions limitations and opportunities for future research in this domain. In the Section " Methods ", we describe the data, the variables in use, and the clustering in greater detail. Robustness checks and additional analyses can be found in the Supplementary Information.

Our analysis relies on two datasets. We infer individual personality facets via a previously published methodology 20 from Twitter user profiles. Here, we restrict our analysis to founders with a Crunchbase profile. Crunchbase is the world’s largest directory on startups. It provides information about more than one million companies, primarily focused on funding and investors. A company’s public Crunchbase profile can be considered a digital business card of an early-stage venture. As such, the founding teams tend to provide information about themselves, including their educational background or a link to their Twitter account.

We infer the personality profiles of the founding teams of early-stage ventures from their publicly available Twitter profiles, using the methodology described by Kern et al. 20 . Then, we correlate this information to data from Crunchbase to determine whether particular combinations of personality traits correspond to the success of early-stage ventures. The final dataset used in the success prediction model contains n = 21,187 startup companies (for more details on the data see the Methods section and SI section  A.5 ).

Revisions of Crunchbase as a data source for investigations on a firm and industry level confirm the platform to be a useful and valuable source of data for startups research, as comparisons with other sources at micro-level, e.g., VentureXpert or PwC, also suggest that the platform’s coverage is very comprehensive, especially for start-ups located in the United States 21 . Moreover, aggregate statistics on funding rounds by country and year are quite similar to those produced with other established sources, going to validate the use of Crunchbase as a reliable source in terms of coverage of funded ventures. For instance, Crunchbase covers about the same number of investment rounds in the analogous sectors as collected by the National Venture Capital Association 22 . However, we acknowledge that the data source might suffer from registration latency (a certain delay between the foundation of the company and its actual registration on Crunchbase) and success bias in company status (the likeliness that failed companies decide to delete their profile from the database).

The definition of startup success

The success of startups is uncertain, dependent on many factors and can be measured in various ways. Due to the likelihood of failure in startups, some large-scale studies have looked at which features predict startup survival rates 23 , and others focus on fundraising from external investors at various stages 24 . Success for startups can be measured in multiple ways, such as the amount of external investment attracted, the number of new products shipped or the annual growth in revenue. But sometimes external investments are misguided, revenue growth can be short-lived, and new products may fail to find traction.

Success in a startup is typically staged and can appear in different forms and times. For example, a startup may be seen to be successful when it finds a clear solution to a widely recognised problem, such as developing a successful vaccine. On the other hand, it could be achieving some measure of commercial success, such as rapidly accelerating sales or becoming profitable or at least cash positive. Or it could be reaching an exit for foundation investors via a trade sale, acquisition or listing of its shares for sale on a public stock exchange via an Initial Public Offering (IPO).

For our study, we focused on the startup’s extrinsic success rather than the founders’ intrinsic success per se, as its more visible, objective and measurable. A frequently considered measure of success is the attraction of external investment by venture capitalists 25 . However, this is not in and of itself a good measure of clear, incontrovertible success, particularly for early-stage ventures. This is because it reflects investors’ expectations of a startup’s success potential rather than actual business success. Similarly, we considered other measures like revenue growth 26 , liquidity events 27 , 28 , 29 , profitability 30 and social impact 31 , all of which have benefits as they capture incremental success, but each also comes with operational measurement challenges.

Therefore, we apply the success definition initially introduced by Bonaventura et al. 32 , namely that a startup is acquired, acquires another company or has an initial public offering (IPO). We consider any of these major capital liquidation events as a clear threshold signal that the company has matured from an early-stage venture to becoming or is on its way to becoming a mature company with clear and often significant business growth prospects. Together these three major liquidity events capture the primary forms of exit for external investors (an acquisition or trade sale and an IPO). For companies with a longer autonomous growth runway, acquiring another company marks a similar milestone of scale, maturity and capability.

Using multifactor analysis and a binary classification prediction model of startup success, we looked at many variables together and their relative influence on the probability of the success of startups. We looked at seven categories of factors through three lenses of firm-level factors: (1) location, (2) industry, (3) age of the startup; founder-level factors: (4) number of founders, (5) gender of founders, (6) personality characteristics of founders and; lastly team-level factors: (7) founder-team personality combinations. The model performance and relative impacts on the probability of startup success of each of these categories of founders are illustrated in more detail in section  A.6 of the Supplementary Information (in particular Extended Data Fig.  19 and Extended Data Fig.  20 ). In total, we considered over three hundred variables (n = 323) and their relative significant associations with success.

The personality of founders

Besides product-market, industry, and firm-level factors (see SI section  A.1 ), research suggests that the personalities of founders play a crucial role in startup success 19 . Therefore, we examine the personality characteristics of individual startup founders and teams of founders in relationship to their firm’s success by applying the success definition used by Bonaventura et al. 32 .

Employing established methods 33 , 34 , 35 , we inferred the personality traits across 30 dimensions (Big Five facets) of a large global sample of startup founders. The startup founders cohort was created from a subset of founders from the global startup industry directory Crunchbase, who are also active on the social media platform Twitter.

To measure the personality of the founders, we used the Big Five, a popular model of personality which includes five core traits: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Emotional stability. Each of these traits can be further broken down into thirty distinct facets. Studies have found that the Big Five predict meaningful life outcomes, such as physical and mental health, longevity, social relationships, health-related behaviours, antisocial behaviour, and social contribution, at levels on par with intelligence and socioeconomic status 36 Using machine learning to infer personality traits by analysing the use of language and activity on social media has been shown to be more accurate than predictions of coworkers, friends and family and similar in accuracy to the judgement of spouses 37 . Further, as other research has shown, we assume that personality traits remain stable in adulthood even through significant life events 38 , 39 , 40 . Personality traits have been shown to emerge continuously from those already evident in adolescence 41 and are not significantly influenced by external life events such as becoming divorced or unemployed 42 . This suggests that the direction of any measurable effect goes from founder personalities to startup success and not vice versa.

As a first investigation to what extent personality traits might relate to entrepreneurship, we use the personality characteristics of individuals to predict whether they were an entrepreneur or an employee. We trained and tested a machine-learning random forest classifier to distinguish and classify entrepreneurs from employees and vice-versa using inferred personality vectors alone. As a result, we found we could correctly predict entrepreneurs with 77% accuracy and employees with 88% accuracy (Fig.  1 A). Thus, based on personality information alone, we correctly predict all unseen new samples with 82.5% accuracy (See SI section  A.2 for more details on this analysis, the classification modelling and prediction accuracy).

We explored in greater detail which personality features are most prominent among entrepreneurs. We found that the subdomain or facet of Adventurousness within the Big Five Domain of Openness was significant and had the largest effect size. The facet of Modesty within the Big Five Domain of Agreeableness and Activity Level within the Big Five Domain of Extraversion was the subsequent most considerable effect (Fig.  1 B). Adventurousness in the Big Five framework is defined as the preference for variety, novelty and starting new things—which are consistent with the role of a startup founder whose role, especially in the early life of the company, is to explore things that do not scale easily 43 and is about developing and testing new products, services and business models with the market.

Once we derived and tested the Big Five personality features for each entrepreneur in our data set, we examined whether there is evidence indicating that startup founders naturally cluster according to their personality features using a Hopkins test (see Extended Data Figure  6 ). We discovered clear clustering tendencies in the data compared with other renowned reference data sets known to have clusters. Then, once we established the founder data clusters, we used agglomerative hierarchical clustering. This ‘bottom-up’ clustering technique initially treats each observation as an individual cluster. Then it merges them to create a hierarchy of possible cluster schemes with differing numbers of groups (See Extended Data Fig.  7 ). And lastly, we identified the optimum number of clusters based on the outcome of four different clustering performance measurements: Davies-Bouldin Index, Silhouette coefficients, Calinski-Harabas Index and Dunn Index (see Extended Data Figure  8 ). We find that the optimum number of clusters of startup founders based on their personality features is six (labelled #0 through to #5), as shown in Fig.  1 C.

To better understand the context of different founder types, we positioned each of the six types of founders within an occupation-personality matrix established from previous research 44 . This research showed that ‘each job has its own personality’ using a substantial sample of employees across various jobs. Utilising the methodology employed in this study, we assigned labels to the cluster names #0 to #5, which correspond to the identified occupation tribes that best describe the personality facets represented by the clusters (see Extended Data Fig.  9 for an overview of these tribes, as identified by McCarthy et al. 44 ).

Utilising this approach, we identify three ’purebred’ clusters: #0, #2 and #5, whose members are dominated by a single tribe (larger than 60% of all individuals in each cluster are characterised by one tribe). Thus, these clusters represent and share personality attributes of these previously identified occupation-personality tribes 44 , which have the following known distinctive personality attributes (see also Table  1 ):

Accomplishers (#0) —Organised & outgoing. confident, down-to-earth, content, accommodating, mild-tempered & self-assured.

Leaders (#2) —Adventurous, persistent, dispassionate, assertive, self-controlled, calm under pressure, philosophical, excitement-seeking & confident.

Fighters (#5) —Spontaneous and impulsive, tough, sceptical, and uncompromising.

We labelled these clusters with the tribe names, acknowledging that labels are somewhat arbitrary, based on our best interpretation of the data (See SI section  A.3 for more details).

For the remaining three clusters #1, #3 and #4, we can see they are ‘hybrids’, meaning that the founders within them come from a mix of different tribes, with no one tribe representing more than 50% of the members of that cluster. However, the tribes with the largest share were noted as #1 Experts/Engineers, #3 Fighters, and #4 Operators.

To label these three hybrid clusters, we examined the closest occupations to the median personality features of each cluster. We selected a name that reflected the common themes of these occupations, namely:

Experts/Engineers (#1) as the closest roles included Materials Engineers and Chemical Engineers. This is consistent with this cluster’s personality footprint, which is highest in openness in the facets of imagination and intellect.

Developers (#3) as the closest roles include Application Developers and related technology roles such as Business Systems Analysts and Product Managers.

Operators (#4) as the closest roles include service, maintenance and operations functions, including Bicycle Mechanic, Mechanic and Service Manager. This is also consistent with one of the key personality traits of high conscientiousness in the facet of orderliness and high agreeableness in the facet of humility for founders in this cluster.

figure 1

Founder-Level Factors of Startup Success. ( A ), Successful entrepreneurs differ from successful employees. They can be accurately distinguished using a classifier with personality information alone. ( B ), Successful entrepreneurs have different Big Five facet distributions, especially on adventurousness, modesty and activity level. ( C ), Founders come in six different types: Fighters, Operators, Accomplishers, Leaders, Engineers and Developers (FOALED) ( D ), Each founder Personality-Type has its distinct facet.

Together, these six different types of startup founders (Fig.  1 C) represent a framework we call the FOALED model of founder types—an acronym of Fighters, Operators, Accomplishers, Leaders, Engineers and D evelopers.

Each founder’s personality type has its distinct facet footprint (for more details, see Extended Data Figure  10 in SI section  A.3 ). Also, we observe a central core of correlated features that are high for all types of entrepreneurs, including intellect, adventurousness and activity level (Fig.  1 D).To test the robustness of the clustering of the personality facets, we compare the mean scores of the individual facets per cluster with a 20-fold resampling of the data and find that the clusters are, overall, largely robust against resampling (see Extended Data Figure  11 in SI section  A.3 for more details).

We also find that the clusters accord with the distribution of founders’ roles in their startups. For example, Accomplishers are often Chief Executive Officers, Chief Financial Officers, or Chief Operating Officers, while Fighters tend to be Chief Technical Officers, Chief Product Officers, or Chief Commercial Officers (see Extended Data Fig.  12 in SI section  A.4 for more details).

The ensemble theory of success

While founders’ individual personality traits, such as Adventurousness or Openness, show to be related to their firms’ success, we also hypothesise that the combination, or ensemble, of personality characteristics of a founding team impacts the chances of success. The logic behind this reasoning is complementarity, which is proposed by contemporary research on the functional roles of founder teams. Examples of these clear functional roles have evolved in established industries such as film and television, construction, and advertising 45 . When we subsequently explored the combinations of personality types among founders and their relationship to the probability of startup success, adjusted for a range of other factors in a multi-factorial analysis, we found significantly increased chances of success for mixed foundation teams:

Initially, we find that firms with multiple founders are more likely to succeed, as illustrated in Fig.  2 A, which shows firms with three or more founders are more than twice as likely to succeed than solo-founded startups. This finding is consistent with investors’ advice to founders and previous studies 46 . We also noted that some personality types of founders increase the probability of success more than others, as shown in SI section  A.6 (Extended Data Figures  16 and 17 ). Also, we note that gender differences play out in the distribution of personality facets: successful female founders and successful male founders show facet scores that are more similar to each other than are non-successful female founders to non-successful male founders (see Extended Data Figure  18 ).

figure 2

The Ensemble Theory of Team-Level Factors of Startup Success. ( A ) Having a larger founder team elevates the chances of success. This can be due to multiple reasons, e.g., a more extensive network or knowledge base but also personality diversity. ( B ) We show that joint personality combinations of founders are significantly related to higher chances of success. This is because it takes more than one founder to cover all beneficial personality traits that ‘breed’ success. ( C ) In our multifactor model, we show that firms with diverse and specific combinations of types of founders have significantly higher odds of success.

Access to more extensive networks and capital could explain the benefits of having more founders. Still, as we find here, it also offers a greater diversity of combined personalities, naturally providing a broader range of maximum traits. So, for example, one founder may be more open and adventurous, and another could be highly agreeable and trustworthy, thus, potentially complementing each other’s particular strengths associated with startup success.

The benefits of larger and more personality-diverse foundation teams can be seen in the apparent differences between successful and unsuccessful firms based on their combined Big Five personality team footprints, as illustrated in Fig.  2 B. Here, maximum values for each Big Five trait of a startup’s co-founders are mapped; stratified by successful and non-successful companies. Founder teams of successful startups tend to score higher on Openness, Conscientiousness, Extraversion, and Agreeableness.

When examining the combinations of founders with different personality types, we find that some ensembles of personalities were significantly correlated with greater chances of startup success—while controlling for other variables in the model—as shown in Fig.  2 C (for more details on the modelling, the predictive performance and the coefficient estimates of the final model, see Extended Data Figures  19 , 20 , and 21 in SI section  A.6 ).

Three combinations of trio-founder companies were more than twice as likely to succeed than other combinations, namely teams with (1) a Leader and two Developers , (2) an Operator and two Developers , and (3) an Expert/Engineer , Leader and Developer . To illustrate the potential mechanisms on how personality traits might influence the success of startups, we provide some examples of well-known, successful startup founders and their characteristic personality traits in Extended Data Figure  22 .

Startups are one of the key mechanisms for brilliant ideas to become solutions to some of the world’s most challenging economic and social problems. Examples include the Google search algorithm, disability technology startup Fingerwork’s touchscreen technology that became the basis of the Apple iPhone, or the Biontech mRNA technology that powered Pfizer’s COVID-19 vaccine.

We have shown that founders’ personalities and the combination of personalities in the founding team of a startup have a material and significant impact on its likelihood of success. We have also shown that successful startup founders’ personality traits are significantly different from those of successful employees—so much so that a simple predictor can be trained to distinguish between employees and entrepreneurs with more than 80% accuracy using personality trait data alone.

Just as occupation-personality maps derived from data can provide career guidance tools, so too can data on successful entrepreneurs’ personality traits help people decide whether becoming a founder may be a good choice for them.

We have learnt through this research that there is not one type of ideal ’entrepreneurial’ personality but six different types. Many successful startups have multiple co-founders with a combination of these different personality types.

To a large extent, founding a startup is a team sport; therefore, diversity and complementarity of personalities matter in the foundation team. It has an outsized impact on the company’s likelihood of success. While all startups are high risk, the risk becomes lower with more founders, particularly if they have distinct personality traits.

Our work demonstrates the benefits of personality diversity among the founding team of startups. Greater awareness of this novel form of diversity may help create more resilient startups capable of more significant innovation and impact.

The data-driven research approach presented here comes with certain methodological limitations. The principal data sources of this study—Crunchbase and Twitter—are extensive and comprehensive, but there are characterised by some known and likely sample biases.

Crunchbase is the principal public chronicle of venture capital funding. So, there is some likely sample bias toward: (1) Startup companies that are funded externally: self-funded or bootstrapped companies are less likely to be represented in Crunchbase; (2) technology companies, as that is Crunchbase’s roots; (3) multi-founder companies; (4) male founders: while the representation of female founders is now double that of the mid-2000s, women still represent less than 25% of the sample; (5) companies that succeed: companies that fail, especially those that fail early, are likely to be less represented in the data.

Samples were also limited to those founders who are active on Twitter, which adds additional selection biases. For example, Twitter users typically are younger, more educated and have a higher median income 47 . Another limitation of our approach is the potentially biased presentation of a person’s digital identity on social media, which is the basis for identifying personality traits. For example, recent research suggests that the language and emotional tone used by entrepreneurs in social media can be affected by events such as business failure 48 , which might complicate the personality trait inference.

In addition to sampling biases within the data, there are also significant historical biases in startup culture. For many aspects of the entrepreneurship ecosystem, women, for example, are at a disadvantage 49 . Male-founded companies have historically dominated most startup ecosystems worldwide, representing the majority of founders and the overwhelming majority of venture capital investors. As a result, startups with women have historically attracted significantly fewer funds 50 , in part due to the male bias among venture investors, although this is now changing, albeit slowly 51 .

The research presented here provides quantitative evidence for the relevance of personality types and the diversity of personalities in startups. At the same time, it brings up other questions on how personality traits are related to other factors associated with success, such as:

Will the recent growing focus on promoting and investing in female founders change the nature, composition and dynamics of startups and their personalities leading to a more diverse personality landscape in startups?

Will the growth of startups outside of the United States change what success looks like to investors and hence the role of different personality traits and their association to diverse success metrics?

Many of today’s most renowned entrepreneurs are either Baby Boomers (such as Gates, Branson, Bloomberg) or Generation Xers (such as Benioff, Cannon-Brookes, Musk). However, as we can see, personality is both a predictor and driver of success in entrepreneurship. Will generation-wide differences in personality and outlook affect startups and their success?

Moreover, the findings shown here have natural extensions and applications beyond startups, such as for new projects within large established companies. While not technically startups, many large enterprises and industries such as construction, engineering and the film industry rely on forming new project-based, cross-functional teams that are often new ventures and share many characteristics of startups.

There is also potential for extending this research in other settings in government, NGOs, and within the research community. In scientific research, for example, team diversity in terms of age, ethnicity and gender has been shown to be predictive of impact, and personality diversity may be another critical dimension 52 .

Another extension of the study could investigate the development of the language used by startup founders on social media over time. Such an extension could investigate whether the language (and inferred psychological characteristics) change as the entrepreneurs’ ventures go through major business events such as foundation, funding, or exit.

Overall, this study demonstrates, first, that startup founders have significantly different personalities than employees. Secondly, besides firm-level factors, which are known to influence firm success, we show that a range of founder-level factors, notably the character traits of its founders, significantly impact a startup’s likelihood of success. Lastly, we looked at team-level factors. We discovered in a multifactor analysis that personality-diverse teams have the most considerable impact on the probability of a startup’s success, underlining the importance of personality diversity as a relevant factor of team performance and success.

Data sources

Entrepreneurs dataset.

Data about the founders of startups were collected from Crunchbase (Table  2 ), an open reference platform for business information about private and public companies, primarily early-stage startups. It is one of the largest and most comprehensive data sets of its kind and has been used in over 100 peer-reviewed research articles about economic and managerial research.

Crunchbase contains data on over two million companies - mainly startup companies and the companies who partner with them, acquire them and invest in them, as well as profiles on well over one million individuals active in the entrepreneurial ecosystem worldwide from over 200 countries and spans. Crunchbase started in the technology startup space, and it now covers all sectors, specifically focusing on entrepreneurship, investment and high-growth companies.

While Crunchbase contains data on over one million individuals in the entrepreneurial ecosystem, some are not entrepreneurs or startup founders but play other roles, such as investors, lawyers or executives at companies that acquire startups. To create a subset of only entrepreneurs, we selected a subset of 32,732 who self-identify as founders and co-founders (by job title) and who are also publicly active on the social media platform Twitter. We also removed those who also are venture capitalists to distinguish between investors and founders.

We selected founders active on Twitter to be able to use natural language processing to infer their Big Five personality features using an open-vocabulary approach shown to be accurate in the previous research by analysing users’ unstructured text, such as Twitter posts in our case. For this project, as with previous research 20 , we employed a commercial service, IBM Watson Personality Insight, to infer personality facets. This service provides raw scores and percentile scores of Big Five Domains (Openness, Conscientiousness, Extraversion, Agreeableness and Emotional Stability) and the corresponding 30 subdomains or facets. In addition, the public content of Twitter posts was collected, and there are 32,732 profiles that each had enough Twitter posts (more than 150 words) to get relatively accurate personality scores (less than 12.7% Average Mean Absolute Error).

The entrepreneurs’ dataset is analysed in combination with other data about the companies they founded to explore questions about the nature and patterns of personality traits of entrepreneurs and the relationships between these patterns and company success.

For the multifactor analysis, we further filtered the data in several preparatory steps for the success prediction modelling (for more details, see SI section  A.5 ). In particular, we removed data points with missing values (Extended Data Fig.  13 ) and kept only companies in the data that were founded from 1990 onward to ensure consistency with previous research 32 (see Extended Data Fig.  14 ). After cleaning, filtering and pre-processing the data, we ended up with data from 25,214 founders who founded 21,187 startup companies to be used in the multifactor analysis. Of those, 3442 startups in the data were successful, 2362 in the first seven years after they were founded (see Extended Data Figure  15 for more details).

Entrepreneurs and employees dataset

To investigate whether startup founders show personality traits that are similar or different from the population at large (i. e. the entrepreneurs vs employees sub-analysis shown in Fig.  1 A and B), we filtered the entrepreneurs’ data further: we reduced the sample to those founders of companies, which attracted more than US$100k in investment to create a reference set of successful entrepreneurs (n \(=\) 4400).

To create a control group of employees who are not also entrepreneurs or very unlikely to be of have been entrepreneurs, we leveraged the fact that while some occupational titles like CEO, CTO and Public Speaker are commonly shared by founders and co-founders, some others such as Cashier , Zoologist and Detective very rarely co-occur seem to be founders or co-founders. To illustrate, many company founders also adopt regular occupation titles such as CEO or CTO. Many founders will be Founder and CEO or Co-founder and CTO. While founders are often CEOs or CTOs, the reverse is not necessarily true, as many CEOs are professional executives that were not involved in the establishment or ownership of the firm.

Using data from LinkedIn, we created an Entrepreneurial Occupation Index (EOI) based on the ratio of entrepreneurs for each of the 624 occupations used in a previous study of occupation-personality fit 44 . It was calculated based on the percentage of all people working in the occupation from LinkedIn compared to those who shared the title Founder or Co-founder (See SI section  A.2 for more details). A reference set of employees (n=6685) was then selected across the 112 different occupations with the lowest propensity for entrepreneurship (less than 0.5% EOI) from a large corpus of Twitter users with known occupations, which is also drawn from the previous occupational-personality fit study 44 .

These two data sets were used to test whether it may be possible to distinguish successful entrepreneurs from successful employees based on the different patterns of personality traits alone.

Hierarchical clustering

We applied several clustering techniques and tests to the personality vectors of the entrepreneurs’ data set to determine if there are natural clusters and, if so, how many are the optimum number.

Firstly, to determine if there is a natural typology to founder personalities, we applied the Hopkins statistic—a statistical test we used to answer whether the entrepreneurs’ dataset contains inherent clusters. It measures the clustering tendency based on the ratio of the sum of distances of real points within a sample of the entrepreneurs’ dataset to their nearest neighbours and the sum of distances of randomly selected artificial points from a simulated uniform distribution to their nearest neighbours in the real entrepreneurs’ dataset. The ratio measures the difference between the entrepreneurs’ data distribution and the simulated uniform distribution, which tests the randomness of the data. The range of Hopkins statistics is from 0 to 1. The scores are close to 0, 0.5 and 1, respectively, indicating whether the dataset is uniformly distributed, randomly distributed or highly clustered.

To cluster the founders by personality facets, we used Agglomerative Hierarchical Clustering (AHC)—a bottom-up approach that treats an individual data point as a singleton cluster and then iteratively merges pairs of clusters until all data points are included in the single big collection. Ward’s linkage method is used to choose the pair of groups for minimising the increase in the within-cluster variance after combining. AHC was widely applied to clustering analysis since a tree hierarchy output is more informative and interpretable than K-means. Dendrograms were used to visualise the hierarchy to provide the perspective of the optimal number of clusters. The heights of the dendrogram represent the distance between groups, with lower heights representing more similar groups of observations. A horizontal line through the dendrogram was drawn to distinguish the number of significantly different clusters with higher heights. However, as it is not possible to determine the optimum number of clusters from the dendrogram, we applied other clustering performance metrics to analyse the optimal number of groups.

A range of Clustering performance metrics were used to help determine the optimal number of clusters in the dataset after an apparent clustering tendency was confirmed. The following metrics were implemented to evaluate the differences between within-cluster and between-cluster distances comprehensively: Dunn Index, Calinski-Harabasz Index, Davies-Bouldin Index and Silhouette Index. The Dunn Index measures the ratio of the minimum inter-cluster separation and the maximum intra-cluster diameter. At the same time, the Calinski-Harabasz Index improves the measurement of the Dunn Index by calculating the ratio of the average sum of squared dispersion of inter-cluster and intra-cluster. The Davies-Bouldin Index simplifies the process by treating each cluster individually. It compares the sum of the average distance among intra-cluster data points to the cluster centre of two separate groups with the distance between their centre points. Finally, the Silhouette Index is the overall average of the silhouette coefficients for each sample. The coefficient measures the similarity of the data point to its cluster compared with the other groups. Higher scores of the Dunn, Calinski-Harabasz and Silhouette Index and a lower score of the Davies-Bouldin Index indicate better clustering configuration.

Classification modelling

Classification algorithms.

To obtain a comprehensive and robust conclusion in the analysis predicting whether a given set of personality traits corresponds to an entrepreneur or an employee, we explored the following classifiers: Naïve Bayes, Elastic Net regularisation, Support Vector Machine, Random Forest, Gradient Boosting and Stacked Ensemble. The Naïve Bayes classifier is a probabilistic algorithm based on Bayes’ theorem with assumptions of independent features and equiprobable classes. Compared with other more complex classifiers, it saves computing time for large datasets and performs better if the assumptions hold. However, in the real world, those assumptions are generally violated. Elastic Net regularisation combines the penalties of Lasso and Ridge to regularise the Logistic classifier. It eliminates the limitation of multicollinearity in the Lasso method and improves the limitation of feature selection in the Ridge method. Even though Elastic Net is as simple as the Naïve Bayes classifier, it is more time-consuming. The Support Vector Machine (SVM) aims to find the ideal line or hyperplane to separate successful entrepreneurs and employees in this study. The dividing line can be non-linear based on a non-linear kernel, such as the Radial Basis Function Kernel. Therefore, it performs well on high-dimensional data while the ’right’ kernel selection needs to be tuned. Random Forest (RF) and Gradient Boosting Trees (GBT) are ensembles of decision trees. All trees are trained independently and simultaneously in RF, while a new tree is trained each time and corrected by previously trained trees in GBT. RF is a more robust and straightforward model since it does not have many hyperparameters to tune. GBT optimises the objective function and learns a more accurate model since there is a successive learning and correction process. Stacked Ensemble combines all existing classifiers through a Logistic Regression. Better than bagging with only variance reduction and boosting with only bias reduction, the ensemble leverages the benefit of model diversity with both lower variance and bias. All the above classification algorithms distinguish successful entrepreneurs and employees based on the personality matrix.

Evaluation metrics

A range of evaluation metrics comprehensively explains the performance of a classification prediction. The most straightforward metric is accuracy, which measures the overall portion of correct predictions. It will mislead the performance of an imbalanced dataset. The F1 score is better than accuracy by combining precision and recall and considering the False Negatives and False Positives. Specificity measures the proportion of detecting the true negative rate that correctly identifies employees, while Positive Predictive Value (PPV) calculates the probability of accurately predicting successful entrepreneurs. Area Under the Receiver Operating Characteristic Curve (AUROC) determines the capability of the algorithm to distinguish between successful entrepreneurs and employees. A higher value means the classifier performs better on separating the classes.

Feature importance

To further understand and interpret the classifier, it is critical to identify variables with significant predictive power on the target. Feature importance of tree-based models measures Gini importance scores for all predictors, which evaluate the overall impact of the model after cutting off the specific feature. The measurements consider all interactions among features. However, it does not provide insights into the directions of impacts since the importance only indicates the ability to distinguish different classes.

Statistical analysis

T-test, Cohen’s D and two-sample Kolmogorov-Smirnov test are introduced to explore how the mean values and distributions of personality facets between entrepreneurs and employees differ. The T-test is applied to determine whether the mean of personality facets of two group samples are significantly different from one another or not. The facets with significant differences detected by the hypothesis testing are critical to separate the two groups. Cohen’s d is to measure the effect size of the results of the previous t-test, which is the ratio of the mean difference to the pooled standard deviation. A larger Cohen’s d score indicates that the mean difference is greater than the variability of the whole sample. Moreover, it is interesting to check whether the two groups’ personality facets’ probability distributions are from the same distribution through the two-sample Kolmogorov-Smirnov test. There is no assumption about the distributions, but the test is sensitive to deviations near the centre rather than the tail.

Privacy and ethics

The focus of this research is to provide high-level insights about groups of startups, founders and types of founder teams rather than on specific individuals or companies. While we used unit record data from the publicly available data of company profiles from Crunchbase , we removed all identifiers from the underlying data on individual companies and founders and generated aggregate results, which formed the basis for our analysis and conclusions.

Data availability

A dataset which includes only aggregated statistics about the success of startups and the factors that influence is released as part of this research. Underlying data for all figures and the code to reproduce them are available on GitHub: https://github.com/Braesemann/FounderPersonalities . Please contact Fabian Braesemann ( [email protected] ) in case you have any further questions.

Change history

07 may 2024.

A Correction to this paper has been published: https://doi.org/10.1038/s41598-024-61082-7

Henrekson, M. & Johansson, D. Gazelles as job creators: A survey and interpretation of the evidence. Small Bus. Econ. 35 , 227–244 (2010).

Article   Google Scholar  

Davila, A., Foster, G., He, X. & Shimizu, C. The rise and fall of startups: Creation and destruction of revenue and jobs by young companies. Aust. J. Manag. 40 , 6–35 (2015).

Which vaccine saved the most lives in 2021?: Covid-19. The Economist (Online) (2022). noteName - AstraZeneca; Pfizer Inc; BioNTech SE; Copyright - Copyright The Economist Newspaper NA, Inc. Jul 14, 2022; Last updated - 2022-11-29.

Oltermann, P. Pfizer/biontech tax windfall brings mainz an early christmas present (2021). noteName - Pfizer Inc; BioNTech SE; Copyright - Copyright Guardian News & Media Limited Dec 27, 2021; Last updated - 2021-12-28.

Grant, K. A., Croteau, M. & Aziz, O. The survival rate of startups funded by angel investors. I-INC WHITE PAPER SER.: MAR 2019 , 1–21 (2019).

Google Scholar  

Top 20 reasons start-ups fail - cb insights version (2019). noteCopyright - Copyright Newstex Oct 21, 2019; Last updated - 2022-10-25.

Hochberg, Y. V., Ljungqvist, A. & Lu, Y. Whom you know matters: Venture capital networks and investment performance. J. Financ. 62 , 251–301 (2007).

Fracassi, C., Garmaise, M. J., Kogan, S. & Natividad, G. Business microloans for us subprime borrowers. J. Financ. Quantitative Ana. 51 , 55–83 (2016).

Davila, A., Foster, G. & Gupta, M. Venture capital financing and the growth of startup firms. J. Bus. Ventur. 18 , 689–708 (2003).

Nann, S. et al. Comparing the structure of virtual entrepreneur networks with business effectiveness. Proc. Soc. Behav. Sci. 2 , 6483–6496 (2010).

Guzman, J. & Stern, S. Where is silicon valley?. Science 347 , 606–609 (2015).

Article   ADS   CAS   PubMed   Google Scholar  

Aldrich, H. E. & Wiedenmayer, G. From traits to rates: An ecological perspective on organizational foundings. 61–97 (2019).

Gartner, W. B. Who is an entrepreneur? is the wrong question. Am. J. Small Bus. 12 , 11–32 (1988).

Thornton, P. H. The sociology of entrepreneurship. Ann. Rev. Sociol. 25 , 19–46 (1999).

Eikelboom, M. E., Gelderman, C. & Semeijn, J. Sustainable innovation in public procurement: The decisive role of the individual. J. Public Procure. 18 , 190–201 (2018).

Kerr, S. P. et al. Personality traits of entrepreneurs: A review of recent literature. Found. Trends Entrep. 14 , 279–356 (2018).

Hamilton, B. H., Papageorge, N. W. & Pande, N. The right stuff? Personality and entrepreneurship. Quant. Econ. 10 , 643–691 (2019).

Salmony, F. U. & Kanbach, D. K. Personality trait differences across types of entrepreneurs: A systematic literature review. RMS 16 , 713–749 (2022).

Freiberg, B. & Matz, S. C. Founder personality and entrepreneurial outcomes: A large-scale field study of technology startups. Proc. Natl. Acad. Sci. 120 , e2215829120 (2023).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Kern, M. L., McCarthy, P. X., Chakrabarty, D. & Rizoiu, M.-A. Social media-predicted personality traits and values can help match people to their ideal jobs. Proc. Natl. Acad. Sci. 116 , 26459–26464 (2019).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Dalle, J.-M., Den Besten, M. & Menon, C. Using crunchbase for economic and managerial research. (2017).

Block, J. & Sandner, P. What is the effect of the financial crisis on venture capital financing? Empirical evidence from us internet start-ups. Ventur. Cap. 11 , 295–309 (2009).

Antretter, T., Blohm, I. & Grichnik, D. Predicting startup survival from digital traces: Towards a procedure for early stage investors (2018).

Dworak, D. Analysis of founder background as a predictor for start-up success in achieving successive fundraising rounds. (2022).

Hsu, D. H. Venture capitalists and cooperative start-up commercialization strategy. Manage. Sci. 52 , 204–219 (2006).

Blank, S. Why the lean start-up changes everything (2018).

Kaplan, S. N. & Lerner, J. It ain’t broke: The past, present, and future of venture capital. J. Appl. Corp. Financ. 22 , 36–47 (2010).

Hallen, B. L. & Eisenhardt, K. M. Catalyzing strategies and efficient tie formation: How entrepreneurial firms obtain investment ties. Acad. Manag. J. 55 , 35–70 (2012).

Gompers, P. A. & Lerner, J. The Venture Capital Cycle (MIT Press, 2004).

Shane, S. & Venkataraman, S. The promise of entrepreneurship as a field of research. Acad. Manag. Rev. 25 , 217–226 (2000).

Zahra, S. A. & Wright, M. Understanding the social role of entrepreneurship. J. Manage. Stud. 53 , 610–629 (2016).

Bonaventura, M. et al. Predicting success in the worldwide start-up network. Sci. Rep. 10 , 1–6 (2020).

Schwartz, H. A. et al. Personality, gender, and age in the language of social media: The open-vocabulary approach. PLoS ONE 8 , e73791 (2013).

Plank, B. & Hovy, D. Personality traits on twitter-or-how to get 1,500 personality tests in a week. In Proceedings of the 6th workshop on computational approaches to subjectivity, sentiment and social media analysis , pp 92–98 (2015).

Arnoux, P.-H. et al. 25 tweets to know you: A new model to predict personality with social media. In booktitleEleventh international AAAI conference on web and social media (2017).

Roberts, B. W., Kuncel, N. R., Shiner, R., Caspi, A. & Goldberg, L. R. The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspect. Psychol. Sci. 2 , 313–345 (2007).

Article   PubMed   PubMed Central   Google Scholar  

Youyou, W., Kosinski, M. & Stillwell, D. Computer-based personality judgments are more accurate than those made by humans. Proc. Natl. Acad. Sci. 112 , 1036–1040 (2015).

Soldz, S. & Vaillant, G. E. The big five personality traits and the life course: A 45-year longitudinal study. J. Res. Pers. 33 , 208–232 (1999).

Damian, R. I., Spengler, M., Sutu, A. & Roberts, B. W. Sixteen going on sixty-six: A longitudinal study of personality stability and change across 50 years. J. Pers. Soc. Psychol. 117 , 674 (2019).

Article   PubMed   Google Scholar  

Rantanen, J., Metsäpelto, R.-L., Feldt, T., Pulkkinen, L. & Kokko, K. Long-term stability in the big five personality traits in adulthood. Scand. J. Psychol. 48 , 511–518 (2007).

Roberts, B. W., Caspi, A. & Moffitt, T. E. The kids are alright: Growth and stability in personality development from adolescence to adulthood. J. Pers. Soc. Psychol. 81 , 670 (2001).

Article   CAS   PubMed   Google Scholar  

Cobb-Clark, D. A. & Schurer, S. The stability of big-five personality traits. Econ. Lett. 115 , 11–15 (2012).

Graham, P. Do Things that Don’t Scale (Paul Graham, 2013).

McCarthy, P. X., Kern, M. L., Gong, X., Parker, M. & Rizoiu, M.-A. Occupation-personality fit is associated with higher employee engagement and happiness. (2022).

Pratt, A. C. Advertising and creativity, a governance approach: A case study of creative agencies in London. Environ. Plan A 38 , 1883–1899 (2006).

Klotz, A. C., Hmieleski, K. M., Bradley, B. H. & Busenitz, L. W. New venture teams: A review of the literature and roadmap for future research. J. Manag. 40 , 226–255 (2014).

Duggan, M., Ellison, N. B., Lampe, C., Lenhart, A. & Madden, M. Demographics of key social networking platforms. Pew Res. Center 9 (2015).

Fisch, C. & Block, J. H. How does entrepreneurial failure change an entrepreneur’s digital identity? Evidence from twitter data. J. Bus. Ventur. 36 , 106015 (2021).

Brush, C., Edelman, L. F., Manolova, T. & Welter, F. A gendered look at entrepreneurship ecosystems. Small Bus. Econ. 53 , 393–408 (2019).

Kanze, D., Huang, L., Conley, M. A. & Higgins, E. T. We ask men to win and women not to lose: Closing the gender gap in startup funding. Acad. Manag. J. 61 , 586–614 (2018).

Fan, J. S. Startup biases. UC Davis Law Review (2022).

AlShebli, B. K., Rahwan, T. & Woon, W. L. The preeminence of ethnic diversity in scientific collaboration. Nat. Commun. 9 , 1–10 (2018).

Article   CAS   Google Scholar  

Żbikowski, K. & Antosiuk, P. A machine learning, bias-free approach for predicting business success using crunchbase data. Inf. Process. Manag. 58 , 102555 (2021).

Corea, F., Bertinetti, G. & Cervellati, E. M. Hacking the venture industry: An early-stage startups investment framework for data-driven investors. Mach. Learn. Appl. 5 , 100062 (2021).

Chapman, G. & Hottenrott, H. Founder personality and start-up subsidies. Founder Personality and Start-up Subsidies (2021).

Antoncic, B., Bratkovicregar, T., Singh, G. & DeNoble, A. F. The big five personality-entrepreneurship relationship: Evidence from slovenia. J. Small Bus. Manage. 53 , 819–841 (2015).

Download references

Acknowledgements

We thank Gary Brewer from BuiltWith ; Leni Mayo from Influx , Rachel Slattery from TeamSlatts and Daniel Petre from AirTree Ventures for their ongoing generosity and insights about startups, founders and venture investments. We also thank Tim Li from Crunchbase for advice and liaison regarding data on startups and Richard Slatter for advice and referrals in Twitter .

Author information

Authors and affiliations.

The Data Science Institute, University of Technology Sydney, Sydney, NSW, Australia

Paul X. McCarthy

School of Computer Science and Engineering, UNSW Sydney, Sydney, NSW, Australia

Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, Australia

Xian Gong & Marian-Andrei Rizoiu

Oxford Internet Institute, University of Oxford, Oxford, UK

Fabian Braesemann & Fabian Stephany

DWG Datenwissenschaftliche Gesellschaft Berlin, Berlin, Germany

Melbourne Graduate School of Education, The University of Melbourne, Parkville, VIC, Australia

Margaret L. Kern

You can also search for this author in PubMed   Google Scholar

Contributions

All authors designed research; All authors analysed data and undertook investigation; F.B. and F.S. led multi-factor analysis; P.M., X.G. and M.A.R. led the founder/employee prediction; M.L.K. led personality insights; X.G. collected and tabulated the data; X.G., F.B., and F.S. created figures; X.G. created final art, and all authors wrote the paper.

Corresponding author

Correspondence to Fabian Braesemann .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this Article was revised: The Data Availability section in the original version of this Article was incomplete, the link to the GitHub repository was omitted. Full information regarding the corrections made can be found in the correction for this Article.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

McCarthy, P.X., Gong, X., Braesemann, F. et al. The impact of founder personalities on startup success. Sci Rep 13 , 17200 (2023). https://doi.org/10.1038/s41598-023-41980-y

Download citation

Received : 15 February 2023

Accepted : 04 September 2023

Published : 17 October 2023

DOI : https://doi.org/10.1038/s41598-023-41980-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research paper and data

Help | Advanced Search

Computer Science > Artificial Intelligence

Title: deepseek-prover: advancing theorem proving in llms through large-scale synthetic data.

Abstract: Proof assistants like Lean have revolutionized mathematical proof verification, ensuring high accuracy and reliability. Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data. To address this issue, we introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. This approach involves translating natural language problems into formal statements, filtering out low-quality statements, and generating proofs to create synthetic data. After fine-tuning the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs, our model achieved whole-proof generation accuracies of 46.3% with 64 samples and 52% cumulatively on the Lean 4 miniF2F test, surpassing the baseline GPT-4 at 23.0% with 64 samples and a tree search reinforcement learning method at 41.0%. Additionally, our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any. These results demonstrate the potential of leveraging large-scale synthetic data to enhance theorem-proving capabilities in LLMs. Both the synthetic dataset and the model will be made available to facilitate further research in this promising field.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. Tables in Research Paper

    research paper and data

  2. Practical Research 1 Data Gathering Instrument and Analysis Procedures

    research paper and data

  3. (PDF) Qualitative Research Strategies and Data Analysis Methods in Real

    research paper and data

  4. Data Collection In Research

    research paper and data

  5. FREE 46+ Research Paper Examples & Templates in PDF, MS Word

    research paper and data

  6. Types of Research Report

    research paper and data

VIDEO

  1. What is a Research

  2. Matching array elements in F#

  3. Using the Array.fold function in F#

  4. Challenges and Opportunities for Educational Data Mining ! Research Paper review

  5. F# Tutorial: Introducing the actor model

  6. Study tips for Best Statistics paper #shorts #fypシ #tips

COMMENTS

  1. Harvard Data Science Review

    As an open access platform of the Harvard Data Science Initiative, Harvard Data Science Review (HDSR) features foundational thinking, research milestones, educational innovations, and major applications, with a primary emphasis on reproducibility, replicability, and readability.We aim to publish content that helps define and shape data science as a scientifically rigorous and globally ...

  2. Data Collection

    Data Collection | Definition, Methods & Examples. Published on June 5, 2020 by Pritha Bhandari.Revised on June 21, 2023. Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem.

  3. Research Methods

    Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design. When planning your methods, there are two key decisions you will make. First, decide how you will collect data. Your methods depend on what type of data you need to answer your research question:

  4. Big Data Research

    About the journal. The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in dealing with big data, as ...

  5. Data science: a game changer for science and innovation

    This paper shows data science's potential for disruptive innovation in science, industry, policy, and people's lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e ...

  6. How To Write A Research Paper (FREE Template

    Step 1: Find a topic and review the literature. As we mentioned earlier, in a research paper, you, as the researcher, will try to answer a question.More specifically, that's called a research question, and it sets the direction of your entire paper. What's important to understand though is that you'll need to answer that research question with the help of high-quality sources - for ...

  7. (PDF) Data Collection Methods and Tools for Research; A Step-by-Step

    PDF | Learn how to choose the best data collection methods and tools for your research project, with examples and tips from ResearchGate experts. | Download and read the full-text PDF.

  8. Data management made simple

    Data management made simple. Keeping your research data freely available is crucial for open science — and your funding could depend on it. When Marjorie Etique learnt that she had to create a ...

  9. A Practical Guide to Writing Quantitative and Qualitative Research

    A research question is what a study aims to answer after data analysis and interpretation. The answer is written in length in the discussion section of the paper. Thus, the research question gives a preview of the different parts and variables of the study meant to address the problem posed in the research question.1 An excellent research ...

  10. Open Data

    Open Data is a strategy for incorporating research data into the permanent scientific record by releasing it under an Open Access license. Whether data is deposited in a purpose-built repository or published as Supporting Information alongside a research article, Open Data practices ensure that data remains accessible and discoverable. For ...

  11. Learning to Do Qualitative Data Analysis: A Starting Point

    For many researchers unfamiliar with qualitative research, determining how to conduct qualitative analyses is often quite challenging. Part of this challenge is due to the seemingly limitless approaches that a qualitative researcher might leverage, as well as simply learning to think like a qualitative researcher when analyzing data. From framework analysis (Ritchie & Spencer, 1994) to content ...

  12. Research Paper

    A research paper is a piece of academic writing that provides analysis, interpretation, and argument based on in-depth independent research. About us; Disclaimer; ... Actual research papers may have different structures, contents, and formats depending on the field of study, research question, data collection and analysis methods, and other ...

  13. How to Write a Research Paper

    A research paper is a piece of academic writing that provides analysis, interpretation, and argument based on in-depth independent research. Research papers are similar to academic essays, but they are usually longer and more detailed assignments, designed to assess not only your writing skills but also your skills in scholarly research ...

  14. data science Latest Research Papers

    Data Science . Information Use . Regulatory Compliance . Future Research . Public And Private . Social Good . Public And Private Sector . Effective Use. AbstractThe appetite for effective use of information assets has been steadily rising in both public and private sector organisations.

  15. Research Data

    Analysis Methods. Some common research data analysis methods include: Descriptive statistics: Descriptive statistics involve summarizing and describing the main features of a dataset, such as the mean, median, and standard deviation. Descriptive statistics are often used to provide an initial overview of the data.

  16. JSTOR Home

    Harness the power of visual materials—explore more than 3 million images now on JSTOR. Enhance your scholarly research with underground newspapers, magazines, and journals. Explore collections in the arts, sciences, and literature from the world's leading museums, archives, and scholars. JSTOR is a digital library of academic journals ...

  17. Data Science and Analytics: An Overview from Data-Driven Smart

    This research contributes to the creation of a research vector on the role of data science in central banking. In , the authors provide an overview and tutorial on the data-driven design of intelligent wireless networks. The authors in provide a thorough understanding of computational optimal transport with application to data science.

  18. Defining Research Data

    Defining Research Data. One definition of research data is: "the recorded factual material commonly accepted in the scientific community as necessary to validate research findings." ( OMB Circular 110 ). Research data covers a broad range of types of information (see examples below), and digital data can be structured and stored in a variety of ...

  19. Search

    Find the research you need | With 160+ million publications, 1+ million questions, and 25+ million researchers, this is where everyone can access science

  20. data analysis Latest Research Papers

    Data Missing . The Given. Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study.

  21. LibGuides: Research Data Services: Data Papers & Journals

    Data preservation is a corollary of data papers, not their main purpose. Most data journals do not archive data in-house. Instead, they generally require that authors submit the dataset to a repository. These repositories archive the data, provide persistent access, and assign the dataset a unique identifier (DOI).

  22. What Is Qualitative Research?

    Qualitative research involves collecting and analyzing non-numerical data (e.g., text, video, or audio) to understand concepts, opinions, or experiences. It can be used to gather in-depth insights into a problem or generate new ideas for research. Qualitative research is the opposite of quantitative research, which involves collecting and ...

  23. The impact of founder personalities on startup success

    Revisions of Crunchbase as a data source for investigations on a firm and industry level confirm the platform to be a useful and valuable source of data for startups research, as comparisons with ...

  24. DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale

    Proof assistants like Lean have revolutionized mathematical proof verification, ensuring high accuracy and reliability. Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data. To address this issue, we introduce an approach to generate extensive Lean 4 proof data derived from high-school ...

  25. Writing a Research Paper Introduction

    Table of contents. Step 1: Introduce your topic. Step 2: Describe the background. Step 3: Establish your research problem. Step 4: Specify your objective (s) Step 5: Map out your paper. Research paper introduction examples. Frequently asked questions about the research paper introduction.

  26. Applied Sciences

    Environmental pollution is a significant problem and is increasing gradually as more and more harmful pollutants are being released into water bodies and the environment. Water pollutants are dangerous and pose a threat to all living organisms and the ecosystem. Paper waste is one of the most widespread and largest wastes in the world. This research aims to address two important problems ...