AIChE The Global Home of Chemical Engineers

  • Contact AIChE
  • Communities
  • Learning & Careers
  • Publications
  • Careers at AIChE
  • Equity, Diversity, Inclusion
  • Young Professionals
  • Operating councils
  • Local Sections

Other Sites & Tools

Technical groups, follow aiche, you are here, big data: challenges and future research directions.

The big data movement is creating opportunities for the chemical process industries to improve their operations. Challenges, however, lie ahead.

The big data movement is gaining momentum, with companies increasingly receptive to engaging in big data projects. Their expectations are that, with massive data and distributed computing, they will be able to answer all of their questions — from questions related to plant operations to those on market demand. With answers in hand, companies hope to pave new and innovative paths toward process improvements and economic growth.

An article in Wired magazine, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” (1) , describes a new era in which abundant data and mathematics will replace theory. Massive data is making the hypothesize-model-test approach to science obsolete, the article states. In the past, scientists had to rely on sample testing and statistical analysis to understand a process. Today, computer scientists have access to the entire population and therefore do not need statistical tools or theoretical models. Why is theory needed if the entire “real thing” is now within reach?

Although big data is at the center of many success stories, unexpected failures can occur when a blind trust is placed in the sheer amount of data available — highlighting the importance of theory and fundamental understanding.

A classic example of such failures is actually quite dated. In 1936, renowned magazine Literary Digest conducted an extensive survey before the presidential election between Franklin D. Roosevelt and Alfred Landon, who was then governor of Kansas. The magazine sent out 10 million postcards — considered a massive amount of data at that time — to gain insight into the voting tendencies of the populace. The Digest collected data from 2.4 million voters, and after triple-checking and verifiying the data, forecast a Landon victory over Roosevelt by a margin of 57% to 43%. The final result, however, was a landslide victory by Roosevelt of 61% versus Landon’s 37% (the remaining votes were for a third candidate). Based on a much smaller sample of approximately 3,000 interviews, George Gallup correctly predicted a clear victory for Roosevelt.

Literary Digest learned the hard way that, when it comes to data, size is not the only thing that matters. Statistical theory shows that sample size affects sample error, and the error was indeed much lower in the Digest poll. But sample bias must also be considered — and this is especially critical in election polls. (The Digest sample was taken from lists of automobile registrations and telephone directories, creating a strong selection bias toward middle- and upper-class voters.)

Another example that demonstrates the danger of putting excessive confidence in the analysis of big data sets regards the mathematical models for predicting loan defaults developed by Lehman Brothers. Based on a very large database of historical data on past defaults, Lehman Brothers developed, and tested for several years, models for forecasting the probability of companies defaulting on their loans. Yet those models built over such an extensive database were not able to predict the largest bankruptcy in history — Lehman Brothers’ own.

These cases illustrate two common flaws that undermine big data analysis:

  • the sample, no matter how big, may not accurately reflect the actual target population or process
  • the population/process evolves in time ( i.e. , it is nonstationary) and data collected over the years may not accurately reflect the current situation to which analytics are applied.

These two cases and other well-known blunders show that domain knowledge is, of course, needed to handle real problems even when massive data are available. Industrial big data can benefit from past experiences, but challenges lie ahead.

photo

▲ Figure 1. The big data movement stems from the availability of data, high-power computer technology, and analytics to handle data characterized by the four Vs — volume, variety, veracity, and velocity.

Like any new, promising field, big data must be viewed in terms of its capabilities as well as its limitations. Some of these limitations are merely challenges that can be addressed — enabling companies to make the most out of new opportunities created by data, technology, and analytics ( Figure 1 ).

This article outlines ten critical challenges regarding big data in industrial contexts that need to be addressed, and suggests some emerging research paths related to them. The challenges are discussed in terms of the four Vs that define the context of big data: volume, variety, veracity, and velocity.

photo

Big data is, first of all, about handling massive amounts of data. However, in industrial processes, the first thing to realize is that not all data are created equal. Several challenges arise from this point.

Meaningful data . Most industrial big data projects rely on happenstance data, i.e. , data passively collected from processes operating under normal operating conditions most of the time. Thus, a large amount of data is indeed available, but those data span a relatively narrow range of operating conditions encountered during regular production situations.

Data sets collected under those circumstances may be suitable for process monitoring and fault detection activities (2) , which rely on a good description of the normal operating conditions (NOC) as a reference to detect any assignable or significant deviation from such behavior. However, their value is limited for predictive activities, and even more so for control and optimization tasks. Prediction can only be carried out under the same conditions found in the data used to construct the models. As a corollary, only when all the NOC correlations linking the input variables are respected can the model be used for prediction.

For process control and optimization activities, the process description must capture the actual influence of each manipulated input variable on the process outputs. Its construction requires experimentation — i.e. , the active collection of process data via a design of experiments (DOE) program for process optimization or via system identification (SI)...

Would you like to access the complete CEP Article?

No problem. You just have to complete the following steps.

You have completed 0 of 2 steps.

You must be logged in to view this content. Log in now.

AIChE Membership

You must be an AIChE member to view this article. Join now.

Copyright Permissions 

Would you like to reuse content from CEP Magazine? It’s easy to request permission to reuse content. Simply click here to connect instantly to licensing services, where you can choose from a list of options regarding how you would like to reuse the desired content and complete the transaction.

PID Explained for Process Engineers: Part 3 – Features and Options

Optimize aerobic fermenter operation, big data analytics: meet the authors, big data: the four vs, big data: what is it, big data: success stories in the process industries, big data: getting started on the journey, big data analytics (full 24-page supplement), departments, catalyzing commercialization: nanofibers produce mega impact for high-efficiency separations, editorial: i see your point, aiche journal highlight: solving global challenges with multiscale systems engineering, leadership q&a: transforming a business, new products: march 2016, cep: news update, patent update: when does patent infringement become a federal offense, process safety beacon: safety devices used as control devices, technical entity trends: when resources are few, ypov: what to ask during an interview — and what questions to avoid.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 04 August 2020

Moving back to the future of big data-driven research: reflecting on the social in genomics

  • Melanie Goisauf   ORCID: orcid.org/0000-0002-3909-8071 1 , 2   na1 ,
  • Kaya Akyüz   ORCID: orcid.org/0000-0002-2444-2095 1 , 2   na1 &
  • Gillian M. Martin   ORCID: orcid.org/0000-0002-5281-8117 3   na1  

Humanities and Social Sciences Communications volume  7 , Article number:  55 ( 2020 ) Cite this article

3077 Accesses

8 Citations

9 Altmetric

Metrics details

  • Science, technology and society

With the advance of genomics, specific individual conditions have received increased attention in the generation of scientific knowledge. This spans the extremes of the aim of curing genetic diseases and identifying the biological basis of social behaviour. In this development, the ways knowledge is produced have gained significant relevance, as the data-intensive search for biology/sociality associations has repercussions on doing social research and on theory. This article argues that an in-depth discussion and critical reflection on the social configurations that are inscribed in, and reproduced by genomic data-intensive research is urgently needed. This is illustrated by debating a recent case: a large-scale genome-wide association study (GWAS) on sexual orientation that suggested partial genetic basis for same-sex sexual behaviour (Ganna et al. 2019b ). This case is analysed from three angles: (1) the demonstration of how, in the process of genomics research, societal relations, understandings and categorizations are used and inscribed into social phenomena and outcomes; (2) the exploration of the ways that the (big) data-driven research is constituted by increasingly moving away from theory and methodological generation of theoretical concepts that foster the understanding of societal contexts and relations (Kitchin 2014a ). Big Data Soc and (3) the demonstration of how the assumption of ‘free from theory’ in this case does not mean free of choices made, which are themselves restricted by data that are available. In questioning how key sociological categories are incorporated in a wider scientific debate on genetic conditions and knowledge production, the article shows how underlying classification and categorizations, which are inherently social in their production, can have wide ranging implications. The conclusion cautions against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

Similar content being viewed by others

big data research direction

Exome-wide analysis implicates rare protein-altering variants in human handedness

Dick Schijven, Sourena Soheili-Nezhad, … Clyde Francks

big data research direction

Genome-wide association studies

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

big data research direction

Genetic variation across and within individuals

Zhi Yu, Tim H. H. Coorens, … Pradeep Natarajan

Introduction

With the advance of genomic research, specific individual conditions received increased attention in scientific knowledge generation. While understanding the genetic foundations of diseases has become an important driver for the advancement of personalized medicine, the focus of interest has also expanded from disease to social behaviour. These developments are embedded in a wider discourse in science and society about the opportunities and limits of genomic research and intervention. With the emergence of the genome as a key concept for ‘life itself’, understandings of health and disease, responsibility and risk, and the relation between present conditions and future health outcomes have shifted, impacting also the ways in which identities are conceptualized under new genetic conditions (Novas and Rose 2000 ). At the same time, the growing literature of postgenomics points to evolving understandings of what ‘gene’ and ‘environment’ are (Landecker and Panofsky 2013 ; Fox Keller 2014 ; Meloni 2016 ). The postgenomic genome is no longer understood as merely directional and static, but rather as a complex and dynamic system that responds to its environment (Fox Keller 2015 ), where the social as part of the environment becomes a signal for activation or silencing of genes (Landecker 2016 ). At the same time, genetic engineering, prominently known as the gene-editing technology CRISPR/Cas9, has received considerable attention, but also caused concerns regarding its ethical, legal and societal implications (ELSI) and governance (Howard et al. 2018 ; Jasanoff and Hurlbut 2018 ). Taking these developments together, the big question of nature vs. nurture has taken on a new significance.

Studies which aim to reveal how biology and culture are being put in relation to each other appear frequently and pursue a genomic re-thinking of social outcomes and phenomena, such as educational attainment (Lee et al. 2018 ) or social stratification (Abdellaoui et al. 2019 ). Yet, we also witness very controversial applications of biotechnology, such as the first known case of human germline editing by He Jiankui in China, which has impacted the scientific community both as an impetus of wide protests and insecurity about the future of gene-editing and its use, but also instigated calls towards public consensus to (re-)set boundaries to what is editable (Morrison and de Saille 2019 ).

Against this background, we are going to debate in this article a particular case that appeared within the same timeframe as these developments: a large-scale genome-wide association study (GWAS) on sexual orientation Footnote 1 , which suggested partial genetic basis for same-sex sexual behaviour (Ganna et al. 2019b ). Some scientists have been claiming sexual orientation to be partly heritable and trying to identify genetic basis for sexual orientation for years (Hamer et al. 1993 ); however, this was the first time that genetic variants were identified as statistically significant and replicated in an independent sample. We consider this GWAS not only by questioning the ways genes are associated with “the social” within this research, but also by exploring how the complexity of the social is reduced through specific data practices in research.

The sexual orientation study also constitutes an interesting case to reflect on how knowledge is produced at a time the data-intensive search for biology/sociality associations has repercussions on doing social research and on theory (Meloni 2014 ). Large amounts of genomic data are needed to identify genetic variations and for finding correlations with different biological and social factors. The rise of the genome corresponds to the rise of big data as the collection and sharing of genomic data gains power with the development of big data analytics (Parry and Greenhough 2017 ). Growing number of correlations, e.g. in genomics of educational attainment (Lee et al. 2018 ; Okbay et al. 2016 ), are being found that are linking the genome to the social, increasingly blurring the established biological/social divide. These could open up new ways of understanding life, and underpin the importance of culture, while, paradoxically, may also carry the risk of new genetic determinism and essentialism. The changing understanding of the now molecularised and datafied body also illustrates the changing significance of empirical research and sociology (Savage and Burrows 2007 ) in the era of postgenomics and ‘datafication’ (Ruckenstein and Schüll 2017 ). These developments are situated within methodological debates in which social sciences often appear through the perspective of ELSI.

As the field of genomics is progressing rapidly and the intervention in the human genome is no longer science fiction, we argue that it is important to discuss and reflect now on the social configurations that are inscribed in, and reproduced by genomic data-driven research. These may co-produce the conception of certain potentially editable conditions, i.e. create new, and reproduce existing classifications that are largely shaped by societal understandings of difference and order. Such definitions could have real consequences—as Thomas and Thomas ( 1929 ) remind us—for individuals and societies, and mark what has been described as an epistemic shift in biomedicine from the clinical gaze to the ‘molecular gaze’ where the processes of “medicalisation and biomedicalisation both legitimate and compel interventions that may produce transformations in individual, familial and other collective identities” (Clarke et al. 2013 , p. 23). While Science and Technology Studies (STS) has demonstrated how science and society are co-produced in research (Jasanoff 2004 ), we want to use the momentum of the current discourse to critically reflect on these developments from three angles: (1) we demonstrate how, in the process of genomics research, societal relations, understandings and categorizations are used and inscribed into social phenomena and outcomes; (2) we explore the ways that the (big) data-driven research is constituted by increasingly moving away from theory and methodological generation of theoretical concepts that foster the understanding of societal contexts and relations (Kitchin 2014a ) and (3) using the GWAS case in focus, we show how the assumption of ‘free from theory’ (Kichin 2014a ) in this case does not mean free of choices made, choices which are themselves restricted by data that are available. We highlight Griffiths’ ( 2016 ) contention that the material nature of genes, their impacts on biological makeup of individuals and their socially and culturally situated behaviour are not deterministic, and need to be understood within the dynamic, culturally and temporally situated context within which knowledge claims are made. We conclude by making the important point that ignoring the social may lead to a distorted, datafied, genomised body which ignores the key fact that “genes are not stable but essentially malleable” (Prainsack 2015 ) and that this ‘malleability’ is rooted in the complex interplay between biological and social environments.

From this perspective, the body is understood through the lens of embodiment, considering humans ‘live’ their genome within their own lifeworld contexts (Rehmann-Sutter and Mahr 2016 ). We also consider this paper as an intervention into the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

In the following reflections, we proceed step by step: First, we introduce the case of the GWAS on same-sex sexual behaviour, as well as its limits, context and impact. Second, we recall key sociological theory on categorizations and their implications. Third, we discuss the emergence of a digital-datafication of scientific knowledge production. Finally, we conclude by cautioning against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.

Studying sexual orientation: The case of same-sex sexual behaviour

Currently, a number of studies at the intersection of genetic and social conditions appear on the horizon. Just as in the examples we have already mentioned, such as those on educational attainment (Lee et al. 2018 ), or social stratification (Abdellaoui et al. 2019 ), it is important to note that the limit to such studies is only the availability of the data itself. In other words, once the data is available, there is always the potential that it would eventually be used. This said, an analysis of the entirety of the genomic research on social outcomes and behaviour is beyond the scope of this article. Therefore, we want to exemplify our argument with reference to the research on the genetics of same-sex sexual behaviour.

Based on a sample of half a million individuals of European ancestry, the first large-scale GWAS of its kind claims five genetic variants to be contributing to the assessed “same-sex sexual behaviour” (Ganna et al. 2019b ). Among these variants, two are useful only for male–male sexual behaviour, one for female–female sexual behaviour, and the remaining two for both. The data that has led to this analysis was sourced from biobanks/cohorts with different methods of data collection. The authors conclude that these genetic variations are not predictive of sexual orientation; not only because genetics is supposedly only part of the picture, but also because the variations are only a small part (<1% of the variance in same-sex sexual behaviour, p. 4) of the approximated genetic basis (8–25% of the variance in same-sex sexual behaviour) that may be identified with large sample sizes (p. 1). The study is an example of how the ‘gay gene’ discourse that has been around for years, gets transformed with the available data accumulating in the biobanks and the consequent genomic analysis, offering only one facet of a complex social phenomenon: same-sex sexual behaviour.

The way the GWAS has been conducted was not novel in terms of data collection. Genome-wide studies of similar scale, e.g. on insomnia (Jansen et al. 2019 ) or blood pressure (Evangelou et al. 2018 ), often rely on already collected data in biobanks rather than trying to collect hundreds of thousands of individuals’ DNA from scratch. Furthermore, in line with wider developments, the study was preregistered Footnote 2 with an analysis plan for the data to be used by the researchers. Unlike other GWASes, however, the researchers partnered with an LGBTQIA+ advocacy group (GLAAD) and a science communication charity (Sense About Science), where individuals beyond the research team interpreted the findings and discussed how to convey the results Footnote 3 . Following these engagements, the researchers have produced a website Footnote 4 with potential frequently asked questions as well as a video about the study, highlighting what it does and what it does not claim.

Despite efforts to control the drifting away of the study into genetic deterministic and discriminatory interpretations, the study has been criticized by many Footnote 5 . Indeed, the controversial “How gay are you?” Footnote 6 app on the GenePlaza website utilized the findings of the study, which in turn raised the alarm bells and, ultimately, was taken down after much debate. The application, however, showed how rapidly such findings can translate into individualized systems of categorization, and consequently feed into and be fed by the public imaginary. One of the study authors demands continuation of research by noting “[s]cientists have a responsibility to describe the human condition in a more nuanced and deeper way” (Maxmen, 2019 , p. 610). Critics, however, note that the context of data collected from the individuals may have influence on the findings; for instance, past developments (i.e. decriminalization of homosexuality, the HIV/AIDS epidemic, and legalization of same-sex marriage) are relevant to understand the UK Biobank’s donor profile and if the GWAS were to be redone according to the birth year of the individuals, different findings could have come out of the study (Richardson et al. 2019 , p. 1461).

It has been pointed out that such research should be assessed by a competent ethical review board according to its potential risks and benefits (Maxmen 2019 , p. 610), in addition to the review and approval by the UK Biobank Access Sub-Committee (Ganna et al. 2019a , p. 1461). Another ethical issue of concern raised by critics is that the informed consent form of UK Biobank does not specify that it could be used for such research since “homosexuality has long been removed from disease classifications” and that the broad consent forms allow only “health-related research” (Holm and Ploug 2019 , p. 1460). We do not want to make a statement here for or against broad consent. However, we argue that discussions about informed consent showcase the complexities related to secondary use of data in research. Similarly, the ‘gay gene’ app developed in the wake of the sexual orientation study, revealed the difficulty of controlling how the produced knowledge may be used, including in ways that are openly denounced by the study authors.

To the best of our knowledge, there have not been similar genome-wide studies published on sexual orientation and, while we acknowledge the limitations associated with focusing on a single case in our discussion, we see this case as relevant to opening up the following question: How are certain social categorizations incorporated into the knowledge production practices? We want to answer this by first revisiting some of the fundamental sociological perspectives into categorizations and the social implications these may have.

Categorizing sex, gender, bodies, disease and knowledge

Sociological perspectives on categorizations.

Categorizations and classifications take a central role in the sociology of knowledge, social stratifications and data-based knowledge production. Categories like gender, race, sexuality and class (and their intersection, see Crenshaw 1989 ) have become key classifications for the study of societies and in understanding the reproduction of social order. One of the most influential theories about the intertwining of categories like gender and class with power relations was formulated by Bourdieu ( 2010 , 2001 ). He claimed that belonging to a certain class or gender is an embodied practice that ensures the reproduction of social structure which is shaped by power relations. The position of subjects within this structure reflects the acquired cultural capital, such as education. Incorporated dispositions, schemes of perception, appreciation, classification that make up the individual’s habitus are shaped by social structure, which actors reproduce in practices. One key mechanism of social categorization is gender classification. The gender order appears to be in the ‘nature of things’ of biologically different bodies, whereas it is in fact an incorporated social construction that reflects and constitutes power relations. Bourdieu’s theory links the function of structuring classifications with embodied knowledge and demonstrates that categories of understanding are pervaded by societal power relations.

In a similar vein Foucault ( 2003 , 2005 ) describes the intertwining of ordering classifications, bodies and power in his study of the clinic. Understandings of and knowledge about the body follow a specific way of looking at it—the ‘medical gaze’ of separating the patient’s body from identity and distinguishing healthy from the diseased, which, too, is a process pervaded by power differentials. Such classifications evolved historically. Foucault reminds us that all periods in history are characterized by specific epistemological assumptions that shape discourses and manifest in modalities of order that made certain kinds of knowledge, for instance scientific knowledge, possible. The unnoticed “order of things”, as well as the social order, is implemented in classifications. Such categorizations also evolved historically for the discourse about sexuality, or, in particular as he pointed out writing in the late 1970s, distinguishing sexuality of married couples from other forms, such as homosexuality (Foucault 1998 ).

Bourdieu and Foucault offer two influential approaches within the wider field of sociology of knowledge that provide a theoretical framework on how categorizations and classifications structure the world in conjunction with social practice and power relations. Their work demonstrates that such structuration is never free from theory, i.e. they are not existing prediscursively, but are embedded within a certain temporal and spatial context that constitutes ‘situated knowledge’ (Haraway 1988 ). Consequently, classifications create (social) order that cannot be understood as ‘naturally’ given but as a result of relational social dynamics embedded in power differentials.

Feminist theory in the 1970s emphasized the inherently social dimension of male and female embodiment, which distinguished between biological sex and socially rooted gender. This distinction built the basis for a variety of approaches that examined gender as a social phenomenon, as something that is (re-)constructed in social interaction, impacted by collectively held beliefs and normative expectations. Consequently, the difference between men and women was no longer simply understood as a given biological fact, but as something that is, also, a result of socialization and relational exchanges within social contexts (see, e.g., Connell 2005 ; Lorber 1994 ). Belonging to a gender or sex is a complex practice of attribution, assignment, identification and, consequently, classification (Kessler and McKenna 1978 ). The influential concept of ‘doing gender’ emphasized that not only the gender, but also the assignment of sex is based on socially agreed-upon biological classification criteria, that form the basis of placing a person in a sex category , which needs to be practically sustained in everyday life. The analytical distinction between sex and gender became eventually implausible as it obscures the process in which the body itself is subject to social forces (West and Zimmerman 1991 ).

In a similar way, sexual behaviour and sexuality are also shaped by society, as societal expectations influence sexual attraction—in many societies within normative boundaries of gender binary and heteronormativity (Butler 1990 ). This also had consequences for a deviation from this norm, resulting for example in the medicalisation of homosexuality (Foucault 1998 ).

Reference to our illustrative case study on the recently published research into the genetic basis of sexuality brings the relevance of this theorization into focus. The study cautions against the ‘gay gene’ discourse, the use of the findings for prediction, and genetic determinism of sexual orientation, noting “the richness and diversity of human sexuality” and stressing that the results do not “make any conclusive statements about the degree to which ‘nature’ and ‘nurture’ influence sexual preference” (Ganna et al. 2019b , p. 6).

Coming back to categorizations, more recent approaches from STS are also based on the assumption that classifications are a “spatio-temporal segmentation of the world” (Bowker and Star 2000 , p. 10), and that classification systems are, similar to concepts of gender theory (e.g. Garfinkel 1967 ), consistent, mutually exclusive and complete. The “International Classification of Diseases (lCD)”, a classification scheme of diseases based on their statistical significance, is an example of such a historically grown knowledge system. How the ICD is utilized in practice points to the ethical and social dimensions involved (Bowker and Star 2000 ). Such approaches help to unravel current epistemological shifts in medical research and intervention, including removal of homosexuality from the disease classification half a century ago.

Re-classifying diseases in tandem with genetic conditions creates new forms of ‘genetic responsibilities (Novas and Rose 2000 ). For instance, this may result in a change of the ‘sick role’ (described early in Parsons 1951 ) in creating new obligations not only for diseased but also for actually healthy persons in relation to potential futures. Such genetic knowledge is increasingly produced using large-scale genomic databases and creates new categories based on genetic risk, and consequently, may result in new categories of individuals that are ‘genetically at risk’ (Novas and Rose 2000 ). The question now is how these new categories will alter, structure or replace evolved categories, in terms of constructing the social world and medical practice.

While advancement in genomics is changing understandings of bodies and diseases, the meanings of certain social categories for medical research remain rather stable. Developments of personalized medicine go along with “the ‘re-inscription’ of traditional epidemiological categories into people’s DNA” and adherence to “old population categories while working out new taxonomies of individual difference” (Prainsack 2015 , pp. 28–29). This, again, highlights the fact that knowledge production draws on and is shaped by categories that have a political and cultural meaning within a social world that is pervaded by power relations.

From categorization to social implication and intervention

While categorizations are inherently social in their production, their use in knowledge production has wide ranging implications. Such is the case of how geneticisation of sexual orientation has been an issue that troubled and comforted the LGBTQIA+ communities. Despite the inexistence of an identified gene, ‘gay gene’ has been part of societal discourse. Such circulation disseminates an unequal emphasis on the biologized interpretations of sexual orientation, which may be portrayed differently in media and appeal to groups of opposing views in contrasting ways (Conrad and Markens 2001 ). Geneticisation, especially through media, moves sexual orientation to an oppositional framework between individual choice and biological consequence (Fausto-Sterling 2007 ) and there have been mixed opinions within LGBTQIA+ communities, whether this would resolve the moralization of sexual orientation or be a move back into its medicalisation (Nelkin and Lindee 2004 ). Thus, while some activists support geneticisation, others resist it and work against the potential medicalisation of homosexuality (Shostak et al. 2008 ). The ease of communicating to the general public simple genetic basis for complex social outcomes which are genetically more complex than reported, contributes to the geneticisation process, while the scientific failures of replicating ‘genetic basis’ claims do not get reported (Conrad 1999 ). In other words, while finding a genetic basis becomes entrenched as an idea in the public imaginary, research showing the opposite does not get an equal share in the media and societal discourse, neither of course does the social sciences’ critique of knowledge production that has been discussed for decades.

A widely, and often quantitatively, studied aspect of geneticisation of sexual orientation is how this plays out in the broader understanding of sexual orientation in society. While there are claims that geneticisation of sexual orientation can result in depoliticization of the identities (O’Riordan 2012 ), it may at the same time lead to polarization of society. According to social psychologists, genetic attributions to conditions are likely to lead to perceptions of immutability, specificity in aetiology, homogeneity and discreteness as well naturalistic fallacy (Dar-Nimrod and Heine 2011 ). Despite the multitude of suggestive surveys that belief in genetic basis of homosexuality correlates with acceptance, some studies suggest learning about genetic attribution to homosexuality can be polarizing and confirmatory of the previously held negative or positive attitudes (Boysen and Vogel 2007 ; Mitchell and Dezarn 2014 ). Such conclusions can be taken as a precaution that just as scientific knowledge production is social, its consequences are, too.

Looking beyond the case

We want to exemplify this argument by taking a detour to another case where the intersection between scientific practice, knowledge production and the social environment is of particular interest. While we have discussed the social implications of geneticisation with a focus on sexual orientation, recent developments in biomedical sciences and biotechnology also have the potential to reframe the old debates in entirely different ways. For instance, while ‘designer babies’ were only an imaginary concept until recently, the facility and affordability of processes, such as in vitro selection of baby’s genotype and germline genome editing, have potentially important impacts in this regard. When CRISPR/Cas9 technique was developed for rapid and easy gene editing, both the hopes and worries associated with its use were high. Martin and others ( 2020 , pp. 237–238) claim gene editing is causing both disruption within the postgenomic regime, specifically to its norms and practices, and the convergence of various biotechnologies such as sequencing and editing. Against this background, He Jiankui’s announcement in November 2018 through YouTube Footnote 7 that twins were born with edited genomes was an unwelcome surprise for many. This unexpected move may have hijacked the discussions on ethical, legal, societal implications of human germline genome-editing, but also rang the alarm bells across the globe for similar “rogue” scientists planning experimentation with the human germline (Morrison and de Saille 2019 ). The facility to conduct germline editing is, logically, only one step away from ‘correcting’ and if there is a correction, then that would mean a return to a normative state. He’s construction of HIV infection as a genetic risk can be read as a placeholder for numerous questions to human germline editing: What are the variations that are “valuable” enough for a change in germline? For instance, there are plans by Denis Rebrikov in Russia to genome edit embryos to ‘fix’ a mutation that causes congenital deafness (Cyranoski 2019 ). If legalized, what would be the limits applied and who would be able to afford such techniques? At a time when genomics research into human sociality is booming, would the currently produced knowledge in this field and others translate into ‘corrective’ genome-editing? Who would decide?

The science, in itself is still unclear at this stage as, for many complex conditions, using gene editing to change one allele to another is often minuscule in effect, considering that numerous alleles altogether may affect phenotypes, while at the same time a single allele may affect multiple phenotypes. In another GWAS case, social genomicists claim there are thousands of variations that are found to be influential for a particular social outcome such as educational attainment (Lee et al. 2018 ), with each having minimal effect. It has also been shown in the last few years, as the same study is conducted with ever more larger samples, more genomic variants are associated with the social outcome, i.e. 74 single nucleotide polymorphisms (SNPs) associated with the outcome in a sample size of 293,723 (Okbay et al. 2016 ) and 1271 SNPs associated with the outcome in a sample size of 1.1 million individuals (Lee et al. 2018 ).

Applying this reasoning to the GWAS on same-sex sexual behaviour, it is highly probable that the findings will be superseded in the following years with similar studies of bigger data, increasing the number of associations.

A genomic re-thinking?

The examples outlined here have served to show how focusing the discussion on “genetic determinism” is fruitless considering the complexity of the knowledge production practices and how the produced knowledge could both mirror social dynamics and shape these further. Genomic rethinking of the social necessitates a new formulation of social equality, where genomes are also relevant. Within the work of social genomics researchers, there has been cautious optimism toward the contribution of findings from genomics research to understanding social outcomes of policy change (Conley and Fletcher 2018 ; Lehrer and Ding 2019 ). Two fundamental thoughts govern this thinking. First, genetic basis is not to be equalized with fate; in other words, ‘genetic predispositions’ make sense only within the broader social and physical environmental frame, which often allows room for intervention. Second, genetics often relates to heterogeneity of the individuals within a population, in ways that the same policy may be positive, neutral or negative for different individuals due to their genes. In this respect, knowledge gained via social genomics may be imagined as a basis for a more equal society in ‘uncovering’ invisible variables, while, paradoxically, it may also be a justification for exclusion of certain groups. For example, a case that has initially raised the possibility that policies affect individuals differently because of their genetic background was a genetic variant that was correlated to being unaffected by tax increases on tobacco (Fletcher 2012 ). The study suggested that raising the taxes may be an ineffective tool for lowering smoking rates below a certain level, since those who are continuing to smoke may be those who cannot easily stop due to their genetic predisposition to smoking. Similar ideas could also apply to a diverse array of knowledge produced in social genomics, where the policies may be under scrutiny according to how they are claimed to variably influence the members of a society due to their genetics.

Datafication of scientific knowledge production

From theory to data-driven science.

More than a decade has gone by since Savage and Burrows ( 2007 ) described a crisis in empirical research, where the well-developed methodologies for collecting data about the social world would become marginal as such data are being increasingly generated and collected as a by-product of daily virtual transactions. Today, sociological research faces a widely datafied world, where (big) data analytics are profoundly changing the paradigm of knowledge production, as Facebook, Twitter, Google and others produce large amounts of socially relevant data. A similar phenomenon is taking place through opportunities that public and private biobanks, such as UK Biobank or 23andMe, offer. Crossing the boundaries of social sciences and biological sciences is facilitated through mapping correlations between genomic data, and data on social behaviour or outcomes.

This shift from theory to data-driven science misleadingly implies a purely inductive knowledge production, neglecting the fact that data is not produced free of preceding theoretical framing, methodological decisions, technological conditions and the interpretation of correlations—i.e. an assemblage situated within a specific place, time, political regime and cultural context (Kitchin 2014a ). It glosses over the fact that data cannot simply be treated as raw materials, but rather as “inherently partial, selective and representative”, the collection of which has consequences (Kitchin 2014b , p. 3). How knowledge of the body is generated starts with how data is produced and how it is used and mobilized. Through sequencing, biological samples are translated into digital data that are circulated and merged and correlated with other data. With the translation from genes into data, their meaning also changes (Saukko 2017 ). The kind of knowledge that is produced is also not free of scientific and societal concepts.

Individually assigned categorical variables to genomes have become important for genomic research and are impacting the ways in which identities are conceptualized under (social) genomic conditions. These characteristics include those of social identity, such as gender, ethnicity, educational and socioeconomic status. They are often used for the study of human genetic variation and individual differences with the aim to advance personalized medicine and based on demographic and ascribed social characteristics.

The sexual orientation study that is central to this paper can be read as a case where such categories intersect with the mode of knowledge production. As the largest contributor of data to the study, UK Biobank’s data used in this research are revealing since they are based on the answer to the following question “Have you ever had sexual intercourse with someone of the same sex?” along with the statement “Sexual intercourse includes vaginal, oral or anal intercourse.” Footnote 8 .

Furthermore, the authors accept having made numerous reductive assumptions and that their study has methodological limitations. For instance, Ganna et al. ( 2019b ) acknowledge both within the article (p. 1) and an accompanying website Footnote 9 that the research is based on a binary ‘sex’ system with exclusions of non-complying groups as the authors report that they “dropped individuals from [the] study whose biological sex and self-identified sex/gender did not match” (p. 2). However, both categorizing sexual orientation mainly on practice rather than attraction or desire, and building it on normative assumptions about sexuality, i.e. gender binary and heteronormativity, are problematic, as sexual behaviour is diverse and does not necessarily correspond with such assumptions.

The variations found in the sexual orientation study, as is true for other genome-wide association studies, are often relevant for the populations studied and in this case, those mainly belong to certain age groups and European ancestry. While the study avoids critique in saying that their research is not genetics of sexual orientation, but rather of same-sex sexual behaviour, whether such a genomic study would be possible is also questionable. This example demonstrates that, despite the increasing influence of big data, a fundamental problem with the datafication of many social phenomena is whether or not they are amenable to measurement. In the case of sexual orientation, whether the answer to the sexual orientation questions corresponds to the “homosexuality” or “willingness to reveal homosexuality”/“stated sexual orientation” is debatable, considering the social pressure and stigma that may be an element in certain social contexts (Conley 2009 , p. 242).

While our aim is to bring a social scientific perspective, biologists have raised at least two different critical opinions on the knowledge production practice here in the case of the sexual orientation study, first on the implications of the produced knowledge Footnote 10 and second on the problems and flaws of the search for a genetic basis Footnote 11 . In STS, however, genetic differences that were hypothesized to be relevant for health, especially under the category of race in the US, have been a major point of discussion within the genomic ‘inclusion’ debates of 1990s (Reardon 2017 , p. 49; Bliss 2015 ). In other words, a point of criticism towards the knowledge production was the focus on certain “racial” or racialized groups, such as American of European ancestry, which supposedly biased the findings and downstream development of therapies for ‘other’ groups. However, measuring health and medical conditions against the background of groups that are constituted based on social or cultural categories (e.g. age, gender, ethnicity), may also result in a reinscription/reconstitution of social inequalities attached to these categories (Prainsack 2015 ) and at the same time result in health justice being a topic seen through a postgenomics lens, where postgenomics is “a frontline weapon against inequality” (Bliss 2015 p. 175). Social-economic factors may recede in the background, while data with its own often invisible politics are foregrounded.

Unlike what Savage and Burrows suggested in 2007, the coming crisis can not only be seen as a crisis of sociology, but of science in general. Just as the shift of focus in social sciences towards digital data is only one part of the picture, another part could be the developments in genomisation of the social. Considering that censuses and large-scale statistics are not new, the distinction of the current phenomenon is possibly the opportunity to individualize the data, while categories themselves are often unable to capture the complexity, despite producing knowledge more efficiently. In that sense, the above-mentioned survey questions do not do justice to the complexity of social behaviour. What is most important to flag within these transformations is the lack of reflexivity regarding how big data comes to represent the world and whether it adds and/or takes away from the ways of knowing before big data. These developments and directions of genetic-based research and big data go far beyond the struggle of a discipline, namely sociology, with a paradigm shift in empirical research. They could set the stage for real consequences for individuals and groups. Just as what is defined as an editable condition happens as a social process that relies on socio-political categories, the knowledge acquired from big data relies in similar way on the same kind of categories.

The data choices and restrictions: ‘Free from theory’ or freedom of choice

Data, broadly understood, have become a fundamental part of our lives, from accepting and granting different kinds of consent for our data to travel on the internet, to gaining the ‘right to be forgotten’ in certain countries, as well as being able to retrieve collected information about ourselves from states, websites, even supermarket chains. While becoming part of our lives, the data collected about individuals in the form of big data is transferred between academic and non-academic research, scientific and commercial enterprises. The associated changes in the knowledge production have important consequences for the ways in which we understand and live in the world (Jasanoff 2004 ). The co-productionist perspective in this sense does not relate to whether or how the social and the biological are co-produced, but rather it is pointing to how produced knowledge in science is both shaped by and shaping societies. Thus, the increasing impact and authority of big data in general, and within the sexual orientation study in focus here, opens up new avenues to claim as some suggest, that we have reached the end of theory.

The “end of theory” has actively been debated within and beyond science. Kitchin ( 2014a ) locates the recent origin of this debate in a piece in the Wired , where the author states “Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson 2008 ). Others call this a paradigm shift towards data-intensive research leaving behind the empirical and theoretical stages (Gray 2009 , p. xviii). While Google and others form the basis for this data-driven understanding in their predictive capacity or letting the data speak, the idea that knowledge production is ‘free from theory’ in this case seems to be, at best, an ignorance of any data infrastructure and how the categories are formed within it.

Taking a deeper look at the same-sex sexual behaviour study from this angle suggests that such research cannot be free from theory as it has to make an assumption regarding the role of genetics in the context of social dynamics. In other words, it has to move sexual orientation, at least partially in the form of same-sex sexual behaviour, out of the domain of the social towards the biological. In doing so, just as the study concludes the complexity of sexual orientation, the authors note in their informative video Footnote 12 on their website, that “they found that about a third of the differences between people in their sexual behaviour could be explained by inherited genetic factors. But the environment also plays a large role in shaping these differences.” While the study points to a minuscule component of the biological, it also frames biology as the basis on which the social, as part of the environment, acts upon.

Reconsidering how the biology and the social are represented in the study, three theoretical choices are made due to the limitation of the data. First of all, the biological is taken to be “the genome-wide data” in the biobanks that the study relies on. This means sexual orientation is assumed to be within the SNPs, or points on the genome that are common variations across a population, and not in other kinds of variations that are rare or not captured by the genotyped SNPs. These differences include, but are not limited to, large-scale to small-scale duplications and deletions of the genomic regions, rare variants or even common variants in the population that the SNP chips do not capture. Such ignored differences are very important for a number of conditions, from cancer to neurobiology. Similarly, the genomic focus leaves aside the epigenetic factors that could theoretically be the missing link between genomes and environments. In noting this, we do not suggest that the authors of the study are unaware or uninterested in epigenetics; however, regardless of their interest and/or knowledge, the availability of large-scale genome-wide data puts such data ahead of any other variation in the genome and epigenome. In other words, if the UK Biobank and 23andMe had similar amounts of epigenomic or whole genome data beyond the SNPs, the study would have most possibly relied on these other variations in the genome. The search for genetic basis within SNPs is a theoretical choice, and in this case this choice is pre-determined by the limitations of the data infrastructures.

The second choice that the authors make is to take three survey questions, i.e. in the case of UK Biobank data, as encompassing enough of the complexity of sexual orientation for their research. As partly discussed earlier, these questions are simply asking about sexual behaviour. Based on the UK Biobank’s definition of sexual intercourse as “vaginal, oral or anal intercourse” the answers to the following questions were relevant for the research: “Have you ever had sexual intercourse with someone of the same sex?” (Data-Field 2159), “How many sexual partners of the same sex have you had in your lifetime?” (Data-Field 3669), and, “About how many sexual partners have you had in your lifetime?” (Data-Field 2149). Answers to such questions do little justice to the complexity of the topic. Considering that they are not included in the biobank as data for the purpose of identifying a genetic basis to same-sex sexual behaviour, there is much to consider in what capacity they are useful for that. It is worth noting here that the UK Biobank is primarily focused on health-related research, and thus these three survey questions could not have been asked with a genomic exploration of ‘same-sex sexual behaviour’ or ‘sexual orientation’ in mind. The degree of success in the way they have been used to identify the genetic basis for complex social behaviours is questionable.

The authors of the study consider the UK Biobank sample to be comprised of relatively old individuals and this to be a shortcoming Footnote 13 . Similarly, the study authors claim that 23andMe samples may be biased because “[i]ndividuals who engage in same-sex sexual behaviour may be more likely to self-select the sexual orientation survey”, which then explains the high percentage of such individuals (18.9%) (Ganna et al. 2019b , p. 1). However, the authors do not problematize that there is at least three-fold difference between the youngest and oldest generation in the UK Biobank sample in their response to the same-sex sexual behaviour question (Ganna et al. 2019b , p. 2). The study, thus, highlights the problematic issue about who should be regarded as the representative sample to be asked about their “same-sex sexual behaviour”. Still, this is a data choice that the authors make in concluding a universal explanation out of a very specific and socially constrained collection of self-reported data that encompasses only part of what the researchers are interested in.

The third choice is a choice unmade. The study data mainly came from UK Biobank, following a proposal by Brendan Zietsch with the title “Direct test whether genetic factors predisposing to homosexuality increase mating success in heterosexuals” Footnote 14 . The original plan for research frames “homosexuality” as a condition that heterosexuals can be “predisposed” to and as this condition is not eliminated through evolution, scientists hypothesize that whatever genetic variation that predisposes an individual to homosexuality may also be functional in increasing the individual’s reproductive capacity. Despite using such an evolutionary explanation as the theoretical basis for obtaining the data from the UK Biobank, the authors use evolution/evolutionary only three times in the article, whereas the concept “mating success” is totally missing. Unlike the expectation in the research proposal, authors observe lower number of offspring for individuals reporting same-sex sexual behaviour, and they conclude briefly “This reproductive deficit raises questions about the evolutionary maintenance of the trait, but we do not address these here” (Ganna et al. 2019b , p. 2). In other words, the hypothesis that allowed scientists to acquire the UK Biobank data becomes irrelevant for the researchers, when they are reporting their findings.

In this section, we have performed an analysis of how data choices are made at different steps of the research and hinted at how these choices reflect certain understandings of how society functions. These are evident in the ways sexual behaviour is represented and categorized according to quantitative data, and, the considerations of whether certain samples are contemporary enough (UK Biobank) or too self-selecting (same-sex sexual behaviour being too high in 23andMe). The study, however, does not problematize how the percentage of individuals reporting same-sex sexual behaviour steadily increases according to year of birth, at least tripling for males and increasing more than five-fold for females from 1940 and 1970 (for UK Biobank). Such details are among the data that the authors display as descriptive statistics in Fig. 1 (Ganna et al. 2019b , p. 2); however, these do not attract a discussion that genomic data receives. The study itself starts from the idea that genetic markers that are associated with same-sex sexual behaviour could have an evolutionary advantage and ends in saying the behaviour is complex. Critics claim the “approach [of the study] implies that it is acceptable to issue claims of genetic drivers of behaviours and then lay the burden of proof on social scientists to perform post-hoc socio-cultural analysis” (Richardson et al. 2019 , p. 1461).

In this paper, we have ‘moved back to the future’—taking stock of the present-day accelerated impact of big data and of its potential and real consequences. Using the sexual orientation GWAS as point of reference, we have shown that claims to working under the premise of ‘pure science’ of genomics are untenable as the social is present by default—within the methodological choices made by the researchers, the impact on/of the social imaginary or epigenetic context.

By focusing on the contingency of the knowledge production on the social categories that are themselves reflections of the social in the data practices, we have highlighted the relational processes at the root of knowledge production. We are experiencing a period where the repertoire of what gets quantified continuously, and possibly exponentially, increases; however, this does not necessarily mean that our understanding of complexity increases at the same rate, rather, it may lead to unintended simplification where meaningful levels of understanding of causality are lost in the “triumph of correlations” in big data (Mayer-Schönberger and Cukier 2013 ; cited in Leonelli 2014 ). While sociology has much to offer through its qualitative roots, we think it should do more than critique, especially considering the culturally and temporally specific understandings of the social are also linked to the socio-material consequences.

We want to highlight that now is the time to think about the broader developments in science and society, not merely from an external perspective, but within a new framework. Clearly, our discussion of a single case here cannot sustain suggestions for a comprehensive and applicable framework for any study; however, we can flag the urgency of its requirement. We have shown that, in the context of the rapid developments within big data-driven, and socio-genomic research, it is necessary to renew the argument for bringing the social, and its interrelatedness to the biological, clearly back into focus. We strongly believe that reemphasizing this argument is essential to underline the analytical strength of the social science perspective, and in order to avoid the possibility of losing sight of the complexity of social phenomena, which risk being oversimplified in mainly statistical data-driven science.

We can also identify three interrelated dimensions of scientific practice that the framework would valorize: (1) Recognition of the contingency of choices made within the research process, and sensibility of their consequent impact within the social context. (2) Ethical responsibilities that move beyond procedural contractual requirements, to sustaining a process rooted in clear understanding of societal environments. (3) Interdisciplinarity in analytical practice that potentiates the impact of each perspectival lens.

Such a framework would facilitate moving out of the disciplinary or institutionalized silos of ELSI, STS, sociology, genetics, or even emerging social genomics. Rather than competing for authority on ‘the social’, the aim should be to critically complement each other and refract the produced knowledge with a multiplicity of lenses. Zooming ‘back to the future’ within the field of socio-biomedical science, we would flag the necessity of re-calibrating to a multi-perspectival endeavour—one that does justice to the complex interplay of social and biological processes within which knowledge is produced.

The GWAS primarily uses the term “same-sex sexual behaviour” as one of the facets of “sexual orientation” where the former becomes the component that is directly associable with the genes and the latter the broader phenomenon of interest. Thus, while the article is referring to “same-sex sexual behaviour” in its title, it is editorially presented in the same Science issue under Human Genetics heading with the subheading “The genetics of sexual orientation” (p. 880) (see Funk 2019 ). Furthermore, the request for data from UK Biobank by the corresponding author Brendan P. Zietsch (see footnote 14) refers only to sexual orientation and homosexuality and not to same-sex sexual behaviour. Therefore, we follow the same interchangeable use in this article.

Source: https://osf.io/xwfe8 (04.03.2020).

Source: https://www.wsj.com/articles/research-finds-genetic-links-to-same-sex-behavior-11567101661 (04.03.2020).

Source: https://geneticsexbehavior.info (04.03.2020).

In addition to footnotes 10 and 11, for a discussion please see: https://www.nytimes.com/2019/08/29/science/gay-gene-sex.html (04.03.2020).

Later “122 Shades of Grey”: https://www.geneplaza.com/app-store/72/preview (04.03.2020).

Source: https://www.youtube.com/watch?v=th0vnOmFltc (04.03.2020).

Source: http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=2159 (04.03.2020).

Source: https://geneticsexbehavior.info/ (04.03.2020).

Source: https://www.broadinstitute.org/blog/opinion-big-data-scientists-must-be-ethicists-too (04.03.2020).

Source: https://medium.com/@cecilejanssens/study-finds-no-gay-gene-was-there-one-to-find-ce5321c87005 (03.03.2020).

Source: https://videos.files.wordpress.com/2AVNyj7B/gosb_subt-4_dvd.mp4 (04.03.2020).

Source: https://geneticsexbehavior.info/what-we-found/ (04.03.2020).

Source: https://www.ukbiobank.ac.uk/2017/04/direct-test-whether-genetic-factors-predisposing-to-homosexuality-increase-mating-success-in-heterosexuals/ (04.03.2020).

Abdellaoui A, Hugh-Jones D, Yengo L, Kemper KE, Nivard MG, Veul L, Holtz Y, Zietsch BP, Frayling TM, Wray NR (2019) Genetic correlates of social stratification in Great Britain. Nat Hum Behav 1–21. https://doi.org/10.1038/s41562-019-0757-5

Anderson C (2008) The end of theory: the data deluge makes the scientific method obsolete, Wired https://www.wired.com/2008/06/pb-theory/ . Accessed 31 Mar 2020

Bliss C (2015) Defining health justice in the postgenomic era. In: Richardson SS, Stevens H (eds) Postgenomics: perspectives on biology after the genome. Duke University Press, Durham, Durham/London, pp. 174–191

Chapter   Google Scholar  

Bourdieu P (2001) Masculine domination. Stanford University Press, Stanford

Google Scholar  

Bourdieu P (2010) Distinction: a social critique of the judgement of taste. Routledge, London/New York

Bowker GC, Star SL (2000) Sorting things out: classification and its consequences. MIT Press, Cambridge/London

Book   Google Scholar  

Boysen GA, Vogel DL (2007) Biased assimilation and attitude polarization in response to learning about biological explanations of homosexuality. Sex Roles 57(9–10):755–762. https://doi.org/10.1007/s11199-007-9256-7

Article   Google Scholar  

Butler J (1990) Gender trouble. Feminism and the subversion of identity. Routledge, New York

Clarke AE, Shim JK, Shostak S, Nelson A (2013) Biomedicalising genetic health, diseases and identities. In: Atkinson P, Glasner P, Lock M (eds) Handbook of genetics and society: mapping the new genomc era. Routledge, Oxon, pp. 21–40

Conley D (2009) The promise and challenges of incorporating genetic data into longitudinal social science surveys and research. Biodemogr Soc Biol 55(2):238–251. https://doi.org/10.1080/19485560903415807

Conley D, Fletcher J (2018) The genome factor: what the social genomics revolution reveals about ourselves, our history, and the future. Princeton University Press, Princeton/Oxford

Connell RW (2005) Masculinities. Polity, Cambridge

Conrad P (1999) A mirage of genes. Sociol Health Illn 21(2):228–241. https://doi.org/10.1111/1467-9566.00151

Conrad P, Markens S (2001) Constructing the ‘gay gene’ in the news: optimism and skepticism in the US and British press. Health 5(3):373–400. https://doi.org/10.1177/136345930100500306

Crenshaw K (1989) Demarginalizing the intersection of race and sex: a black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics, vol 1989(8). University of Chicago Legal Forum. http://chicagounbound.uchicago.edu/uclf/vol1989/iss1/8 . Accessed 1 Apr 2020

Cyranoski D (2019) Russian ‘CRISPR-baby’ scientist has started editing genes in human eggs with goal of altering deaf gene. Nature 574(7779):465–466. https://doi.org/10.1038/d41586-019-03018-0

Article   ADS   CAS   PubMed   Google Scholar  

Dar-Nimrod I, Heine SJ (2011) Genetic essentialism: on the deceptive determinism of DNA. Psychol Bull 137(5):800–818. https://doi.org/10.1037/a0021860

Article   PubMed   PubMed Central   Google Scholar  

Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, Ntritsos G, Dimou N, Cabrera CP, Karaman I (2018) Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nat Genet 50(10):1412–1425. https://doi.org/10.1038/s41588-018-0205-x

Article   CAS   PubMed   PubMed Central   Google Scholar  

Fausto-Sterling A (2007) Frameworks of desire. Daedalus 136(2):47–57. https://doi.org/10.1162/daed.2007.136.2.47

Fletcher JM (2012) Why have tobacco control policies stalled? Using genetic moderation to examine policy impacts. PLoS ONE 7(12):e50576. https://doi.org/10.1371/journal.pone.0050576

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Foucault M (1998) The history of sexuality 1: the will to knowledge. Penguin Books, London

Foucault M (2003) The birth of the clinic. Routledge, London/New York

Foucault M (2005) The order of things. Routledge, London/New York

Fox Keller E (2014) From gene action to reactive genomes. J Physiol 592(11):2423–2429. https://doi.org/10.1113/jphysiol.2014.270991

Article   CAS   Google Scholar  

Fox Keller E (2015) The postgenomic genome. In: Richardson SS, Stevens H (eds) Postgenomics: perspectives on biology after the genome. Duke University Press, Durham/London, pp. 9–31

Funk M (2019) The genetics of sexual orientation. Science 365(6456):878–880. https://doi.org/10.1126/science.365.6456.878-k

Article   ADS   Google Scholar  

Ganna A, Verweij KJ, Nivard MG, Maier R, Wedow R, Busch AS, Abdellaoui A, Guo S, Sathirapongsasuti JF, Lichtenstein P (2019a) Genome studies must account for history—response. Science 366(6472):1461–1462. https://doi.org/10.1126/science.aaz8941

Ganna A, Verweij KJ, Nivard MG, Maier R, Wedow R, Busch AS, Abdellaoui A, Guo S, Sathirapongsasuti JF, Lichtenstein P (2019b) Large-scale GWAS reveals insights into the genetic architecture of same-sex sexual behavior. Science 365(6456):eaat7693. https://doi.org/10.1126/science.aat7693

Garfinkel H (1967) Studies in ethnomethodology. Polity Press, Cambridge

Gray J (2009) Jim Gray on eScience: a transformed scientific method. In: Hey T, Tansley S, Tolle KM (eds) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond, pp. xvii–xxxi

Griffiths DA (2016) Queer genes: realism, sexuality and science. J Crit Realism 15(5):511–529. https://doi.org/10.1080/14767430.2016.1210872

Hamer DH, Hu S, Magnuson VL, Hu N, Pattatucci AM (1993) A linkage between DNA markers on the X chromosome and male sexual orientation. Science 261(5119):321–327. https://doi.org/10.1126/science.8332896

Haraway D (1988) Situated knowledges: the science question in feminism and the privilege of partial perspective. Fem Stud 14(3):575–599

Holm S, Ploug T (2019) Genome studies reveal flaws in broad consent. Science 366(6472):1460–1461. https://doi.org/10.1126/science.aaz3797

Howard HC, van El CG, Forzano F, Radojkovic D, Rial-Sebbag E, de Wert G, Borry P, Cornel MC (2018) One small edit for humans, one giant edit for humankind? Points and questions to consider for a responsible way forward for gene editing in humans. Eur J Hum Genet 26(1):1. https://doi.org/10.1038/s41431-017-0024-z

Article   CAS   PubMed   Google Scholar  

Jansen PR, Watanabe K, Stringer S, Skene N, Bryois J, Hammerschlag AR, de Leeuw CA, Benjamins JS, Muñoz-Manchado AB, Nagel M, Savage JE, Tiemeier H, White T, Agee M, Alipanahi B, Auton A, Bell RK, Bryc K, Elson SL, Fontanillas P, Furlotte NA, Hinds DA, Huber KE, Kleinman A, Litterman NK, McCreight JC, McIntyre MH, Mountain JL, Noblin ES, Northover CAM, Pitts SJ, Sathirapongsasuti JF, Sazonova OV, Shelton JF, Shringarpure S, Tian C, Wilson CH, Tung JY, Hinds DA, Vacic V, Wang X, Sullivan PF, van der Sluis S, Polderman TJC, Smit AB, Hjerling-Leffler J, Van Someren EJW, Posthuma D, The 23andMe Research, T. (2019) Genome-wide analysis of insomnia in 1,331,010 individuals identifies new risk loci and functional pathways. Nat Genet 51(3):394–403. https://doi.org/10.1038/s41588-018-0333-3

Jasanoff S (2004) The idiom of co-production. In: Jasanoff S (ed.) States of knowledge: the co-production of science and social order. Routledge, London, p 1–12

Jasanoff S, Hurlbut JB (2018) A global observatory for gene editing. Nature 555:435–437. https://doi.org/10.1038/d41586-018-03270-w

Kessler SJ, McKenna W (1978) Gender: an ethnomethodological approach. John Wiley & Sons, New York

Kitchin, R. (2014a) Big Data, new epistemologies and paradigm shifts. Big Data Soc. https://doi.org/10.1177/2053951714528481

Kitchin R (2014b) The data revolution. Big data, open data, data infrastructures and their consequences. Sage, London

Landecker H (2016) The social as signal in the body of chromatin. Sociol Rev 64(1_suppl):79–99. https://doi.org/10.1111/2059-7932.12014

Landecker H, Panofsky A (2013) From social structure to gene regulation, and back: a critical introduction to environmental epigenetics for sociology. Annu Rev Sociol 39:333–357. https://doi.org/10.1146/annurev-soc-071312-145707

Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, Nguyen-Viet TA, Bowers P, Sidorenko J, Linnér RK (2018) Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment. Nat Genet 50(8):1112. https://doi.org/10.1038/s41588-018-0147-3

Lehrer SF, Ding W (2019) Can social scientists use molecular genetic data to explain individual differences and inform public policy? In: Foster G (ed.) Biophysical measurement in experimental social science research. Academic Press, London/San Diego/Cambridge/Oxford, pp. 225–265

Leonelli, S. (2014) What difference does quantity make? On the epistemology of Big Data in biology. Big Data Soc. https://doi.org/10.1177/2053951714534395

Lorber J (1994) Paradoxes of gender. Yale University Press, New Haven

Martin P, Morrison M, Turkmendag I, Nerlich B, McMahon A, de Saille S, Bartlett A (2020) Genome editing: the dynamics of continuity, convergence, and change in the engineering of life. New Genet Soc 39(2):219–242. https://doi.org/10.1080/14636778.2020.1730166

Maxmen A (2019) Controversial ‘gay gene’ app provokes fears of a genetic Wild West. Nature 574(7780):609–610. https://doi.org/10.1038/d41586-019-03282-0

Mayer-Schönberger V, Cukier K (2013) Big data: a revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, Boston/New York

Meloni M (2014) Biology without biologism: social theory in a postgenomic age. Sociology 48(4):731–746. https://doi.org/10.1177/0038038513501944

Meloni M (2016) Political biology: Science and social values in human heredity from eugenics to epigenetics. Palgrave Macmillan, n.p.p

Mitchell RW, Dezarn L (2014) Does knowing why someone is gay influence tolerance? Genetic, environmental, choice, and “reparative” explanations. Sex Cult 18(4):994–1009. https://doi.org/10.1007/s12119-014-9233-6

Morrison M, de Saille S (2019) CRISPR in context: towards a socially responsible debate on embryo editing. Palgrave Commun 5(1):1–9. https://doi.org/10.1057/s41599-019-0319-5

Nelkin D, Lindee MS (2004) The DNA mystique: the gene as a cultural icon. University of Michigan Press, Ann Arbor

Novas C, Rose N (2000) Genetic risk and the birth of the somatic individual. Econ Soc 29(4):485–513. https://doi.org/10.1080/03085140050174750

O’Riordan K (2012) The life of the gay gene: from hypothetical genetic marker to social reality. J Sex Res 49(4):362–368. https://doi.org/10.1080/00224499.2012.663420

Article   PubMed   Google Scholar  

Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, Rietveld CA, Turley P, Chen G-B, Emilsson V, Meddens SFW (2016) Genome-wide association study identifies 74 loci associated with educational attainment. Nature 533(7604):539–542. https://doi.org/10.1038/nature17671

Parry B, Greenhough B (2017) Bioinformation. Polity Press, Cambridge

Parsons T (1951) The social system. Free Press, New York

Prainsack B (2015) Is personalized medicine different? (Reinscription: the sequel) A response to Troy Duster. Br J Sociol 66(1):28–35. https://doi.org/10.1111/1468-4446.12117

Reardon J (2017) The postgenomic condition: ethics, justice, and knowledge after the genome. University of Chicago Press, Chicago/London

Rehmann-Sutter C, Mahr D (2016) The lived genome. In: Whitehead A, Woods A (eds) Edinburgh companion to the critical medical humanities. Edinburgh University Press, Edinburgh, pp. 87–103

Richardson SS, Borsa A, Boulicault M, Galka J, Ghosh N, Gompers A, Noll NE, Perret M, Reiches MW, Sandoval JCB (2019) Genome studies must account for history. Science 366(6472):1461. https://doi.org/10.1126/science.aaz6594

Ruckenstein M, Schüll ND (2017) The datafication of health. Annu Rev Anthropol 46(261–278). https://doi.org/10.1146/annurev-anthro-102116-041244

Saukko P (2017) Shifting metaphors in direct-to-consumer genetic testing: from genes as information to genes as big data. New Genet Soc 36(3):296–313. https://doi.org/10.1080/14636778.2017.1354691

Savage M, Burrows R (2007) The coming crisis of empirical sociology. Sociology 41(5):885–899. https://doi.org/10.1177/0038038507080443

Shostak S, Conrad P, Horwitz AV (2008) Sequencing and its consequences: path dependence and the relationships between genetics and medicalization. Am J Sociol 114(S1):S287–S316. https://doi.org/10.1086/595570

Thomas WJ, Thomas DS (1929) The child in America. Behavior problems and programs. Knopf, New York

West C, Zimmerman DH (1991) Doing gender. In: Lorber J, Farrell SA (eds) The social construction of gender. Sage, Newbury Park/London, pp. 13–37

Download references

Acknowledgements

Open access funding provided by University of Vienna. The authors thank Brígida Riso for contributing to a previous version of this article.

Author information

These authors contributed equally: Melanie Goisauf, Kaya Akyüz, Gillian M. Martin.

Authors and Affiliations

Department of Science and Technology Studies, University of Vienna, Vienna, Austria

Melanie Goisauf & Kaya Akyüz

BBMRI-ERIC, Graz, Austria

Department of Sociology, University of Malta, Msida, Malta

Gillian M. Martin

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Melanie Goisauf or Kaya Akyüz .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Goisauf, M., Akyüz, K. & Martin, G.M. Moving back to the future of big data-driven research: reflecting on the social in genomics. Humanit Soc Sci Commun 7 , 55 (2020). https://doi.org/10.1057/s41599-020-00544-5

Download citation

Received : 15 November 2019

Accepted : 09 July 2020

Published : 04 August 2020

DOI : https://doi.org/10.1057/s41599-020-00544-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Biobanking and risk assessment: a comprehensive typology of risks for an adaptive risk governance.

  • Gauthier Chassang
  • Michaela Th. Mayrhofer

Life Sciences, Society and Policy (2021)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

big data research direction

Handling big data: research challenges and future directions

  • Published: 25 February 2016
  • Volume 72 , pages 1494–1516, ( 2016 )

Cite this article

  • I. Anagnostopoulos 1 ,
  • S. Zeadally 2 &
  • E. Exposito 3  

4976 Accesses

88 Citations

3 Altmetric

Explore all metrics

Today, an enormous amount of data is being continuously generated in all walks of life by all kinds of devices and systems every day. A significant portion of such data is being captured, stored, aggregated and analyzed in a systematic way without losing its “4V” (i.e., volume, velocity, variety, and veracity) characteristics. We review major drivers of big data today as well the recent trends and established platforms that offer valuable perspectives on the information stored in large and heterogeneous data sets. Then, we present a classification of some of the most important challenges when handling big data. Based on this classification, we recommend solutions that could address the identified challenges, and in addition we highlight cross-disciplinary research directions that need further investigation in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Similar content being viewed by others

big data research direction

Big data in healthcare: management, analysis and future prospects

Sabyasachi Dash, Sushil Kumar Shakyawar, … Sandeep Kaushik

big data research direction

Trends and Future Perspective Challenges in Big Data

big data research direction

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

Iqbal H. Sarker

IDC: http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm .

http://www.diabeticlink.org/ .

http://hadoop.apache.org/ .

http://www.pentaho.com/product/big-data-analytics .

https://mahout.apache.org/ .

https://code.google.com/p/ml-hadoop/ .

https://hive.apache.org/ .

http://pig.apache.org/ .

http://storm.apache.org/ .

https://spark.apache.org/ .

http://spark.apache.org/docs/latest/mllib-guide.html .

http://sqoop.apache.org/ .

http://flume.apache.org/ .

http://zookeeper.apache.org/ .

https://github.com/twitter/elephant-bird/ .

https://github.com/mbostock/d3/wiki/Gallery .

http://polymaps.org/ .

http://www.nodexlgraphgallery.org/Pages/Default.aspx .

http://moebio.com/ .

http://moebio.com/newk/twitter/ .

https://networkx.github.io/ .

http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/white_paper_c11-520862.pdf .

http://openrefine.org/ .

http://vis.stanford.edu/wrangler/ .

http://cran.r-project.org/web/packages/plyr/index.html .

http://www-01.ibm.com/software/data/infosphere/ .

http://datacleaner.org/ .

http://www.paxata.com/ .

https://www.xplenty.com/ .

http://docs.oasis-open.org/xri/2.0/specs/xri-resolution-V2.0.html .

https://hadoop.apache.org/ .

https://hbase.apache.org/ .

http://aws.amazon.com/dynamodb/ .

http://www.amazon.com/b?node=8037720011 .

http://creativecommons.org/ .

http://www.europeana.eu/ .

http://ocw.mit.edu/index.html .

http://publicspending.net/ .

http://www.personalgenomes.org/ .

http://www.ands.org.au/guides/ethics-working-level.html .

Jacobs A (2009) The pathologies of big data. Commun ACM 52(8):36–44

Article   Google Scholar  

Madden S (2012) From databases to big data. IEEE Internet Comput 16(3):4–6

Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107

Gantz J, Reinsel D (2011) Extracting value from chaos. IDC iView, pp 1–12

Banaee H, Ahmed MU, Loutfi A (2013) Data mining for wearable sensors in health monitoring systems: a review of recent trends and challenges. Sensors 13(12):17472–17500

Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of ‘big data’ on cloud computing: review and open research issues. Inf Syst 47:98–115

Kwon O, Lee N, Shin B (2014) Data quality management, data usage experience and acquisition intention of big data analytics. Int J Inf Manag 34(3):387–394

Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH (2016) Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute, 2011. http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation . Accessed 12 January 2016

Pretz K (2016) Better health care through data: how health analytics could contain costs and improve care. The IEEE Institute, New York. http://theinstitute.ieee.org/technology-focus/technology-topic/better-health-care-through-data . Accessed 12 January 2016

Chen H, Compton S, Hsiao O (2013) DiabeticLink: a health big data system for patient empowerment and personalized healthcare, vol 8040. In: Smart health. Springer, Berlin, pp 71–83

O’Driscoll A, Daugelaite J, Sleator RD (2013) Big data. Hadoop and cloud computing in genomics. J Biomed Inf 46(5):774–781

Big Data Insight Group. http://www.thebigdatainsightgroup.com/site/article/nypd-make-big-apple-safer-big-data . Accessed 12 January 2016

Rozenfeld M (2016) The future of crime prevention. IEEE Institute, New York. http://theinstitute.ieee.org/technology-focus/technology-topic/the-future-of-crime-prevention . Accessed 12 January 2016

NASA Jet Propulsion Laboratory, Managing the deluge of ’Big Data’ from space. http://solarsystem.nasa.gov/news/display.cfm?News_ID=45192 . Accessed 12 January 2016

Kambatla K, Kollias G, Kumar V, Grama A (2014) Trends in big data analytics. J Parallel Distrib Comput 74(7):2561–2573 ISSN 0743–7315

Atzeni P, Bugiotti F, Rossi L (2014) Uniform access to NoSQL systems. Inf Syst 43:117–133 ISSN 0306–4379

Chen M, Mao S, Liu Y (2014) Big data: a survey. Mobile Netw Appl 19(2):171–209

Article   MathSciNet   Google Scholar  

Owen S, Anil R, Dunning T, Friedman E (2011) Mahout in action. Manning Publications Co, USA ISBN: 9781935182689

Google Scholar  

Prakashbhai PA, Pandey HM (2014) Inference patterns from Big Data using aggregation, filtering and tagging—a survey. In: 5th international conference The next generation information technology summit (confluence), September 2014, pp 66–71

Hu H, Wen Y, Chua TS, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687

Che D, Safran M, Peng Z (2013) From big data to big data mining: challenges, issues, and opportunities. In: Lecture notes in computer science, vol 7827, pp 1–15

Tan W, Blake MB, Saleh I, Dustdar S (2013) Social-network-sourced big data analytics. IEEE Internet Comput 7(5):62–69

Lin J, Kolcz A (2012) Large-scale machine learning at twitter. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data (SIGMOD ’12). ACM, New York, pp 793–804

Liu J, Liu F, Ansari N (2014) Monitoring and analyzing big traffic data of a large-scale cellular network with Hadoop. IEEE Netw 28(4):32–39

Marchal S, Francois J, State R, Engel T (2014) Phishstorm: detecting phishing with streaming analytics. IEEE Trans Netw Serv Manag 11(4):458–471

Ma C, Zhang HH, Wang X (2014) Machine learning for Big Data analytics in plants. Trends Plant Sci 19(12):798–808

Chandola V, Sukumar SR, Schryver JC (2013) Knowledge discovery from massive healthcare claims data. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’13). ACM, New York, pp 1312–1320

3M Meeting Network. http://www.3rd-force.org/meetingnetwork/files/meetingguide_pres.pdf . Accessed 12 January 2016

Reda K, Febretti A, Knoll A, Aurisano J, Leigh J, Johnson AE, Papka ME, Hereld M (2013) Visualizing large, heterogeneous data in hybrid-reality environments. IEEE Comput Graph Appl 33(4):38–48

Philip Chen CL, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf Sci 275:314–347

Jagadish HV, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM, Ramakrishnan R, Shahabi C (2014) Big data and its technical challenges. Commun ACM 57(7):86–94

Labrinidis A, Jagadish HV (2012) Challenges and opportunities with big data. Proc VLDB Endow 5(12):2032–2033

Buneman P, Khanna S, Tan W (2000) Data provenance: some basic issues. In: Proceedings of foundations of software technology and theoretical computer science (FST TCS 2000). LNCS, vol 1974, pp 87–93

Price S, Flach PA (2013) A Higher-order data flow model for heterogeneous Big Data. In: 2013 IEEE international conference on big data, October 2013, pp 569–574

Xindong W, Xingquan Z, Gong-Qing W, Wei D (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107

Davis K, Patterson D (2012) Ethics of big data, O’Reilly. ISBN 978-1-4493-1179-7

Mann S (2012) Through the glass. Light IEEE Technol Soc Mag 31(3):10–14

Michael K, Miller KW (2013) Big data: new opportunities and new challenges. IEEE Comput 46(6):22–24

Kupwade PH, Seshadri R (2014) Big data security and privacy issues in healthcare. In: 2014 IEEE international congress on big data, pp 762–765

Volkovs M, Fei C, Szlichta J, Miller RJ (2014) Continuous data cleaning. In: 2014 IEEE 30th international conference on data engineering (ICDE), pp 244–255

Wang J, Song Z, Li Q, Yu J, Chen F (2014) Semantic-based intelligent data clean framework for big data. In: 2014 international conference on security, pattern analysis, and cybernetics (SPAC), pp 448–453

Stonebraker M, Bruckner D, Ilyas I, Beskales G, Cherniack M, Zdonik S, Pagan A, Xu S (2013) Data curation at scale: the data tamer system. In: Proceedings of biennial ACM conference on innovative data systems research (CIDR’13), Alisomar

Bansal SK (2014) Towards a semantic extract-transform-load (ETL) framework for big data integration. In: 2014 IEEE international congress on big data (BigData Congress), pp 522–529

Kadadi A, Agrawal R, Nyamful C, Atiq R (2014) Challenges of data integration and interoperability in big data. In: 2014 IEEE international conference on big data (Big Data), pp 38–40

Dong XL, Srivastava D (2013) Big data integration. In: 2013 IEEE 29th international conference on data engineering (ICDE), pp 1245–1248

Sowe SK, Zettsu K (2013) The architecture and design of a community-based cloud platform for curating big data. In: 2013 international conference on cyber-enabled distributed computing and knowledge discovery (CyberC), pp 171–178

O’Leary DE (2014) Embedding AI and crowdsourcing in the big data lake. IEEE Intell Syst 29(5):70–73

Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. ACM Commun 51(1):107–113

Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst 26(2):1–26

Kumar KA, Quamar A, Deshpande A, Khuller S (2014) SWORD: workload-aware data placement and replica selection for cloud data management systems. VLDB J 23(6):845–870

Wang Z, Zhu W, Chen X, Sun L, Liu J, Chen M, Cui P, Yang S (2013) Propagation-based social-aware multimedia content distribution. ACM Trans Multimed Comput Commun Appl (TOMM) 9(1):52:1–52:20

Wang Z, Zhu W, Chen M, Sun L, Yang S (2015) CPCDN: content delivery powered by context and user intelligence. IEEE Trans Multimed 17(1):92–103

Menglan H, Jun L, Yang W, Veeravalli B (2014) Practical resource provisioning and caching with dynamic resilience for cloud-based content distribution networks. IEEE Trans Parall Distrib Syst 25(8):2169–2179

Suto K, Nishiyama H, Kato N, Nakachi T, Fujii T, Takahara A (2014) Toward integrating overlay and physical networks for robust parallel processing architecture. IEEE Netw 28(4):40–45

Jiayi L, Rosenberg C, Simon G, Texier G (2014) Optimal delivery of rate-adaptive streams in underprovisioned networks. IEEE J Select Areas Commun 32(4):706–718

Fiore S, D’Anca A, Elia D, Palazzo C, Foster I, Williams D, Aloisio G (2014) Ophidia: a full software stack for scientific data analytics. In: 2014 international conference on high performance computing & simulation (HPCS), pp 343–350

Bhandarkar SM, Arabnia HR, Smith JW (1995) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell (IJPRAI) 9(2):201–229. (Special issue on VLSI Algorithms and Architectures for Computer Vision. Image Processing, Pattern Recognition and AI)

Heinze T, Pappalardo V, Jerzak Z, Fetzer C (2014) Auto-scaling techniques for elastic data stream processing. In: 2014 IEEE 30th international conference on data engineering workshops (ICDEW), pp 296–302

Hsiang HW, Tse CY, Chien MW (2014) Multiple two-phase data processing with mapreduce. In: 2014 IEEE 7th international conference on cloud computing (CLOUD), pp 352–359

Arif Wani M, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable multi-ring network. J Supercomput 25(1):43–63

Article   MATH   Google Scholar  

Mokhtari R, Stumm M (2014) BigKernel—high performance CPU-GPU communication pipelining for big data-style applications. In: 2014 IEEE 28th international parallel and distributed processing symposium, pp 819–828

Chatterjee A, Radhakrishnan S, Sekharan CN (2014) Connecting the dots: triangle completion and related problems on large data sets using GPUs. In: 2014 IEEE international conference on big data (Big Data), pp 1–8

Shahar Y (1997) A framework for knowledge-based temporal abstraction. Elsevier Artif Intell 90(1–2):79–133

Tajer A, Veeravalli VV, Poor HV (2014) Outlying sequence detection in large data sets: a data-driven approach. IEEE Signal Process Mag 31(5):44–56

Bhandarkar SM, Arabnia HR (1995) The REFINE multiprocessor: theoretical properties and algorithms. Elsevier Parall Comput 21(11):1783–1806

Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parall Distrib Comput 24(1):107–114

Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–270

Vafopoulos M, Meimaris M, Anagnostopoulos I, Papantoniou A, Xidias I, Alexiou G, Vafeiadis G, Klonaras M, Loumos V (2015) Public spending as LOD: the case of Greece. Seman Web Interoperabil Usabil Applicabil Seman Web 6(2):155–164

Ekbia H, Mattioli M, Kouper I, Arave G, Ghazinejad A, Bowman T, Suri VR, Tsou A, Weingart S, Sugimoto CR (2014) Big data, bigger dilemmas: a critical review. J Assoc Inf Sci Technol. Wiley, New York

Smith M, Szongott C, Henne B, von Voigt G (2012) Big data privacy issues in public social media. In: 6th IEEE international conference on digital ecosystems technologies (DEST), pp 1–6

Zhang X, Dou W, Pei J, Nepal S, Yang C, Liu C, Chen J (2015) Proximity-aware local-recoding anonymization with mapreduce for scalable big data privacy preservation in cloud. IEEE Trans Comput 64(8):2293–2307

Download references

Acknowledgments

We express our gratitude to Emna Mezghani for her contributions to this work. The authors thank the anonymous reviewers for their valuable comments and suggestions.

Author information

Authors and affiliations.

University of Thessaly, Nea Ionia, Greece

I. Anagnostopoulos

University of Kentucky, Lexington, USA

S. Zeadally

University of Toulouse, Toulouse, France

E. Exposito

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to S. Zeadally .

Rights and permissions

Reprints and permissions

About this article

Anagnostopoulos, I., Zeadally, S. & Exposito, E. Handling big data: research challenges and future directions. J Supercomput 72 , 1494–1516 (2016). https://doi.org/10.1007/s11227-016-1677-z

Download citation

Published : 25 February 2016

Issue Date : April 2016

DOI : https://doi.org/10.1007/s11227-016-1677-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data curation
  • Data cleansing
  • Data analytics
  • Find a journal
  • Publish with us
  • Track your research

REVIEW article

Challenges and future directions of big data and artificial intelligence in education.

\r\nHui Luan

  • 1 Institute for Research Excellence in Learning Sciences, National Taiwan Normal University, Taipei, Taiwan
  • 2 National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan
  • 3 School of Dentistry, Faculty of Medicine & Dentistry, University of Alberta, Edmonton, AB, Canada
  • 4 Graduate School of Education, Rutgers – The State University of New Jersey, New Brunswick, NJ, United States
  • 5 Apprendis, LLC, Berlin, MA, United States
  • 6 Department of Computer Science and Information Engineering, College of Electrical Engineering and Computer Science, National Central University, Taoyuan City, Taiwan
  • 7 Graduate School of Informatics, Kyoto University, Kyoto, Japan
  • 8 Department of Electrical Engineering, College of Technology and Engineering, National Taiwan Normal University, Taipei, Taiwan
  • 9 Centro de Tecnologia, Universidade Federal de Santa Maria, Santa Maria, Brazil
  • 10 Department of Chinese and Bilingual Studies, Faculty of Humanities, The Hong Kong Polytechnic University, Kowloon, Hong Kong
  • 11 Program of Learning Sciences, National Taiwan Normal University, Taipei, Taiwan

We discuss the new challenges and directions facing the use of big data and artificial intelligence (AI) in education research, policy-making, and industry. In recent years, applications of big data and AI in education have made significant headways. This highlights a novel trend in leading-edge educational research. The convenience and embeddedness of data collection within educational technologies, paired with computational techniques have made the analyses of big data a reality. We are moving beyond proof-of-concept demonstrations and applications of techniques, and are beginning to see substantial adoption in many areas of education. The key research trends in the domains of big data and AI are associated with assessment, individualized learning, and precision education. Model-driven data analytics approaches will grow quickly to guide the development, interpretation, and validation of the algorithms. However, conclusions from educational analytics should, of course, be applied with caution. At the education policy level, the government should be devoted to supporting lifelong learning, offering teacher education programs, and protecting personal data. With regard to the education industry, reciprocal and mutually beneficial relationships should be developed in order to enhance academia-industry collaboration. Furthermore, it is important to make sure that technologies are guided by relevant theoretical frameworks and are empirically tested. Lastly, in this paper we advocate an in-depth dialog between supporters of “cold” technology and “warm” humanity so that it can lead to greater understanding among teachers and students about how technology, and specifically, the big data explosion and AI revolution can bring new opportunities (and challenges) that can be best leveraged for pedagogical practices and learning.

Introduction

The purpose of this position paper is to present current status, opportunities, and challenges of big data and AI in education. The work has originated from the opinions and panel discussion minutes of an international conference on big data and AI in education ( The International Learning Sciences Forum, 2019 ), where prominent researchers and experts from different disciplines such as education, psychology, data science, AI, and cognitive neuroscience, etc., exchanged their knowledge and ideas. This article is organized as follows: we start with an overview of recent progress of big data and AI in education. Then we present the major challenges and emerging trends. Finally, based on our discussions of big data and AI in education, conclusion and future scope are suggested.

Rapid advancements in big data and artificial intelligence (AI) technologies have had a profound impact on all areas of human society including the economy, politics, science, and education. Thanks in large part to these developments, we are able to continue many of our social activities under the COVID-19 pandemic. Digital tools, platforms, applications, and the communications among people have generated vast amounts of data (‘big data’) across disparate locations. Big data technologies aim at harnessing the power of extensive data in real-time or otherwise ( Daniel, 2019 ). The characteristic attributes of big data are often referred to as the four V’s. That is, volume (amount of data), variety (diversity of sources and types of data), velocity (speed of data transmission and generation), and veracity (the accuracy and trustworthiness of data) ( Laney, 2001 ; Schroeck et al., 2012 ; Geczy, 2014 ). Recently, a 5th V was added, namely value (i.e., that data could be monetized; Dijcks, 2013 ). Because of intrinsic big data characteristics (the five Vs), large and complex datasets are impossible to process and utilize by using traditional data management techniques. Hence, novel and innovative computational technologies are required for the acquisition, storage, distribution, analysis, and management of big data ( Lazer et al., 2014 ; Geczy, 2015 ). Big data analytics commonly encompasses the processes of gathering, analyzing, and evaluating large datasets. Extraction of actionable knowledge and viable patterns from data are often viewed as the core benefits of the big data revolution ( Mayer-Schönberger and Cukier, 2013 ; Jagadish et al., 2014 ). Big data analytics employ a variety of technologies and tools, such as statistical analysis, data mining, data visualization, text analytics, social network analysis, signal processing, and machine learning ( Chen and Zhang, 2014 ).

As a subset of AI, machine learning focuses on building computer systems that can learn from and adapt to data automatically without explicit programming ( Jordan and Mitchell, 2015 ). Machine learning algorithms can provide new insights, predictions, and solutions to customize the needs and circumstances of each individual. With the availability of large quantity and high-quality input training data, machine learning processes can achieve accurate results and facilitate informed decision making ( Manyika et al., 2011 ; Gobert et al., 2012 , 2013 ; Gobert and Sao Pedro, 2017 ). These data-intensive, machine learning methods are positioned at the intersection of big data and AI, and are capable of improving the services and productivity of education, as well as many other fields including commerce, science, and government.

Regarding education, our main area of interest here, the application of AI technologies can be traced back to approximately 50 years ago. The first Intelligent Tutoring System “SCHOLAR” was designed to support geography learning, and was capable of generating interactive responses to student statements ( Carbonell, 1970 ). While the amount of data was relatively small at that time, it was comparable to the amount of data collected in other traditional educational and psychological studies. Research on AI in education over the past few decades has been dedicated to advancing intelligent computing technologies such as intelligent tutoring systems ( Graesser et al., 2005 ; Gobert et al., 2013 ; Nye, 2015 ), robotic systems ( Toh et al., 2016 ; Anwar et al., 2019 ), and chatbots ( Smutny and Schreiberova, 2020 ). With the breakthroughs in information technologies in the last decade, educational psychologists have had greater access to big data. Concretely speaking, social media (e.g., Facebook, Twitter), online learning environments [e.g., Massive Open Online Courses (MOOCs)], intelligent tutoring systems (e.g., AutoTutor), learning management systems (LMSs), sensors, and mobile devices are generating ever-growing amounts of dynamic and complex data containing students’ personal records, physiological data, learning logs and activities, as well as their learning performance and outcomes ( Daniel, 2015 ). Learning analytics, described as “the measurement, collection, analysis, and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs” ( Long and Siemens, 2011 , p. 34), are often implemented to analyze these huge amounts of data ( Aldowah et al., 2019 ). Machine learning and AI techniques further expand the capabilities of learning analytics ( Zawacki-Richter et al., 2019 ). The essential information extracted from big data could be utilized to optimize learning, teaching, and administration ( Daniel, 2015 ). Hence, research on big data and AI is gaining increasing significance in education ( Johnson et al., 2011 ; Becker et al., 2017 ; Hwang et al., 2018 ) and psychology ( Harlow and Oswald, 2016 ; Yarkoni and Westfall, 2017 ; Adjerid and Kelley, 2018 ; Cheung and Jak, 2018 ). Recently, the adoption of big data and AI in the psychology of learning and teaching has been trending as a novel method in cutting-edge educational research ( Daniel, 2015 ; Starcic, 2019 ).

The Position Formulation

A growing body of literature has attempted to uncover the value of big data at different education levels, from preschool to higher education ( Chen N.-S. et al., 2020 ). Several journal articles and book chapters have presented retrospective descriptions and the latest advances in the rapidly expanding research area from different angles, including systematic literature review ( Zawacki-Richter et al., 2019 ; Quadir et al., 2020 ), bibliometric study ( Hinojo-Lucena et al., 2019 ), qualitative analysis ( Malik et al., 2019 ; Chen L. et al., 2020 ), and social network analysis ( Goksel and Bozkurt, 2019 ). More details can be found in the previously mentioned reviews. In this paper, we aim at presenting the current progress of the application of big data and AI in education. By and large, the research on the learner side is devoted to identifying students’ learning and affective behavior patterns and profiles, improving methods of assessment and evaluation, predicting individual students’ learning performance or dropouts, and providing adaptive systems for personalized support ( Papamitsiou and Economides, 2014 ; Zawacki-Richter et al., 2019 ). On the teacher side, numerous studies have attempted to enhance course planning and curriculum development, evaluation of teaching, and teaching support ( Zawacki-Richter et al., 2019 ; Quadir et al., 2020 ). Additionally, teacher dashboards, such as Inq-Blotter, driven by big data techniques are being used to inform teachers’ instruction in real time while students simultaneously work in Inq-ITS ( Gobert and Sao Pedro, 2017 ; Mislevy et al., 2020 ). Big data technologies employing learning analytics and machine learning have demonstrated high predictive accuracy of students’ academic performance ( Huang et al., 2020 ). Only a small number of studies have focused on the effectiveness of learning analytics programs and AI applications. However, recent findings have revealed encouraging results in terms of improving students’ academic performance and retention, as well as supporting teachers in learning design and teaching strategy refinement ( Viberg et al., 2018 ; Li et al., 2019 ; Sonderlund et al., 2019 ; Mislevy et al., 2020 ).

Despite the growing number of reports and methods outlining implementations of big data and AI technologies in educational environments, we see a notable gap between contemporary technological capabilities and their utilization for education. The fast-growing education industry has developed numerous data processing techniques and AI applications, which may not be guided by current theoretical frameworks and research findings from psychology of learning and teaching. The rapid pace of technological progress and relatively slow educational adoption have contributed to the widening gap between technology readiness and its application in education ( Macfadyen, 2017 ). There is a pressing need to reduce this gap and stimulate technological adoption in education. This work presents varying viewpoints and their controversial issues, contemporary research, and prospective future developments in adoption of big data and AI in education. We advocate an interdisciplinary approach that encompasses educational, technological, and governmental spheres of influence. In the educational domain, there is a relative lack of knowledge and skills in AI and big data applications. On the technological side, few data scientists and AI developers are familiar with the advancements in education psychology, though this is changing with the advent of graduate programs at the intersection of Learning Sciences and Computer Science. Finally, in terms of government policies, the main challenges faced are the regulatory and ethical dilemmas between support of educational reforms and restrictions on adoptions of data-oriented technologies.

An Interdisciplinary Approach to Educational Adoption of Big Data and AI

In response to the new opportunities and challenges that the big data explosion and AI revolution are bringing, academics, educators, policy-makers, and professionals need to engage in productive collaboration. They must work together to cultivate our learners’ necessary competencies and essential skills important for the 21st century work, driven by the knowledge economy ( Bereiter, 2002 ). Collaboration across diverse disciplines and sectors is a demanding task—particularly when individual sides lack a clear vision of their mutually beneficial interests and the necessary knowledge and skills to realize that vision. We highlight several overlapping spheres of interest at the intersection of research, policy-making, and industry engagements. Researchers and the industry would benefit from targeted educational technology development and its efficient transfer to commercial products. Businesses and governments would benefit from legislature that stimulates technology markets while suitably protecting data and users’ privacy. Academics and policy makers would benefit from prioritizing educational reforms enabling greater adoption of technology-enhanced curricula. The recent developments and evolving future trends at intersections between researchers, policy-makers, and industry stakeholders arising from advancements and deployments of big data and AI technologies in education are illustrated in Figure 1 .

www.frontiersin.org

Figure 1. Contemporary developments and future trends at the intersections between research, policy, and industry driven by big data and AI advances in education.

The constructive domains among stakeholders progressively evolve along with scientific and technological developments. Therefore, it is important to reflect on longer-term projections and challenges. The following sections highlight the novel challenges and future directions of big data and AI technologies at the intersection of education research, policy-making, and industry.

Big Data and AI in Education: Research

An understanding of individual differences is critical for developing pedagogical tools to target specific students and to tailor education to individual needs at different stages. Intelligent educational systems employing big data and AI techniques are capable of collecting accurate and rich personal data. Data analytics can reveal students’ learning patterns and identify their specific needs ( Gobert and Sao Pedro, 2017 ; Mislevy et al., 2020 ). Hence, big data and AI have the potential to realize individualized learning to achieve precision education ( Lu et al., 2018 ). We see the following emerging trends, research gaps, and controversies in integrating big data and AI into education research so that there is a deep and rigorous understanding of individual differences that can be used to personalize learning in real time and at scale.

(1) Education is progressively moving from a one-size-fits-all approach to precision education or personalized learning ( Lu et al., 2018 ; Tsai et al., 2020 ). The one-size-fits-all approach was designed for average students, whereas precision education takes into consideration the individual differences of learners in their learning environments, along with their learning strategies. The main idea of precision education is analogous to “precision medicine,” where researchers harvest big data to identify patterns relevant to specific patients such that prevention and treatment can be customized. Based on the analysis of student learning profiles and patterns, precision education predicts students’ performance and provides timely interventions to optimize learning. The goal of precision education is to improve the diagnosis, prediction, treatment, and prevention of learning outcomes ( Lu et al., 2018 ). Contemporary research gaps related to adaptive tools and personalized educational experiences are impeding the transition to precision education. Adaptive educational tools and flexible learning systems are needed to accommodate individual learners’ interaction, pace, and learning progress, and to fit the specific needs of the individual learners, such as students with learning disabilities ( Xie et al., 2019 ; Zawacki-Richter et al., 2019 ). Hence, as personalized learning is customized for different people, researchers are able to focus on individualized learning that is adaptive to individual needs in real time ( Gobert and Sao Pedro, 2017 ; Lu et al., 2018 ).

(2) The research focus on deploying AI in education is gradually shifting from a computational focus that demonstrates use cases of new technology to cognitive focus that incorporates cognition in its design, such as perception ( VanRullen, 2017 ), emotion ( Song et al., 2016 ), and cognitive thinking ( Bramley et al., 2017 ). Moreover, it is also shifting from a single domain (e.g., domain expertise, or expert systems) to a cross-disciplinary approach through collaboration ( Spikol et al., 2018 ; Krouska et al., 2019 ) and domain transfers ( L’heureux et al., 2017 ). These controversial shifts are facilitating transitions from the knowing of the unknown (gaining insights through reasoning) to the unknown of the unknown (figuring out hidden values and unknown results through algorithms) ( Abed Ibrahim and Fekete, 2019 ; Cutumisu and Guo, 2019 ). In other words, deterministic learning, aimed at deductive/inductive reasoning and inference engines, predominated in traditional expert systems and old AI. Whereas, today, dynamic and stochastic learning, the outcome of which involves some randomness and uncertainty, is gradually becoming the trend in modern machine learning techniques.

(3) The format of machine-generated data and the purpose of machine learning algorithms should be carefully designed. There is a notable gap between theoretical design and its applicability. A theoretical model is needed to guide the development, interpretation, and validation of algorithms ( Gobert et al., 2013 ; Hew et al., 2019 ). The outcomes of data analytics and algorithmically generated evidence must be shared with educators and applied with caution. For instance, efforts to algorithmically detect mental states such as boredom, frustration, and confusion ( Baker et al., 2010 ) must be supported by the operational definitions and constructs that have been prudently evaluated. Additionally, the affective data collected by AI systems should take into account the cultural differences combined with contextual factors, teachers’ observations, and students’ opinions ( Yadegaridehkordi et al., 2019 ). Data need to be informatively and qualitatively balanced, in order to avoid implicit biases that may propagate into algorithms trained on such data ( Staats, 2016 ).

(4) There are ethical and algorithmic challenges when balancing human provided learning and machine assisted learning. The significant influence of AI and contemporary technologies is a double-edged sword ( Khechine and Lakhal, 2018 ). On the one hand, it facilitates better usability and drives progress. On the other, it might lead to the algorithmic bias and loss of certain essential skills among students who are extensively relying on technology. For instance, in creativity- or experience-based learning, technology may even become an obstacle to learning, since it may hinder students from attaining first-hand experiences and participating in the learning activities ( Cuthbertson et al., 2004 ). Appropriately balancing the technology adoption and human involvement in various educational contexts will be a challenge in the foreseeable future. Nonetheless, the convergence of human and machine learning has the potential for highly effective teaching and learning beyond the simple “sum of the parts of human and artificial intelligence” ( Topol, 2019 ).

(5) Algorithmic bias is another controversial issue ( Obermeyer et al., 2019 ). Since modern AI algorithms extensively rely on data, their performance is governed solely by data. Algorithms adapt to inherent qualitative and quantitative characteristics of data. For example, if data is unbalanced and contains disproportionately better information on students from general population in comparison to minorities, the algorithms may produce systematic and repeatable errors disadvantaging minorities. These controversial issues need to be addressed before its wide implementation in education practice since every single student is precious. More rigorous studies and validation in real learning environments are required though work along these lines is being done ( Sao Pedro et al., 2013 ).

(6) The fast expansion of technology and inequalities of learning opportunities has aroused great controversies. Due to the exponential nature of technological progress, particularly big data and AI revolution, a fresh paradigm and new learning landscape are on the horizon. For instance, the elite smartphone 10 years ago, in 2010, was BlackBerry. Today, 10 years later, even in sub-Saharan Africa, 75% of the population has mobile phones several generations more advanced ( GSMA Intelligence, 2020 ). Hence, the entry barriers are shifting from the technical requirements to the willingness of and/or need for adoption. This has been clearly demonstrated during the COVID-19 pandemic. The need for social distancing and continuing education has led to online/e-learning deployments within months ( United Nations, 2020 ). A huge amount of learning data is created accordingly. The extraction of meaningful patterns and the discovery of knowledge from these data is expected to be carried out through learning analytics and AI techniques. Inevitably, the current learning cultures, learning experiences, and classroom dynamics are changing as “we live algorithmic lives” ( Bucher, 2018 ). Thus, there is a critical need to adopt proper learning theories of educational psychology and to encourage our learners to be active participants rather than passive recipients or merely tracked objects ( Loftus and Madden, 2020 ). For example, under the constructionist framework ( Tsai, 2000 ), the technology-enhanced or AI-powered education may empower students to know their learning activities and patterns, predict their possible learning outcomes, and strategically regulate their learning behavior ( Koh et al., 2014 ; Loftus and Madden, 2020 ). On the other hand, in the era of information explosion and AI revolution, the disadvantaged students and developing countries are indeed facing a wider digital divide. To reduce the inequalities and bring more opportunities, cultivating young people’s competencies is seemed like one of the most promising means ( UNESCO, 2015 ). Meanwhile, overseas support from international organizations such as World Bank and UNESCO are imperative for developing countries in their communication infrastructure establishment (e.g., hardware, software, connectivity, electricity). Naturally, technology will not replace or hinder human learning; rather, a smart use of new technologies will facilitate transfer and acquisition of knowledge ( Azevedo et al., 2019 ).

An overarching theme from the above trends of research is that we need theories of cognitive and educational psychology to guide our understanding of the individual learner (and individual differences), in order to develop best tools, algorithms, and practices for personalized learning. Take, for example, VR (virtual reality) or AR (augmented reality) as a fast-developing technology for education. The industry has developed many different types of VR/AR applications (e.g., Google Expeditions with over 100 virtual field trips), but these have typically been developed in the views of the industry (see further discussion below) and may not be informed by theories and data from educational psychology about how students actually learn. To make VR/AR effective learning tools, we must separate the technological features from the human experiences and abilities (e.g., cognitive, linguistic, spatial abilities of the learner; see Li et al., 2020 ). For example, VR provides a high-fidelity 3D real-life virtual environment, and the technological tools are built on the assumption that 3D realism enables the learner to gain ‘perceptual grounding’ during learning (e.g., having access to visual, auditory, tactile experiences as in real world). Following the ‘embodied cognition’ theory ( Barsalou, 2008 ), we should expect VR learning to yield better learning outcomes compared with traditional classroom learning. However, empirical data suggest that there are significant individual differences in that some students benefit more than others from VR learning. It may be that the individuals with higher cognitive and perceptual abilities need no additional visuospatial information (provided in VR) to succeed in learning. In any case, we need to understand how embodied experiences (provided by the technology) interact with different learners’ inherent abilities (as well as their prior knowledge and background) for the best application of the relevant technology in education.

Big Data and AI in Education: Policy-Making

Following the revolution triggered by breakthroughs in big data and AI technology, policy-makers have attempted to formulate strategies and policies regarding how to incorporate AI and emerging technologies into primary, secondary, and tertiary education ( Pedró et al., 2019 ). Major challenges must be overcome in order to suitably integrate big data and AI into educational practice. The following three segments highlight pertinent policy-oriented challenges, gaps, and evolving trends.

(1) In digitally-driven knowledge economies, traditional formal education systems are undergoing drastic changes or even a paradigm shift ( Peters, 2018 ). Lifelong learning is quickly being adopted and implemented through online or project-based learning schemes that incorporate multiple ways of teaching ( Lenschow, 1998 ; Sharples, 2000 ; Field, 2001 ; Koper and Tattersall, 2004 ). This new concept of continual education will require micro-credits or micro-degrees to sustain learners’ efforts ( Manuel Moreno-Marcos et al., 2019 ). The need to change the scope and role of education will become evident in the near future ( Williams, 2019 ). For example, in the next few years, new instruction methods, engagement, and assessment will need to be developed in formal education to support lifelong education. The system should be based on micro-credits or micro-degrees.

(2) Solutions for integrating cutting-edge research findings, innovative theory-driven curricula, and emerging technologies into students’ learning are evidently beneficial, and perhaps even ready for adoption. However, there is an apparent divergence between a large number of pre-service and in-service teachers and their willingness to support and adopt these emerging technologies ( Pedró et al., 2019 ). Pre-service teachers have greater exposure to modern technologies and, in general, are more willing to adopt them. In-service teachers have greater practical experience and tend to more rely on it. To bridge the gap, effective teacher education programs and continuing education programs have to be developed and offered to support the adoption of these new technologies so that they can be implemented with fidelity ( O’Donnell, 2008 ). This issue could become even more pressing to tackle in light of the extended period of the COVID-19 pandemic.

(3) A suitable legislative framework is needed to protect personal data from unscrupulous collection, unauthorized disclosure, commercial exploitation, and other abuses ( Boyd and Crawford, 2012 ; Pardo and Siemens, 2014 ). Education records and personal data are highly sensitive. There are significant risks associated with students’ educational profiles, records, and other personal data. Appropriate security measures must be adopted by educational institutions. Commercial educational system providers are actively exploiting both legislative gaps and concealed data acquisition channels. Increasing numbers of industry players are implementing data-oriented business models ( Geczy, 2018 ). There is a vital role to play for legislative, regulatory, and enforcing bodies at both the national and local levels. It is pertinent that governments enact, implement, and enforce privacy and personal data protection legislation and measures. In doing so, there is a need to strike a proper balance between desirable use of personal data for educational purposes and undesirable commercial monetization and abuse of personal data.

Big Data and AI in Education: Industry

As scientific and academic aspects of big data and AI in education have their unique challenges, so does the commercialization of educational tools and systems ( Renz et al., 2020 ). Numerous countries have attempted to stimulate innovation-based growth through enhancing technology transfer and fostering academia-industry collaboration ( Huggins and Thompson, 2015 ). In the United States, this was initiated by the Bayh-Dole Act ( Mowery et al., 2001 ). Building a reciprocal and sustained partnership is strongly encouraged. It facilitates technology transfers and strengthens the links between academia and the education industry. There are several points to be considered when approaching academia-industry collaboration. It is important that collaboration is mutually beneficial. The following points highlight the overlapping spheres of benefits for both educational commerce and academia. They also expose existing gaps and future prospects.

(1) Commercializing intelligent educational tools and systems that include the latest scientific and technological advances can provide educators with tools for developing more effective curricula, pedagogical frameworks, assessments, and programs. Timely release of educational research advances onto commercial platforms is desirable by vendors from development, marketing, and revenue perspectives ( Renz and Hilbig, 2020 ). Implementation of the latest research enables progressive development of commercial products and distinctive differentiation for marketing purposes. This could also potentially solve the significant gap between what the industry knows and develops and what the academic research says with regard to student learning. Novel features may also be suitably monetized—hence, expanding revenue streams. The gaps between availability of the latest research and its practical adoption are slowing progress and negatively impacting commercial vendors. A viable solution is a closer alignment and/or direct collaboration between academia and industry.

(2) A greater spectrum of commercially and freely available tools helps maintain healthy market competition. It also helps to avoid monopolies and oligopolies that stifle innovation, limit choices, and damage markets for educational tools. Some well-stablished or free-of-charge platforms (e.g., Moodle, LMS) might show such potential of oligopolies during the COVID-19 pandemic. With more tools available on the market, educators and academics may explore novel avenues for improving education and research. New and more effective forms of education may be devised. For instance, multimodal virtual educational environments have high potential future prospects. These are environments that would otherwise be impossible in conventional physical settings (see previous discussion of VR/AR). Expanding educational markets and commerce should inevitably lead to expanding resources for research and development funding ( Popenici and Kerr, 2017 ). Collaborative research projects sponsored by the industry should provide support and opportunities for academics to advance educational research. Controversially, in numerous geographies there is a decreasing trend in collaborative research. To reverse the trend, it is desirable that academic researchers and industry practitioners increase their engagements via mutual presentations, educations, and even government initiatives. All three stakeholders (i.e., academia, industry, and government) should play more active roles.

(3) Vocational and practical education provides numerous opportunities for fruitful academia-industry collaboration. With the changing nature of work and growing technology adoption, there is an increasing demand for radical changes in vocational education—for both teachers and students ( World Development and Report, 2019 ). Domain knowledge provided by teachers is beneficially supplemented by AI-assisted learning environments in academia. Practical skills are enhanced in industrial environments with hands-on experience and feedback from both trainers and technology tools. Hence, students benefit from acquiring domain knowledge and enhancing their skills via interactions with human teachers and trainers. Equally, they benefit from gaining the practical skills via interactions with simulated and real-world technological environments. Effective vocational training demands teachers and trainers on the human-learning side, and AI environments and actual technology tools on machine-learning side. Collaboration between academia and industry, as well as balanced human and machine learning approaches are pertinent for vocational education.

Discussion and Conclusion

Big data and AI have enormous potential to realize highly effective learning and teaching. They stimulate new research questions and designs, exploit innovative technologies and tools in data collection and analysis, and ultimately become a mainstream research paradigm ( Daniel, 2019 ). Nonetheless, they are still fairly novel and unfamiliar to many researchers and educators. In this paper, we have described the general background, core concepts, and recent progress of this rapidly growing domain. Along with the arising opportunities, we have highlighted the crucial challenges and emerging trends of big data and AI in education, which are reflected in educational research, policy-making, and industry. Table 1 concisely summarizes the major challenges and possible solutions of big data and AI in education. In summary, future studies should be aimed at theory-based precision education, incorporating cross-disciplinary application, and appropriately using educational technologies. The government should be devoted to supporting lifelong learning, offering teacher education programs, and protecting personal data. With regard to the education industry, reciprocal and mutually beneficial relationships should be developed in order to enhance academia-industry collaboration.

www.frontiersin.org

Table 1. Major challenges and possible solutions for integrating big data and AI into education.

Regarding the future development of big data and AI, we advocate an in-depth dialog between the supporters of “cold” technology and “warm” humanity so that users of technology can benefit from its capacity and not see it as a threat to their livelihood. An equally important issue is that overreliance on technology may lead to an underestimation of the role of humans in education. Remember the fundamental role of schooling: the school is a great equalizer as well as a central socialization agent. We need to better understand the role of social and affective processing (e.g., emotion, motivation) in addition to cognitive processing in student learning successes (or failures). After all, human learning is a social behavior, and a number of key regions in our brains are wired to be socially engaged (see Li and Jeong, 2020 for a discussion).

It has been estimated that approximately half of the current routine jobs might be automated in the near future ( Frey and Osborne, 2017 ; World Development and Report, 2019 ). However, the teacher’s job could not be replaced. The teacher-student relationship is indispensable in students’ learning, and inspirational in students’ personal growth ( Roorda et al., 2011 ; Cheng and Tsai, 2019 ). On the other hand, new developments in technologies will enable us to collect and analyze large-scale, multimodal, and continuous real-time data. Such data-intensive and technology-driven analysis of human behavior, in real-world and simulated environments, may assist teachers in identifying students’ learning trajectories and patterns, developing corresponding lesson plans, and adopting effective teaching strategies ( Klašnja-Milicevic et al., 2017 ; Gierl and Lai, 2018 ). It may also support teachers in tackling students’ more complex problems and cultivating students’ higher-order thinking skills by freeing the teachers from their monotonous and routine tasks ( Li, 2007 ; Belpaeme et al., 2018 ). Hence, it is now imperative for us to embrace AI and technology and prepare our teachers and students for the future of AI-enhanced and technology-supported education.

The adoption of big data and AI in learning and teaching is still in its infancy and limited by technological and mindset challenges for now; however, the convergence of developments in psychology, data science, and computer science shows great promise in revolutionizing educational research, practice, and industry. We hope that the latest achievements and future directions presented in this paper will advance our shared goal of helping learners and teachers pursue sustainable development.

Author Contributions

HLu wrote the initial draft of the manuscript. PG, HLa, JG, and PL revised the drafts and provided theoretical background. SY, HO, JB, and RG contributed content for the original draft preparation of the manuscript. C-CT provided theoretical focus, design, draft feedback, and supervised throughout the research. All authors contributed to the article and approved the submitted version.

This work was financially supported by the Institute for Research Excellence in Learning Sciences of National Taiwan Normal University (NTNU) from the Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan.

Conflict of Interest

JG was employed by company Apprendis, LLC, Berlin.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abed Ibrahim, L., and Fekete, I. (2019). What machine learning can tell us about the role of language dominance in the diagnostic accuracy of german litmus non-word and sentence repetition tasks. Front. Psychol. 9:2757. doi: 10.3389/fpsyg.2018.02757

CrossRef Full Text | Google Scholar

Adjerid, I., and Kelley, K. (2018). Big data in psychology: a framework for research advancement. Am. Psychol. 73, 899–917. doi: 10.1037/amp0000190

PubMed Abstract | CrossRef Full Text | Google Scholar

Aldowah, H., Al-Samarraie, H., and Fauzy, W. M. (2019). Educational data mining and learning analytics for 21st century higher education: a review and synthesis. Telemat. Inform. 37, 13–49. doi: 10.1016/j.tele.2019.01.007

Anwar, S., Bascou, N. A., Menekse, M., and Kardgar, A. (2019). A systematic review of studies on educational robotics. J. Pre-College Eng. Educ. Res. (J-PEER) 9, 19–42. doi: 10.7771/2157-9288.1223

Azevedo, J. P. W. D., Crawford, M. F., Nayar, R., Rogers, F. H., Barron Rodriguez, M. R., Ding, E. Y. Z., et al. (2019). Ending Learning Poverty: What Will It Take?. Washington, D.C: The World Bank.

Google Scholar

Baker, R. S. J. D., D’Mello, S. K., Rodrigo, M. M. T., and Graesser, A. C. (2010). Better to be frustrated than bored: the incidence, persistence, and impact of learners’ cognitive-affective states during interactions with three different computer-based learning environments. Int. J. Human-Comp. Stud. 68, 223–241. doi: 10.1016/j.ijhcs.2009.12.003

Barsalou, L. W. (2008). “Grounding symbolic operations in the brain’s modal systems,” in Embodied Grounding: Social, Cognitive, Affective, and Neuroscientific Approaches , eds G. R. Semin and E. R. Smith (Cambridge: Cambridge University Press), 9–42. doi: 10.1017/cbo9780511805837.002

Becker, S. A., Cummins, M., Davis, A., Freeman, A., Hall, C. G., and Ananthanarayanan, V. (2017). NMC Horizon Report: 2017 Higher Education Edition. Austin, TX: The New Media Consortium.

Belpaeme, T., Kennedy, J., Ramachandran, A., Scassellati, B., and Tanaka, F. (2018). Social robots for education: a review. Sci. Robot. 3:eaat5954. doi: 10.1126/scirobotics.aat5954

Bereiter, C. (2002). Education and MIND in the Knowledge Age. Mahwah, NJ: LEA.

Boyd, D., and Crawford, K. (2012). Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Inform. Commun. Soc. 15, 662–679. doi: 10.1080/1369118x.2012.678878

Bramley, N. R., Dayan, P., Griffiths, T. L., and Lagnado, D. A. (2017). Formalizing Neurath’s ship: approximate algorithms for online causal learning. Psychol. Rev. 124, 301–338. doi: 10.1037/rev0000061

Bucher, T. (2018). If Then: Algorithmic Power and Politics. New York, NY: Oxford University Press.

Carbonell, J. R. (1970). AI in CAI: an artificial-intelligence approach to computer-assisted instruction. IEEE Trans. Man-Machine Sys. 11, 190–202. doi: 10.1109/TMMS.1970.299942

Chen, C. P., and Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inform. Sci. 275, 314–347. doi: 10.1016/j.ins.2014.01.015

Chen, L., Chen, P., and Lin, Z. (2020). Artificial intelligence in education: a review. IEEE Access 8, 75264–75278. doi: 10.1109/ACCESS.2020.2988510

Chen, N.-S., Yin, C., Isaias, P., and Psotka, J. (2020). Educational big data: extracting meaning from data for smart education. Interact. Learn. Environ. 28, 142–147. doi: 10.1080/10494820.2019.1635395

Cheng, K.-H., and Tsai, C.-C. (2019). A case study of immersive virtual field trips in an elementary classroom: students’ learning experience and teacher-student interaction behaviors. Comp. Educ. 140:103600. doi: 10.1016/j.compedu.2019.103600

Cheung, M. W.-L., and Jak, S. (2018). Challenges of big data analyses and applications in psychology. Zeitschrift Fur Psychol. J. Psychol. 226, 209–211. doi: 10.1027/2151-2604/a000348

Cuthbertson, B., Socha, T. L., and Potter, T. G. (2004). The double-edged sword: critical reflections on traditional and modern technology in outdoor education. J. Adv. Educ. Outdoor Learn. 4, 133–144. doi: 10.1080/14729670485200491

Cutumisu, M., and Guo, Q. (2019). Using topic modeling to extract pre-service teachers’ understandings of computational thinking from their coding reflections. IEEE Trans. Educ. 62, 325–332. doi: 10.1109/te.2019.2925253

Daniel, B. (2015). Big data and analytics in higher education: opportunities and challenges. Br. J. Educ. Technol. 46, 904–920. doi: 10.1111/bjet.12230

Daniel, B. K. (2019). Big data and data science: a critical review of issues for educational research. Br. J. Educ. Technol. 50, 101–113. doi: 10.1111/bjet.12595

Dijcks, J. (2013). Oracle: Big data for the enterprise. Oracle White Paper . Redwood Shores, CA: Oracle Corporation.

Field, J. (2001). Lifelong education. Int. J. Lifelong Educ. 20, 3–15. doi: 10.1080/09638280010008291

Frey, C. B., and Osborne, M. A. (2017). The future of employment: how susceptible are jobs to computerisation? Technol. Forecast. Soc. Change 114, 254–280. doi: 10.1016/j.techfore.2016.08.019

Geczy, P. (2014). Big data characteristics. Macrotheme Rev. 3, 94–104.

Geczy, P. (2015). Big data management: relational framework. Rev. Bus. Finance Stud. 6, 21–30.

Geczy, P. (2018). Data-Oriented business models: gaining competitive advantage. Global J. Bus. Res. 12, 25–36.

Gierl, M. J., and Lai, H. (2018). Using automatic item generation to create solutions and rationales for computerized formative testing. Appl. Psychol. Measurement 42, 42–57. doi: 10.1177/0146621617726788

Gobert, J., Sao Pedro, M., Raziuddin, J., and Baker, R. S. (2013). From log files to assessment metrics for science inquiry using educational data mining. J. Learn. Sci. 22, 521–563. doi: 10.1080/10508406.2013.837391

Gobert, J. D., and Sao Pedro, M. A. (2017). “Digital assessment environments for scientific inquiry practices,” in The Wiley Handbook of Cognition and Assessment , eds A. A. Rupp and J. P. Leighton (West Sussex: Frameworks, Methodologies, and Applications), 508–534. doi: 10.1002/9781118956588.ch21

Gobert, J. D., Sao Pedro, M. A., Baker, R. S., Toto, E., and Montalvo, O. (2012). Leveraging educational data mining for real-time performance assessment of scientific inquiry skills within microworlds. J. Educ. Data Min. 4, 104–143. doi: 10.5281/zenodo.3554645

Goksel, N., and Bozkurt, A. (2019). “Artificial intelligence in education: current insights and future perspectives,” in Handbook of Research on Learning in the Age of Transhumanism , eds S. Sisman-Ugur and G. Kurubacak (Hershey, PA: IGI Global), 224–236 doi: 10.4018/978-1-5225-8431-5.ch014

Graesser, A. C., Chipman, P., Haynes, B. C., and Olney, A. (2005). AutoTutor: an intelligent tutoring system with mixed-initiative dialogue. IEEE Trans. Educ. 48, 612–618. doi: 10.1109/te.2005.856149

GSMA Intelligence (2020). The Mobile Economy 2020 . London: GSM Association.

Harlow, L. L., and Oswald, F. L. (2016). Big data in psychology: introduction to the special issue. Psychol. Methods 21, 447–457. doi: 10.1037/met0000120

Hew, K. F., Lan, M., Tang, Y., Jia, C., and Lo, C. K. (2019). Where is the “theory” within the field of educational technology research? Br. J. Educ. Technol. 50, 956–971. doi: 10.1111/bjet.12770

Hinojo-Lucena, F. J., Aznar-Díaz, I., Cáceres-Reche, M. P., and Romero-Rodríguez, J. M. (2019). Artificial intelligence in higher education: a bibliometric study on its impact in the scientific literature. Educ. Sci. 9:51. doi: 10.3390/educsci9010051

Huang, A. Y., Lu, O. H., Huang, J. C., Yin, C., and Yang, S. J. (2020). Predicting students’ academic performance by using educational big data and learning analytics: evaluation of classification methods and learning logs. Int. Learn. Environ. 28, 206–230. doi: 10.1080/10494820.2019.1636086

Huggins, R., and Thompson, P. (2015). Entrepreneurship, innovation and regional growth: a network theory. Small Bus. Econ. 45, 103–128. doi: 10.1007/s11187-015-9643-3

Hwang, G.-J., Spikol, D., and Li, K.-C. (2018). Guest editorial: trends and research issues of learning analytics and educational big data. Educ. Technol. Soc. 21, 134–136.

Jagadish, H. V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J. M., Ramakrishnan, R., et al. (2014). Big data and its technical challenges. Commun. ACM. 57, 86–94. doi: 10.1145/2611567

Johnson, L., Smith, R., Willis, H., Levine, A., and Haywood, K. (2011). The 2011 Horizon Report. Austin, TX: The New Media Consortium.

Jordan, M. I., and Mitchell, T. M. (2015). Machine learning: trends, perspectives, and prospects. Science 349, 255–260. doi: 10.1126/science.aaa8415

Khechine, H., and Lakhal, S. (2018). Technology as a double-edged sword: from behavior prediction with UTAUT to students’ outcomes considering personal characteristics. J. Inform. Technol. Educ. Res. 17, 63–102. doi: 10.28945/4022

Klašnja-Milicevic, A., Ivanovic, M., and Budimac, Z. (2017). Data science in education: big data and learning analytics. Comput. Applicat. Eng. Educ. 25, 1066–1078. doi: 10.1002/cae.21844

Koh, J. H. L., Chai, C. S., and Tsai, C. C. (2014). Demographic factors, TPACK constructs, and teachers’ perceptions of constructivist-oriented TPACK. J. Educ. Technol. Soc. 17, 185–196.

Koper, R., and Tattersall, C. (2004). New directions for lifelong learning using network technologies. Br. J. Educ. Technol. 35, 689–700. doi: 10.1111/j.1467-8535.2004.00427.x

Krouska, A., Troussas, C., and Virvou, M. (2019). SN-Learning: an exploratory study beyond e-learning and evaluation of its applications using EV-SNL framework. J. Comp. Ass. Learn. 35, 168–177. doi: 10.1111/jcal.12330

Laney, D. (2001). 3D data management: controlling data volume, velocity and variety. META Group Res. Note 6, 70–73.

Lazer, D., Kennedy, R., King, G., and Vespignani, A. (2014). The parable of Google Flu: traps in big data analysis. Science 343, 1203–1205. doi: 10.1126/science.1248506

Lenschow, R. J. (1998). From teaching to learning: a paradigm shift in engineering education and lifelong learning. Eur. J. Eng. Educ. 23, 155–161. doi: 10.1080/03043799808923494

L’heureux, A., Grolinger, K., Elyamany, H. F., and Capretz, M. A. (2017). Machine learning with big data: challenges and approaches. IEEE Access 5, 7776–7797. doi: 10.1109/ACCESS.2017.2696365

Li, H., Gobert, J., and Dickler, R. (2019). “Evaluating the transfer of scaffolded inquiry: what sticks and does it last?,” in Artificial Intelligence in Education , eds S. Isotani, E. Millán, A. Ogan, P. Hastings, B. McLaren, and R. Luckin (Cham: Springer), 163–168. doi: 10.1007/978-3-030-23207-8_31

Li, P., and Jeong, H. (2020). The social brain of language: grounding second language learning in social interaction. npj Sci. Learn. 5:8. doi: 10.1038/s41539-020-0068-7

Li, P., Legault, J., Klippel, A., and Zhao, J. (2020). Virtual reality for student learning: understanding individual differences. Hum. Behav. Brain 1, 28–36. doi: 10.37716/HBAB.2020010105

Li, X. (2007). Intelligent agent–supported online education. Dec. Sci. J. Innovat. Educ. 5, 311–331. doi: 10.1111/j.1540-4609.2007.00143.x

Loftus, M., and Madden, M. G. (2020). A pedagogy of data and Artificial intelligence for student subjectification. Teach. Higher Educ. 25, 456–475. doi: 10.1080/13562517.2020.1748593

Long, P., and Siemens, G. (2011). Penetrating the fog: analytics in learning and education. Educ. Rev. 46, 31–40. doi: 10.1007/978-3-319-38956-1_4

Lu, O. H. T., Huang, A. Y. Q., Huang, J. C. H., Lin, A. J. Q., Ogata, H., and Yang, S. J. H. (2018). Applying learning analytics for the early prediction of students’ academic performance in blended learning. Educ. Technol. Soc. 21, 220–232.

Macfadyen, L. P. (2017). Overcoming barriers to educational analytics: how systems thinking and pragmatism can help. Educ. Technol. 57, 31–39.

Malik, G., Tayal, D. K., and Vij, S. (2019). “An analysis of the role of artificial intelligence in education and teaching,” in Recent Findings in Intelligent Computing Techniques. Advances in Intelligent Systems and Computing , eds P. Sa, S. Bakshi, I. Hatzilygeroudis, and M. Sahoo (Singapore: Springer), 407–417.

Manuel Moreno-Marcos, P., Alario-Hoyos, C., Munoz-Merino, P. J., and Delgado Kloos, C. (2019). Prediction in MOOCs: a review and future research directions. IEEE Trans. Learn. Technol. 12, 384–401. doi: 10.1109/TLT.2018.2856808

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., et al. (2011). Big data: The Next Frontier for Innovation, Competition and Productivity. New York, NY: McKinsey Global Institute.

Mayer-Schönberger, V., and Cukier, K. (2013). Big data: A Revolution That Will Transform How we live, Work, and Think. Boston, MA: Houghton Mifflin Harcourt.

Mislevy, R. J., Yan, D., Gobert, J., and Sao Pedro, M. (2020). “Automated scoring in intelligent tutoring systems,” in Handbook of Automated Scoring , eds D. Yan, A. A. Rupp, and P. W. Foltz (London: Chapman and Hall/CRC), 403–422. doi: 10.1201/9781351264808-22

Mowery, D. C., Nelson, R. R., Sampat, B. N., and Ziedonis, A. A. (2001). The growth of patenting and licensing by US universities: an assessment of the effects of the Bayh–Dole act of 1980. Res. Pol. 30, 99–119. doi: 10.1515/9780804796361-008

Nye, B. D. (2015). Intelligent tutoring systems by and for the developing world: a review of trends and approaches for educational technology in a global context. Int. J. Art. Intell. Educ. 25, 177–203. doi: 10.1007/s40593-014-0028-6

Obermeyer, Z., Powers, B., Vogeli, C., and Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453. doi: 10.1126/science.aax2342

O’Donnell, C. (2008). Defining, conceptualizing, and measuring fidelity of implementation and its relationship to outcomes in K-12 curriculum intervention research. Rev. Educ. Res. 78, 33–84. doi: 10.3102/0034654307313793

Papamitsiou, Z., and Economides, A. A. (2014). Learning analytics and educational data mining in practice: a systematic literature review of empirical evidence. Educ. Technol. Soc. 17, 49–64.

Pardo, A., and Siemens, G. (2014). Ethical and privacy principles for learning analytics. Br. J. Educ. Technol. 45, 438–450. doi: 10.1111/bjet.12152

Pedró, F., Subosa, M., Rivas, A., and Valverde, P. (2019). Artificial Intelligence in Education: Challenges and Opportunities for Sustainable Development. Paris: UNESCO.

Peters, M. A. (2018). Deep learning, education and the final stage of automation. Educ. Phil. Theory 50, 549–553. doi: 10.1080/00131857.2017.1348928

Popenici, S. A., and Kerr, S. (2017). Exploring the impact of artificial intelligence on teaching and learning in higher education. Res. Pract. Technol. Enhanced Learn. 12:22. doi: 10.1186/s41039-017-0062-8

Quadir, B., Chen, N.-S., and Isaias, P. (2020). Analyzing the educational goals, problems and techniques used in educational big data research from 2010 to 2018. Int. Learn. Environ. 1–17. doi: 10.1080/10494820.2020.1712427

Renz, A., and Hilbig, R. (2020). Prerequisites for artificial intelligence in further education: identification of drivers, barriers, and business models of educational technology companies. Int. J. Educ. Technol. Higher Educ. 17:14. doi: 10.1186/s41239-020-00193-3

Renz, A., Krishnaraja, S., and Gronau, E. (2020). Demystification of artificial intelligence in education–how much ai is really in the educational technology? Int. J. Learn. Anal. Art. Intell. Educ. (IJAI). 2, 4–30. doi: 10.3991/ijai.v2i1.12675

Roorda, D. L., Koomen, H. M. Y., Spilt, J. L., and Oort, F. J. (2011). The influence of affective teacher-student relationships on students’ school engagement and achievement: a meta-analytic approach. Rev. Educ. Res. 81, 493–529. doi: 10.3102/0034654311421793

Sao Pedro, M., Baker, R., and Gobert, J. (2013). “What different kinds of stratification can reveal about the generalizability of data-mined skill assessment models,” in Proceedings of the 3rd Conference on Learning Analytics and Knowledge (Leuven), 190–194.

Schroeck, M., Shockley, R., Smart, J., Romero-Morales, D., and Tufano, P. (2012). Analytics: the real-world use of big data. IBM Global Bus. Serv. 12, 1–20. doi: 10.1002/9781119204183.ch1

Sharples, M. (2000). The design of personal mobile technologies for lifelong learning. Comp. Educ. 34, 177–193. doi: 10.1016/s0360-1315(99)00044-5

Smutny, P., and Schreiberova, P. (2020). Chatbots for learning: a review of educational chatbots for the facebook messenger. Comp. Educ. 151:103862. doi: 10.1016/j.compedu.2020.103862

Sonderlund, A. L., Hughes, E., and Smith, J. (2019). The efficacy of learning analytics interventions in higher education: a systematic review. Br. J. Educ. Technol. 50, 2594–2618. doi: 10.1111/bjet.12720

Song, Y., Dai, X.-Y., and Wang, J. (2016). Not all emotions are created equal: expressive behavior of the networked public on China’s social media site. Comp. Hum. Behav. 60, 525–533. doi: 10.1016/j.chb.2016.02.086

Spikol, D., Ruffaldi, E., Dabisias, G., and Cukurova, M. (2018). Supervised machine learning in multimodal learning analytics for estimating success in project-based learning. J. Comp. Ass. Learn. 34, 366–377. doi: 10.1111/jcal.12263

Staats, C. (2016). Understanding implicit bias: what educators should know. Am. Educ. 39, 29–33. doi: 10.2307/3396655

Starcic, A. I. (2019). Human learning and learning analytics in the age of artificial intelligence. Br. J. Educ. Technol. 50, 2974–2976. doi: 10.1111/bjet.12879

The International Learning Sciences Forum (2019). The International Learning Sciences Forum: International Trends for Ai and Big Data in Learning Sciences. Taipei: National Taiwan Normal University.

Toh, L. P. E., Causo, A., Tzuo, P. W., Chen, I. M., and Yeo, S. H. (2016). A review on the use of robots in education and young children. J. Educ. Technol. Soc. 19, 148–163.

Topol, E. J. (2019). High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25, 44–56. doi: 10.1038/s41591-018-0300-7

Tsai, C. C. (2000). Relationships between student scientific epistemological beliefs and perceptions of constructivist learning environments. Educ. Res. 42, 193–205. doi: 10.1080/001318800363836

Tsai, S. C., Chen, C. H., Shiao, Y. T., Ciou, J. S., and Wu, T. N. (2020). Precision education with statistical learning and deep learning: a case study in Taiwan. Int. J. Educ. Technol. Higher Educ. 17, 1–13. doi: 10.1186/s41239-020-00186-2

UNESCO (2015). SDG4-Education 2030, Incheon Declaration (ID) and Framework for Action. For the Implementation of Sustainable Development Goal 4, Ensure Inclusive and Equitable Quality Education and Promote Lifelong Learning Opportunities for All, ED-2016/WS/28. London: UNESCO

United Nations (2020). Policy Brief: Education During Covid-19 and Beyond. New York, NY: United Nations

VanRullen, R. (2017). Perception science in the age of deep neural networks. Front. Psychol. 8:142. doi: 10.3389/fpsyg.2017.00142

Viberg, O., Hatakka, M., Bälter, O., and Mavroudi, A. (2018). The current landscape of learning analytics in higher education. Comput. Human Behav. 89, 98–110. doi: 10.1016/j.chb.2018.07.027

Williams, P. (2019). Does competency-based education with blockchain signal a new mission for universities? J. Higher Educ. Pol. Manag. 41, 104–117. doi: 10.1080/1360080x.2018.1520491

World Development and Report (2019). The Changing Nature of Work. Washington, DC: The World Bank/International Bank for Reconstruction and Development.

Xie, H., Chu, H.-C., Hwang, G.-J., and Wang, C.-C. (2019). Trends and development in technology-enhanced adaptive/personalized learning: a systematic review of journal publications from 2007 to 2017. Comp. Educ. 140:103599. doi: 10.1016/j.compedu.2019.103599

Yadegaridehkordi, E., Noor, N. F. B. M., Ayub, M. N. B., Affal, H. B., and Hussin, N. B. (2019). Affective computing in education: a systematic review and future research. Comp. Educ. 142:103649. doi: 10.1016/j.compedu.2019.103649

Yarkoni, T., and Westfall, J. (2017). Choosing prediction over explanation in psychology: lessons from machine learning. Perspect. Psychol. Sci. 12, 1100–1122. doi: 10.1177/1745691617693393

Zawacki-Richter, O., Marín, V. I., Bond, M., and Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education–where are the educators? Int. J. Educ. Technol. Higher Educ. 16:39. doi: 10.1186/s41239-019-0171-0

Keywords : big data, artificial intelligence, education, learning, teaching

Citation: Luan H, Geczy P, Lai H, Gobert J, Yang SJH, Ogata H, Baltes J, Guerra R, Li P and Tsai C-C (2020) Challenges and Future Directions of Big Data and Artificial Intelligence in Education. Front. Psychol. 11:580820. doi: 10.3389/fpsyg.2020.580820

Received: 07 July 2020; Accepted: 22 September 2020; Published: 19 October 2020.

Reviewed by:

Copyright © 2020 Luan, Geczy, Lai, Gobert, Yang, Ogata, Baltes, Guerra, Li and Tsai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Chin-Chung Tsai, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • SAGE Open Med

A review of big data and medical research

Universally, the volume of data has increased, with the collection rate doubling every 40 months, since the 1980s. “Big data” is a term that was introduced in the 1990s to include data sets too large to be used with common software. Medicine is a major field predicted to increase the use of big data in 2025. Big data in medicine may be used by commercial, academic, government, and public sectors. It includes biologic, biometric, and electronic health data. Examples of biologic data include biobanks; biometric data may have individual wellness data from devices; electronic health data include the medical record; and other data demographics and images. Big data has also contributed to the changes in the research methodology. Changes in the clinical research paradigm has been fueled by large-scale biological data harvesting (biobanks), which is developed, analyzed, and managed by cheaper computing technology (big data), supported by greater flexibility in study design (real-world data) and the relationships between industry, government regulators, and academics. Cultural changes along with easy access to information via the Internet facilitate ease of participation by more people. Current needs demand quick answers which may be supplied by big data, biobanks, and changes in flexibility in study design. Big data can reveal health patterns, and promises to provide solutions that have previously been out of society’s grasp; however, the murkiness of international laws, questions of data ownership, public ignorance, and privacy and security concerns are slowing down the progress that could otherwise be achieved by the use of big data. The goal of this descriptive review is to create awareness of the ramifications for big data and to encourage readers that this trend is positive and will likely lead to better clinical solutions, but, caution must be exercised to reduce harm.

Introduction

What is big data.

“Big data” is a term that was introduced in the 1990s to include data sets too large to be used with common software. In 2016, it was defined as information assets characterized by high volume, velocity, and variety that required specific technology and analytic methods for its transformation into use. 1 In addition to the three attributes of volume, velocity, and variety, some have suggested that for big data to be effective, nuances including quality, veracity, and value need to be added as well. 2 , 3 Big data reveals health patterns, and promises to provide solutions that have previously been out of society’s grasp; however, the murkiness of international laws, questions of data ownership, public ignorance, and privacy and security concerns are slowing down the progress that could otherwise be achieved by the use of big data. In this descriptive review, we highlight the roles of big data, the changing research paradigm, and easy access to research participation via the Internet fueled by the need for quick answers.

Universally, data volume has increased, with the collection rate doubling every 40 months, ever since the 1980s. 4 The big data age, starting in 2002, has generated increasing amounts of alphanumeric data; in addition, social media has generated large amounts of data in the form of audio and images. The use of Internet-based devices including smart phones and computers, wearable electronics, the Internet of things (IoT), electronic health records (EHRs), insurance websites, and mobile health all generate terabytes of data. Sources that are not obvious include clickstream data, machine to machine data processing, geo-spatial data, audio and video inputs, and unstructured text. In general, the total volume of data generated can only be estimated. For example, the usual personal computer in the year 2000 held 10 gigabytes of storage; recently, Facebook analyzed more than 105 terabytes of data every 30 min, including shared items and likes, which allows for optimization of product features for its advertising performance; additionally, in its first year Google images used up 13.7 petabytes of storage on users devices. 5 , 6 It is clear that all four domains of big data: acquisition, storage, analysis, and distribution have increased over the data life cycle. 7

Besides being statistically powerful and complex, data need to be available in real time, which allows it to be analyzed and used immediately. Big data has immense volume, dynamic and diverse characteristics, and requires special management technologies including software, infrastructure, and skills. Big data shows trends from shopping, crime statistics, weather patterns, disease outbreaks, and so on. Recognizing the power of big data to effect change, the United Nations (UN) Global Working Group on big data was created under the UN Statistical Commission in 2014. Its vision was to use big data technologies in the UN global platform to create a global statistical community for data sharing and economic benefit. 8

We aimed to write a descriptive review to inform physicians about use of big data (biological, biometric, and electronic health records) in both the commercial and research fields. Pubmed-based searches were performed, and in addition, since many of the topics were outside the scope of this data base, general Internet searches using Google search engine were performed. Searching for “Big data and volume and velocity and variety” in the Pubmed data base resulted in 45 articles in English. Papers were deemed to be appropriate by the consensus of at least two authors. Pubmed search for “artificial intelligence in clinical decision support” resulted in two relevant review articles, and the addition of “randomized control trials” resulted in 11 randomized control studies, of which only one was relevant. For non-Pubmed indexed scholarly articles, two authors determined relevance by the frequency of the paper being cited or accessed online. As some content was to be informative rather than conclusive, commercial websites, such as those dealing with DNA testing for ancestry, were accessed. The Food and Drug Administration (FDA) website was accessed when searching for the “oldest biobank,” which revealed the HIV registry. Landmark trials were selected for changes in research design and use of big data mining.

Big data in medicine

The major fields predicted to increasingly use big data by 2025 include astronomy, social media ( Twitter, YouTube , etc.) and medicine-Genomics, which will be measured in zetta-bytes/year (zetta = 10 21 ). Big data in medicine includes biologic, biometric, and electronic health data ( Figure 1 ).

An external file that holds a picture, illustration, etc.
Object name is 10.1177_2050312120934839-fig1.jpg

Big data in medicine.

Biological banks , also called biobanks, may be present at the local, national, or international levels. Examples include local academic institutions, the National Cancer Institute, United Kingdom Biobank, China Kadoorie Biobank, and the European Bioinformatics Institute, among others. 9 Non-profit organizations may perform biological data collection during a health fair with screening of blood pressure, or urine and blood tests. Commercial biobanks include those that provide services like saliva testing for ancestry determination. 10

Before the data can be converted to digital form, biological specimens need to be processed and preserved. Biospecimen preservation standards in the past varied based on the organization. In 2005, in an effort to standardize biospecimen preservation, the National Cancer Institute contributed to the creation of the Office of Biobanking and Biospecimen Research (OBBR) and the annual symposium for Biospecimen Research Network Symposia. 11 In 2009, with international support, there was the publication of the first biobank-specific quality standard, which has since been applied to many biobanks. Biobanking has evolved with regulatory pressures, advances in medical and computational information technology, and is a crucial enterprise to biological sciences. One of the longest existing biobanks is the University of California at San Francisco AIDS specimen bank, which has functioned for the past 30 years. 12

One thing in common that all biobanks have is the need for significant resources to manage, analyze, and use the information in a timely manner. 13 Commercial biobanks include multinational companies that collect biological specimens from subjects for verification of ancestry. Subjects pay for the DNA analysis kit, which is collected by them and mailed to the companies where they are analyzed and stored. The company then can sell the data to third parties for research based on legislation.

The shifting paradigm in medical research

The clinical research paradigm has changed to match an increasingly older population’s needs. This has been fueled by large-scale biological data harvesting (biobanks), which is developed, analyzed, and managed by cheaper computing technology (big data), supported by greater flexibility in study design and the relationships between industry, government regulators, and academics. With easy access to information via the Internet, citizen science had allowed many non-scientists to participate in research. 14 Biological specimens collected via Internet-based projects may be sold to third parties for research; these may be as data of healthy controls or as part of a specific medical condition.

Historical precedent and its difficulties

In the past, drug development may have started in serendipity. 15 Subsequent to the Second World War, the therapeutic research approach became long and expensive. The initial step was the search into possible therapies, followed by in vitro and in vivo testing via multiple phases: the first phase for safety, the second for efficacy and the third to compare the treatment to the existing standard of care. In addition, hurdles for new drugs included FDA approval, randomized control trials (RCTs), and finally post-release studies. In some unfortunate cases, once the drug was released in the market, rare, but serious, adverse events would bankrupt the company and patients who needed the therapy would still not have effective treatment choices. This was particularly hard for patients suffering from rare diseases, where the small population needed a large investment of money and time, which was less attractive to industry to attempt a repeat study. In patients who had limited life spans, the long process precluded them from beneficial therapies. Understanding this need, when there was an urgency for rapid treatments, the FDA worked to expedite the release of new drugs, such as the release of new medications to treat HIV during its epidemic. 16 , 17

In the case of oncology, the historical approaches in research and development (R&D) of a new drug followed by the usual phases to RCTs have been expensive. In 2018, pharmaceutical companies invested approximately 50 billion dollars in R&D for a 3% probability of success from individual projects. A 3% probability of success, despite the investment of financial and human effort, is too low for patients who may not have any treatment options. 18

Changes in research

Changes in study design.

At present, a more purposeful and organized approach for determining the responsible cause as a starting point for subsequent therapy is being used.

After completion of the Human Genome Project, technology for pinpointing mutations increased. 19 Broad sweeps of the human genome with more than 3000 genome-wide association studies (GWAS) have examined about 1800 diseases. 20 Following GWAS or Quantitative trait locus (QTL) determination, microarray data allowed identification of candidate genes of interest. 21 For allelic variants to be correlated to disease, large biobanks that have both patient and control data are compared. If a mutated allelic frequency correlates at a significantly higher rate in those with the disease, that variant can be targeted for therapy.

In a tumor, once a driver mutation that promotes abnormal growth is identified, therapy targeting the specific genetic alteration can be attempted. 22 In the presence of multiple mutations, driver mutations are differentiated from bystander or passenger mutations, as tumors may have a heterogeneous molecular signature.

Pharmaco-genomics is the foundation for precision medicine, which is now being clinically practiced in oncology and is being adapted in other fields. The introduction of molecular pathological epidemiology (MPE) allows the identification of new biomarkers using big data to select therapy 23 , 24 ( Table 1 ). Based on an individual’s cellular genetics, drugs that target the desired mutation can be studied and effective doses determined, which can result in safe and efficient treatments.

Examples of big data and new research designs trials.

AI: artificial intelligence.

Big data technology allows large cohorts of biological specimens to be collected, and the data can be stored, managed, and analyzed. At the point of analysis, machine learning algorithms (a subset of artificial intelligence (AI)) can generate further output data that may be different from the initial input data. AI can create knowledge from big data 25 , 26 ( Table 1 ). For example, Beck et al., 25 using a computation pathology model in breast cancer specimens with AI, found prior unknown morphologic features to be predictive of negative outcomes.

Rapid learning health care (RLHC) models using AI may discover data that are of varying quality which need to be compared to validated data sets to be truly meaningful. 29 Subsequently, the information extracted can be processed into decision support systems (DSS), which are software applications that can eventually apply knowledge-driven healthcare into practice.

AI can be classified into knowledge-based or data-driven AI. Knowledge-based AI starts with information entered by humans to solve a query in a domain of expertise formalized by the software. Data-driven AI starts with large amounts of data generated by human activity to make a prediction. Data-driven AI needs big data and, with inexpensive computing, is a promising economic choice. 30 , 31

The combination of AI and DSS is a clinically powerful one to improve health care delivery. For example, in a small study of 12 patients with type one diabetes, using AI and DSS allowed for quicker changes in therapy rather than the patients waiting for their next caregiver appointment, without an increase in adverse events. 32

New study designs

With new technology for diagnosing, managing, and treating diseases, modifying the RCT design was essential. The development of master clinical trial protocols, platform trials, basket/bucket designs, and umbrella designs has been seen over the last decade. 33

Basket design : A basket trial is a clinical trial where enrollment eligibility is based on the presence of a specific genomic alteration, irrespective of histology or origin of cell type, and includes sub-trials of multiple tumor types. To qualify for the study, thousands of patients’ data need to be screened to find the suitable genomic alteration to get a small number of patients into a sub-trial.

Usually, sub-trials may be designed as early phase and single arm studies, with one or two stages having an option of stopping early if the study is considered futile. The study design is based on determining tumor pathophysiology/activity and matching the target mutation with a hypothesized treatment. Analogous to a screening test, a responsive sub-study would require a larger confirmatory study. For example, although rare cancers are uncommon on an individual basis, the total sum of these cases make “rare cancers” the fourth largest category of cancer in the United States and Europe. 34 These are challenging to diagnose and treat and have a worse 5-year survival rate as compared to common cancers. One option to help these patients would be to make them eligible for a clinical trial based on genetic dysregulation of the tumor rather than organ histology.

Drugs have been studied for a signature driver mutation rather than for an organ-specific disease. With enough information about the molecular definitions of the targets, the focus on the site of origin of the cancer is diminishing, for example, the study drug Larotrectinib was noted to have significant sustained antitumor activity in patients with 17 types of Tropomysin Receptor kinase fusion–positive cancers, regardless of the age of the patient or of the tumor site of origin. 35 , 36 This landmark drug was the first which was FDA approved for tumors with a specific mutation and not a disease.

Basket trials may also test off-label use of a drug in patients who have the same genomic alteration for which the drug was initially approved, or it could test a repurposed drug. 37

Umbrella design : The umbrella design looks at a single disease by testing various therapies on a variety of mutations, such as lung cancer. (Ferrarotto et al.; 28 Table 1 .)

Platform trials : Big data allows the pooling of resources. Data captured about biomarker status can allow patients to have access to various trials. Compared to a traditional RCT with a control and experimental arm, a platform trial uses a single control arm, which can be compared to many experimental arms, and which may not need to be randomized at the start of the trial; therefore, a platform trial may be seen as a prolonged screening process. 38

Even if the traditional RCT is planned, matching various data sets with AI to run various configurations can result in determining possible therapy choices, and can eliminate time and investment outlay. In the end, this could speed up the process of drug testing and result in a quicker arrival to the RCT stage.

Adverse Drug Events ( ADE ): ADE reporting is a continuous process. Big data in medicine includes literature searches for ADE; using data mining with AI can yield better results than traditional methods in regards to accuracy and precision. 39 In addition, big data can visualize ADE interactions between medications and can be updated on a daily basis.

Real-world evidence

Real-world evidence (RWE), is information obtained from routine clinical practice and it has increased with the use of the EHR. RWE in the digital format can be significantly furthered by big data. Clinical practice guidelines that have been using RWE-based insights include the National Comprehensive Cancer Network. In addition, the American Society of Clinical Oncology suggests using RWE in a complementary nature to randomized controlled trials. 40 Big data in RWE allows for more rapid evaluation of therapy in the clinical setting, which is a key element in the cost of R&D of drugs. The 21st Century Cures Act (signed into law 13 December 2016) resulted in the FDA creating a framework for evaluating the potential use of RWE to help support the approval of a new indication of a drug, or to help support post-approval study requirements. 41 Focusing on EHR data, industry is starting to generate interest in a new pathway to drug approvals. An example would be using natural language processing and machine learning systems to provide observational clinical studies with adequate quality to attempt justification of approval for the new indication of drugs. Another example includes using AI technology to identify the effect of comorbidities on therapy outcomes and subgroups in single disease entity all of which will enhance personalized medicine. RWE data that are collected include demographics, family history, lifestyle, and genetics, and can be used to predict probabilities of diseases in the future. Once marketed, RWE along with RCT could speed up the FDA requirements to get the therapy to the patient or to compare drugs. A recently published study that used RWE to compare cardiovascular outcomes between different therapies was the Cardiovascular Outcome Study of Linagliptin versus Glimepiride in Type 2 Diabetes (CAROLINA) trial. (Patorno et al.; 27 see Table 1 .)

Big data: technology and security

Computing technology has gotten cheaper which allows for the extensive use of big data. Examples of big data technology can be characterized by its function: either operational or analytic ( Table 2 ). Both systems have specific advantages, formats, data forms, and computer network capabilities ( Figure 2 ).

Big data technology with examples of systems in use.

An external file that holds a picture, illustration, etc.
Object name is 10.1177_2050312120934839-fig2.jpg

Big data security.

Big data security should include measures and tools that guard big data at all points: data collection, transfer, analysis, storage, and processing. This includes the security needed to protective massive amounts of dynamic data and faster creative processing like massive parallel processing systems. The risk to data may be theft, loss, or corruption either through human error, inadequate technology (example crash of a server), or malicious intent. Loss of privacy with health-related information adds to the need for greater security and exposes involved organizations to financial losses, fines, and litigation.

Processes to prevent data loss and corruption at each access point needs to be in place, for example, during data collection, there needs to be interruption to incoming threats. Security measures include encrypting data at input and output points, allowing only partial data volume transfers and analysis to occur, separating storage compartments on cloud computing, limiting access with firewalls, and other filters. 45 For example, Block chain technology is a security device that can authenticate users, track data access, and, due to its decentralized nature, can limit data volume retrieval. 46 Standardizing big data security continues to be an area where further research and development is required. A review of 804 scholarly papers on big data analytics to identify challenges, found data security to be a major challenge while managing a large volume of sensitive personal health data. 47

With changes in the scientific method, difficulties are to be expected. Examples of big data with non-traditional research techniques and negative consequences are listed in Table 3 . These include preemptive release of drugs to the market as in the Bellini trial, loss of privacy of the relatives of criminals who underwent ancestry determination, and questions of ownership of data. Whether the developing research systems will justify the trust invested in it by altruistic participants, patients and physicians need to be seen. Government regulators are included in the struggle as a shifting legal framework could challenge everyone involved ( Table 3 ).

Weaknesses and consequences faced by big data in the changing research landscape.

RCTs: randomized control trials; FDA: Food and Drug Administration; AI: artificial intelligence.

Changing cultural context and the physician

All hospitals have collected biological specimens as part of their routine workflow, an example being routine blood tests. In the ideal world, many doctors would like to do some research; however, in the real world, research is performed by the minority of physicians. A survey of physicians across two hospitals in Australia found physicians interested in having biobanks in hospitals; 64 however, large biobanks may be more efficient and financially viable. Rather than discounting the routinely collected specimens, consideration to capture this potential resource should be explored. One option is to explore how to close the gap between those who routinely prepare the specimens, those who store it, and those who use the information for research. One such project, Polyethnic-1000 includes the collection of biological specimens from minority populations via community and academic hospitals in New York City. 65

Correlations between genetics and disease, and connections that were not obvious in the past, can become visible as the data set increases in size. Instead of starting with people who have the disease in whom the new drug is tested in a RCT and then waiting to determine post-marketing study outcomes, large data collections of genetic and demographic information (including family history, lifestyle, etc.) can be used to show the risk of disease in a population and predict if risk modification can prevent illness. The shift toward prevention rather than cure may get a big boost from big data. In those with the disease, cellular specifics (receptors, cytokines, along with gene variants) can predict what sites to target (increasing or decreasing effects) in order to develop therapies that are personalized in that subset of the same disease.

The growth of the Internet over the last 20 years and creation of open access to scientific literature has resulted in the availability of unlimited medical information to patients. 66 It has led to the direct use of products and practices by the general public, at times eliminating the need for the clinician’s input. Lack of transparency has created an inconsistently safe environment, and this is especially true among those who participate in social media research. Minimally invasive activities like mailing a saliva swab for genetic testing, while done for reasons of curiosity like determining one’s ancestry, contribute to the collection and sale of large amounts of genetic information to third parties. The loss of privacy is a clear risk outlined in the several pages of online consent that most subjects will probably not read. 67 , 68 There are collections of large data banks with more than a million biospecimens in many private organizations. In the past, medical big data may have seemed more aspirational than practical with both physicians and the general public unaware of its risks and benefits.

For physicians, researchers, and the general public, flexibility to find answers rapidly is vital for our well-being today more than ever before. For example, in the coronavirus disease of 2019 (COVID-19) pandemic, the FDA has engaged directly with more than 100 test developers since the end of January 2020. This unprecedented policy by the FDA is attempting to get rapid and widespread testing available. According to the policy update, responsibility for the tests, including those by commercial manufacturers, is being shared with state governments and these laboratories are not required to pursue emergency use authorization (EAU) with the FDA. 69

An example of big data with an alternate research paradigm using public participation in the COVID-19 pandemic could be as follows: direct-to-consumer marketing of a quantifiable antibody home test for COVID-19. The FDA is working with the Gates foundation to produce a self-test kit for COVID-19 as a nasopharyngeal swab. 70 If a biobank registry is subsequently created for COVID-19, it would provide us with tremendous information, including, but not limited to, an accurate mortality rate and identification of those who have high antibody levels. The identification of participants with high antibody levels may then allow them to donate antibodies to those at risk for worse outcomes.

Limitations of the article

The article is about the various aspects of data and medical research and is limited to being a relevant analysis of literature rather than an exhaustive review. The most cited or electronically accessed articles have been used as references. Changes in the many aspects of data collection to security are based on rapidly changing technology. Information which had physical restrictions and was located in controlled physical premises have migrated to the cloud with digital transformation. In addition, dynamic factors like enterprise mobility or even the current COVID-19 lock down has changed the way people work. A comprehensive review and in-depth analysis would be out of the scope of a review article.

Final thoughts

The increasing use of big data and AI with heterogeneous large data sets for analysis and predictive medicine may result in more contributions from physicians, patients, and citizen-scientists without having to go down the path of an expensive RCT. The formative pressures between altruistic public participants, government regulators, Internet-using patients in search of cures, clinicians who refer patients, and industries seeking to reduce cost, all supported by cheaper technology, will determine the direction of how new therapies are tried out for use. Increased government interest and funding in this aspect is noted with programs like the “All of Us initiative.” 71 At present, pressing needs in the COVID-19 pandemic force flexibility between all interested parties to conduct investigations and find answers quickly.

Personalized health care is expanding rapidly with more clues for cures than ever before. Each solution presented brings its own set of problems, which in turn needs new solutions. Collaboration across silos, like government agencies, commercial manufacturers, researchers, and the public needs to be flexible to help the greatest number of patients. Big data and biobanks are tools needed for basic research, which, if successful, may lead to new therapies and clinical trials, which will ultimately lead to new cures. Data that are collected, analyzed, and managed still needs to be converted into insight with the goal of “first do no harm.” All involved must have the common goal of data security and transparency to continue to build public trust.

Declaration of conflicting interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethical approval: Ethical approval for this study was waived by “Institutional Review Board of State University of New York at Downstate” because “this is a review article and considered exempt.”

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

Informed consent: Written informed consent was obtained from all subjects before the study.

An external file that holds a picture, illustration, etc.
Object name is 10.1177_2050312120934839-img1.jpg

  • Survey paper
  • Open access
  • Published: 29 September 2017

A bibliometric approach to tracking big data research trends

  • Ali Kalantari 1 ,
  • Amirrudin Kamsin 1 ,
  • Halim Shukri Kamaruddin 2 ,
  • Nader Ale Ebrahim 3 ,
  • Abdullah Gani 1 ,
  • Ali Ebrahimi 1 &
  • Shahaboddin Shamshirband 4 , 5  

Journal of Big Data volume  4 , Article number:  30 ( 2017 ) Cite this article

16k Accesses

64 Citations

22 Altmetric

Metrics details

The explosive growing number of data from mobile devices, social media, Internet of Things and other applications has highlighted the emergence of big data. This paper aims to determine the worldwide research trends on the field of big data and its most relevant research areas. A bibliometric approach was performed to analyse a total of 6572 papers including 28 highly cited papers and only papers that were published in the Web of Science TM Core Collection database from 1980 to 19 March 2015 were selected. The results were refined by all relevant Web of Science categories to computer science, and then the bibliometric information for all the papers was obtained. Microsoft Excel version 2013 was used for analyzing the general concentration, dispersion and movement of the pool of data from the papers. The t test and ANOVA were used to prove the hypothesis statistically and characterize the relationship among the variables. A comprehensive analysis of the publication trends is provided by document type and language, year of publication, contribution of countries, analysis of journals, analysis of research areas, analysis of web of science categories, analysis of authors, analysis of author keyword and keyword plus. In addition, the novelty of this study is that it provides a formula from multi-regression analysis for citation analysis based on the number of authors, number of pages and number of references.

Introduction

The age of big data has arrived, which has a significant role in the current Information Technology (IT) environment [ 1 ]. In 2015, there were over 3 billion Internet users around the world [ 2 ]. Accordingly, data have become more complex due to the increasing volume of structured and unstructured data with a growing number of various applications produced by the social media, Internet of Things (IoT), and multimedia, and etc. [ 3 , 4 ]. Commonly scientists have introduced four V’s for big data as: volume, velocity, variety and veracity. Meanwhile there is another study by [ 5 ] that presented three more V’s for big data as: validity, volatility and the special V for value. Researchers have highlighted that the needs for big data are increasing, which will have a powerful impact on computer science, healthcare, society, educational systems, social media, government, economic systems and Islamic studies [ 6 , 7 , 8 , 9 , 10 ]. In [ 11 ], the state of the art of big data indexing using intelligent and non-intelligent approaches are investigated in order to show the strength of machine learning techniques in big data.

To identify the gaps on big data research trends in different fields, researchers have to investigate or review the comprehensive sources and databases about the papers published in the field. We found that Web of Science (WoS) is the most completed and well-known online scientific citation search that is provided by Thomson Reuters [ 12 , 13 ]. Hence, it is a valuable reference for researchers to find and publish the latest technology, trends, enhancements, experimental, challenges and opportunities in research. In 1955, Garfield, E. wrote a paper entitled “Citation index for science: a new dimension in documentation through association of ideas” that introduced contemporary scientometrics. However, the Science Citation Index (SCI) has been used in indexing as a principal tool from 1964 [ 14 , 15 , 16 ].

Bibliometric is defined as the application of mathematical and statistical methods to papers, books and other means of communication that are used in the analysis of science publications [ 17 ]. To recognize the research trends, bibliometric methods are usually used to evaluate scientific manuscripts [ 18 , 19 ]. Bibliometric methods have been used to measure scientific progress in many disciplines of science and engineering, and are a common research instrument for the systematic analysis of publications [ 20 , 21 , 22 , 23 , 24 ]. In this research, bibliometric analysis is employed in the field of “big data”.

Highly cited papers have a greater chance of visibility, thus attracting greater attention among researchers [ 25 ]. Evaluating the top cited publications content is very useful for obtaining information about the trends of specific fields in the perspective of research progress [ 26 ]. It can reveal to researchers how they can find the best field or best journal to succeed in their publication. Although, the citation is not a scientific tool to assess the publication, it is a valuable metric that recognizes research parameters [ 27 ]. The citation index, as a type of bibliometric method, shows the number of times an article has been used by other papers [ 28 ]. Hence, citation analysis helps researchers to obtain a preliminary idea about the articles and research that has an impact in a particular field of interest and it deals with the examination of the documents cited by scholarly works [ 29 , 30 ]. In addition, there are various bibliometric studies has been evaluated based on different metrics and applications such as forecasting emerging technologies by using bibliometric and patent analysis [ 31 ], multiple regression analysis for Japanese patent case [ 32 ], medical innovation using medical subject headings [ 33 ], based on region or countries [ 34 , 35 , 36 ], number of authors [ 37 , 38 ] and a bibliometric analysis based on number publications and cited references [ 39 ].

In this study, bibliometric tools have been selected to determine the major and essential research trends in the field of big data research, and the most relevant research areas upon which big data has a significant impact. The Thomson Reuters’ Web of Science (WoS) database is used to extract the bibliometric information for “big data”. The WoS is a structured database that indexes selected top publications that, covering the majority of significant scientific results [ 13 ]. A total of 6572 papers were collected from WoS and the aim of this research is to provide a comprehensive analysis and evaluate the latest research trends followed by the Document Type and Language, Publication output, Contribution of Countries, Top WoS Categories and Journals, Top Authors, Top Research Areas and Analysis of Author Keywords and Keyword Plus related to the field of big data and its most relevant research areas. In addition, this paper analyzes the number of citation based on the impact of multiple factors of research paper as number of authors, number of pages and number of references. Therefore, our analysis makes an important contribution to researchers interested in the field of big data because we outline, research trends and identify the most relevant research areas to be taken into consideration when conducting future research on big data. We also provide a wide-ranging analysis on the relevant research areas that is mostly used in the field of big data with emerging research streams. Thus, this study will be useful for researchers to determine the relevant area of research in big data that has been broadly focused on along with the gaps that should be addressed. The rest of this paper is organized as follow: “ Methodology " section discusses proposed methodology. “ Results and discussion ” section provides the results and following by discussions. “ Conclusion ” section concludes this work.

Methodology

The methodology is based on bibliometric techniques which permit a robust analysis of “Big Data Research” publications at different levels. The proposed methodology depends on the quantity analysis of all publications in the field that were selected based on keywords search in the title of papers. In order to define initial keywords, 30 documents from various sources relevant to the topic of “big data” were reviewed. Based on the interview with experts in the related field, the keywords were modified. The comments and suggestions from the interviewees were used to finalize the keyword list.

To achieve a complete set of keywords, an online questionnaire design and was distributed to about 400 people through posting in “big data” community groups on different social media platforms such as Facebook and LinkedIn. In addition, several questionnaires were sent by email to those authors whose papers were reviewed. The purpose of the survey was to obtain the participants’ comments and then analyze the collected data to illustrate the percentage of correct keywords that had been chosen and other relevant keywords that might be used for this study. The details of the data collection process are illustrated in Figs.  1 and 2 , showing the results obtained from the survey analysis of which a total of 142 responses were received. After comparing with the results of the survey, the final relevant research keywords to the field of big data are shown in Table  1 .

Data collection processes

Survey results (number of responses)

The data for this paper were derived from the online version of the Web of Science TM (WoS) Core Collection database, which consists of Science Citation Index Expanded (SCI), Social Science Citation Index (SSCI), and Arts & Humanities Citation Index (A&HCI) from 1980 to 19 March 2015, and Conference Proceeding Citation Index-Science (COCI-S), and Conference Proceedings Citation Index—Social Science & Humanities (CPCI-SSH) from 2004 to 19 March 2015. To ensure that the article was relevant to the research topic, the title of the published papers in WoS were scrutinized for the list of keywords in Table  2 . Wildcard (*) and Boolean operator (OR) with a combination of keywords was used to purify the results. The preliminary results included 11,307 papers. The results were refined by all relevant WoS categories to computer science and then the bibliometric information for all the papers was obtained (Additional file 1 ). The final result consists of data for 6572 papers, which were downloaded into a Microsoft Excel spreadsheet. There are different ways of calculating author-level impact: number of article citations, number of publications, or combine the publication and citation counts to create a “hybrid indicator” [ 13 ]. In this study, the citation counts were selected for evaluation. Following which, 28 highly cited papers were selected according to the Essential Science Indicators SM (ESI) provided by Thomson Reuters [ 40 ]. Since the citation rates vary by field and older papers are cited more than recent papers, the selection of highly cited papers is an important issue [ 41 ]. The procedure for selection is summarized in (Additional file 1 ), and the illustration for the citation report of the highly cited and all papers is shown in Table  3 . In summary, highly cited papers are the ones that ranked within the top 1% over the past 10 years [ 42 , 43 ].

Besides the highly cited papers, which were reported by ESI, the citations per year were calculated as a division of the total citations by life year of the article. The citations per year are more accurate and more scientific than the total citations to identify the top cited papers [ 44 , 45 ]. Citation statistics produced for a period of less than 3 years may not be sufficiently stable [ 46 , 47 ]. Therefore, we only select the papers published up to 19 March 2015 for citation analysis. The rest of the analyses were based on the whole dataset. The emphasis of the research was to describe trends in physical activity and ageing research from the following five aspects:

Trend of publications during 1980–2015

Analysis of distribution of author keywords

Analysis of distribution of KeyWords Plus

Comparison of papers citation based on author keywords with the KeyWords Plus

Citation analysis of the research output

The research will provide a guideline based on the publication trend and impact for future research. After classifying and extracting the data, the process of analyzing was started and the observations include statistical analysis, statistical descriptions, statistical tests, ANOVA, and regression analysis of some of the factors chosen. The ANOVA table was constructed using Microsoft Excel version 2013 [ 48 ], which can be used to show the statistical relationship between two variables. In addition, a free version of StatPlanet Plus [ 49 ] software was used in this study for creating the interactive world map to visualize the distribution of all papers among different countries.

The four relationships among two variables were observed in this study. The first observation was the relationship between the number of publications of each country in all papers and the impact in a highly cited paper. The second observation was the relationship between the number of the author’s publications in all the papers and the impact in highly cited papers. The third observation was the relationship between the number of publication of journals in all papers and the impact in highly cited papers. Microsoft Excel 2013 was used in constructing the ANOVA table and to conclude this two-factor relationship. The fourth observation was the relationship between the author keyword and keyword plus in all and the impact in highly cited papers.

The last part of the analysis focused on the multiple regressions of three factors concerning the number of citations. The three independent variables in this regression were number of authors, number of pages and number of references, form each of the 6572 papers, and the dependent variable was the number of citations of each paper (out of 6572). By using Microsoft Excel 2013, the t-test was constructed to analyze how strongly these three factors contribute to the number of citations. At the end of this analysis, this study produced a multiple regression equation, which could be used to forecast the number of citations giving the number of pages, authors and references.

Results and discussion

As discussed above, the total of 6572 papers was refined by all WoS categories relevant to computer science (Additional file 1 ). Hence, various bibliometric tools to evaluate the different metrics were selected to determine the research trends based on the two groups that extracted from WoS: All Papers (6572) and Highly Cited Papers (28) groups. In addition, we have not discounted double counting in papers co-authored from multi countries; therefore, the results shown below or in other tables might be counted more than the real data as 28 papers for highly cited papers and 6572 papers for all papers.

Table  4 shows the highly cited papers sorted by publication year, which was used to analyze and compare with the total number of publications (all papers group).

Document type and language

Table  5 illustrates the main distribution of document types in both groups (all and highly cited papers). In the all papers group, proceeding papers (62.73%) and articles (38.61%) are the main contributors, whereas in the highly cited papers, articles (89.28%) is only the highest contributor. We found that English was the dominant language with 6549 records (99.65%). Other less significant outputs for all papers were editorial material, review, meeting abstract, news item, book review, letter, correction, software review, book chapter, item about an individual, note, and reprint; for highly cited papers it was the same with lower significant percentages for review and proceeding paper (Table  5 ).

Even though, proceeding paper and article were the most used document types, we also considered all the document types, as shown in Table  5 . We believe that each document type has its own intrinsic value and would provide insight on the research trend.

Publication trends: annually, regions/countries, contribution of countries

Publication output.

The recent research concentration is reflected in its publication output [ 76 , 77 ]. Figure  3 shows the number of published items spanning 36 years from 1980 to 19 March 2015. In general, the number of publications increased over the period studies. A huge positive jump from 1 year to another can be seen from 2012 to 2013, with a difference of 395 publications. If the last three recent years is excluded from the observation, the highest drop can be observed between 2009 and 2010, the published items dropped by 146 items. In the first 16 years in Fig.  3 , the count movement of publications steadily increased but with a very small increment. It can be clearly seen from the trend of the graph that there is a sudden jump in the initial period after a long steady movement, which was between 1993 and 1994. On average, the number of papers published every year was 182.56 with a standard deviation of 280.03. It can be concluded that, the shape of the graph is skewed to the right, which consistently agrees with the huge value for standard deviation generated from Excel.

Number of published items from 1980 to 19 March 2015 (citation statistics produced for a period of less than 3 years may not be sufficiently stable, as indicated in blue color bars)

In general, the number of publications is increasing over the years considered. But, in 2009, 2012 and 2013 we observed sudden spikes in the number of publications. Further analysis we found, there were more research activities reporting on proceeding papers in those years compared to their previous year. For instance, there were 556 proceeding papers in 2012 and 260 proceeding papers in 2011.

Contribution of regions/countries

It has been reported that, “Each author of an article has made an independent contribution to the manuscript and therefore the institution and country the author affiliated could be considered the important contributors for the evaluation of research” [ 78 , 79 ]. Consequently, the number of publication counts for each country were used to evaluate the research contribution of any region/country in the related field.

Figure  4 shows the geographical distribution of the published papers in the world relevant to the field of big data. As shown in Fig.  4 , the USA (1852) was the most productive country with the largest number of publications regardless of the participation of international collaborators, followed by China (1059), Germany (303), England (285), Spain (282), Canada (255), India (253), France (226), Italy (198) and Australia (193).

Distribution of all papers in the world

The geographical world map shows that there has been a gradual increase in the number of publications in North and South America, and that it has a higher impact in the world. We found that among the 196 countries in the world, 96 countries such as South Africa and a few countries in the Middle East have no publications. The present results show that big data is a growing area of research in most countries.

Analysis of countries between all and highly cited papers

The comparison of the top ten countries with the highest publications for all and highly cited papers are shown in (Additional file 1 ). The top three countries for both groups were the USA, China and Germany with the combined total numbering over 3000 publications produced. Over 36 years, the average for the USA was 51.44 meaning that there were more than 51 publications. In other words, this was over 22 publications (22.02) higher than China for each year. Our analysis shows that there was huge positive correlation between the number of publications by country in all and highly cited papers. The coefficient of determination between these two groups was 0.90. The ANOVA table, which was extracted from Excel, is shown in (Additional file 1 ). The F-value retrieved from the ANOVA table was 10.54; this was higher than its F-crit. (3.88), which was set at the 5% level of significance. Based on the result obtained from the ANOVA test, the null hypothesis was rejected and there was strong evidence to show that the number of publications for each country for all papers had an impact on the number of publications of the country in highly cited papers.

Analysis of web of science categories and journals between all and highly cited papers

The comparison the top ten WoS categories based on the total number of publications in all and highly cited papers are shown in (Additional file 1 ). In total, there were 96 WoS categories for which the Computer Science Theory Methods (2624) had the highest number of publications in the all papers group, while, in the highly cited group, the first category of WoS was Computer Science Artificial Intelligence (10). It is evident that computer science categories and related subjects are the leading fields in big data research.

According to Garfield, E. [ 80 ], “Journal impact factors generally involve relatively large populations of articles and citations”. Among the most key striking metric that can evaluate the contribution of a journal is the Journal Impact Factor [ 81 ].

Table  6 depicts the top 10 Journals between all and highly cited papers by their impact factor for 2012 and 2013. Commonly, researchers believe that the high journal impact factor shows the high value of the journals. But, we found that there were some journals with a lower impact factor that contained the highly cited papers.

Overall, there were 2866 source titles including Journals and Conferences. The average of every source title was 2.29, showing that there were more than 2 papers published for each source title. The coefficient of determination (r-squared) of these two groups was 0.21 and these two groups (all and highly cited papers) had a weak linear relationship. The constructed ANOVA table was observed and analyzed to show how these two arrays of data are related. Table  7 highlights the results of ANOVA, which was set at the 5% significant level. It is obvious that the null hypothesis is rejected as the F value equals 790.75, which is far greater than F-crit. (3.84). The result shows that there is strong evidence to show that the number of publications of source tittles in all papers has an impact on the number of publications in highly cited papers.

Analysis of authors between all and highly cited papers

The comparison of the top ten authors with the highest number of publications from the all and highly cited papers groups are shown in (Additional file 1 ). In total, there were 14,949 authors in all papers and the average was 2.27, meaning that there were more than 2 authors for each paper (1.45 papers for each author with a standard deviation of 1.45). There were five authors with two publications in the highly cited group and 105 authors with one publication. The coefficient of determination (r 2 ) for these two groups was 0.02, implying that these two data have a close to zero linear relationship. For further analysis of this case, the ANOVA table was constructed to observe the relationship between the all and highly cited papers. The results of ANOVA, which was set at the 5% significant level is shown in (Additional file 1 ). It is obvious that the null hypothesis is rejected, as the F value is equal to 14,909.21, which is far greater than F-crit. (3.84). From the observation of ANOVA (Additional file 1 ), the null hypothesis was rejected and there was strong evidence to show that the number of publications of authors for the all papers had an impact on the number of publications of authors in highly cited papers.

Analysis of research areas between all and highly cited papers

This section provides the research areas between all and highly cited papers, which were sorted by the number of records based on the total publications. Overall, there were 54 research areas for the all papers group and 11 research areas for the highly cited group. As shown in (Additional file 1 ), the top two research areas for both groups were Computer Science and Engineering, with the highest number of records and the respective number of publications in other research areas showing that there are slightly different research areas compared to each group. The result shows that the most relevant research areas were in Computer Science with the highest number of publications. However, there were other important research areas with a lower number of records. For instance, in the research areas for the all papers group, this was followed by “Telecommunications” (527), “Operations Research Management Science” (194), “Medical Informatics” (184) and for the highly cited group it was followed by “Mathematics” (4), “Biochemistry Molecular Biology” (3), and “Biotechnology Applied Microbiology” (3).

Analysis of author keywords and KeyWords plus

Author keyword is one of the essential types of information about the research trends from the view of researchers and has been proven to be important for monitoring the development of science [ 82 , 83 , 84 ].

In this study, 10,002 author keywords were used for the analysis from 1980 to 19 March 2015. Table  8 depicts the top 20 frequency for the author keywords used in all papers. The author keywords were compared by the total number of records for three different periods. The distribution of each keyword would assist the researchers to identify the importance of each author keyword used in different years or decades. Accordingly, the result shows that among the top 20 frequency for author keywords, only a few keywords were used between 1980 and 1999. These included “machine learning”, “data warehouse”, “data mining”, “classification”, “neural networks” and “clustering”. However, over the past 15 years the number of keywords increased. For instance, the frequency of use of “machine learning” was only 6% between 1980 and 1999 and from 2000 to 2009 it was 40%, but, 53% of total publications were used between 2010 and 2015. This means that more focus was given for each keyword from 2000 to 2015. In addition, machine learning was used as the highest frequency of author keywords which has added a big value to big data. The main objective of machine learning is to learn from data in order to make a suitable decision. The mean of data in term of “big data” refers to complex data which are not easy to process in a single machine learning platform. Therefore, the need of a platform such as Hadoop to run machine learning for big data is essential [ 85 ].

In another example, the keyword of “big data” itself was used from 2010 to 2015. On the other hand, Table  8 reveals that there was a pivotal role for the top frequent author keywords from 2000 to 2015. Thus, researchers are able to evaluate the latest trends of the research based on the top frequent author keywords relevant to the field in any particular decade to see the efficiency of any keyword that might be used to extend the research study. In addition, “mapreduce”, “data warehouse”, “big data”, “hadoop” and “cloud computing” had high growth in the ranking of author keyword frequency.

Furthermore, another metric was used to evaluate the publications based on the title which is known as KeyWords Plus. It has been proven that [ 86 , 87 ], “KeyWords plus, which provides search terms extracted from the titles of papers cited in each new article in the ISI database, is an independent supplement for title-words and author keywords”.

Figure  5 provides the top 10 most frequently used terms from KeyWords Plus used over the total number of 3750 keywords. Based on Table  8 and Fig.  5 , the most similar keywords among the most frequently used terms from author keyword and KeyWord Plus were “classification(s)/classifier(s)”, “neural network(s)”, “support vector machine(s)”, “Performance” and “mapreduce”, which played a pivotal role over all keywords. In addition, there were also other keywords with a significant growth in percentage, which were not in the list of both author keyword and KeyWord Plus. This means that several types of analysis were used in big data.

Top 10 most frequently used terms of KeyWords plus (from 1980 to 2015)

Moreover, the ANOVA test was used to provide the analysis between the author keywords and KeyWords Plus. As mentioned above, there were 10,002 author keywords and 3750 KeyWords Plus in all papers (6572). From the total of author keywords, the average of a single author keyword was 1.84 (with a standard deviation of 8.35), meaning that there was more than one author keyword for each paper. The high value of standard deviation contributed from the wide range of frequency of data, or, in other words the frequency distribution was widely spread. In addition, from the total of KeyWords Plus, the average for a single KeyWords Plus was 2.16 (with a standard deviation of 5.93), showing that there were more than two KeyWords Plus for each paper. The reason for the standard deviation of KeyWords Plus being lower compared to the author keywords’ standard deviation is that the distribution of KeyWords Plus was more condensed and concentrated. The coefficient of determination (r-squared) for these two groups, as retrieved from Excel was 0.012, which indicates that the two sets of data had a very low linear relationship. Table  9 shows the ANOVA table of KeyWords Plus and author keywords data set with 5% significant level. From the outcome of the ANOVA table, the null hypothesis is rejected as F value equals 121.97, which is greater than F-crit. (3.84). As a summary, there is strong evidence to shows that the frequency of author keywords had an impact on the KeyWords Plus in the total number of publications.

Multi-regression analysis

Three factors were used to observe the multiple regression for each paper that is, the number of pages (NP), the number of references (NR) and the number of authors (NA). Microsoft Excel version 2013 was used for the regression analysis to analyze the effect of these factors on the Number of Citation (NC) over the total number of papers. The result obtained from the analysis is shown in Table  10 , to determine how each single factor contributes to the value of the dependent variable. The coefficient column in Table  10 shows the value of the percentage of effect of each factor on the number of citations. Below is a multiple regression equation produced from Excel, which correlates to the number of citation formula:

If we are given the number of authors, number of pages and number of references, then the number of citations can be predicted using this multiple regression. The t-test for each of the coefficient is displayed in Table  10 . For the number of pages and number of references the P-values are very small and acceptable for factors in the multiple regression equation. For the number of authors the coefficient at the 10% significant level was rejected (but it was not rejected at the 5% significant level). Consequently, for this given multiple regression, all the mentioned factors were acceptable to determine the number of citations with the above given weightage. Therefore, all of the factors had a very small P-value except the number of pages factor which needed a smaller significant level to not be rejected.

The bibliometric analysis of the big data revealed that the worldwide research trends and performance in the subject areas. So far, there are significant gaps in current research about the bibliometric analysis of the big data. In this study, selected keywords were used to extract the most relevant papers from the Web of Science TM (WoS) Core Collection database, which consists of Science Citation Index Expanded (SCI), Social Science Citation Index (SSCI), and Arts & Humanities Citation Index (A&HCI) from 1980 to 19 March 2015, and Conference Proceeding Citation Index-Science (COCI-S), and Conference Proceedings Citation Index—Social Science & Humanities (CPCI-SSH) from 2004 to 19 March 2015. In total, 6572 papers (including 28 highly cited papers reported by ESI) were refined by all relevant WoS categories to computer science and then the bibliometric information of all the papers was obtained. In total, 2866 source tittles including journals and conferences were listed in 96 web of science categories with the 14 different document types. English was the dominant language with 6549 records (99.65%) and the five most popular trends areas were Computer Science, Engineering, Telecommunications, Operations Research Management Science, and Medical Informatics. The findings showed that the USA, China and Germany were the most productive countries that played a predominant role in this study with the highest number of published papers in the world. However, other top countries as mentioned above, made a significant contribution to the field. In contrast, there is an essential lack of research to the field of big data in 96 countries with no publications.

The analysis of authors among the all and highly cited papers showed that there were 14,949 authors in all papers, and the average was 2.27, meaning that there were more than two authors for each paper. In addition, there were five authors with two publications and 105 authors with one publication in the highly cited group. In addition, the analysis of the author keywords showed that among the top 20 most frequent keywords there were fewer records from 1980–1999. However, the most keywords were used from 2000 to 2015. The results obtained from the analysis of KeyWords Plus revealed that similar keywords were “classification(s)/classifier(s)”, “neural network(s)”, “support vector machine(s)”, “performance”, and “mapreduce” which played a pivotal role over all keywords. The analysis of top 20 highest frequency of author keywords used from 1980–2015, shows that all to 20 fields are increasing and none of them declining during the mentioned period. Moreover, the multi-regression analysis of the number of pages, number of references and number of authors compared to the number of citations were analyzed and the correlation formula is provided in this study.

The findings of this study provide relevant researchers with a panorama of worldwide big data research and the established direction for further study to the field and most relevant research areas.

Wu X, et al. Data mining with big data. Knowl Data Eng IEEE Trans. 2014;26(1):97–107.

Article   Google Scholar  

Banks R. There are now 3 billion Internet users worldwide in 2015. Mobile Industry Review 2015; http://www.mobileindustryreview.com/2015/01/3-billion-internet-users-2015.html .

Hashem IAT, et al. The rise of “big data” on cloud computing: review and open research issues. Info Syst. 2015;47:98–115.

Diaz M. et al. Big data on the internet of things. In 2012 sixth international conference on innovative mobile and internet services in ubiquitous computing. 2012.

Khan M, Uddin MF, Gupta N. Seven V’s of big data understanding big data to extract value. In American Society for engineering education (ASEE Zone 1), 2014 zone 1 conference of the 2014. IEEE.

Chen M, Mao S, Liu Y. Big data: a survey. Mob Netw Appl. 2014;19(2):171–209.

Menacer M, Menacer A, Arbaoui A. Islamic resources big data mining, extraction and archiving. Enhanc Res Manag Comput Appl. 2014;3(12):20–5.

Google Scholar  

Murdoch TB, Detsky AS. The inevitable application of big data to health care. JAMA. 2013;309(13):1351–2.

Michael K, Miller KW. Big data: new opportunities and new challenges [guest editors’ introduction]. Computer. 2013;46(6):22–4.

Xiang Z, et al. What can big data and text analytics tell us about hotel guest experience and satisfaction? Int J Hosp Manag. 2015;44:120–30.

Gani A, et al. A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl Inf Syst. 2016;46(2):241–84.

Drake M. Encyclopedia of library and information science, vol. 1. USA: CRC Press; 2003.

MATH   Google Scholar  

Wildgaard L. A comparison of 17 author-level bibliometric indicators for researchers in Astronomy, environmental science, philosophy and public health in web of science and google scholar. Scientometrics. 2015;104(3):1–34.

Garfield E. Citation indexes for science: a new dimension in documentation through association of ideas. Science. 1955;122(3159):108–11.

Ho Y-S. The top-cited research works in the science citation index expanded. Scientometrics. 2013;94(3):1297–312.

Garfield E. Science citation index-a new dimension in indexing. Science. 1964;144(3619):649–54.

Repanovici A. Measuring the visibility of the university’s scientific production using google scholar, Publish or Perish software and Scientometrics. In: World library and information congress: 76th ifla general conference and assembly. Gothenburg; 2010. ( 10–15 August 2010 )

Zitt M, Ramanana-Rahary S, Bassecoulard E. Relativity of citation performance and excellence measures: from cross-field to cross-scale effects of field-normalisation. Scientometrics. 2005;63(2):373–401.

Li LL, et al. Global stem cell research trend: bibliometric analysis as a tool for mapping of trends from 1991 to 2006. Scientometrics. 2009;80(1):39–58.

Ale Ebrahim N, et al. Visibility and citation impact. Int Educ Stud. 2014;7(4):120–5.

Budd JM. A bibliometric analysis of higher-education literature. Res High Educ. 1988;28(2):180–90.

Canas-Guerrero I, et al. Bibliometric analysis in the international context of the “Construction & Building Technology” category from the web of science database. Constr Build Mater. 2014;53:13–25.

Canas-Guerrero I, et al. Bibliometric analysis of research activity in the “Agronomy” category from the web of science, 1997–2011. Eur J Agron. 2013;50:19–28.

Ingwersen P. The international visibility and citation impact of Scandinavian research articles in selected social science fields: the decay of a myth. Scientometrics. 2000;49(1):39–61.

Wohlin C. An analysis of the most cited articles in software engineering journals—1999. Inf Softw Technol. 2005;47(15):957–64.

Fardi A, et al. Top-cited articles in endodontic journals. J Endod. 2011;37(9):1183–90.

Shadgan B, et al. Top-cited articles in rehabilitation. Arch Phys Med Rehabil. 2010;91(5):806–15.

Fooladi M, et al. Do criticisms overcome the praises of journal impact factor? Asian Soc Sci. 2013;9(5):176–82.

Ale Ebrahim N, et al. Equality of google scholar with web of science citations: case of Malaysian engineering highly cited papers. Mod Appl Sci. 2014;8(5):63–9.

Gomez-Jauregui V, et al. Information management and improvement of citation indices. Int J Inf Manage. 2014;34(2):257–71.

Daim TU, et al. Forecasting emerging technologies: use of bibliometrics and patent analysis. Technol Forecast Soc Chang. 2006;73(8):981–1012.

Yoshikane F. Multiple regression analysis of a patent’s citation frequency and quantitative characteristics: the case of Japanese patents. Scientometrics. 2013;96(1):365–79.

Leydesdorff L, Rotolo D, Rafols I. Bibliometric perspectives on medical innovation using the medical subject headings of PubMed. J Assoc Inf Sci Technol. 2012;63(11):2239–53.

Bornmann L, Wagner C, Leydesdorff L. BRICS countries and scientific excellence: a bibliometric analysis of most frequently cited papers. J Assoc Inf Sci Technol. 2015;66(7):1507–13.

Kozak M, Bornmann L, Leydesdorff L. How have the Eastern European countries of the former Warsaw Pact developed since 1990? A bibliometric study. Scientometrics. 2015;102(2):1101–17.

Zhou P, Leydesdorff L. Chemistry in China–A bibliometric view. Chim Oggi Chem Today. 2009;27(6):19–22.

Abramo G, D’Angelo CA. The relationship between the number of authors of a publication, its citations and the impact factor of the publishing journal: evidence from Italy. J Informetr. 2015;9(4):746–61.

Fox CW, Paine CE, Sauterey B. Citations increase with manuscript length, author number, and references cited in ecology journals. Ecol Evol. 2016;6(21):7717–26.

Bornmann L, Mutz R. Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J Assoc Inf Sci Technol. 2015;66(11):2215–22.

Huang M-H, Chang H-W, Chen D-Z. Research evaluation of research-oriented universities in Taiwan from 1993 to 2003. Scientometrics. 2006;67(3):419–35.

ESI. 2015. Web of science core collection help essential science indicators highly cited papers. http://images.webofknowledge.com/WOKRS517B4/help/WOS/hs_citation_applications.html#dsy7851-TRS_highly_cited_papers . Accessed on 2 June 2015

Bornmann L, et al. A multilevel modelling approach to investigating the predictive validity of editorial decisions: do the editors of a high profile journal select manuscripts that are highly cited after publication? J R Stat Soc. 2011;174(4):857–79.

Article   MathSciNet   Google Scholar  

Fu H-Z, et al. Characteristics of research in China assessed with essential science indicators. Scientometrics. 2011;88(3):841–62.

Chuang KY, Wang MH, Ho YS. High-impact papers presented in the subject category of water resources in the essential science indicators database of the institute for scientific information. Scientometrics. 2011;87(3):551–62.

Ho JC, et al. Technological barriers and research trends in fuel cell technologies: a citation network analysis. Technol Forecast Soc Chang. 2014;82:66–79.

Adams J. Early citation counts correlate with accumulated impact. Scientometrics. 2005;63(3):567–81.

UZUN A. Statistical relationship of some basic bibliometric indicators in scientometrics research. In: International workshop on webometrics, informetrics and scientometrics & seventh COLLNET meeting. France: Nancy; 2006. p. 5.

Microsoft Excel 2013. https://products.office.com/en-us/excel . Accessed April 2015

StatPlanet Plus. http://www.statsilk.com/software/statplanet . Accessed April 2015

Kambatla K, et al. Trends in big data analytics. J Parallel Distrib Comput. 2014;74(7):2561–73.

Zhang J, et al. A comparison of parallel large-scale knowledge acquisition using rough set theory on different MapReduce runtime systems. Int J Approx Reason. 2014;55(3):896–907.

Zhang X, et al. A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. Parallel Distrib Syst IEEE Trans. 2014;25(2):363–73.

Balahur A, Turchi M. Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis. Comput Speech Lang. 2014;28(1):56–75.

Feldman R. Techniques and applications for sentiment analysis. Commun ACM. 2013;56(4):82–9.

Cambria E, et al. New avenues in opinion mining and sentiment analysis. IEEE Intell Syst. 2013;28(2):15–21.

Wang L, Khan SU. Review of performance metrics for green data centers: a taxonomy study. J Supercomput. 2013;63(3):639–56.

Wang L, et al. G-Hadoop: mapReduce across distributed data centers for data-intensive computing. Future Gener Comput Syst. 2013;29(3):739–50.

Bari MF, et al. Data center network virtualization: a survey. Commun Surv Tutor IEEE. 2013;15(2):909–28.

Chen H, Chiang RH, Storey VC. Business intelligence and analytics: from big data to big impact. MIS Q. 2012;36(4):1165–88.

Beloglazov A, Abawajy J, Buyya R. Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing. Future Gener Comput Syst. 2012;28(5):755–68.

Kachris C, Tomkos I. A survey on optical interconnects for data centers. Commun Surv Tutor IEEE. 2012;14(4):1021–36.

Pedregosa F, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.

MathSciNet   MATH   Google Scholar  

Taboada M, et al. Lexicon-based methods for sentiment analysis. Comput linguist. 2011;37(2):267–307.

Dean J, Ghemawat S. MapReduce: a flexible data processing tool. Commun ACM. 2010;53(1):72–7.

Rosten E, Porter R, Drummond T. Faster and better: a machine learning approach to corner detection. Pattern Anal Mach Intell IEEE Trans. 2010;32(1):105–19.

Greenberg A, Hamilton JR, Jain N, Kandula S, Kim C, Lahiri P, Maltz DA, Patel P, Sengupta S. VL2: a scalable and flexible data center network. ACM SIGCOMM Comput Commun Rev. 2009;39(4):51–62 (ACM) .

García S, et al. A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput. 2009;13(10):959–77.

Finley AO, et al. Improving the performance of predictive process modeling for large datasets. Comput Stat Data Anal. 2009;53(8):2873–84.

Article   MathSciNet   MATH   Google Scholar  

Schatz MC. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. 2009;25(11):1363–9.

Al-Fares M, Loukissas A, Vahdat A. A scalable, commodity data center network architecture. ACM SIGCOMM Comput Commun Rev. 2008;38(4):63–74.

Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13.

Ishibuchi H, Nojima Y. Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J Approx Reason. 2007;44(1):4–31.

Cheng J, Baldi P. A machine learning information retrieval approach to protein fold recognition. Bioinformatics. 2006;22(12):1456–63.

Rosten E, Drummond T. Machine learning for high-speed corner detection, in Computer Vision–ECCV 2006. 2006; 430–443.

Lu Z, et al. Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics. 2004;20(4):547–56.

Garfield E. Citation indexing for studying science. Nature. 1970;227(5259):669–71.

Qian F, et al. A bibliometric analysis of global research progress on pharmaceutical wastewater treatment during 1994–2013. Environ Earth Sci. 2015;73(9):4995–5005.

Coats AJ. Ethical authorship and publishing. Int J Cardiol. 2009;131(2):149–50.

Sun Y, Fu H-Z, Ho Y-S. A bibliometric analysis of global research on genome sequencing from 1991 to 2010. Afr J Biotech. 2013;12(51):7043–53.

Garfield E. The history and meaning of the journal impact factor. JAMA. 2006;295(1):90–3.

Eshraghi A, et al. 100 top-cited scientific papers in limb prosthetics. Biomed Eng Online. 2013;12(1):1–12.

Li L-L, et al. Global stem cell research trend: bibliometric analysis as a tool for mapping of trends from 1991 to 2006. Scientometrics. 2009;80(1):39–58.

Chiu W-T, Ho Y-S. Bibliometric analysis of tsunami research. Scientometrics. 2007;73(1):3–17.

Liao J, Huang Y. Global trend in aquatic ecosystem research from 1992 to 2011. Scientometrics. 2014;98(2):1203–19.

Landset S, et al. A survey of open source tools for machine learning with big data in the hadoop ecosystem. J Big Data. 2015;2(1):24.

Garfield E. KeyWords plus-ISI’s breakthrough retrieval method. 1. Expanding your searching power on current-contents on diskette. Curr Contents. 1990; 32:5–9.

Dong B, et al. A bibliometric analysis of solar power research from 1991 to 2010. Scientometrics. 2012;93(3):1101–17.

Download references

Authors’ contributions

All authors contributed equally to this work. All authors read and approved the final manuscript.

Authors’ information

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Availability of data and materials

Consent for publication, ethics approval and consent to participate, publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and affiliations.

Faculty of Computer Science and Information Technology, University of Malaya, 50603, Kuala Lumpur, Malaysia

Ali Kalantari, Amirrudin Kamsin, Abdullah Gani & Ali Ebrahimi

Department of Actuarial and Applied Statistics, Faculty of Business & Information Science, USCI University, 56000, Kuala Lumpur, Malaysia

Halim Shukri Kamaruddin

Centre for Research Services, Institute of Research Management and Monitoring (IPPP), University of Malaya (UM), Kuala Lumpur, Malaysia

Nader Ale Ebrahim

Department for Management of Science and Technology Development, Ton Duc Thang University, Ho Chi Minh City, Vietnam

Shahaboddin Shamshirband

Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Shahaboddin Shamshirband .

Additional file

Additional file 1. additional tables., rights and permissions.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Kalantari, A., Kamsin, A., Kamaruddin, H.S. et al. A bibliometric approach to tracking big data research trends. J Big Data 4 , 30 (2017). https://doi.org/10.1186/s40537-017-0088-1

Download citation

Received : 17 May 2017

Accepted : 18 September 2017

Published : 29 September 2017

DOI : https://doi.org/10.1186/s40537-017-0088-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Research trends
  • Highly cited papers
  • Citation analysis

big data research direction

  • Review article
  • Open access
  • Published: 02 November 2020

Big data in education: a state of the art, limitations, and future research directions

  • Maria Ijaz Baig 1 ,
  • Liyana Shuib   ORCID: orcid.org/0000-0002-7907-0671 1 &
  • Elaheh Yadegaridehkordi 1  

International Journal of Educational Technology in Higher Education volume  17 , Article number:  44 ( 2020 ) Cite this article

56k Accesses

81 Citations

36 Altmetric

Metrics details

Big data is an essential aspect of innovation which has recently gained major attention from both academics and practitioners. Considering the importance of the education sector, the current tendency is moving towards examining the role of big data in this sector. So far, many studies have been conducted to comprehend the application of big data in different fields for various purposes. However, a comprehensive review is still lacking in big data in education. Thus, this study aims to conduct a systematic review on big data in education in order to explore the trends, classify the research themes, and highlight the limitations and provide possible future directions in the domain. Following a systematic review procedure, 40 primary studies published from 2014 to 2019 were utilized and related information extracted. The findings showed that there is an increase in the number of studies that address big data in education during the last 2 years. It has been found that the current studies covered four main research themes under big data in education, mainly, learner’s behavior and performance, modelling and educational data warehouse, improvement in the educational system, and integration of big data into the curriculum. Most of the big data educational researches have focused on learner’s behavior and performances. Moreover, this study highlights research limitations and portrays the future directions. This study provides a guideline for future studies and highlights new insights and directions for the successful utilization of big data in education.

Introduction

The world is changing rapidly due to the emergence of innovational technologies (Chae, 2019 ). Currently, a large number of technological devices are used by individuals (Shorfuzzaman, Hossain, Nazir, Muhammad, & Alamri, 2019 ). In every single moment, an enormous amount of data is produced through these devices (ur Rehman et al., 2019 ). In order to cater for this massive data, current technologies and applications are being developed. These technologies and applications are useful for data analysis and storage (Kalaian, Kasim, & Kasim, 2019 ). Now, big data has become a matter of interest for researchers (Anshari, Alas, & Yunus, 2019 ). Researchers are trying to define and characterize big data in different ways (Mikalef, Pappas, Krogstie, & Giannakos, 2018 ).

According to Yassine, Singh, Hossain, and Muhammad ( 2019 ), big data is a large volume of data. However, De Mauro, Greco, and Grimaldi ( 2016 ) referred to it as an informational asset that is characterized by high quantity, speed, and diversity. Moreover, Shahat ( 2019 ) described big data as large data sets that are difficult to process, control or examine in a traditional way. Big data is generally characterized into 3 Vs which are Volume, Variety, and Velocity (Xu & Duan, 2019 ). The volume refers to as a large amount of data or increasing scale of data. The size of big data can be measured in terabytes and petabytes (Herschel & Miori, 2017 ). In order to cater for the large volume of data, high capacity storage systems are required. The variety refers to as a type or heterogeneity of data. The data can be in a structured format (databases) or unstructured format (images, video, emails). Big data analytical tools are helpful in handling unstructured data. Velocity refers to as the speed at which big data can access. The data is virtually present in a real-time environment (Internet logs) (Sivarajah, Kamal, Irani, & Weerakkody, 2017 ).

Currently, the concept of 3 V’s is inflated into several V’s. For instance, Demchenko, Grosso, De Laat, and Membrey ( 2013 ) classified big data into 5vs, which are Volume, Velocity, Variety, Veracity, and Value. Similarly, Saggi and Jain ( 2018 ) characterized big data into 7 V’s namely Volume, Velocity, Variety, Valence, Veracity, Variability, and Value.

Big data demand is significantly increasing in different fields of endeavour such as insurance and construction (Dresner Advisory Services, 2017 ), healthcare (Wang, Kung, & Byrd, 2018 ), telecommunication (Ahmed et al., 2018 ), and e-commerce (Wu & Lin, 2018 ). According to Dresner Advisory Services ( 2017 ), technology (14%), financial services (10%), consulting (9%), healthcare (9%), education (8%) and telecommunication (7%) are the most active sectors in producing a vast amount of data.

However, the educational sector is not an exception in this situation. In the educational realm, a large volume of data is produced through online courses, teaching and learning activities (Oi, Yamada, Okubo, Shimada, & Ogata, 2017 ). With the advent of big data, now teachers can access student’s academic performance, learning patterns and provide instant feedback (Black & Wiliam, 2018 ). The timely and constructive feedback motivates and satisfies the students, which gives a positive impact on their performance (Zheng & Bender, 2019 ). Academic data can help teachers to analyze their teaching pedagogy and affect changes according to students’ needs and requirement. Many online educational sites have been designed, and multiple courses based on individual student preferences have been introduced (Holland, 2019 ). The improvement in the educational sector depends upon acquisition and technology. The large-scale administrative data can play a tremendous role in managing various educational problems (Sorensen, 2018 ). Therefore, it is essential for professionals to understand the effectiveness of big data in education in order to minimize educational issues.

So far, several review studies have been conducted in the big data realm. Mikalef et al. ( 2018 ) conducted a systematic literature review study that focused on big data analytics capabilities in the firm. Mohammad & Torabi ( 2018 ), in their review study on big data, observed the emerging trends of big data in the oil and gas industry. Furthermore, another systematic literature review was conducted by Neilson, Daniel, and Tjandra ( 2019 ) on big data in the transportation system. Kamilaris, Kartakoullis, and Prenafeta-Boldú ( 2017 ), conducted a review study on the use of big data in agriculture. Similarly, Wolfert, Ge, Verdouw, and Bogaardt ( 2017 ) conducted a review study on the use of big data in smart farming. Moreover, Camargo Fiorini, Seles, Jabbour, Mariano, and Sousa Jabbour ( 2018 ) conducted a review study on big data and management theory. Even though that many fields have been covered in the previous review studies, yet, a comprehensive review of big data in the education sector is still lacking today. Thus, this study aims to conduct a systematic review of big data in education in order to identify the primary studies, their trends & themes, as well as limitations and possible future directions. This research can play a significant role in the advancement of big data in the educational domain. The identified limitations and future directions will be helpful to the new researchers to bring encroachment in this particular realm.

The research questions of this study are stated below:

What are the trends in the papers published on big data in education?

What research themes have been addressed in big data in education domain?

What are the limitations and possible future directions?

The remainder of this study is organized as follows: Section 2 explains the review methodology and exposes the SLR results; Section 3 reports the findings of research questions; and finally, Section 4 presents the discussion and conclusion and research implications.

Review methodology

In order to achieve the aforementioned objective, this study employs a systematic literature review method. An effective review is based on analysis of literature, find the limitations and research gap in a particular area. A systematic review can be defined as a process of analyzing, accessing and understanding the method. It explains the relevant research questions and area of research. The essential purpose of conducting the systematic review is to explore and conceptualize the extant studies, identification of the themes, relations & gaps, and the description of the future directions accordingly. Thus, the identified reasons are matched with the aim of this study. This research applies the Kitchenham and Charters ( 2007 ) strategies. A systematic review comprised of three phases: Organizing the review, managing the review, and reporting the review. Each phase has specific activities. These activities are: 1) Develop review protocol 2) Formulate inclusion and exclusion criteria 3) Describe the search strategy process 4) Define the selection process 5) Perform the quality evaluation procedure and 6) Data extraction and synthesis. The description of each activity is provided in the following sections.

Review protocol

The review protocol provides the foundation and mechanism to undertake a systematic literature review. The essential purpose of the review protocol is to minimize the research bias. The review protocol comprised of background, research questions, search strategy, selection process, quality assessment, and extraction of data and synthesis. The review protocol helps to maintain the consistency of review and easy update at a later stage when new findings are incorporated. This is the most significant aspect that discriminates SLR from other literature reviews.

Inclusion and exclusion criteria

The aim of defining the inclusion and exclusion criteria is to be rest assured that only highly relevant researches are included in this study. This study considers the published articles in journals, workshops, conferences, and symposium. The articles that consist of introductions, tutorials and posters and summaries were eliminated. However, complete and full-length relevant studies published in the English language between January 2014 to 2019 March were considered for the study. The searched words should be present in title, abstract, or in the keywords section.

Table  1 shows a summary of the inclusion and exclusion criteria.

Search strategy process

The search strategy comprised of two stages, namely S1 (automatic stage) and S2 (manual stage). Initially, an automatic search (S1) process was applied to identify the primary studies of big data in education. The following databases and search engines were explored: Science Direct, SAGE.

Journals, Emerald Insight, Springer Link, IEEE Xplore, ACM Digital Library, Taylor and Francis and AIS e-Library. These databases were considered as it possessed highest impact journals and germane conference proceedings, workshops and symposium. According to Kitchenham and Charters ( 2007 ), electronic databases provide a broad perspective on a subject rather than a limited set of specific journals and conferences. In order to find the relevant articles, keywords on big data and education were searched to obtain relatable results. The general words correlated to education were also explored (education OR academic OR university OR learning.

OR curriculum OR higher education OR school). This search string was paired with big data. The second stage is a manual search stage (S2). In this stage, a manual search was performed on the references of all initial searched studies. Kitchenham ( 2004 ) suggested that manual search should be applied to the primary study references. However, EndNote was used to manage, sort and remove the replicate studies easily.

Selection process

The selection process is used to identify the researches that are relevant to the research questions of this review study. The selection process of this study is presented in Fig.  1 . By applying the string of keywords, a total number of 559 studies were found through automatic search. However, 348 studies are replica studies and were removed using the EndNote library. The inclusion and exclusion criteria were applied to the remaining 211 studies. According to Kitchenham and Charters ( 2007 ), recommendation and irrelevant studies should be excluded from the review subject. At this phase, 147 studies were excluded as full-length articles were not available to download. Thus, 64 full-length articles were present to download and were downloaded. To ensure the comprehensiveness of the initial search results, the snowball technique was used. In the second stage, manual search (S2) was performed on the references of all the relevant papers through Google Scholar (Fig. 1 ). A total of 1 study was found through Google Scholar search. The quality assessment criteria were applied to 65 studies. However, 25 studies were excluded, as these studies did not fulfil the quality assessment criteria. Therefore, a total of 40 highly relevant primary studies were included in this research. The selection of studies from different databases and sources before and after results retrieval is shown in Table  2 . It has been found that majority of research studies were present in Science Direct (90), SAGE Journals (50), Emerald Insight (81), Springer Link (38), IEEE Xplore (158), ACM Digital Library (73), Taylor and Francis (17) and AIS e-Library (52). Google Scholar was employed only for the second round of manual search.

figure 1

Selection Process

Quality assessment

According to (Kitchenham & Charters, 2007 ), quality assessment plays a significant role in order to check the quality of primary researches. The subtleties of assessment are totally dependent on the quality of the instruments. This assessment mechanism can be based on the checklist of components or a set of questions. The primary purpose of the checklist of components and a set of questions is to analyze the quality of every study. Nonetheless, for this study, four quality measurements standard was created to evaluate the quality of each research. The measurement standards are given as:

QA1. Does the topic address in the study related to big data in education?

QA2. Does the study describe the context?

QA3. Does the research method given in the paper?

QA4. Does data collection portray in the article?

The four quality assessment standards were applied to 65 selected studies to determine the integrity of each research. The measurement standards were categorized into low, medium and high. The quality of each study depends on the total number of score. Each quality assessment has two-point scores. If the study meets the full standard, a score of 2 is awarded. In the case of partial fulfillment, a score of 1 is acquired. If none of the assessment standards is met, then a score of 0 is awarded. In the total score, if the study gets below 4, it is counted as ‘low’ and exact 4 considered as ‘medium’. However, the above 4 is reflected as ‘high’. The details of studies are presented in Table 11 in Appendix B . The 25 studies were excluded as it did not meet the quality assessment standard. Therefore, based on the quality assessment standard, a total of 40 primary studies were included in this systemic literature review (Table 10 in Appendix A ). The scores of the studies (in terms of low, medium and high) are presented in Fig.  2 .

figure 2

Scores of studies

Data extraction and synthesis

The data extraction and synthesis process were carried by reading the 65 primary studies. The studies were thoroughly studied, and the required details extracted accordingly. The objective of this stage is to find out the needed facts and figure from primary studies. The data was collected through the aspects of research ID, names of author, the title of the research, its publishing year and place, research themes, research context, research method, and data collection method. Data were extracted from 65 studies by using this aspect. The narration of each item is given in Table  3 . The data extracted from all primary studies are tabulated. The process of data synthesizing is presented in the next section.

Figure  3 presented the allocation of studies based on their publication sources. All publications were from high impact journals, high-level conferences, and workshops. The primary studies are comprised of 21 journals, 17 conferences, 1 workshop, and 1 symposium. However, 14 studies were from Science Direct journals and conferences. A total of 5 primary studies were from the SAGE group, 1 primary study from SpringerLink. Whereas 6 studies were from IEEE conferences, 2 studies were from IEEE symposium and workshop. Moreover, 1 primary study from AISeL Conference. Hence, 4 studies were from Emraldinsight journals, 5 studies were from ACM conferences and 2 studies were from Taylor and Francis. The summary of published sources is given in Table  4 .

figure 3

Allocation of studies based on publication

Temporal view of researches

The selection period of this study is from January 2014–March 2019. The yearly allocation of primary studies is presented in Fig.  4 . The big data in education trend started in the year 2014. This trend gradually gained popularity. In 2015, 8 studies were published in this domain. It has been found that a number of studies rise in the year 2017. Thus, the highest number of publication in big data in the education realm was observed in the year 2017. In 2017, 12 studies were published. This trend continued in 2018, and in that year, 11 studies that belong to big data in education were published. In 2019, the trend of this domain is still continued as this paper covers that period of March 2019. Thus, 4 studies were published until March 2019.

figure 4

Temporal view of Papers

In order to find the total citation count for the studies, Google Scholar was used. The number of citation is shown in Fig.  5 . It has been observed that 28 studies were cited by other sources 1–50 times. However, 11 studies were not cited by any other source. Thus, 1 study was cited by other sources 127 times. The top cited studies with their titles are presented in Table  5 , which provides general verification. The data provided here is not for comparison purpose among the studies.

figure 5

Research methodologies

The research methods employed by primary studies are shown in Fig.  6 . It has been found that majority of them are review based studies. These reviews were conducted in a different educational context and big data. However, reviews covered 28% of primary studies. The second most used research method was quantitative. This method covered 23% of the total primary studies. Only 3% of the study was based on a mix method approach. Moreover, design science method also covered 3% of primary studies. Nevertheless, 20% of the studies used qualitative research method, whereas the remaining 25% of the studies were not discussed and given in the articles.

figure 6

Distribution of Research Methods of Primary Studies

Data collection methods

The data collection methods used by primary studies are shown in Fig.  7 . The primary studies employed different data collection methods. However, the majority of studies used extant literature. The 5 types of research conducted surveys which covered 13% of primary Studies. The 4 studies carried experiments for data collection, which covered 10% of primary studies. Nevertheless, 6 studies conducted interviews for data collection, which is based on 15% of primary studies. The 4 studies used data logs which are based on 10% of primary studies. The 2 studies collected data through observations, 1 study used social network data, and 3 studies used website data. The observational, social network data and website-based researches covered 5%, 3% and 8% of primary studies. Moreover, 11 studies used extant literature and 1 study extracted data from a focus group discussion. The extant literature and focus group-based studies covered 28% and 3% of primary studies. However, the data collection method is not available for the remaining 3 studies.

figure 7

Distribution of Data Collection Methods of Primary Studies

What research themes have been addressed in educational studies of big data?

The theme refers to an idea, topic or an area covered by different research studies. The central idea reflects the theme that can be helpful in developing real insight and analysis. A theme can be in single or combination of more words (Rimmon-Kenan, 1995 ). This study classified big data research themes into four groups (Table  6 ). Thus, Fig.  8 shows a mind map of big data in education research themes, sub-themes, and the methodologies.

figure 8

Mind Map of big data in education research themes, sub-themes, and the methodologies

Figure  9 presents, research themes under big data in education, namely learner’s behavior and performance, modelling, and educational data warehouse, improvement of the educational system, and integration of big data into the curriculum.

figure 9

Research Themes

The first research theme was based on the leaner’s behavior and performance. This theme covers 21 studies, which consists of 53% of overall primary studies (Fig.  9 ). The theme studies are based on teaching and learning analytics, big data frameworks, user behaviour, and attitude, learner’s strategies, adaptive learning, and satisfaction. The total number of 8 studies relies on teaching and learning analytics (Table  7 ). Three (3) studies deal with big data framework. However, 6 studies concentrated on user behaviour and attitude. Nevertheless, 2 studies dwell on learning strategies. The adaptive learning and satisfaction covered 1 study, respectively. In this theme, 2 studies conducted surveys, 4 studies carried out experiments and 1 study employed the observational method. The 5 studies reported extant literature. In addition, 4 studies used event log data and 5 conducted interviews (Fig.  10 ).

figure 10

Number of Studies and Data Collection Methods

In the second theme, studies conducted focused on modeling and educational data warehouses. In this theme, 6 studies covered 15% of primary studies. This theme studies investigated the cloud environment, big data modeling, cluster analysis, and data warehouse for educational purpose (Table  8 ). Three (3) studies introduced big data modeling in education and highlighted the potential for organizing data from multiple sources. However, 1 study analyzed data warehouse with big data tools (Hadoop). Moreover, 1 study analyzed the accessibility of huge academic data in a cloud computing environment whereas, 1 study used clustering techniques and data warehouse for educational purpose. In this theme, 4 studies reported extant review, 1 study conduct survey, and 1 study used social network data.

The third theme concentrated on the improvement of the educational system. In this theme, 9 studies covered 23% of the primary studies. They consist of statistical tools and measurements, educational research implications, big data training, the introduction of the ranking system, usage of websites, big data educational challenges and effectiveness (Table  9 ). Two (2) studies considered statistical tools and measurements. Educational research implications, ranking system, usage of websites, and big data training covered 1 study respectively. However, 3 studies considered big data effectiveness and challenges. In this theme, 1 study conducted a survey for data collection, 2 studies used website traffic data, and 1 study exploited the observational method. However, 3 studies reported extant literature.

The fourth theme concentrated on incorporating the big data approaches into the curriculum. In this theme, 4 studies covered 10% of the primary studies. These 4 studies considered the introduction of big data topics into different courses. However, 1 study conducted interviews, 1 study employed survey method and 1 study used focus group discussion.

The 20% of the studies (Fig. 6 ) used qualitative research methods (Dinter et al., 2017 ; Veletsianos et al., 2016 ; Yang & Du, 2016 ). Qualitative methods are mostly applicable to observe the single variable and its relationship with other variables. However, this method does not quantify relationships. In qualitative researches, understanding is attained through ‘wording’ (Chaurasia & Frieda Rosin, 2017 ). The behaviors, attitude, satisfaction, performance, and overall learning performance are related with human phenomenons (Cantabella et al., 2019 ; Elia et al., 2018 ; Sedkaoui & Khelfaoui, 2019 ). Qualitative researches are not statistically tested (Chaurasia & Frieda Rosin, 2017 ). Big data educational studies which employed qualitative methods lacks some certainties that are present in quantitative research methods. Therefore, future researches might quantify the educational big data applications and its impact on higher education.

The six studies conducted interviews for data collection (Chaurasia et al., 2018 ; Chaurasia & Frieda Rosin, 2017 ; Nelson & Pouchard, 2017 ; Troisi et al., 2018 ; Veletsianos et al., 2016 ). However, 2 studies used observational method (Maldonado-Mahauad et al., 2018 ; Sooriamurthi, 2018 ) and one (1) study conducted focus group discussion (Buffum et al., 2014 ) for data collection (Fig.  10 ). The observational studies were conducted in uncontrolled environments. Sometimes results of these studies lead to self-selection biased. There is a chance of ambiguities in data collection where human language and observation are involved. The findings of interviews, observations and focus group discussions are limited and cannot be extended to a wider population of learners (Dinter et al., 2017 ).

The four big data educational studies analyzed the event log data and conducted interviews (Cantabella et al., 2019 ; Hirashima et al., 2017 ; Liang et al., 2016 ; Yang & Du, 2016 ). However, longitudinal data are more appropriate for multidimensional measurements and to analyze the large data sets in the future (Sorensen, 2018 ).

The eight studies considered the teaching and learning analytics (Chaurasia et al., 2018 ; Chaurasia & Frieda Rosin, 2017 ; Dessì et al., 2019 ; Roy & Singh, 2017 ). There are limited researches that covered the aspects of learning environments, ethical and cultural values and government support in the adoption of educational big data (Yang & Du, 2016 ). In the future, comparison of big data in different learning environments, ethical and cultural values, government support and training in adopting big data in higher education can be covered through leading journals and conferences.

The three studies are related to big data frameworks for education (Cantabella et al., 2019 ; Muthukrishnan & Yasin, 2018 ). However, the existed frameworks did not cover the organizational and institutional cultures, yet lacking robust theoretical grounds (Dubey & Gunasekaran, 2015 ; Muthukrishnan & Yasin, 2018 ). In the future, big data educational framework that concentrates on theories and adoption of big data technology is recommended. The extension of existed models and interpretation of data models are recommended. This will help in better decision and ensure the predictive analysis in the academic realm. Moreover, further relations can be tested by integrating other constructs like university size and type (Chaurasia et al., 2018 ).

The three studies dwelled on big data modeling (Pardos, 2017 ; Petrova-Antonova et al., 2017 ; Wassan, 2015 ). These models do not incorporate with the present systems (Santoso & Yulia, 2017 ). Therefore, efficient research solutions that can manage the educational data, new interchanging and resources are required in the future. One (1) study explored a cloud-based solution for managing academic big data (Logica & Magdalena, 2015 ). However, this solution is expensive. In the future, a combination of LMS that is supported by open-source applications and software’s can be used. This development will help universities to obtain benefits from unified LMS and to introduce new trends and economic opportunities for the academic industry. The data warehouse with big data tools was investigated by one (1) study (Santoso & Yulia, 2017 ). Nevertheless, a manifold node cluster can be implemented to process and access the structural and un-structural data in future (Ramos et al., 2015 ). In addition, new techniques that are based on relational and nonrelational databases and development of index catalogs are recommended to improve the overall retrieval system. Furthermore, the applicability of the least analytical tools and parallel programming models are needed to be tested for academic big data. MapReduce, MongoDB, pig,

Cassandra, Yarn, and Mahout are suggested for exploring and analysis of educational big data (Wassan, 2015 ). These tools will improve the analysis process and help in the development of reliable models for academic analytics.

One (1) study detected ICT factors through data mining techniques and tools in order to enhance educational effectiveness and improves its system (Martínez-Abad et al., 2018 ). Additionally, two studies also employed big data analytic tools on popular websites to examine the academic user’s interest (Martínez-Abad et al., 2018 ; Qiu et al., 2015 ). Thus, in future research, more targeted strategies and regions can be selected for organizing the academic data. Similarly, in-depth data mining techniques can be applied according to the nature of the data. Thus, the foreseen research can be used to validate the findings by applying it on other educational websites. The present research can be extended by analyzing the socioeconomic backgrounds and use of other websites (Qiu et al., 2015 ).

The two research studies were conducted on measurements and selection of statistical software for educational big data (Ozgur et al., 2015 ; Selwyn, 2014 ). However, there is no statistical software that is fit for every academic project. Therefore, in future research, all in one’ type statistical software is recommended for big data in order to fulfill the need of all academic projects. The four research studies were based on incorporating the big data academic curricula (Buffum et al., 2014 ; Sledgianowski et al., 2017 ). However, in order to integrate the big data into the curriculum, the significant changes are required. Firstly, in future researches, curricula need to be redeveloped or restructured according to the level and learning environment (Nelson & Pouchard, 2017 ). Secondly, the training factor, learning objectives, and outcomes should be well designed in future studies. Lastly, comparable exercises, learning activities and assessment plan need to be well structured before integrating big data into curricula (Dinter et al., 2017 ).

Discussion and conclusion

Big data has become an essential part of the educational realm. This study presented a systematic review of the literature on big data in the educational sector. However, three research questions were formulated to present big data educational studies trends, themes, and identification of the limitations and directions for further research. The primary studies were collected by performing a systematic search through IEEE Xplore, ScienceDirect, Emerald Insight, AIS Electronic Library, Sage, ACM Digital Library, Springer Link, Taylor and Francis, and Google Scholar databases. Finally, 40 studies were selected that meet the research protocols. These studies were published between the years 2014 (January) and 2019 (April). Through the findings of this study, it can be concluded that 53% of extant studies were conducted on learner’s behavior and performance theme. Moreover, 15% of the studies were on modeling and educational Data Warehouse, and 23% of the studies were on the improvement of educational system themes. However, only 10% of the studies were on the integration of big data into the curriculum theme.

Thus, a large number of studies were conducted in learner’s behavior and performance theme. However, other themes gained lesser attention. Therefore, more researches are expected in modeling and educational Data Warehouse in the future, in order to improve the educational system and integration of big data into the curriculum, related themes.

It has been found that 20% of the studies used qualitative research methods. However, 6 studies conducted interviews, 2 studies used observational method and 1 study conducted focus group discussion for data collection. The findings of interviews, observations and focus group discussions are limited and cannot be extended to a wider population of learners. Therefore, prospect researches might quantify the educational big data applications and its impact in higher education. The longitudinal data are more appropriate for multidimensional measurements and future analysis of the large data sets. The eight studies were carried out on teaching and learning analytics. In the future, comparison of big data in different learning environments, ethical and cultural values, government support and training to adopt big data in higher education can be covered through leading journals and conferences.

The three studies were related to big data frameworks for education. In the future, big data educational framework that dwells on theories and extension of existed models are recommended. The three studies concentrated on big data modeling. These models cannot incorporate with present systems. Therefore, efficient research solutions are that can manage the educational data, new interchanging and resources are required in a future study. The two studies explored a cloud-based solution for managing academic big data and investigated data warehouse with big data tools. Nevertheless, in the future, a manifold node cluster can be implemented for processing and accessing of the structural and un-structural data. The applicability of the least analytical tools and parallel programming models needs to be tested for academic big data.

One (1) study considered the detection of ICT factors through data mining technique and 2 studies employed big data analytic tools on popular websites to examine the academic user’s interest. Thus, more targeted strategies and regions can be selected for organizing the academic data in future. Four (4) research studies featured on incorporating the big data academic curricula. However, the big data based curricula need to be redeveloped by considering the learning objectives. In the future, well-designed learning activities for big data curricula are suggested.

Research implications

This study has two folded implications for stakeholders and researchers. Firstly, this review explored the trends published on big data in education realm. The identified trends uncover the studies allocation, publication sources, sequential view and most cited papers. In addition, it highlights the research methods used in these studies. The described trends can provide opportunities and new ideas to researchers to predict the accurate direction in future studies.

Secondly, this research explored the themes, sub-themes, and the methodologies in big data in education domain. The classified themes, sub-themes, and the methodologies present a comprehensive overview of existing literature of big data in education. The described themes and sub-themes can be helpful for researchers to identify new research gap and avoid using repeated themes in future studies. Meanwhile, it can help researchers to focus on the combination of different themes in order to uncover new insights on how big data can improve the learning and teaching process. In addition, illustrated methodologies can be useful for researchers in the selection of method according to nature of the study in future.

Identified research can be an implication for stakeholders towards the holistic expansion of educational competencies. The identified themes give new insight to universities to plan mixed learning programs that combine conventional learning with web-based learning. This permits students to accomplish focused learning outcomes, engrossing exercises at an ideal pace. It can be helpful for teachers to apprehend the ways to gauge students learning behaviour and attitude simultaneously and advance teaching strategy accordingly. Understanding the latest trends in big data and education are of growing importance for the ministry of education as they can develop flexible possibly to support the institutions to improve the educational system.

Lastly, the identified limitations and possible future directions can provide guidelines for researchers about what has been explored or need to explore in future. In addition, stakeholders can also extract ideas to impart the future cohort and comprehend the learning and academic requirements.

Availability of data and materials

Not applicable.

Ahmed, E., Yaqoob, I., Hashem, I. A. T., Shuja, J., Imran, M., Guizani, N., & Bakhsh, S. T. (2018). Recent advances and challenges in mobile big data. IEEE Communications Magazine , 56 (2), 102–108. China: East China Normal University. https://doi.org/10.1109/MCOM.2018.1700294 .

Anshari, M., Alas, Y., & Yunus, N. (2019). A survey study of smartphones behavior in Brunei: A proposal of Modelling big data strategies. In Multigenerational Online Behavior and Media Use: Concepts, Methodologies, Tools, and Applications , (pp. 201–214). IGI global.

Black, P., & Wiliam, D. (2018). Classroom assessment and pedagogy. Assessment in Education: Principles, Policy & Practice , 25 (6), 551–575. https://doi.org/10.1080/0969594X.2018.1441807 .

Article   Google Scholar  

Buffum, P. S., Martinez-Arocho, A. G., Frankosky, M. H., Rodriguez, F. J., Wiebe, E. N., & Boyer, K. E. (2014, March). CS principles goes to middle school: Learning how to teach big data. In Proceedings of the 45th ACM technical Computer science education , (pp. 151–156). New York: ACM. https://doi.org/10.1145/2538862.2538949 .

Camargo Fiorini, P., Seles, B. M. R. P., Jabbour, C. J. C., Mariano, E. B., & Sousa Jabbour, A. B. L. (2018). Management theory and big data literature: From a review to a research agenda. International Journal of Information Management , 43 , 112–129. https://doi.org/10.1016/j.ijinfomgt.2018.07.005 .

Cantabella, M., Martínez-España, R., Ayuso, B., Yáñez, J. A., & Muñoz, A. (2019). Analysis of student behavior in learning management systems through a big data framework. Future Generation Computer Systems , 90 (2), 262–272. https://doi.org/10.1016/j.future.2018.08.003 .

Chae, B. K. (2019). A general framework for studying the evolution of the digital innovation ecosystem: The case of big data. International Journal of Information Management , 45 , 83–94. https://doi.org/10.1016/j.ijinfomgt.2018.10.023 .

Chaurasia, S. S., & Frieda Rosin, A. (2017). From big data to big impact: Analytics for teaching and learning in higher education. Industrial and Commercial Training , 49 (7), 321–328. https://doi.org/10.1108/ict-10-2016-0069 .

Chaurasia, S. S., Kodwani, D., Lachhwani, H., & Ketkar, M. A. (2018). Big data academic and learning analytics. International Journal of Educational Management , 32 (6), 1099–1117. https://doi.org/10.1108/ijem-08-2017-0199 .

Coccoli, M., Maresca, P., & Stanganelli, L. (2017). The role of big data and cognitive computing in the learning process. Journal of Visual Languages & Computing , 38 , 97–103. https://doi.org/10.1016/j.jvlc.2016.03.002 .

De Mauro, A., Greco, M., & Grimaldi, M. (2016). A formal definition of big data based on its essential features. Library Review , 65 (3), 122–135. https://doi.org/10.1108/LR-06-2015-0061 .

Demchenko, Y., Grosso, P., De Laat, C., & Membrey, P. (2013). Addressing big data issues in scientific data infrastructure. In Collaboration Technologies and Systems (CTS), 2013 International Conference on , (pp. 48–55). San Diego: IEEE. https://doi.org/10.1109/CTS.2013.6567203 .

Dessì, D., Fenu, G., Marras, M., & Reforgiato Recupero, D. (2019). Bridging learning analytics and cognitive computing for big data classification in micro-learning video collections. Computers in Human Behavior , 92 (1), 468–477. https://doi.org/10.1016/j.chb.2018.03.004 .

Dinter, B., Jaekel, T., Kollwitz, C., & Wache, H. (2017). Teaching Big Data Management – An Active Learning Approach for Higher Education . North America: Paper presented at the proceedings of the pre-ICIS 2017 SIGDSA, (pp. 1–17). North America: AISeL.

Dresner Advisory Services. (2017). Big data adoption: State of the market. ZoomData. Retrieved from https://www.zoomdata.com/master-class/state-market/big-data-adoption

Google Scholar  

Dubey, R., & Gunasekaran, A. (2015). Education and training for successful career in big data and business analytics. Industrial and Commercial Training , 47 (4), 174–181. https://doi.org/10.1108/ict-08-2014-0059 .

Elia, G., Solazzo, G., Lorenzo, G., & Passiante, G. (2018). Assessing learners’ satisfaction in collaborative online courses through a big data approach. Computers in Human Behavior , 92 , 589–599. https://doi.org/10.1016/j.chb.2018.04.033 .

Gupta, D., & Rani, R. (2018). A study of big data evolution and research challenges. Journal of Information Science. , 45 (3), 322–340. https://doi.org/10.1177/0165551518789880 .

Herschel, R., & Miori, V. M. (2017). Ethics & big data. Technology in Society , 49 , 31–36. https://doi.org/10.1016/j.techsoc.2017.03.003 .

Hirashima, T., Supianto, A. A., & Hayashi, Y. (2017, September). Model-based approach for educational big data analysis of learners thinking with process data. In 2017 International Workshop on Big Data and Information Security (IWBIS) (pp. 11-16). San Diego: IEEE. https://doi.org/10.1177/0165551518789880

Holland, A. A. (2019). Effective principles of informal online learning design: A theory-building metasynthesis of qualitative research. Computers & Education , 128 , 214–226. https://doi.org/10.1016/j.compedu.2018.09.026 .

Kalaian, S. A., Kasim, R. M., & Kasim, N. R. (2019). Descriptive and predictive analytical methods for big data. In Web Services: Concepts, Methodologies, Tools, and Applications , (pp. 314–331). USA: IGI global. https://doi.org/10.4018/978-1-5225-7501-6.ch018 .

Kamilaris, A., Kartakoullis, A., & Prenafeta-Boldú, F. X. (2017). A review on the practice of big data analysis in agriculture. Computers and Electronics in Agriculture , 143 , 23–37. https://doi.org/10.1016/j.compag.2017.09.037 .

Kitchenham, B. (2004). Procedures for performing systematic reviews. Keele, UK, Keele University , 33 (2004), 1–26.

Kitchenham, B., & Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering version 2.3. Engineering , 45 (4), 13–65.

Lia, Y., & Zhaia, X. (2018). Review and prospect of modern education using big data. Procedia Computer Science , 129 (3), 341–347. https://doi.org/10.1016/j.procs.2018.03.085 .

Liang, J., Yang, J., Wu, Y., Li, C., & Zheng, L. (2016). Big Data Application in Education: Dropout Prediction in Edx MOOCs. In Paper presented at the 2016 IEEE second international conference on multimedia big data (BigMM) , (pp. 440–443). USA: IEEE. https://doi.org/10.1109/BigMM.2016.70 .

Logica, B., & Magdalena, R. (2015). Using big data in the academic environment. Procedia Economics and Finance , 33 (2), 277–286. https://doi.org/10.1016/s2212-5671(15)01712-8 .

Maldonado-Mahauad, J., Pérez-Sanagustín, M., Kizilcec, R. F., Morales, N., & Munoz-Gama, J. (2018). Mining theory-based patterns from big data: Identifying self-regulated learning strategies in massive open online courses. Computers in Human Behavior , 80 (1), 179196. https://doi.org/10.1016/j.chb.2017.11.011 .

Martínez-Abad, F., Gamazo, A., & Rodríguez-Conde, M. J. (2018). Big Data in Education. In Paper presented at the proceedings of the sixth international conference on technological ecosystems for enhancing Multiculturality - TEEM'18, Salamanca, Spain , (pp. 145–150). New York: ACM. https://doi.org/10.1145/3284179.3284206 .

Mikalef, P., Pappas, I. O., Krogstie, J., & Giannakos, M. (2018). Big data analytics capabilities: A systematic literature review and research agenda. Information Systems and e-Business Management , 16 (3), 547–578. https://doi.org/10.1007/10257-017-0362-y .

Mohammadpoor, M., & Torabi, F. (2018). Big Data analytics in oil and gas industry: An emerging trend. Petroleum. In press. https://doi.org/10.1016/j.petlm.2018.11.001 .

Muthukrishnan, S. M., & Yasin, N. B. M. (2018). Big Data Framework for Students’ Academic. Paper presented at the symposium on computer applications & industrial electronics (ISCAIE), Penang, Malaysia (pp. 376–382). USA: IEEE. https://doi.org/10.1109/ISCAIE.2018.8405502

Neilson, A., Daniel, B., & Tjandra, S. (2019). Systematic review of the literature on big data in the transportation Domain: Concepts and Applications. Big Data Research . In press. https://doi.org/10.1016/j.bdr.2019.03.001 .

Nelson, M., & Pouchard, L. (2017). A pilot “big data” education modular curriculum for engineering graduate education: Development and implementation. In Paper presented at the Frontiers in education conference (FIE), Indianapolis, USA , (pp. 1–5). USA: IEEE. https://doi.org/10.1109/FIE.2017.8190688 .

Nie, M., Yang, L., Sun, J., Su, H., Xia, H., Lian, D., & Yan, K. (2018). Advanced forecasting of career choices for college students based on campus big data. Frontiers of Computer Science , 12 (3), 494–503. https://doi.org/10.1007/s11704-017-6498-6 .

Oi, M., Yamada, M., Okubo, F., Shimada, A., & Ogata, H. (2017). Reproducibility of findings from educational big data. In Paper presented at the proceedings of the Seventh International Learning Analytics & Knowledge Conference , (pp. 536–537). New York: ACM. https://doi.org/10.1145/3027385.3029445 .

Ong, V. K. (2015). Big Data and Its Research Implications for Higher Education: Cases from UK Higher Education Institutions. In Paper presented at the 2015 IIAI 4th international confress on advanced applied informatics , (pp. 487–491). USA: IEEE. https://doi.org/10.1109/IIAI-AAI.2015.178 .

Ozgur, C., Kleckner, M., & Li, Y. (2015). Selection of statistical software for solving big data problems. SAGE Open , 5 (2), 59–94. https://doi.org/10.1177/2158244015584379 .

Pardos, Z. A. (2017). Big data in education and the models that love them. Current Opinion in Behavioral Sciences , 18 (2), 107–113. https://doi.org/10.1016/j.cobeha.2017.11.006 .

Petrova-Antonova, D., Georgieva, O., & Ilieva, S. (2017, June). Modelling of educational data following big data value chain. In Proceedings of the 18th International Conference on Computer Systems and Technologies (pp. 88–95). New York City: ACM. https://doi.org/10.1145/3134302.3134335

Qiu, R. G., Huang, Z., & Patel, I. C. (2015, June). A big data approach to assessing the US higher education service. In 2015 12th International Conference on Service Systems and Service Management (ICSSSM) (pp. 1–6). New York: IEEE. https://doi.org/10.1109/ICSSSM.2015.7170149

Ramos, T. G., Machado, J. C. F., & Cordeiro, B. P. V. (2015). Primary education evaluation in Brazil using big data and cluster analysis. Procedia Computer Science , 55 (1), 10311039. https://doi.org/10.1016/j.procs.2015.07.061 .

Rimmon-Kenan, S. (1995). What Is Theme and How Do We Get at It?. Thematics: New Approaches, 9–20.

Roy, S., & Singh, S. N. (2017). Emerging trends in applications of big data in educational data mining and learning analytics. In 2017 7th International Conference on Cloud Computing, Data Science & Engineering-Confluence , (pp. 193–198). New York: IEEE. https://doi.org/10.1109/confluence.2017.7943148 .

Saggi, M. K., & Jain, S. (2018). A survey towards an integration of big data analytics to big insights for value-creation. Information Processing & Management , 54 (5), 758–790. https://doi.org/10.1016/j.ipm.2018.01.010 .

Santoso, L. W., & Yulia (2017). Data warehouse with big data Technology for Higher Education. Procedia Computer Science , 124 (1), 93–99. https://doi.org/10.1016/j.procs.2017.12.134 .

Sedkaoui, S., & Khelfaoui, M. (2019). Understand, develop and enhance the learning process with big data. Information Discovery and Delivery , 47 (1), 2–16. https://doi.org/10.1108/idd-09-2018-0043 .

Selwyn, N. (2014). Data entry: Towards the critical study of digital data and education. Learning, Media and Technology , 40 (1), 64–82. https://doi.org/10.1080/17439884.2014.921628 .

Shahat, O. A. (2019). A novel big data analytics framework for smart cities. Future Generation Computer Systems , 91 (1), 620–633. https://doi.org/10.1016/j.future.2018.06.046 .

Shorfuzzaman, M., Hossain, M. S., Nazir, A., Muhammad, G., & Alamri, A. (2019). Harnessing the power of big data analytics in the cloud to support learning analytics in mobile learning environment. Computers in Human Behavior , 92 (1), 578–588. https://doi.org/10.1016/j.chb.2018.07.002 .

Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of big data challenges and analytical methods. Journal of Business Research , 70 , 263–286. https://doi.org/10.1016/j.jbusres.2016.08.001 .

Sledgianowski, D., Gomaa, M., & Tan, C. (2017). Toward integration of big data, technology and information systems competencies into the accounting curriculum. Journal of Accounting Education , 38 (1), 81–93. https://doi.org/10.1016/j.jaccedu.2016.12.008 .

Sooriamurthi, R. (2018). Introducing big data analytics in high school and college. In Proceedings of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education (pp. 373–374). New York: ACM. https://doi.org/10.1145/3197091.3205834

Sorensen, L. C. (2018). "Big data" in educational administration: An application for predicting school dropout risk. Educational Administration Quarterly , 45 (1), 1–93. https://doi.org/10.1177/0013161x18799439 .

Article   MathSciNet   Google Scholar  

Su, Y. S., Ding, T. J., Lue, J. H., Lai, C. F., & Su, C. N. (2017). Applying big data analysis technique to students’ learning behavior and learning resource recommendation in a MOOCs course. In 2017 International conference on applied system innovation (ICASI) (pp. 1229–1230). New York: IEEE. https://doi.org/10.1109/ICASI.2017.7988114

Troisi, O., Grimaldi, M., Loia, F., & Maione, G. (2018). Big data and sentiment analysis to highlight decision behaviours: A case study for student population. Behaviour & Information Technology , 37 (11), 1111–1128. https://doi.org/10.1080/0144929x.2018.1502355 .

Ur Rehman, M. H., Yaqoob, I., Salah, K., Imran, M., Jayaraman, P. P., & Perera, C. (2019). The role of big data analytics in industrial internet of things. Future Generation Computer Systems , 92 , 578–588. https://doi.org/10.1016/j.future.2019.04.020 .

Veletsianos, G., Reich, J., & Pasquini, L. A. (2016). The Life Between Big Data Log Events. AERA Open , 2 (3), 1–45. https://doi.org/10.1177/2332858416657002 .

Wang, Y., Kung, L., & Byrd, T. A. (2018). Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change , 126 , 3–13. https://doi.org/10.1016/j.techfore.2015.12.019 .

Wassan, J. T. (2015). Discovering big data modelling for educational world. Procedia - Social and Behavioral Sciences , 176 , 642–649. https://doi.org/10.1016/j.sbspro.2015.01.522 .

Wolfert, S., Ge, L., Verdouw, C., & Bogaardt, M. J. (2017). Big data in smart farming–a review. Agricultural Systems , 153 , 69–80. https://doi.org/10.1016/j.agsy.2017.01.023 .

Wu, P. J., & Lin, K. C. (2018). Unstructured big data analytics for retrieving e-commerce logistics knowledge. Telematics and Informatics , 35 (1), 237–244. https://doi.org/10.1016/j.tele.2017.11.004 .

Xu, L. D., & Duan, L. (2019). Big data for cyber physical systems in industry 4.0: A survey. Enterprise Information Systems , 13 (2), 148–169. https://doi.org/10.1080/17517575.2018.1442934 .

Yang, F., & Du, Y. R. (2016). Storytelling in the age of big data. Asia Pacific Media Educator , 26 (2), 148–162. https://doi.org/10.1177/1326365x16673168 .

Yassine, A., Singh, S., Hossain, M. S., & Muhammad, G. (2019). IoT big data analytics for smart homes with fog and cloud computing. Future Generation Computer Systems , 91 (2), 563–573. https://doi.org/10.1016/j.future.2018.08.040 .

Zhang, M. (2015). Internet use that reproduces educational inequalities: Evidence from big data. Computers & Education , 86 (1), 212–223. https://doi.org/10.1016/j.compedu.2015.08.007 .

Zheng, M., & Bender, D. (2019). Evaluating outcomes of computer-based classroom testing: Student acceptance and impact on learning and exam performance. Medical Teacher , 41 (1), 75–82. https://doi.org/10.1080/0142159X.2018.1441984 .

Download references

Acknowledgements

Not applicable

Author information

Authors and affiliations.

Department of Information Systems, Faculty of Computer Science & Information Technology University of Malaya, 50603, Kuala Lumpur, Malaysia

Maria Ijaz Baig, Liyana Shuib & Elaheh Yadegaridehkordi

You can also search for this author in PubMed   Google Scholar

Contributions

Maria Ijaz Baig composed the manuscript under the guidance of Elaheh Yadegaridehkordi. Liyana Shuib supervised the project. All authors discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Liyana Shuib .

Ethics declarations

Competing interests.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Baig, M.I., Shuib, L. & Yadegaridehkordi, E. Big data in education: a state of the art, limitations, and future research directions. Int J Educ Technol High Educ 17 , 44 (2020). https://doi.org/10.1186/s41239-020-00223-0

Download citation

Received : 09 March 2020

Accepted : 10 June 2020

Published : 02 November 2020

DOI : https://doi.org/10.1186/s41239-020-00223-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data science applications in education
  • Learning communities
  • Teaching/learning strategies

big data research direction

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents

Bibliography

Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Scientific Research and Big Data

Big Data promises to revolutionise the production of knowledge within and beyond science, by enabling novel, highly efficient ways to plan, conduct, disseminate and assess research. The last few decades have witnessed the creation of novel ways to produce, store, and analyse data, culminating in the emergence of the field of data science , which brings together computational, algorithmic, statistical and mathematical techniques towards extrapolating knowledge from big data. At the same time, the Open Data movement—emerging from policy trends such as the push for Open Government and Open Science—has encouraged the sharing and interlinking of heterogeneous research data via large digital infrastructures. The availability of vast amounts of data in machine-readable formats provides an incentive to create efficient procedures to collect, organise, visualise and model these data. These infrastructures, in turn, serve as platforms for the development of artificial intelligence, with an eye to increasing the reliability, speed and transparency of processes of knowledge creation. Researchers across all disciplines see the newfound ability to link and cross-reference data from diverse sources as improving the accuracy and predictive power of scientific findings and helping to identify future directions of inquiry, thus ultimately providing a novel starting point for empirical investigation. As exemplified by the rise of dedicated funding, training programmes and publication venues, big data are widely viewed as ushering in a new way of performing research and challenging existing understandings of what counts as scientific knowledge.

This entry explores these claims in relation to the use of big data within scientific research, and with an emphasis on the philosophical issues emerging from such use. To this aim, the entry discusses how the emergence of big data—and related technologies, institutions and norms—informs the analysis of the following themes:

  • how statistics, formal and computational models help to extrapolate patterns from data, and with which consequences;
  • the role of critical scrutiny (human intelligence) in machine learning, and its relation to the intelligibility of research processes;
  • the nature of data as research components;
  • the relation between data and evidence, and the role of data as source of empirical insight;
  • the view of knowledge as theory-centric;
  • understandings of the relation between prediction and causality;
  • the separation of fact and value; and
  • the risks and ethics of data science.

These are areas where attention to research practices revolving around big data can benefit philosophy, and particularly work in the epistemology and methodology of science. This entry doesn’t cover the vast scholarship in the history and social studies of science that has emerged in recent years on this topic, though references to some of that literature can be found when conceptually relevant. Complementing historical and social scientific work in data studies, the philosophical analysis of data practices can also elicit significant challenges to the hype surrounding data science and foster a critical understanding of the role of data-fuelled artificial intelligence in research.

1. What Are Big Data?

2. extrapolating data patterns: the role of statistics and software, 3. human and artificial intelligence, 4. the nature of (big) data, 5. big data and evidence, 6. big data, knowledge and inquiry, 7. big data between causation and prediction, 8. the fact/value distinction, 9. big data risks and the ethics of data science, 10. conclusion: big data and good science, other internet resources, related entries.

We are witnessing a progressive “datafication” of social life. Human activities and interactions with the environment are being monitored and recorded with increasing effectiveness, generating an enormous digital footprint. The resulting “big data” are a treasure trove for research, with ever more sophisticated computational tools being developed to extract knowledge from such data. One example is the use of various different types of data acquired from cancer patients, including genomic sequences, physiological measurements and individual responses to treatment, to improve diagnosis and treatment. Another example is the integration of data on traffic flow, environmental and geographical conditions, and human behaviour to produce safety measures for driverless vehicles, so that when confronted with unforeseen events (such as a child suddenly darting into the street on a very cold day), the data can be promptly analysed to identify and generate an appropriate response (the car swerving enough to avoid the child while also minimising the risk of skidding on ice and damaging to other vehicles). Yet another instance is the understanding of the nutritional status and needs of a particular population that can be extracted from combining data on food consumption generated by commercial services (e.g., supermarkets, social media and restaurants) with data coming from public health and social services, such as blood test results and hospital intakes linked to malnutrition. In each of these cases, the availability of data and related analytic tools is creating novel opportunities for research and for the development of new forms of inquiry, which are widely perceived as having a transformative effect on science as a whole.

A useful starting point in reflecting on the significance of such cases for a philosophical understanding of research is to consider what the term “big data” actually refers to within contemporary scientific discourse. There are multiple ways to define big data (Kitchin 2014, Kitchin & McArdle 2016). Perhaps the most straightforward characterisation is as large datasets that are produced in a digital form and can be analysed through computational tools. Hence the two features most commonly associated with Big Data are volume and velocity. Volume refers to the size of the files used to archive and spread data. Velocity refers to the pressing speed with which data is generated and processed. The body of digital data created by research is growing at breakneck pace and in ways that are arguably impossible for the human cognitive system to grasp and thus require some form of automated analysis.

Volume and velocity are also, however, the most disputed features of big data. What may be perceived as “large volume” or “high velocity” depends on rapidly evolving technologies to generate, store, disseminate and visualise the data. This is exemplified by the high-throughput production, storage and dissemination of genomic sequencing and gene expression data, where both data volume and velocity have dramatically increased within the last two decades. Similarly, current understandings of big data as “anything that cannot be easily captured in an Excel spreadsheet” are bound to shift rapidly as new analytic software becomes established, and the very idea of using spreadsheets to capture data becomes a thing of the past. Moreover, data size and speed do not take account of the diversity of data types used by researchers, which may include data that are not generated in digital formats or whose format is not computationally tractable, and which underscores the importance of data provenance (that is, the conditions under which data were generated and disseminated) to processes of inference and interpretation. And as discussed below, the emphasis on physical features of data obscures the continuing dependence of data interpretation on circumstances of data use, including specific queries, values, skills and research situations.

An alternative is to define big data not by reference to their physical attributes, but rather by virtue of what can and cannot be done with them. In this view, big data is a heterogeneous ensemble of data collected from a variety of different sources, typically (but not always) in digital formats suitable for algorithmic processing, in order to generate new knowledge. For example boyd and Crawford (2012: 663) identify big data with “the capacity to search, aggregate and cross-reference large datasets”, while O’Malley and Soyer (2012) focus on the ability to interrogate and interrelate diverse types of data, with the aim to be able to consult them as a single body of evidence. The examples of transformative “big data research” given above are all easily fitted into this view: it is not the mere fact that lots of data are available that makes a different in those cases, but rather the fact that lots of data can be mobilised from a wide variety of sources (medical records, environmental surveys, weather measurements, consumer behaviour). This account makes sense of other characteristic “v-words” that have been associated with big data, including:

  • Variety in the formats and purposes of data, which may include objects as different as samples of animal tissue, free-text observations, humidity measurements, GPS coordinates, and the results of blood tests;
  • Veracity , understood as the extent to which the quality and reliability of big data can be guaranteed. Data with high volume, velocity and variety are at significant risk of containing inaccuracies, errors and unaccounted-for bias. In the absence of appropriate validation and quality checks, this could result in a misleading or outright incorrect evidence base for knowledge claims (Floridi & Illari 2014; Cai & Zhu 2015; Leonelli 2017);
  • Validity , which indicates the selection of appropriate data with respect to the intended use. The choice of a specific dataset as evidence base requires adequate and explicit justification, including recourse to relevant background knowledge to ground the identification of what counts as data in that context (e.g., Loettgers 2009, Bogen 2010);
  • Volatility , i.e., the extent to which data can be relied upon to remain available, accessible and re-interpretable despite changes in archival technologies. This is significant given the tendency of formats and tools used to generate and analyse data to become obsolete, and the efforts required to update data infrastructures so as to guarantee data access in the long term (Bowker 2006; Edwards 2010; Lagoze 2014; Borgman 2015);
  • Value , i.e., the multifaceted forms of significance attributed to big data by different sections of society, which depend as much on the intended use of the data as on historical, social and geographical circumstances (Leonelli 2016, D’Ignazio and Klein 2020). Alongside scientific value, researchers may impute financial, ethical, reputational and even affective value to data, depending on their intended use as well as the historical, social and geographical circumstances of their use. The institutions involved in governing and funding research also have ways of valuing data, which may not always overlap with the priorities of researchers (Tempini 2017).

This list of features, though not exhaustive, highlights how big data is not simply “a lot of data”. The epistemic power of big data lies in their capacity to bridge between different research communities, methodological approaches and theoretical frameworks that are difficult to link due to conceptual fragmentation, social barriers and technical difficulties (Leonelli 2019a). And indeed, appeals to big data often emerge from situations of inquiry that are at once technically, conceptually and socially challenging, and where existing methods and resources have proved insufficient or inadequate (Sterner & Franz 2017; Sterner, Franz, & Witteveen 2020).

This understanding of big data is rooted in a long history of researchers grappling with large and complex datasets, as exemplified by fields like astronomy, meteorology, taxonomy and demography (see the collections assembled by Daston 2017; Anorova et al. 2017; Porter & Chaderavian 2018; as well as Anorova et al. 2010, Sepkoski 2013, Stevens 2016, Strasser 2019 among others). Similarly, biomedical research—and particularly subfields such as epidemiology, pharmacology and public health—has an extensive tradition of tackling data of high volume, velocity, variety and volatility, and whose validity, veracity and value are regularly negotiated and contested by patients, governments, funders, pharmaceutical companies, insurances and public institutions (Bauer 2008). Throughout the twentieth century, these efforts spurred the development of techniques, institutions and instruments to collect, order, visualise and analyse data, such as: standard classification systems and formats; guidelines, tools and legislation for the management and security of sensitive data; and infrastructures to integrate and sustain data collections over long periods of time (Daston 2017).

This work culminated in the application of computational technologies, modelling tools and statistical methods to big data (Porter 1995; Humphreys 2004; Edwards 2010), increasingly pushing the boundaries of data analytics thanks to supervised learning, model fitting, deep neural networks, search and optimisation methods, complex data visualisations and various other tools now associated with artificial intelligence. Many of these tools are based on algorithms whose functioning and results are tested against specific data samples (a process called “training”). These algorithms are programmed to “learn” from each interaction with novel data: in other words, they have the capacity to change themselves in response to new information being inputted into the system, thus becoming more attuned to the phenomena they are analysing and improving their ability to predict future behaviour. The scope and extent of such changes is shaped by the assumptions used to build the algorithms and the capability of related software and hardware to identify, access and process information of relevance to the learning in question. There is however a degree of unpredictability and opacity to these systems, which can evolve to the point of defying human understanding (more on this below).

New institutions, communication platforms and regulatory frameworks also emerged to assemble, prepare and maintain data for such uses (Kitchin 2014), such as various forms of digital data infrastructures, organisations aiming to coordinate and improve the global data landscape (e.g., the Research Data Alliance), and novel measures for data protection, like the General Data Protection Regulation launched in 2017 by the European Union. Together, these techniques and institutions afford the opportunity to assemble and interpret data at a much broader scale, while also promising to deliver finer levels of granularity in data analysis. [ 1 ] They increase the scope of any investigation by making it possible for researchers to link their own findings to those of countless others across the world, both within and beyond the academic sphere. By enhancing the mobility of data, they facilitate their repurposing for a variety of goals that may have been unforeseeable when the data were originally generated. And by transforming the role of data within research, they heighten their status as valuable research outputs in and of themselves. These technological and methodological developments have significant implications for philosophical conceptualisations of data, inferential processes and scientific knowledge, as well as for how research is conducted, organised, governed and assessed. It is to these philosophical concerns that I now turn.

Big data are often associated to the idea of data-driven research, where learning happens through the accumulation of data and the application of methods to extract meaningful patterns from those data. Within data-driven inquiry, researchers are expected to use data as their starting point for inductive inference, without relying on theoretical preconceptions—a situation described by advocates as “the end of theory”, in contrast to theory-driven approaches where research consists of testing a hypothesis (Anderson 2008, Hey et al. 2009). In principle at least, big data constitute the largest pool of data ever assembled and thus a strong starting point to search for correlations (Mayer-Schönberger & Cukier 2013). Crucial to the credibility of the data-driven approach is the efficacy of the methods used to extrapolate patterns from data and evaluate whether or not such patterns are meaningful, and what “meaning” may involve in the first place. Hence, some philosophers and data scholars have argued that

the most important and distinctive characteristic of Big Data [is] its use of statistical methods and computational means of analysis, (Symons & Alvarado 2016: 4)

such as for instance machine learning tools, deep neural networks and other “intelligent” practices of data handling.

The emphasis on statistics as key adjudicator of validity and reliability of patterns extracted from data is not novel. Exponents of logical empiricism looked for logically watertight methods to secure and justify inference from data, and their efforts to develop a theory of probability proceeded in parallel with the entrenchment of statistical reasoning in the sciences in the first half of the twentieth century (Romeijn 2017). In the early 1960s, Patrick Suppes offered a seminal link between statistical methods and the philosophy of science through his work on the production and interpretation of data models. As a philosopher deeply embedded in experimental practice, Suppes was interested in the means and motivations of key statistical procedures for data analysis such as data reduction and curve fitting. He argued that once data are adequately prepared for statistical modelling, all the concerns and choices that motivated data processing become irrelevant to their analysis and interpretation. This inspired him to differentiate between models of theory, models of experiment and models of data, noting that such different components of inquiry are governed by different logics and cannot be compared in a straightforward way. For instance,

the precise definition of models of the data for any given experiment requires that there be a theory of the data in the sense of the experimental procedure, as well as in the ordinary sense of the empirical theory of the phenomena being studied. (Suppes 1962: 253)

Suppes viewed data models as necessarily statistical: that is, as objects

designed to incorporate all the information about the experiment which can be used in statistical tests of the adequacy of the theory. (Suppes 1962: 258)

His formal definition of data models reflects this decision, with statistical requirements such as homogeneity, stationarity and order identified as the ultimate criteria to identify a data model Z and evaluate its adequacy:

Z is an N-fold model of the data for experiment Y if and only if there is a set Y and a probability measure P on subsets of Y such that \(Y = \langle Y, P\rangle\) is a model of the theory of the experiment, Z is an N-tuple of elements of Y , and Z satisfies the statistical tests of homogeneity, stationarity and order. (1962: 259)

This analysis of data models portrayed statistical methods as key conduits between data and theory, and hence as crucial components of inferential reasoning.

The focus on statistics as entry point to discussions of inference from data was widely promoted in subsequent philosophical work. Prominent examples include Deborah Mayo, who in her book Error and the Growth of Experimental Knowledge asked:

What should be included in data models? The overriding constraint is the need for data models that permit the statistical assessment of fit (between prediction and actual data); (Mayo 1996: 136)

and Bas van Fraassen, who also embraced the idea of data models as “summarizing relative frequencies found in data” (Van Fraassen 2008: 167). Closely related is the emphasis on statistics as means to detect error within datasets in relation to specific hypotheses, most prominently endorsed by the error-statistical approach to inference championed by Mayo and Aris Spanos (Mayo & Spanos 2009a). This approach aligns with the emphasis on computational methods for data analysis within big data research, and supports the idea that the better the inferential tools and methods, the better the chance to extract reliable knowledge from data.

When it comes to addressing methodological challenges arising from the computational analysis of big data, however, statistical expertise needs to be complemented by computational savvy in the training and application of algorithms associated to artificial intelligence, including machine learning but also other mathematical procedures for operating upon data (Bringsjord & Govindarajulu 2018). Consider for instance the problem of overfitting, i.e., the mistaken identification of patterns in a dataset, which can be greatly amplified by the training techniques employed by machine learning algorithms. There is no guarantee that an algorithm trained to successfully extrapolate patterns from a given dataset will be as successful when applied to other data. Common approaches to this problem involve the re-ordering and partitioning of both data and training methods, so that it is possible to compare the application of the same algorithms to different subsets of the data (“cross-validation”), combine predictions arising from differently trained algorithms (“ensembling”) or use hyperparameters (parameters whose value is set prior to data training) to prepare the data for analysis.

Handling these issues, in turn, requires

familiarity with the mathematical operations in question, their implementations in code, and the hardware architectures underlying such implementations. (Lowrie 2017: 3)

For instance, machine learning

aims to build programs that develop their own analytic or descriptive approaches to a body of data, rather than employing ready-made solutions such as rule-based deduction or the regressions of more traditional statistics. (Lowrie 2017: 4)

In other words, statistics and mathematics need to be complemented by expertise in programming and computer engineering. The ensemble of skills thus construed results in a specific epistemological approach to research, which is broadly characterised by an emphasis on the means of inquiry as the most significant driver of research goals and outputs. This approach, which Sabina Leonelli characterised as data-centric , involves “focusing more on the processes through which research is carried out than on its ultimate outcomes” (Leonelli 2016: 170). In this view, procedures, techniques, methods, software and hardware are the prime motors of inquiry and the chief influence on its outcomes. Focusing more specifically on computational systems, John Symons and Jack Horner argued that much of big data research consists of software-intensive science rather than data-driven research: that is, science that depends on software for its design, development, deployment and use, and thus encompasses procedures, types of reasoning and errors that are unique to software, such as for example the problems generated by attempts to map real-world quantities to discrete-state machines, or approximating numerical operations (Symons & Horner 2014: 473). Software-intensive science is arguably supported by an algorithmic rationality focused on the feasibility, practicality and efficiency of algorithms, which is typically assessed by reference to concrete situations of inquiry (Lowrie 2017).

Algorithms are enormously varied in their mathematical structures and underpinning conceptual commitments, and more philosophical work needs to be carried out on the specifics of computational tools and software used in data science and related applications—with emerging work in philosophy of computer science providing an excellent way forward (Turner & Angius 2019). Nevertheless, it is clear that whether or not a given algorithm successfully applies to the data at hand depends on factors that cannot be controlled through statistical or even computational methods: for instance, the size, structure and format of the data, the nature of the classifiers used to partition the data, the complexity of decision boundaries and the very goals of the investigation.

In a forceful critique informed by the philosophy of mathematics, Christian Calude and Giuseppe Longo argued that there is a fundamental problem with the assumption that more data will necessarily yield more information:

very large databases have to contain arbitrary correlations. These correlations appear only due to the size, not the nature, of data. (Calude & Longo 2017: 595)

They conclude that big data analysis is by definition unable to distinguish spurious from meaningful correlations and is therefore a threat to scientific research. A related worry, sometimes dubbed “the curse of dimensionality” by data scientists, concerns the extent to which the analysis of a given dataset can be scaled up in complexity and in the number of variables being considered. It is well known that the more dimensions one considers in classifying samples, for example, the larger the dataset on which such dimensions can be accurately generalised. This demonstrates the continuing, tight dependence between the volume and quality of data on the one hand, and the type and breadth of research questions for which data need to serve as evidence on the other hand.

Determining the fit between inferential methods and data requires high levels of expertise and contextual judgement (a situation known within machine learning as the “no free lunch theorem”). Indeed, overreliance on software for inference and data modelling can yield highly problematic results. Symons and Horner note that the use of complex software in big data analysis makes margins of error unknowable, because there is no clear way to test them statistically (Symons & Horner 2014: 473). The path complexity of programs with high conditionality imposes limits on standard error correction techniques. As a consequence, there is no effective method for characterising the error distribution in the software except by testing all paths in the code, which is unrealistic and intractable in the vast majority of cases due to the complexity of the code.

Rather than acting as a substitute, the effective and responsible use of artificial intelligence tools in big data analysis requires the strategic exercise of human intelligence—but for this to happen, AI systems applied to big data need to be accessible to scrutiny and modification. Whether or not this is the case, and who is best qualified to exercise such scrutiny, is under dispute. Thomas Nickles argued that the increasingly complex and distributed algorithms used for data analysis follow in the footsteps of long-standing scientific attempts to transcend the limits of human cognition. The resulting epistemic systems may no longer be intelligible to humans: an “alien intelligence” within which “human abilities are no longer the ultimate criteria of epistemic success” (Nickles forthcoming). Such unbound cognition holds the promise of enabling powerful inferential reasoning from previously unimaginable volumes of data. The difficulties in contextualising and scrutinising such reasoning, however, sheds doubt on the reliability of the results. It is not only machine learning algorithms that are becoming increasingly inaccessible to evaluation: beyond the complexities of programming code, computational data analysis requires a whole ecosystem of classifications, models, networks and inference tools which typically have different histories and purposes, and whose relation to each other—and effects when they are used together—are far from understood and may well be untraceable.

This raises the question of whether the knowledge produced by such data analytic systems is at all intelligible to humans, and if so, what forms of intelligibility it yields. It is certainly the case that deriving knowledge from big data may not involve an increase in human understanding, especially if understanding is understood as an epistemic skill (de Regt 2017). This may not be a problem to those who await the rise of a new species of intelligent machines, who may master new cognitive tools in a way that humans cannot. But as Nickles, Nicholas Rescher (1984), Werner Callebaut (2012) and others pointed out, even in that case “we would not have arrived at perspective-free science” (Nickles forthcoming). While the human histories and assumptions interwoven into these systems may be hard to disentangle, they still affect their outcomes; and whether or not these processes of inquiry are open to critical scrutiny, their telos, implications and significance for life on the planet arguably should be. As argued by Dan McQuillan (2018), the increasing automation of big data analytics may foster acceptance of a Neoplatonist machinic metaphysics , within which mathematical structures “uncovered” by AI would trump any appeal to human experience. Luciano Floridi echoes this intuition in his analysis of what he calls the infosphere :

The great opportunities offered by Information and Communication Technologies come with a huge intellectual responsibility to understand them and take advantage of them in the right way. (2014: vii)

These considerations parallel Paul Humphreys’s long-standing critique of computer simulations as epistemically opaque (Humphreys 2004, 2009)—and particularly his definition of what he calls essential epistemic opacity:

A process is essentially epistemically opaque to X if and only if it is impossible , given the nature of X , for X to know all of the epistemically relevant elements of the process. (Humphreys 2009: 618)

Different facets of the general problem of epistemic opacity are stressed within the vast philosophical scholarship on the role of modelling, computing and simulations in the sciences: the implications of lacking experimental access to the concrete parts of the world being modelled, for instance (Morgan 2005; Parker 2009; Radder 2009); the difficulties in testing the reliability of computational methods used within simulations (Winsberg 2010; Morrison 2015); the relation between opacity and justification (Durán & Formanek 2018); the forms of black-boxing associated to mechanistic reasoning implemented in computational analysis (Craver and Darden 2013; Bechtel 2016); and the debate over the intrinsic limits of computational approaches and related expertise (Collins 1990; Dreyfus 1992). Roman Frigg and Julian Reiss argued that such issues do not constitute fundamental challenges to the nature of inquiry and modelling, and in fact exist in a continuum with traditional methodological issues well-known within the sciences (Frigg & Reiss 2009). Whether or not one agrees with this position (Humphreys 2009; Beisbart 2012), big data analysis is clearly pushing computational and statistical methods to their limit, thus highlighting the boundaries to what even technologically augmented human beings are capable of knowing and understanding.

Research on big data analysis thus sheds light on elements of the research process that cannot be fully controlled, rationalised or even considered through recourse to formal tools.

One such element is the work required to present empirical data in a machine-readable format that is compatible with the software and analytic tools at hand. Data need to be selected, cleaned and prepared to be subjected to statistical and computational analysis. The processes involved in separating data from noise, clustering data so that it is tractable, and integrating data of different formats turn out to be highly sophisticated and theoretically structured, as demonstrated for instance by James McAllister’s (1997, 2007, 2011) and Uljana Feest’s (2011) work on data patterns, Marcel Boumans’s and Leonelli’s comparison of clustering principles across fields (forthcoming), and James Griesemer’s (forthcoming) and Mary Morgan’s (forthcoming) analyses of the peculiarities of datasets. Suppes was so concerned by what he called the “bewildering complexity” of data production and processing activities, that he worried that philosophers would not appreciate the ways in which statistics can and does help scientists to abstract data away from such complexity. He described the large group of research components and activities used to prepare data for modelling as “pragmatic aspects” encompassing “every intuitive consideration of experimental design that involved no formal statistics” (Suppes 1962: 258), and positioned them as the lowest step of his hierarchy of models—at the opposite end of its pinnacle, which are models of theory. Despite recent efforts to rehabilitate the methodology of inductive-statistical modelling and inference (Mayo & Spanos 2009b), this approach has been shared by many philosophers who regard processes of data production and processing as so chaotic as to defy systematic analysis. This explains why data have received so little consideration in philosophy of science when compared to models and theory.

The question of how data are defined and identified, however, is crucial for understanding the role of big data in scientific research. Let us now consider two philosophical views—the representational view and the relational view —that are both compatible with the emergence of big data, and yet place emphasis on different aspects of that phenomenon, with significant implications for understanding the role of data within inferential reasoning and, as we shall see in the next section, as evidence. The representational view construes data as reliable representations of reality which are produced via the interaction between humans and the world. The interactions that generate data can take place in any social setting regardless of research purposes. Examples range from a biologist measuring the circumference of a cell in the lab and noting the result in an Excel file, to a teacher counting the number of students in her class and transcribing it in the class register. What counts as data in these interactions are the objects created in the process of description and/or measurement of the world. These objects can be digital (the Excel file) or physical (the class register) and form a footprint of a specific interaction with the natural world. This footprint—“trace” or “mark”, in the words of Ian Hacking (1992) and Hans-Jörg Rheinberger (2011), respectively—constitutes a crucial reference point for analytic study and for the extraction of new insights. This is the reason why data forms a legitimate foundation to empirical knowledge: the production of data is equivalent to “capturing” features of the world that can be used for systematic study. According to the representative approach, data are objects with fixed and unchangeable content, whose meaning, in virtue of being representations of reality, needs to be investigated and revealed step-by-step through adequate inferential methods. The data documenting cell shape can be modelled to test the relevance of shape to the elasticity, permeability and resilience of cells, producing an evidence base to understand cell-to-cell signalling and development. The data produced counting students in class can be aggregated with similar data collected in other schools, producing an evidence base to evaluate the density of students in the area and their school attendance frequency.

This reflects the intuition that data, especially when they come in the form of numerical measurements or images such as photographs, somehow mirror the phenomena that they are created to document, thus providing a snapshot of those phenomena that is amenable to study under the controlled conditions of research. It also reflects the idea of data as “raw” products of research, which are as close as it gets to unmediated knowledge of reality. This makes sense of the truth-value sometimes assigned to data as irrefutable sources of evidence—the Popperian idea that if data are found to support a given claim, then that claim is corroborated as true at least as long as no other data are found to disprove it. Data in this view represent an objective foundation for the acquisition of knowledge and this very objectivity—the ability to derive knowledge from human experience while transcending it—is what makes knowledge empirical. This position is well-aligned with the idea that big data is valuable to science because it facilitates the (broadly understood) inductive accumulation of knowledge: gathering data collected via reliable methods produces a mountain of facts ready to be analysed and, the more facts are produced and connected with each other, the more knowledge can be extracted.

Philosophers have long acknowledged that data do not speak for themselves and different types of data require different tools for analysis and preparation to be interpreted (Bogen 2009 [2013]). According to the representative view, there are correct and incorrect ways of interpreting data, which those responsible for data analysis need to uncover. But what is a “correct” interpretation in the realm of big data, where data are consistently treated as mobile entities that can, at least in principle, be reused in countless different ways and towards different objectives? Perhaps more than at any other time in the history of science, the current mobilisation and re-use of big data highlights the degree to which data interpretation—and with it, whatever data is taken to represent—may differ depending on the conceptual, material and social conditions of inquiry. The analysis of how big data travels across contexts shows that the expectations and abilities of those involved determine not only the way data are interpreted, but also what is regarded as “data” in the first place (Leonelli & Tempini forthcoming). The representative view of data as objects with fixed and contextually independent meaning is at odds with these observations.

An alternative approach is to embrace these findings and abandon the idea of data as fixed representations of reality altogether. Within the relational view , data are objects that are treated as potential or actual evidence for scientific claims in ways that can, at least in principle, be scrutinised and accounted for (Leonelli 2016). The meaning assigned to data depends on their provenance, their physical features and what these features are taken to represent, and the motivations and instruments used to visualise them and to defend specific interpretations. The reliability of data thus depends on the credibility and strictness of the processes used to produce and analyse them. The presentation of data; the way they are identified, selected, and included (or excluded) in databases; and the information provided to users to re-contextualise them are fundamental to producing knowledge and significantly influence its content. For instance, changes in data format—as most obviously involved in digitisation, data compression or archival procedures— can have a significant impact on where, when, and who uses the data as source of knowledge.

This framework acknowledges that any object can be used as a datum, or stop being used as such, depending on the circumstances—a consideration familiar to big data analysts used to pick and mix data coming from a vast variety of sources. The relational view also explains how, depending on the research perspective interpreting it, the same dataset may be used to represent different aspects of the world (“phenomena” as famously characterised by James Bogen and James Woodward, 1988). When considering the full cycle of scientific inquiry from the viewpoint of data production and analysis, it is at the stage of data modelling that a specific representational value is attributed to data (Leonelli 2019b).

The relational view of data encourages attention to the history of data, highlighting their continual evolution and sometimes radical alteration, and the impact of this feature on the power of data to confirm or refute hypotheses. It explains the critical importance of documenting data management and transformation processes, especially with big data that transit far and wide over digital channels and are grouped and interpreted in different ways and formats. It also explains the increasing recognition of the expertise of those who produce, curate, and analyse data as indispensable to the effective interpretation of big data within and beyond the sciences; and the inextricable link between social and ethical concerns around the potential impact of data sharing and scientific concerns around the quality, validity, and security of data (boyd & Crawford 2012; Tempini & Leonelli, 2018).

Depending on which view on data one takes, expectations around what big data can do for science will vary dramatically. The representational view accommodates the idea of big data as providing the most comprehensive, reliable and generative knowledge base ever witnessed in the history of science, by virtue of its sheer size and heterogeneity. The relational view makes no such commitment, focusing instead on what inferences are being drawn from such data at any given point, how and why.

One thing that the representational and relational views agree on is the key epistemic role of data as empirical evidence for knowledge claims or interventions. While there is a large philosophical literature on the nature of evidence (e.g., Achinstein 2001; Reiss 2015; Kelly 2016), however, the relation between data and evidence has received less attention. This is arguably due to an implicit acceptance, by many philosophers, of the representational view of data. Within the representational view, the identification of what counts as data is prior to the study of what those data can be evidence for: in other words, data are “givens”, as the etymology of the word indicates, and inferential methods are responsible for determining whether and how the data available to investigators can be used as evidence, and for what. The focus of philosophical attention is thus on formal methods to single out errors and misleading interpretations, and the probabilistic and/or explanatory relation between what is unproblematically taken to be a body of evidence and a given hypothesis. Hence much of the expansive philosophical work on evidence avoids the term “data” altogether. Peter Achinstein’s seminal work is a case in point: it discusses observed facts and experimental results, and whether and under which conditions scientists would have reasons to believe such facts, but it makes no mention of data and related processing practices (Achinstein 2001).

By contrast, within the relational view an object can only be identified as datum when it is viewed as having value as evidence. Evidence becomes a category of data identification, rather than a category of data use as in the representational view (Canali 2019). Evidence is thus constitutive of the very notion of data and cannot be disentangled from it. This involves accepting that the conditions under which a given object can serve as evidence—and thus be viewed as datum - may change; and that should this evidential role stop altogether, the object would revert back into an ordinary, non-datum item. For example, the photograph of a plant taken by a tourist in a remote region may become relevant as evidence for an inquiry into the morphology of plants from that particular locality; yet most photographs of plants are never considered as evidence for an inquiry into the features and functioning of the world, and of those who are, many may subsequently be discarded as uninteresting or no longer pertinent to the questions being asked.

This view accounts for the mobility and repurposing that characterises big data use, and for the possibility that objects that were not originally generated in order to serve as evidence may be subsequently adopted as such. Consider Mayo and Spanos’s “minimal scientific principle for evidence”, which they define as follows:

Data x 0 provide poor evidence for H if they result from a method or procedure that has little or no ability of finding flaws in H , even if H is false. (Mayo & Spanos 2009b)

This principle is compatible with the relational view of data since it incorporates cases where the methods used to generate and process data may not have been geared towards the testing of a hypothesis H: all it asks is that such methods can be made relevant to the testing of H, at the point in which data are used as evidence for H (I shall come back to the role of hypotheses in the handling of evidence in the next section).

The relational view also highlights the relevance of practices of data formatting and manipulation to the treatment of data as evidence, thus taking attention away from the characteristics of the data objects alone and focusing instead on the agency attached to and enabled by those characteristics. Nora Boyd has provided a way to conceptualise data processing as an integral part of inferential processes, and thus of how we should understand evidence. To this aim she introduced the notion of “line of evidence”, which she defines as:

a sequence of empirical results including the records of data collection and all subsequent products of data processing generated on the way to some final empirical constraint. (Boyd 2018:406)

She thus proposes a conception of evidence that embraces both data and the way in which data are handled, and indeed emphasises the importance of auxiliary information used when assessing data for interpretation, which includes

the metadata regarding the provenance of the data records and the processing workflow that transforms them. (2018: 407)

As she concludes,

together, a line of evidence and its associated metadata compose what I am calling an “enriched line of evidence”. The evidential corpus is then to be made up of many such enriched lines of evidence. (2018: 407)

The relational view thus fosters a functional and contextualist approach to evidence as the manner through which one or more objects are used as warrant for particular knowledge items (which can be propositional claims, but also actions such as specific decisions or modes of conduct/ways of operating). This chimes with the contextual view of evidence defended by Reiss (2015), John Norton’s work on the multiple, tangled lines of inferential reasoning underpinning appeals to induction (2003), and Hasok Chang’s emphasis on the epistemic activities required to ground evidential claims (2012). Building on these ideas and on Stephen Toulmin’s seminal work on research schemas (1958), Alison Wylie has gone one step further in evaluating the inferential scaffolding that researchers (and particularly archaeologists, who so often are called to re-evaluate the same data as evidence for new claims; Wylie 2017) need to make sense of their data, interpret them in ways that are robust to potential challenges, and modify interpretations in the face of new findings. This analysis enabled Wylie to formulate a set of conditions for robust evidential reasoning, which include epistemic security in the chain of evidence, causal anchoring and causal independence of the data used as evidence, as well as the explicit articulation of the grounds for calibration of the instruments and methods involved (Chapman & Wylie 2016; Wylie forthcoming). A similar conclusion is reached by Jessey Wright’s evaluation of the diverse data analysis techniques that neuroscientists use to make sense of functional magnetic resonance imaging of the brain (fMRI scans):

different data analysis techniques reveal different patterns in the data. Through the use of multiple data analysis techniques, researchers can produce results that are locally robust. (Wright 2017: 1179)

Wylie’s and Wright’s analyses exemplify how a relational approach to data fosters a normative understanding of “good evidence” which is anchored in situated judgement—the arguably human prerogative to contextualise and assess the significance of evidential claims. The advantages of this view of evidence are eloquently expressed by Nancy Cartwright’s critique of both philosophical theories and policy approaches that do not recognise the local and contextual nature of evidential reasoning. As she notes,

we need a concept that can give guidance about what is relevant to consider in deciding on the probability of the hypothesis, not one that requires that we already know significant facts about the probability of the hypothesis on various pieces of evidence. (Cartwright 2013: 6)

Thus she argues for a notion of evidence that is not too restrictive, takes account of the difficulties in combining and selecting evidence, and allows for contextual judgement on what types of evidence are best suited to the inquiry at hand (Cartwright 2013, 2019). Reiss’s proposal of a pragmatic theory of evidence similarly aims to

takes scientific practice [..] seriously, both in terms of its greater use of knowledge about the conditions under which science is practised and in terms of its goal to develop insights that are relevant to practising scientists. (Reiss 2015: 361)

A better characterisation of the relation between data and evidence, predicated on the study of how data are processed and aggregated, may go a long way towards addressing these demands. As aptly argued by James Woodward, the evidential relationship between data and claims is not a “a purely formal, logical, or a priori matter” (Woodward 2000: S172–173). This again sits uneasily with the expectation that big data analysis may automate scientific discovery and make human judgement redundant.

Let us now return to the idea of data-driven inquiry, often suggested as a counterpoint to hypothesis-driven science (e.g., Hey et al. 2009). Kevin Elliot and colleagues have offered a brief history of hypothesis-driven inquiry (Elliott et al. 2016), emphasising how scientific institutions (including funding programmes and publication venues) have pushed researchers towards a Popperian conceptualisation of inquiry as the formulation and testing of a strong hypothesis. Big data analysis clearly points to a different and arguably Baconian understanding of the role of hypothesis in science. Theoretical expectations are no longer seen as driving the process of inquiry and empirical input is recognised as primary in determining the direction of research and the phenomena—and related hypotheses—considered by researchers.

The emphasis on data as a central component of research poses a significant challenge to one of the best-established philosophical views on scientific knowledge. According to this view, which I shall label the theory-centric view of science, scientific knowledge consists of justified true beliefs about the world. These beliefs are obtained through empirical methods aiming to test the validity and reliability of statements that describe or explain aspects of reality. Hence scientific knowledge is conceptualised as inherently propositional: what counts as an output are claims published in books and journals, which are also typically presented as solutions to hypothesis-driven inquiry. This view acknowledges the significance of methods, data, models, instruments and materials within scientific investigations, but ultimately regards them as means towards one end: the achievement of true claims about the world. Reichenbach’s seminal distinction between contexts of discovery and justification exemplifies this position (Reichenbach 1938). Theory-centrism recognises research components such as data and related practical skills as essential to discovery, and more specifically to the messy, irrational part of scientific work that involves value judgements, trial-and-error, intuition and exploration and within which the very phenomena to be investigated may not have been stabilised. The justification of claims, by contrast, involves the rational reconstruction of the research that has been performed, so that it conforms to established norms of inferential reasoning. Importantly, within the context of justification, only data that support the claims of interest are explicitly reported and discussed: everything else—including the vast majority of data produced in the course of inquiry—is lost to the chaotic context of discovery. [ 2 ]

Much recent philosophy of science, and particularly modelling and experimentation, has challenged theory-centrism by highlighting the role of models, methods and modes of intervention as research outputs rather than simple tools, and stressing the importance of expanding philosophical understandings of scientific knowledge to include these elements alongside propositional claims. The rise of big data offers another opportunity to reframe understandings of scientific knowledge as not necessarily centred on theories and to include non-propositional components—thus, in Cartwright’s paraphrase of Gilbert Ryle’s famous distinction, refocusing on knowing-how over knowing-that (Cartwright 2019). One way to construe data-centric methods is indeed to embrace a conception of knowledge as ability, such as promoted by early pragmatists like John Dewey and more recently reprised by Chang, who specifically highlighted it as the broader category within which the understanding of knowledge-as-information needs to be placed (Chang 2017).

Another way to interpret the rise of big data is as a vindication of inductivism in the face of the barrage of philosophical criticism levelled against theory-free reasoning over the centuries. For instance, Jon Williamson (2004: 88) has argued that advances in automation, combined with the emergence of big data, lend plausibility to inductivist philosophy of science. Wolfgang Pietsch agrees with this view and provided a sophisticated framework to understand just what kind of inductive reasoning is instigated by big data and related machine learning methods such as decision trees (Pietsch 2015). Following John Stuart Mill, he calls this approach variational induction and presents it as common to both big data approaches and exploratory experimentation, though the former can handle a much larger number of variables (Pietsch 2015: 913). Pietsch concludes that the problem of theory-ladenness in machine learning can be addressed by determining under which theoretical assumptions variational induction works (2015: 910ff).

Others are less inclined to see theory-ladenness as a problem that can be mitigated by data-intensive methods, and rather see it as a constitutive part of the process of empirical inquiry. Arching back to the extensive literature on perspectivism and experimentation (Gooding 1990; Giere 2006; Radder 2006; Massimi 2012), Werner Callebaut has forcefully argued that the most sophisticated and standardised measurements embody a specific theoretical perspective, and this is no less true of big data (Callebaut 2012). Elliott and colleagues emphasise that conceptualising big data analysis as atheoretical risks encouraging unsophisticated attitudes to empirical investigation as a

“fishing expedition”, having a high probability of leading to nonsense results or spurious correlations, being reliant on scientists who do not have adequate expertise in data analysis, and yielding data biased by the mode of collection. (Elliott et al. 2016: 880)

To address related worries in genetic analysis, Ken Waters has provided the useful characterisation of “theory-informed” inquiry (Waters 2007), which can be invoked to stress how theory informs the methods used to extract meaningful patterns from big data, and yet does not necessarily determine either the starting point or the outcomes of data-intensive science. This does not resolve the question of what role theory actually plays. Rob Kitchin (2014) has proposed to see big data as linked to a new mode of hypothesis generation within a hypothetical-deductive framework. Leonelli is more sceptical of attempts to match big data approaches, which are many and diverse, with a specific type of inferential logic. She rather focused on the extent to which the theoretical apparatus at work within big data analysis rests on conceptual decisions about how to order and classify data—and proposed that such decisions can give rise to a particular form of theorization, which she calls classificatory theory (Leonelli 2016).

These disagreements point to big data as eliciting diverse understandings of the nature of knowledge and inquiry, and the complex iterations through which different inferential methods build on each other. Again, in the words of Elliot and colleagues,

attempting to draw a sharp distinction between hypothesis-driven and data-intensive science is misleading; these modes of research are not in fact orthogonal and often intertwine in actual scientific practice. (Elliott et al. 2016: 881, see also O’Malley et al. 2009, Elliott 2012)

Another epistemological debate strongly linked to reflection on big data concerns the specific kinds of knowledge emerging from data-centric forms of inquiry, and particularly the relation between predictive and causal knowledge.

Big data science is widely seen as revolutionary in the scale and power of predictions that it can support. Unsurprisingly perhaps, a philosophically sophisticated defence of this position comes from the philosophy of mathematics, where Marco Panza, Domenico Napoletani and Daniele Struppa argued for big data science as occasioning a momentous shift in the predictive knowledge that mathematical analysis can yield, and thus its role within broader processes of knowledge production. The whole point of big data analysis, they posit, is its disregard for causal knowledge:

answers are found through a process of automatic fitting of the data to models that do not carry any structural understanding beyond the actual solution of the problem itself. (Napoletani, Panza, & Struppa 2014: 486)

This view differs from simplistic popular discourse on “the death of theory” (Anderson 2008) and the “power of correlations” (Mayer-Schoenberg and Cukier 2013) insofar as it does not side-step the constraints associated with knowledge and generalisations that can be extracted from big data analysis. Napoletani, Panza and Struppa recognise that there are inescapable tensions around the ability of mathematical reasoning to overdetermine empirical input, to the point of providing a justification for any and every possible interpretation of the data. In their words,

the problem arises of how we can gain meaningful understanding of historical phenomena, given the tremendous potential variability of their developmental processes. (Napoletani et al. 2014: 487)

Their solution is to clarify that understanding phenomena is not the goal of predictive reasoning, which is rather a form of agnostic science : “the possibility of forecasting and analysing without a structured and general understanding” (Napoletani et al. 2011: 12). The opacity of algorithmic rationality thus becomes its key virtue and the reason for the extraordinary epistemic success of forecasting grounded on big data. While “the phenomenon may forever re-main hidden to our understanding”(ibid.: 5), the application of mathematical models and algorithms to big data can still provide meaningful and reliable answers to well-specified problems—similarly to what has been argued in the case of false models (Wimsatt 2007). Examples include the use of “forcing” methods such as regularisation or diffusion geometry to facilitate the extraction of useful insights from messy datasets.

This view is at odds with accounts that posit scientific understanding as a key aim of science (de Regt 2017), and the intuition that what researchers are ultimately interested in is

whether the opaque data-model generated by machine-learning technologies count as explanations for the relationships found between input and output. (Boon 2020: 44)

Within the philosophy of biology, for example, it is well recognised that big data facilitates effective extraction of patterns and trends, and that being able to model and predict how an organism or ecosystem may behave in the future is of great importance, particularly within more applied fields such as biomedicine or conservation science. At the same time, researchers are interested in understanding the reasons for observed correlations, and typically use predictive patterns as heuristics to explore, develop and verify causal claims about the structure and functioning of entities and processes. Emanuele Ratti (2015) has argued that big data mining within genome-wide association studies often used in cancer genomics can actually underpin mechanistic reasoning, for instance by supporting eliminative inference to develop mechanistic hypotheses and by helping to explore and evaluate generalisations used to analyse the data. In a similar vein, Pietsch (2016) proposed to use variational induction as a method to establish what counts as causal relationships among big data patterns, by focusing on which analytic strategies allow for reliable prediction and effective manipulation of a phenomenon.

Through the study of data sourcing and processing in epidemiology, Stefano Canali has instead highlighted the difficulties of deriving mechanistic claims from big data analysis, particularly where data are varied and embodying incompatible perspectives and methodological approaches (Canali 2016, 2019). Relatedly, the semantic and logistical challenges of organising big data give reason to doubt the reliability of causal claims extracted from such data. In terms of logistics, having a lot of data is not the same as having all of them, and cultivating illusions of comprehensiveness is a risky and potentially misleading strategy, particularly given the challenges encountered in developing and applying curatorial standards for data other than the high-throughput results of “omics” approaches (see also the next section). The constant worry about the partiality and reliability of data is reflected in the care put by database curators in enabling database users to assess such properties; and in the importance given by researchers themselves, particularly in the biological and environmental sciences, to evaluating the quality of data found on the internet (Leonelli 2014, Fleming et al. 2017). In terms of semantics, we are back to the role of data classifications as theoretical scaffolding for big data analysis that we discussed in the previous section. Taxonomic efforts to order and visualise data inform causal reasoning extracted from such data (Sterner & Franz 2017), and can themselves constitute a bottom-up method—grounded in comparative reasoning—for assigning meaning to data models, particularly in situation where a full-blown theory or explanation for the phenomenon under investigation is not available (Sterner 2014).

It is no coincidence that much philosophical work on the relation between causal and predictive knowledge extracted from big data comes from the philosophy of the life sciences, where the absence of axiomatized theories has elicited sophisticated views on the diversity of forms and functions of theory within inferential reasoning. Moreover, biological data are heterogeneous both in their content and in their format; are curated and re-purposed to address the needs of highly disparate and fragmented epistemic communities; and present curators with specific challenges to do with tracking complex, diverse and evolving organismal structures and behaviours, whose relation to an ever-changing environment is hard to pinpoint with any stability (e.g., Shavit & Griesemer 2009). Hence in this domain, some of the core methods and epistemic concerns of experimental research—including exploratory experimentation, sampling and the search for causal mechanisms—remain crucial parts of data-centric inquiry.

At the start of this entry I listed “value” as a major characteristic of big data and pointed to the crucial role of valuing procedures in identifying, processing, modelling and interpreting data as evidence. Identifying and negotiating different forms of data value is an unavoidable part of big data analysis, since these valuation practices determine which data is made available to whom, under which conditions and for which purposes. What researchers choose to consider as reliable data (and data sources) is closely intertwined not only with their research goals and interpretive methods, but also with their approach to data production, packaging, storage and sharing. Thus, researchers need to consider what value their data may have for future research by themselves and others, and how to enhance that value—such as through decisions around which data to make public, how, when and in which format; or, whenever dealing with data already in the public domain (such as personal data on social media), decisions around whether the data should be shared and used at all, and how.

No matter how one conceptualises value practices, it is clear that their key role in data management and analysis prevents facile distinctions between values and “facts” (understood as propositional claims for which data provide evidential warrant). For example, consider a researcher who values both openness —and related practices of widespread data sharing—and scientific rigour —which requires a strict monitoring of the credibility and validity of conditions under which data are interpreted. The scale and manner of big data mobilisation and analysis create tensions between these two values. While the commitment to openness may prompt interest in data sharing, the commitment to rigour may hamper it, since once data are freely circulated online it becomes very difficult to retain control over how they are interpreted, by whom and with which knowledge, skills and tools. How a researcher responds to this conflict affects which data are made available for big data analysis, and under which conditions. Similarly, the extent to which diverse datasets may be triangulated and compared depends on the intellectual property regimes under which the data—and related analytic tools—have been produced. Privately owned data are often unavailable to publicly funded researchers; and many algorithms, cloud systems and computing facilities used in big data analytics are only accessible to those with enough resources to buy relevant access and training. Whatever claims result from big data analysis are, therefore, strongly dependent on social, financial and cultural constraints that condition the data pool and its analysis.

This prominent role of values in shaping data-related epistemic practices is not surprising given existing philosophical critiques of the fact/value distinction (e.g., Douglas 2009), and the existing literature on values in science—such as Helen Longino’s seminal distinction between constitutive and contextual values, as presented in her 1990 book Science as Social Knowledge —may well apply in this case too. Similarly, it is well-established that the technological and social conditions of research strongly condition its design and outcomes. What is particularly worrying in the case of big data is the temptation, prompted by hyped expectations around the power of data analytics, to hide or side-line the valuing choices that underpin the methods, infrastructures and algorithms used for big data extraction.

Consider the use of high-throughput data production tools, which enable researchers to easily generate a large volume of data in formats already geared to computational analysis. Just as in the case of other technologies, researchers have a strong incentive to adopt such tools for data generation; and may do so even in cases where such tools are not good or even appropriate means to pursue the investigation. Ulrich Krohs uses the term convenience experimentation to refer to experimental designs that are adopted not because they are the most appropriate ways of pursuing a given investigation, but because they are easily and widely available and usable, and thus “convenient” means for researchers to pursue their goals (Krohs 2012).

Appeals to convenience can extend to other aspects of data-intensive analysis. Not all data are equally easy to digitally collect, disseminate and link through existing algorithms, which makes some data types and formats more convenient than others for computational analysis. For example, research databases often display the outputs of well-resourced labs within research traditions which deal with “tractable” data formats (such as “omics”). And indeed, the existing distribution of resources, infrastructure and skills determines high levels of inequality in the production, dissemination and use of big data for research. Big players with large financial and technical resources are leading the development and uptake of data analytics tools, leaving much publicly funded research around the world at the receiving end of innovation in this area. Contrary to popular depictions of the data revolution as harbinger of transparency, democracy and social equality, the digital divide between those who can access and use data technologies, and those who cannot, continues to widen. A result of such divides is the scarcity of data relating to certain subgroups and geographical locations, which again limits the comprehensiveness of available data resources.

In the vast ecosystem of big data infrastructures, it is difficult to keep track of such distortions and assess their significance for data interpretation, especially in situations where heterogeneous data sources structured through appeal to different values are mashed together. Thus, the systematic aggregation of convenient datasets and analytic tools over others often results in a big data pool where the relevant sources and forms of bias are impossible to locate and account for (Pasquale 2015; O’Neill 2016; Zuboff 2017; Leonelli 2019a). In such a landscape, arguments for a separation between fact and value—and even a clear distinction between the role of epistemic and non-epistemic values in knowledge production—become very difficult to maintain without discrediting the whole edifice of big data science. Given the extent to which this approach has penetrated research in all domains, it is arguably impossible, however, to critique the value-laden structure of big data science without calling into question the legitimacy of science itself. A more constructive approach is to embrace the extent to which big data science is anchored in human choices, interests and values, and ascertain how this affects philosophical views on knowledge, truth and method.

In closing, it is important to consider at least some of the risks and related ethical questions raised by research with big data. As already mentioned in the previous section, reliance on big data collected by powerful institutions or corporations risks raises significant social concerns. Contrary to the view that sees big and open data as harbingers of democratic social participation in research, the way that scientific research is governed and financed is not challenged by big data. Rather, the increasing commodification and large value attributed to certain kinds of data (e.g., personal data) is associated to an increase in inequality of power and visibility between different nations, segments of the population and scientific communities (O’Neill 2016; Zuboff 2017; D’Ignazio and Klein 2020). The digital gap between those who not only can access data, but can also use it, is widening, leading from a state of digital divide to a condition of “data divide” (Bezuidenout et al. 2017).

Moreover, the privatisation of data has serious implications for the world of research and the knowledge it produces. Firstly, it affects which data are disseminated, and with which expectations. Corporations usually only release data that they regard as having lesser commercial value and that they need public sector assistance to interpret. This introduces another distortion on the sources and types of data that are accessible online while more expensive and complex data are kept secret. Even many of the ways in which citizens -researchers included - are encouraged to interact with databases and data interpretation sites tend to encourage participation that generates further commercial value. Sociologists have recently described this type of social participation as a form of exploitation (Prainsack & Buyx 2017; Srnicek 2017). In turn, these ways of exploiting data strengthen their economic value over their scientific value. When it comes to the commerce of personal data between companies working in analysis, the value of the data as commercial products -which includes the evaluation of the speed and efficiency with which access to certain data can help develop new products - often has priority over scientific issues such as for example, representativity and reliability of the data and the ways they were analysed. This can result in decisions that pose a problem scientifically or that simply are not interested in investigating the consequences of the assumptions made and the processes used. This lack of interest easily translates into ignorance of discrimination, inequality and potential errors in the data considered. This type of ignorance is highly strategic and economically productive since it enables the use of data without concerns over social and scientific implications. In this scenario the evaluation on the quality of data shrinks to an evaluation of their usefulness towards short-term analyses or forecasting required by the client. There are no incentives in this system to encourage evaluation of the long-term implications of data analysis. The risk here is that the commerce of data is accompanied by an increasing divergence between data and their context. The interest in the history of the transit of data, the plurality of their emotional or scientific value and the re-evaluation of their origins tend to disappear over time, to be substituted by the increasing hold of the financial value of data.

The multiplicity of data sources and tools for aggregation also creates risks. The complexity of the data landscape is making it harder to identify which parts of the infrastructure require updating or have been put in doubt by new scientific developments. The situation worsens when considering the number of databases that populate every area of scientific research, each containing assumptions that influence the circulation and interoperability of data and that often are not updated in a reliable and regular way. Just to provide an idea of the numbers involved, the prestigious scientific publication Nucleic Acids Research publishes a special issue on new databases that are relevant to molecular biology every year and included: 56 new infrastructures in 2015, 62 in 2016, 54 in 2017 and 82 in 2018. These are just a small proportion of the hundreds of databases that are developed each year in the life sciences sector alone. The fact that these databases rely on short term funding means that a growing percentage of resources remain available to consult online although they are long dead. This is a condition that is not always visible to users of the database who trust them without checking whether they are actively maintained or not. At what point do these infrastructures become obsolete? What are the risks involved in weaving an ever more extensive tapestry of infrastructures that depend on each other, given the disparity in the ways they are managed and the challenges in identifying and comparing their prerequisite conditions, the theories and scaffolding used to build them? One of these risks is rampant conservativism: the insistence on recycling old data whose features and management elements become increasingly murky as time goes by, instead of encouraging the production of new data with features that specifically respond to the requirements and the circumstances of their users. In disciplines such as biology and medicine that study living beings and therefore are by definition continually evolving and developing, such trust in old data is particularly alarming. It is not the case, for example, that data collected on fungi ten, twenty or even a hundred years ago is reliable to explain the behaviour of the same species of fungi now or in the future (Leonelli 2018).

Researchers of what Luciano Floridi calls the infosphere —the way in which the introduction of digital technologies is changing the world - are becoming aware of the destructive potential of big data and the urgent need to focus efforts for management and use of data in active and thoughtful ways towards the improvement of the human condition. In Floridi’s own words:

ICT yields great opportunity which, however, entails the enormous intellectual responsibility of understanding this technology to use it in the most appropriate way. (Floridi 2014: vii; see also British Academy & Royal Society 2017)

In light of these findings, it is essential that ethical and social issues are seen as a core part of the technical and scientific requirements associated with data management and analysis. The ethical management of data is not obtained exclusively by regulating the commerce of research and management of personal data nor with the introduction of monitoring of research financing, even though these are important strategies. To guarantee that big data are used in the most scientifically and socially forward-thinking way it is necessary to transcend the concept of ethics as something external and alien to research. An analysis of the ethical implications of data science should become a basic component of the background and activity of those who take care of data and the methods used to view and analyse it. Ethical evaluations and choices are hidden in every aspect of data management, including those choices that may seem purely technical.

This entry stressed how the emerging emphasis on big data signals the rise of a data-centric approach to research, in which efforts to mobilise, integrate, disseminate and visualise data are viewed as central contributions to discovery. The emergence of data-centrism highlights the challenges involved in gathering, classifying and interpreting data, and the concepts, technologies and institutions that surround these processes. Tools such as high-throughput measurement instruments and apps for smartphones are fast generating large volumes of data in digital formats. In principle, these data are immediately available for dissemination through internet platforms, which can make them accessible to anybody with a broadband connection in a matter of seconds. In practice, however, access to data is fraught with conceptual, technical, legal and ethical implications; and even when access can be granted, it does not guarantee that the data can be fruitfully used to spur further research. Furthermore, the mathematical and computational tools developed to analyse big data are often opaque in their functioning and assumptions, leading to results whose scientific meaning and credibility may be difficult to assess. This increases the worry that big data science may be grounded upon, and ultimately supporting, the process of making human ingenuity hostage to an alien, artificial and ultimately unintelligible intelligence.

Perhaps the most confronting aspect of big data science as discussed in this entry is the extent to which it deviates from understandings of rationality grounded on individual agency and cognitive abilities (on which much of contemporary philosophy of science is predicated). The power of any one dataset to yield knowledge lies in the extent to which it can be linked with others: this is what lends high epistemic value to digital objects such as GPS locations or sequencing data, and what makes extensive data aggregation from a variety of sources into a highly effective surveillance tool. Data production and dissemination channels such as social media, governmental databases and research repositories operate in a globalised, interlinked and distributed network, whose functioning requires a wide variety of skills and expertise. The distributed nature of decision-making involved in developing big data infrastructures and analytics makes it impossible for any one individual to retain oversight over the quality, scientific significance and potential social impact of the knowledge being produced.

Big data analysis may therefore constitute the ultimate instance of a distributed cognitive system. Where does this leave accountability questions? Many individuals, groups and institutions end up sharing responsibility for the conceptual interpretation and social outcomes of specific data uses. A key challenge for big data governance is to find mechanisms for allocating responsibilities across this complex network, so that erroneous and unwarranted decisions—as well as outright fraudulent, unethical, abusive, discriminatory or misguided actions—can be singled out, corrected and appropriately sanctioned. Thinking about the complex history, processing and use of data can encourage philosophers to avoid ahistorical, uncontextualized approaches to questions of evidence, and instead consider the methods, skills, technologies and practices involved in handling data—and particularly big data—as crucial to understanding empirical knowledge-making.

  • Achinstein, Peter, 2001, The Book of Evidence , Oxford: Oxford University Press. doi:10.1093/0195143892.001.0001
  • Anderson, Chris, 2008, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”, Wired Magazine , 23 June 2008.
  • Aronova, Elena, Karen S. Baker, and Naomi Oreskes, 2010, “Big science and big data in biology: From the International Geophysical Year through the International Biological Program to the Long Term Ecological Research (LTER) Network, 1957–present”, Historical Studies in the Natural Sciences , 40: 183–224.
  • Aronova, Elena, Christine von Oertzen, and David Sepkoski, 2017, “Introduction: Historicizing Big Data”, Osiris , 32(1): 1–17. doi:10.1086/693399
  • Bauer, Susanne, 2008, “Mining Data, Gathering Variables and Recombining Information: The Flexible Architecture of Epidemiological Studies”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 39(4): 415–428. doi:10.1016/j.shpsc.2008.09.008
  • Bechtel, William, 2016, “Using Computational Models to Discover and Understand Mechanisms”, Studies in History and Philosophy of Science Part A , 56: 113–121. doi:10.1016/j.shpsa.2015.10.004
  • Beisbart, Claus, 2012, “How Can Computer Simulations Produce New Knowledge?”, European Journal for Philosophy of Science , 2(3): 395–434. doi:10.1007/s13194-012-0049-7
  • Bezuidenhout, Louise, Leonelli, Sabina, Kelly, Ann and Rappert, Brian, 2017, “Beyond the Digital Divide: Towards a Situated Approach to Open Data”. Science and Public Policy , 44(4): 464–475. doi: 10.1093/scipol/scw036
  • Bogen, Jim, 2009 [2013], “Theory and Observation in Science”, in The Stanford Encyclopedia of Philosophy (Spring 2013 Edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/spr2013/entries/science-theory-observation/ >.
  • –––, 2010, “Noise in the World”, Philosophy of Science , 77(5): 778–791. doi:10.1086/656006
  • Bogen, James and James Woodward, 1988, “Saving the Phenomena”, The Philosophical Review , 97(3): 303. doi:10.2307/2185445
  • Bokulich, Alisa, 2018, “Using Models to Correct Data: Paleodiversity and the Fossil Record”, in S.I.: Abstraction and Idealization in Scientific Modelling by Synthese , 29 May 2018. doi:10.1007/s11229-018-1820-x
  • Boon, Mieke, 2020, “How Scientists Are Brought Back into Science—The Error of Empiricism”, in A Critical Reflection on Automated Science , Marta Bertolaso and Fabio Sterpetti (eds.), (Human Perspectives in Health Sciences and Technology 1), Cham: Springer International Publishing, 43–65. doi:10.1007/978-3-030-25001-0_4
  • Borgman, Christine L., 2015, Big Data, Little Data, No Data , Cambridge, MA: MIT Press.
  • Boumans, M.J. and Sabina Leonelli, forthcoming, “From Dirty Data to Tidy Facts: Practices of Clustering in Plant Phenomics and Business Cycles”, in Leonelli and Tempini forthcoming.
  • Boyd, Danah and Kate Crawford, 2012, “Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon”, Information, Communication & Society , 15(5): 662–679. doi:10.1080/1369118X.2012.678878
  • Boyd, Nora Mills, 2018, “Evidence Enriched”, Philosophy of Science , 85(3): 403–421. doi:10.1086/697747
  • Bowker, Geoffrey C., 2006, Memory Practices in the Sciences , Cambridge, MA: The MIT Press.
  • Bringsjord, Selmer and Naveen Sundar Govindarajulu, 2018, “Artificial Intelligence”, in The Stanford Encyclopedia of Philosophy (Fall 2018 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/fall2018/entries/artificial-intelligence/ >.
  • British Academy & Royal Society, 2017, Data Management and Use: Governance in the 21st Century. A Joint Report of the Royal Society and the British Academy , British Academy & Royal Society 2017 available online (see Report).
  • Cai, Li and Yangyong Zhu, 2015, “The Challenges of Data Quality and Data Quality Assessment in the Big Data Era”, Data Science Journal , 14: 2. doi:10.5334/dsj-2015-002
  • Callebaut, Werner, 2012, “Scientific Perspectivism: A Philosopher of Science’s Response to the Challenge of Big Data Biology”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 69–80. doi:10.1016/j.shpsc.2011.10.007
  • Calude, Cristian S. and Giuseppe Longo, 2017, “The Deluge of Spurious Correlations in Big Data”, Foundations of Science , 22(3): 595–612. doi:10.1007/s10699-016-9489-4
  • Canali, Stefano, 2016, “Big Data, Epistemology and Causality: Knowledge in and Knowledge out in EXPOsOMICS”, Big Data & Society , 3(2): 205395171666953. doi:10.1177/2053951716669530
  • –––, 2019, “Evaluating Evidential Pluralism in Epidemiology: Mechanistic Evidence in Exposome Research”, History and Philosophy of the Life Sciences , 41(1): art. 4. doi:10.1007/s40656-019-0241-6
  • Cartwright, Nancy D., 2013, Evidence: For Policy and Wheresoever Rigor Is a Must , London School of Economics and Political Science (LSE), Order Project Discussion Paper Series [Cartwright 2013 available online ].
  • –––, 2019, Nature, the Artful Modeler: Lectures on Laws, Science, How Nature Arranges the World and How We Can Arrange It Better (The Paul Carus Lectures) , Chicago, IL: Open Court.
  • Chang, Hasok, 2012, Is Water H2O? Evidence, Realism and Pluralism , (Boston Studies in the Philosophy of Science 293), Dordrecht: Springer Netherlands. doi:10.1007/978-94-007-3932-1
  • –––, 2017, “VI—Operational Coherence as the Source of Truth”, Proceedings of the Aristotelian Society , 117(2): 103–122. doi:10.1093/arisoc/aox004
  • Chapman, Robert and Alison Wylie, 2016, Evidential Reasoning in Archaeology , London: Bloomsbury Publishing Plc.
  • Collins, Harry M., 1990, Artificial Experts: Social Knowledge and Intelligent Machines , Cambridge, MA: MIT Press.
  • Craver, Carl F. and Lindley Darden, 2013, In Search of Mechanisms: Discoveries Across the Life Sciences , Chicago: University of Chicago Press.
  • Daston, Lorraine, 2017, Science in the Archives: Pasts, Presents, Futures , Chicago: University of Chicago Press.
  • De Regt, Henk W., 2017, Understanding Scientific Understanding , Oxford: Oxford University Press. doi:10.1093/oso/9780190652913.001.0001
  • D’Ignazio, Catherine and Klein, Lauren F., 2020, Data Feminism , Cambridge, MA: The MIT Press.
  • Douglas, Heather E., 2009, Science, Policy and the Value-Free Ideal , Pittsburgh, PA: University of Pittsburgh Press.
  • Dreyfus, Hubert L., 1992, What Computers Still Can’t Do: A Critique of Artificial Reason , Cambridge, MA: MIT Press.
  • Durán, Juan M. and Nico Formanek, 2018, “Grounds for Trust: Essential Epistemic Opacity and Computational Reliabilism”, Minds and Machines , 28(4): 645–666. doi:10.1007/s11023-018-9481-6
  • Edwards, Paul N., 2010, A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming , Cambridge, MA: The MIT Press.
  • Elliott, Kevin C., 2012, “Epistemic and methodological iteration in scientific research”. Studies in History and Philosophy of Science , 43: 376–382.
  • Elliott, Kevin C., Kendra S. Cheruvelil, Georgina M. Montgomery, and Patricia A. Soranno, 2016, “Conceptions of Good Science in Our Data-Rich World”, BioScience , 66(10): 880–889. doi:10.1093/biosci/biw115
  • Feest, Uljana, 2011, “What Exactly Is Stabilized When Phenomena Are Stabilized?”, Synthese , 182(1): 57–71. doi:10.1007/s11229-009-9616-7
  • Fleming, Lora, Niccolò Tempini, Harriet Gordon-Brown, Gordon L. Nichols, Christophe Sarran, Paolo Vineis, Giovanni Leonardi, Brian Golding, Andy Haines, Anthony Kessel, Virginia Murray, Michael Depledge, and Sabina Leonelli, 2017, “Big Data in Environment and Human Health”, in Oxford Research Encyclopedia of Environmental Science , by Lora Fleming, Niccolò Tempini, Harriet Gordon-Brown, Gordon L. Nichols, Christophe Sarran, Paolo Vineis, Giovanni Leonardi, Brian Golding, Andy Haines, Anthony Kessel, Virginia Murray, Michael Depledge, and Sabina Leonelli, Oxford: Oxford University Press. doi:10.1093/acrefore/9780199389414.013.541
  • Floridi, Luciano, 2014, The Fourth Revolution: How the Infosphere is Reshaping Human Reality , Oxford: Oxford University Press.
  • Floridi, Luciano and Phyllis Illari (eds.), 2014, The Philosophy of Information Quality , (Synthese Library 358), Cham: Springer International Publishing. doi:10.1007/978-3-319-07121-3
  • Frigg, Roman and Julian Reiss, 2009, “The Philosophy of Simulation: Hot New Issues or Same Old Stew?”, Synthese , 169(3): 593–613. doi:10.1007/s11229-008-9438-z
  • Frigg, Roman and Stephan Hartmann, 2016, “Models in Science”, in The Stanford Encyclopedia of Philosophy (Winter 2016 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/win2016/entries/models-science/ >.
  • Gooding, David C., 1990, Experiment and the Making of Meaning , Dordrecht & Boston: Kluwer.
  • Giere, Ronald, 2006, Scientific Perspectivism , Chicago: University of Chicago Press.
  • Griesemer, James R., forthcoming, “A Data Journey through Dataset-Centric Population Biology”, in Leonelli and Tempini forthcoming.
  • Hacking, Ian, 1992, “The Self-Vindication of the Laboratory Sciences”, In Science as Practice and Culture , Andrew Pickering (ed.), Chicago, IL: The University of Chicago Press, 29–64.
  • Harris, Todd, 2003, “Data Models and the Acquisition and Manipulation of Data”, Philosophy of Science , 70(5): 1508–1517. doi:10.1086/377426
  • Hey Tony, Stewart Tansley, and Kristin Tolle, 2009, The Fourth Paradigm. Data-Intensive Scientific Discovery , Redmond, WA: Microsoft Research.
  • Humphreys, Paul, 2004, Extending Ourselves: Computational Science, Empiricism, and Scientific Method , Oxford: Oxford University Press. doi:10.1093/0195158709.001.0001
  • –––, 2009, “The Philosophical Novelty of Computer Simulation Methods”, Synthese , 169(3): 615–626. doi:10.1007/s11229-008-9435-2
  • Karaca, Koray, 2018, “Lessons from the Large Hadron Collider for Model-Based Experimentation: The Concept of a Model of Data Acquisition and the Scope of the Hierarchy of Models”, Synthese , 195(12): 5431–5452. doi:10.1007/s11229-017-1453-5
  • Kelly, Thomas, 2016, “Evidence”, in The Stanford Encyclopedia of Philosophy (Winter 2016 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/win2016/entries/evidence/ >.
  • Kitchin, Rob, 2013, The Data Revolution: Big Data, Open Data, Data Infrastructures & Their Consequences , Los Angeles: Sage.
  • –––, 2014, “Big Data, new epistemologies and paradigm shifts”, Big Data and Society , 1(1) April-June. doi: 10.1177/2053951714528481
  • Kitchin, Rob and Gavin McArdle, 2016, “What Makes Big Data, Big Data? Exploring the Ontological Characteristics of 26 Datasets”, Big Data & Society , 3(1): 205395171663113. doi:10.1177/2053951716631130
  • Krohs, Ulrich, 2012, “Convenience Experimentation”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 52–57. doi:10.1016/j.shpsc.2011.10.005
  • Lagoze, Carl, 2014, “Big Data, data integrity, and the fracturing of the control zone,” Big Data and Society , 1(2) July-December. doi: 10.1177/2053951714558281
  • Leonelli, Sabina, 2014, “What Difference Does Quantity Make? On the Epistemology of Big Data in Biology”, Big Data & Society , 1(1): 205395171453439. doi:10.1177/2053951714534395
  • –––, 2016, Data-Centric Biology: A Philosophical Study , Chicago: University of Chicago Press.
  • –––, 2017, “Global Data Quality Assessment and the Situated Nature of ‘Best’ Research Practices in Biology”, Data Science Journal , 16: 32. doi:10.5334/dsj-2017-032
  • –––, 2018, “The Time of Data: Timescales of Data Use in the Life Sciences”, Philosophy of Science , 85(5): 741–754. doi:10.1086/699699
  • –––, 2019a, La Recherche Scientifique à l’Ère des Big Data: Cinq Façons Donc les Données Massive Nuisent à la Science, et Comment la Sauver , Milano: Éditions Mimésis.
  • –––, 2019b, “What Distinguishes Data from Models?”, European Journal for Philosophy of Science , 9(2): 22. doi:10.1007/s13194-018-0246-0
  • Leonelli, Sabina and Niccolò Tempini, 2018, “Where Health and Environment Meet: The Use of Invariant Parameters in Big Data Analysis”, Synthese , special issue on the Philosophy of Epidemiology , Sean Valles and Jonathan Kaplan (eds.). doi:10.1007/s11229-018-1844-2
  • –––, forthcoming, Data Journeys in the Sciences , Cham: Springer International Publishing.
  • Loettgers, Andrea, 2009, “Synthetic Biology and the Emergence of a Dual Meaning of Noise”, Biological Theory , 4(4): 340–356. doi:10.1162/BIOT_a_00009
  • Longino, Helen E., 1990, Science as Social Knowledge: Values and Objectivity in Scientific Inquiry , Princeton, NJ: Princeton University Press.
  • Lowrie, Ian, 2017, “Algorithmic Rationality: Epistemology and Efficiency in the Data Sciences”, Big Data & Society , 4(1): 1–13. doi:10.1177/2053951717700925
  • MacLeod, Miles and Nancy J. Nersessian, 2013, “Building Simulations from the Ground Up: Modeling and Theory in Systems Biology”, Philosophy of Science , 80(4): 533–556. doi:10.1086/673209
  • Massimi, Michela, 2011, “From Data to Phenomena: A Kantian Stance”, Synthese , 182(1): 101–116. doi:10.1007/s11229-009-9611-z
  • –––, 2012, “ Scientific perspectivism and its foes”, Philosophica , 84: 25–52.
  • –––, 2016, “Three Tales of Scientific Success”, Philosophy of Science , 83(5): 757–767. doi:10.1086/687861
  • Mayer-Schönberger, Victor and Kenneth Cukier, 2013, Big Data: A Revolution that Will Transform How We Live, Work, and Think , New York: Eamon Dolan/Houghton Mifflin Harcourt.
  • Mayo, Deborah G., 1996, Error and the Growth of Experimental Knowledge , Chicago: University of Chicago Press.
  • Mayo, Deborah G. and Aris Spanos (eds.), 2009a, Error and Inference , Cambridge: Cambridge University Press.
  • Mayo, Deborah G. and Aris Spanos, 2009b, “Introduction and Background”, in Mayo and Spanos (eds.) 2009a, pp. 1–27.
  • McAllister, James W., 1997, “Phenomena and Patterns in Data Sets”, Erkenntnis , 47(2): 217–228. doi:10.1023/A:1005387021520
  • –––, 2007, “Model Selection and the Multiplicity of Patterns in Empirical Data”, Philosophy of Science , 74(5): 884–894. doi:10.1086/525630
  • –––, 2011, “What Do Patterns in Empirical Data Tell Us about the Structure of the World?”, Synthese , 182(1): 73–87. doi:10.1007/s11229-009-9613-x
  • McQuillan, Dan, 2018, “Data Science as Machinic Neoplatonism”, Philosophy & Technology , 31(2): 253–272. doi:10.1007/s13347-017-0273-3
  • Mitchell, Sandra D., 2003, Biological Complexity and Integrative Pluralism , Cambridge: Cambridge University Press. doi:10.1017/CBO9780511802683
  • Morgan, Mary S., 2005, “Experiments versus Models: New Phenomena, Inference and Surprise”, Journal of Economic Methodology , 12(2): 317–329. doi:10.1080/13501780500086313
  • –––, forthcoming, “The Datum in Context”, in Leonelli and Tempini forthcoming.
  • Morrison, Margaret, 2015, Reconstructing Reality: Models, Mathematics, and Simulations , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780199380275.001.0001
  • Müller-Wille, Staffan and Isabelle Charmantier, 2012, “Natural History and Information Overload: The Case of Linnaeus”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 4–15. doi:10.1016/j.shpsc.2011.10.021
  • Napoletani, Domenico, Marco Panza, and Daniele C. Struppa, 2011, “Agnostic Science. Towards a Philosophy of Data Analysis”, Foundations of Science , 16(1): 1–20. doi:10.1007/s10699-010-9186-7
  • –––, 2014, “Is Big Data Enough? A Reflection on the Changing Role of Mathematics in Applications”, Notices of the American Mathematical Society , 61(5): 485–490. doi:10.1090/noti1102
  • Nickles, Thomas, forthcoming, “Alien Reasoning: Is a Major Change in Scientific Research Underway?”, Topoi , first online: 20 March 2018. doi:10.1007/s11245-018-9557-1
  • Norton, John D., 2003, “A Material Theory of Induction”, Philosophy of Science , 70(4): 647–670. doi:10.1086/378858
  • O’Malley M, Maureen A., Kevin C. Elliott, Chris Haufe, and Richard Burian, 2009. “Philosophies of funding”. Cell , 138: 611–615. doi: 10.1016/j.cell.2009.08.008
  • O’Malley, Maureen A. and Orkun S. Soyer, 2012, “The Roles of Integration in Molecular Systems Biology”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 58–68. doi:10.1016/j.shpsc.2011.10.006
  • O’Neill, Cathy, 2016, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy , New York: Crown.
  • Parker, Wendy S., 2009, “Does Matter Really Matter? Computer Simulations, Experiments, and Materiality”, Synthese , 169(3): 483–496. doi:10.1007/s11229-008-9434-3
  • –––, 2017, “Computer Simulation, Measurement, and Data Assimilation”, The British Journal for the Philosophy of Science , 68(1): 273–304. doi:10.1093/bjps/axv037
  • Pasquale, Frank, 2015, The Black Box Society: The Secret Algorithms That Control Money and Information , Cambridge, MA: Harvard University Press.
  • Pietsch, Wolfgang, 2015, “Aspects of Theory-Ladenness in Data-Intensive Science”, Philosophy of Science , 82(5): 905–916. doi:10.1086/683328
  • –––, 2016, “The Causal Nature of Modeling with Big Data”, Philosophy & Technology , 29(2): 137–171. doi:10.1007/s13347-015-0202-2
  • –––, 2017, “Causation, probability and all that: Data science as a novel inductive paradigm”, in Frontiers in Data Science , Matthias Dehmer and Frank Emmert-Streib (eds.), Boca Raton, FL: CRC, 329–353.
  • Porter, Theodore M., 1995, Trust in Numbers: The Pursuit of Objectivity in Science and Public Life , Princeton, NJ: Princeton University Press.
  • Porter, Theodore M. and Soraya de Chadarevian, 2018, “Introduction: Scrutinizing the Data World”, Historical Studies in the Natural Sciences , 48(5): 549–556. doi:10.1525/hsns.2018.48.5.549
  • Prainsack, Barbara and Buyx, Alena, 2017, Solidarity in Biomedicine and Beyond , Cambridge, UK: Cambridge University Press.
  • Radder, Hans, 2009, “The Philosophy of Scientific Experimentation: A Review”, Automated Experimentation , 1(1): 2. doi:10.1186/1759-4499-1-2
  • Ratti, Emanuele, 2015, “Big Data Biology: Between Eliminative Inferences and Exploratory Experiments”, Philosophy of Science , 82(2): 198–218. doi:10.1086/680332
  • Reichenbach, Hans, 1938, Experience and Prediction: An Analysis of the Foundations and the Structure of Knowledge , Chicago, IL: The University of Chicago Press.
  • Reiss, Julian, 2015, “A Pragmatist Theory of Evidence”, Philosophy of Science , 82(3): 341–362. doi:10.1086/681643
  • Reiss, Julian, 2015, Causation, Evidence, and Inference , New York: Routledge.
  • Rescher, Nicholas, 1984, The Limits of Science , Berkely, CA: University of California Press.
  • Rheinberger, Hans-Jörg, 2011, “Infra-Experimentality: From Traces to Data, from Data to Patterning Facts”, History of Science , 49(3): 337–348. doi:10.1177/007327531104900306
  • Romeijn, Jan-Willem, 2017, “Philosophy of Statistics”, in The Stanford Encyclopedia of Philosophy (Spring 2017), Edward N. Zalta (ed.), URL: https://plato.stanford.edu/archives/spr2017/entries/statistics/ .
  • Sepkoski, David, 2013, “Toward ‘a natural history of data’: Evolving practices and epistemologies of data in paleontology, 1800–2000”, Journal of the History of Biology , 46: 401–444.
  • Shavit, Ayelet and James Griesemer, 2009, “There and Back Again, or the Problem of Locality in Biodiversity Surveys*”, Philosophy of Science , 76(3): 273–294. doi:10.1086/649805
  • Srnicek, Nick, 2017, Platform capitalism , Cambridge, UK and Malden, MA: Polity Press.
  • Sterner, Beckett, 2014, “The Practical Value of Biological Information for Research”, Philosophy of Science , 81(2): 175–194. doi:10.1086/675679
  • Sterner, Beckett and Nico M. Franz, 2017, “Taxonomy for Humans or Computers? Cognitive Pragmatics for Big Data”, Biological Theory , 12(2): 99–111. doi:10.1007/s13752-017-0259-5
  • Sterner, Beckett W., Nico M. Franz, and J. Witteveen, 2020, “Coordinating dissent as an alternative to consensus classification: insights from systematics for bio-ontologies”, History and Philosophy of the Life Sciences , 42(1): 8. doi: 10.1007/s40656-020-0300-z
  • Stevens, Hallam, 2016, “Hadooping the Genome: The Impact of Big Data Tools on Biology”, BioSocieties , 11: 352–371.
  • Strasser, Bruno, 2019, Collecting Experiments: Making Big Data Biology , Chicago: University of Chicago Press.
  • Suppes, Patrick, 1962, “Models of data”, in Logic, Methodology and Philosophy of Science , Ernest Nagel, Patrick Suppes, & Alfred Tarski (eds.), Stanford: Stanford University Press, 252–261.
  • Symons, John and Ramón Alvarado, 2016, “Can We Trust Big Data? Applying Philosophy of Science to Software”, Big Data & Society , 3(2): 1-17. doi:10.1177/2053951716664747
  • Symons, John and Jack Horner, 2014, “Software Intensive Science”, Philosophy & Technology , 27(3): 461–477. doi:10.1007/s13347-014-0163-x
  • Tempini, Niccolò, 2017, “Till Data Do Us Part: Understanding Data-Based Value Creation in Data-Intensive Infrastructures”, Information and Organization , 27(4): 191–210. doi:10.1016/j.infoandorg.2017.08.001
  • Tempini, Niccolò and Sabina Leonelli, 2018, “Concealment and Discovery: The Role of Information Security in Biomedical Data Re-Use”, Social Studies of Science , 48(5): 663–690. doi:10.1177/0306312718804875
  • Toulmin, Stephen, 1958, The Uses of Arguments , Cambridge: Cambridge University Press.
  • Turner, Raymond and Nicola Angius, 2019, “The Philosophy of Computer Science”, in The Stanford Encyclopedia of Philosophy (Spring 2019 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/spr2019/entries/computer-science/ >.
  • Van Fraassen, Bas C., 2008, Scientific Representation: Paradoxes of Perspective , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780199278220.001.0001
  • Waters, C. Kenneth, 2007, “The Nature and Context of Exploratory Experimentation: An Introduction to Three Case Studies of Exploratory Research”, History and Philosophy of the Life Sciences , 29(3): 275–284.
  • Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, et al., 2016, “The FAIR Guiding Principles for Scientific Data Management and Stewardship”, Scientific Data , 3(1): 160018. doi:10.1038/sdata.2016.18
  • Williamson, Jon, 2004 “A dynamic interaction between machine learning and the philosophy of science”, Minds and Machines , 14(4): 539–54. doi:10.1093/bjps/axx012
  • Wimsatt, William C., 2007, Re-Engineering Philosophy for Limited Beings: Piecewise Approximations to Reality , Cambridge, MA: Harvard University Press.
  • Winsberg, Eric, 2010, Science in the Age of Computer Simulation , Chicago: University of Chicago Press.
  • Woodward, James, 2000, “Data, phenomena and reliability”, Philosophy of Science , 67(supplement): Proceedings of the 1998 Biennial Meetings of the Philosophy of Science Association. Part II: Symposia Papers (Sep., 2000), pp. S163–S179. https://www.jstor.org/stable/188666
  • –––, 2010, “Data, Phenomena, Signal, and Noise”, Philosophy of Science , 77(5): 792–803. doi:10.1086/656554
  • Wright, Jessey, 2017, “The Analysis of Data and the Evidential Scope of Neuroimaging Results”, The British Journal for the Philosophy of Science , 69(4): 1179–1203. doi:10.1093/bjps/axx012
  • Wylie, Alison, 2017, “How Archaeological Evidence Bites Back: Strategies for Putting Old Data to Work in New Ways”, Science, Technology, & Human Values , 42(2): 203–225. doi:10.1177/0162243916671200
  • –––, forthcoming, “Radiocarbon Dating in Archaeology: Triangulation and Traceability”, in Leonelli and Tempini forthcoming.
  • Zuboff, Shoshana, 2017, The Age of Surveillance Capitalism: The Fight for the Future at the New Frontier of Power , New York: Public Affairs.
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.

[Please contact the author with suggestions.]

artificial intelligence | Bacon, Francis | biology: experiment in | computer science, philosophy of | empiricism: logical | evidence | human genome project | models in science | Popper, Karl | science: theory and observation in | scientific explanation | scientific method | scientific theories: structure of | statistics, philosophy of

Acknowledgments

The research underpinning this entry was funded by the European Research Council (grant award 335925) and the Alan Turing Institute (EPSRC Grant EP/N510129/1).

Copyright © 2020 by Sabina Leonelli < s . leonelli @ exeter . ac . uk >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

Research Directions for Engineering Big Data Analytics Software

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Home

Data Best Practices Workshop - Large language models as research tools

Event Details:

Bryce David Grier

Bryce Grier , Neural Data Architect at Wu Tsai Neuro, hosts a rotating series of Data Best Practices workshops. This workshop introduces the basics of large language models and gives attendees hands-on experience in training and employing them. 

  • Thursday, April 4, 1:00pm – 4:00pm
  • Application is required

This workshop is open to the Stanford research community.

About the Data Best Practices Workshop Series  This workshop series aims to educate and empower the Stanford neuroscience and broader research communities to acquire, store, and analyze their data more effectively. These recurring workshops provide attendees with hands-on introductions and training with essential tools.

Visit the website for more information.

James Haberberger, Data Architect, Brain Resilience Laboratory

James Haberberger

Related event.

Wu Tsai Neurosciences Institute, Data Best Practices Workshop

Data Best Practices Workshop - Data transfer and storage

ANA | Driving Growth

Your company may already be a member. View our member list to find out, or create a new account .

Forgot Password?

Content Library

You can search our content library for case studies, research, industry insights, and more.

You can search our website for events, press releases, blog posts, and more.

Developing a Data Quality Framework for Gen AI

By David Fogarty     April 11, 2024    

big data research direction

T he HBR article " Is Your Company's Data Ready for Generative AI " written by Davenport and Tiwari was very insightful. I had a chance to collaborate with Tom Davenport and his team when I worked at the General Electric Company and Cigna — and his dedication and commitment to research in the field of data and analytics and now artificial intelligence (AI) has been instrumental to the entire data and analytics practice. Several recent studies by MIT and various data partners including AWS, Snowflake, and Databricks all concur that data is emerging as the key ingredient for developing a successful AI application. The MIT and Databricks article found that in a survey of CIOs major spending growth is planned to bolster AI's data foundations. This was especially emphasized by the participants from companies considered to be AI leaders who participated in the survey. Moreover, marketing applications of generative AI consistently make the top of the list ncerning where gen AI applications can have the greatest impact in firms.

In the HRB article, Davenport and Tiwari pointed to the fact that the survey of over 300 chief data officers indicated that despite all the hype and effort firms are making to become more AI-driven , these same firms have yet to create new data strategies or have begun to manage their data in the ways necessary to make generative AI work for them. For example, while 93 percent of the firms in the study agreed that data strategy is critical for getting value from gen AI, only 57 percent said they had made no changes thus far in their organizations. Moreover, only 11 percent of the 37 percent which did make the changes to their data strategy agreed very strongly that their organizations have the right data foundation for  gen AI . Furthermore, they point out that the data that generative AI uses needs to be of a high quality if generative AI models employing it are to be highly useful. Like the old saying, garbage-in, garbage out. This phase can be modernized to the age of AI by stating, "Poor-quality internal data will yield poor-quality responses from gen AI models." Given the need for high quality data in order to power gen AI applications, it would be now worthwhile to dip deeper and create a comprehensive data quality framework to identify and address the issues of data quality related to the implementation of AI applications.

First, a general word of caution to all managers of gen AI projects. Just because your data management team is telling you the data in your firm is of a high quality, the definition of high quality for one application does not necessarily mean it the same for ique to every application. Therefore, the quality of your data will have to be reevaluated for your new gen AI application .

Now how do we find the data quality issues themselves? The first step to understanding the data quality in your firm is to operate your investigation with a solid data quality framework. The seven key dimensions of data quality we would like to focus on which comprise this framework are accuracy, completeness, consistency, timeliness, relevance, uniqueness, and validity. Below is the short definition of each:

  • Accuracy : Refers to the degree to which the data values are correct and, not some made-up mumbo jumbo.
  • Completeness : Refers to the degree where the data values are complete and not missing.
  • Consistency : Refers to the degree to which the data values are free from contradiction and conforms to a set of established rules or standards.
  • Timeliness : Refers to the degree to which the data values are up to date, i.e. the timeliness or recency of the data.
  • Relevance : Refers to the degree to which data is pertinent and useful to a particular situation or decision-making process.
  • Uniqueness : Refers to the degree to which data items in a dataset are distinct from one another and represents a unique piece of information.
  • Validity : Refers to the extent to which data conforms to the predefined format and that data is entered in the correct format such that it fits within the constraints of the system.

Each one of the seven dimensions plays a critical role in ensuring the overall quality of data. By examining your data for gen AI according to the above framework it should provide a structured format to data managers in order to assess and improve the quality of data for gen AI. Now the next step is how one will go about examining their data quality fitness for gen AI in the context of the above framework. For this evaluation I highly recommend some of the tools for LEAN Six Sigma, especially DMAIC as I found this paradigm to be excellent for quantifying, measuring, and getting to the root causes of data quality issues. I spent 20 years at the General Electric Company as a certified Master Black Belt in Lean Six Sigma and used this quality paradigm many times for addressing data quality. (For more on this process, check out the American Society for Quality .) The important thing to remember is that one does not have to be a Black Belt or Master Back Belt in Lean Six Sigma (highest trained members in the discipline) to be able to put some of these principles in place. Start small and at one stage at a time and build your data quality program. Now some practical examples of how to employ Lean Six Sigma DMAIC quality paradigm to manage data quality issues. Lean Six Sigma DMAIC is an acronym for define, measure, analyze, improve, and control. It is a phased approach that can be sequential but is often more parallel in practice. The first step is to define the quality goals for gen AI during the Define Phase. Who will participate in the project? What are the limits of the project?

For more, see:

  • Brand Purpose and Sustainability
  • We Need to Level Up If We Want to Hit Sustainability Targets

The second phase known as the Measure Phase is where the actual extent of the data quality issues is measured. What are the metrics we will use to measure data quality using the framework? At this phase, one would also benchmark the data quality against certain standards. Then, during the Analyze Phase, one would find the root cause of the situations creating the data quality issues. Tools such as Fishbone Diagrams and Pareto Analyses can be very helpful in this context. Next, the Improvement Phase is where improvement plans are implemented and tested. There are a myriad of commercial tools available to fix data quality issues and this is the phase where these all would be evaluated And, finally, during the Control Phase tools like the FMEA (which is the acronym for failure mode, effect analysis) are used to make sure that the quality improvements can be constantly monitored and don't slip back to the original level of defects and that the data owners know how to measure and what to do if there is too much of a drift back to lower quality. Control Charts are a really useful tool during this phase. The Control Phase shouldn't necessarily signal the end of the DMAIC cycle. Leaders can actually immediately propose another DMAIC cycle to implement continuous improvement and focus on some of the items on the Pareto chart which were deprioritized in the first round. In conclusion, if firms are planning to implement gen AI applications for marketing and other functions they should start with their data quality. I hope this was useful for firms to start thinking about a framework and process for managing the data quality of their gen AI applications for marketing.

References: Databricks and MIT (2023). CIO vision 2025: Bridging the gap between BI and AI . Databricks. 

Davenport, T. H., & Tiwari, P. (2024). Is Your Company's Data Ready for Generative AI? Harvard Business Review. 

Davenport, T., Bean, R., & Wang, R. (2024). CDO Agenda 2024: Navigating Data and Generative AI Frontiers . CDO Agenda. 

The views and opinions expressed are solely those of the contributor and do not necessarily reflect the official position of the ANA or imply endorsement from the ANA.

David Fogarty, PhD, MBA is the SVP of Data Excellence and Privacy at the Association of National Advertisers (ANA). Prior to the ANA, he was a seasoned Fortune 100 chief data and analytics executive and an adjunct professor at Columbia University, Cornell University and New York University. David is also a bestselling author with over 50 published research papers and books.

big data research direction

Creative Australia's report into the music festival sector shows how many of the country's big events are struggling

A crowd of people face a brightly lit stage with the words 'Spilt Milk' atop it.

More than one-third of Australian music festivals are losing money as they face skyrocketing operational costs and dwindling younger audiences, according to a new report from Creative Australia.

Billed as the first widespread report of its kind, Soundcheck: Insights into Australia's music festival sector  delves into the cultural, social and economic impacts of Australian music festivals, and paints a clear picture of the landscape as it stood in the 2022-23 financial year.

Spanning the 535 music festivals held nationwide in that time — that's almost 1.5 festivals per day — the 116-page report reflects the scope, scale and diversity of the Australian music festival landscape.

Given the highly publicised recent struggles festivals have faced, it's timely research that looks to help Australian audiences and funding bodies understand the challenges these events face.

Flume performs to a massive crowd at Splendour in the Grass

How much money do music festivals make?

Just 56 per cent of music festivals reported a profit in the 2022-23 financial year, with more than one third of festivals reporting a deficit and eight per cent breaking even.

The median average cost to stage a music festival is $3.3 million, and those events that do make a profit pull in a median average of $731,569 per event.

When looking at the mean average of the same data, though, that figure skyrockets to $2.6 million — confirming that some festivals are in a much better financial position and stand to gain far more than some of their contemporaries.

For instance, the highest profit for a festival surveyed for this data was $47.4 million, while the smallest profit was just $20,000.

What are the biggest challenges festivals face?

Rising operational costs had the most severe impact on almost half of festival organisers (47 per cent) — overheads like artist fees, production, suppliers, freight, transportation and insurance.

Other major barriers included a lack of funding and grants, as well as extreme weather events. Almost one third of festivals said skyrocketing insurance costs were a major challenge.

Australian live music venues' public liability insurance policies increased 10-fold in the past financial year, climbing from $20,000 per year to as much as $120,000.

One festival organiser noted that necessary event cancellation insurance costs had "pretty much doubled" since the COVID-19 pandemic.

"The excess used to be like a standard commercial policy, which is like $4,000 or $5,000. Our excess for this year is $250,000."

Another organiser said navigating insurance paperwork had become an "absolute minefield" after making the tough call to cancel their festival.

"We had to wait until the morning of the show to make the final determination to cancel, otherwise there's the possibility that the insurance company could have said we could have worked out other alternatives.

"You're left with this real balancing act of, do you let your patrons know … who may have been booking accommodation, may have been getting drivers, getting babysitters, outlaying some money to attend the festival?"

People at a music festival

The rising costs of securing police and security was another sore point. More than a quarter of festivals noted the challenges of navigating police and security requirements, and the difficulties of dealing with different government and council regulations across different states and jurisdictions.

"There's not enough consistency," said one logistics/operations worker from New South Wales.

"Whether you do an event in the metro area, or you do an event in Newcastle, or you do an event down the South Coast, or whatever the case may be, all these authorities have different expectations in regards to what they want from security and from the event. That makes it hard because some of the implications are more costs for the event promoter."

By contrast, most festivals found health, medical and liquor licensing requirements were the least challenging regulatory challenge, with around seven per cent reporting these elements had an impact.

Ongoing festival cancellations have created a vicious cycle where the more events pull the plug or lose headliners last-minute, the more hesitation it creates in the wider market — from both the industry and from punters holding off on purchasing tickets.

Who's buying festival tickets?

While music festival revenue comes from various avenues — from corporate sponsorship to hospitality services to merchandise and more — it's ticket sales that determine the ultimate feasibility of a music festival.

There is some good news on that front, with average ticket sales in 2022-23 higher than pre-COVID levels.

The average festival sold 8,116 tickets in 2018-19, which ballooned to 9,506 for 2022-23, indicating that the industry is slowly recovering from the decimating impacts of the COVID-19 pandemic.

The research suggests that young people are no longer the main consumer of music festivals, nor are they attending as much as they have in the past.

The 18-24-year-old group is no longer the biggest ticket-buying demographic, with people in their mid-to-late twenties overtaking them. The younger crowd slumped from 41 per cent of all ticket buyers in 2018/19 to 27 per cent in 2022/23. 

Genre specific events faring better

The report arrives amid a feast or famine crisis for the Australian music festival scene.

There's been a growing list of festival cancellations, from major events like Splendour In The Grass , Groovin The Moo and Mona Foma , to newer players like This That, Summerground ,  Vintage Vibes , Tent Pole , Valleyways, Costal Jam and more.

Amid those reports, however, genre-focused events — such as Good Things, Knotfest, Listen Out, CMC Rocks — are still proving popular, and summer staples — like Laneway Festival, Beyond The Valley and Field Day — are adapting to current challenges with great success.

The vast majority of Australian festivals predominantly feature homegrown line-ups, with four out of five acts being Australian. The most popular genre offering was electronic music, accounting for almost a quarter of Australian festivals. Other popular genres included rock (21 per cent) country (19 per cent) and indie (17 per cent).

Georgie McClean of Creative Australia says she hopes this research will serve as both a tool for those in the industry, as well as a way to exhibit the contributions music festivals make to Australia's creative sector.

"We hope this report will help us to better understand the role and contribution of festivals within the broader creative industries as they face multiple challenges.

"To inform the future work of Music Australia, we will be undertaking further research into how Australians discover, engage with and consume music, in order to better understand the broader ecosystem that underpins live music including festivals."

  • X (formerly Twitter)
  • Arts, Culture and Entertainment
  • Carnivals and Festivals
  • Music (Arts and Entertainment)
  • Music Industry

Watch CBS News

Solar eclipse maps show 2024 totality path, peak times and how much of the eclipse people could see across the U.S.

By Aliza Chasan

Updated on: April 9, 2024 / 5:00 AM EDT / CBS News

A total solar eclipse  crossed North America Monday with parts of 15 U.S. states within the path of totality. Maps show  where and when astronomy fans could see the big event  as skies darkened in the middle of the day Monday, April 8.

The total eclipse first appeared along Mexico's Pacific Coast at around 11:07 a.m. PDT, then traveled across a swath of the U.S., from Texas to Maine, and into Canada.

About 31.6 million people live in the path of totality , the area where the moon fully blocked out the sun , according to NASA. The path ranged between 108 and 122 miles wide. An additional 150 million people live within 200 miles of the path of totality.

Solar eclipse path of totality map for 2024

United states map showing the path of the 2024 solar eclipse and specific regions of what the eclipse duration will be.

The total solar eclipse started over the Pacific Ocean, and the first location in continental North America that experienced totality was Mexico's Pacific Coast, around 11:07 a.m. PDT, according to NASA. From there, the path continued into Texas, crossing more than a dozen states before the eclipse enters Canada in southern Ontario. The eclipse exited continental North America at around 5:16 p.m. NDT from Newfoundland, Canada.

The path of totality included portions of the following states:

  • Pennsylvania
  • New Hampshire

Small parts of Tennessee and Michigan also experienced the total solar eclipse.

Several major cities across the U.S. were included in the eclipse's path of totality, while many others saw a partial eclipse. These were some of the best major cities for eclipse viewing — though the weather was a factor :

  • San Antonio, Texas (partially under the path)
  • Austin, Texas
  • Waco, Texas
  • Dallas, Texas
  • Little Rock, Arkansas
  • Indianapolis, Indiana
  • Dayton, Ohio
  • Cleveland, Ohio
  • Buffalo, New York
  • Rochester, New York
  • Syracuse, New York
  • Burlington, Vermont

Map of when the solar eclipse reached totality across its path

The eclipse began in the U.S. as a partial eclipse beginning at 12:06 p.m. CDT near Eagle Pass, Texas, before progressing to totality by about 1:27 p.m. CDT and then moving along its path to the northeast over the following few hours.

Eclipse map of totality

NASA shared times for several cities in the path of totality across the U.S. People could have also  checked their ZIP code on NASA's map  to see when the eclipse was to reach them if they were on, or near, the path of totality — or if they saw a partial eclipse instead.

How much of the eclipse did people see if they live outside the totality path?

While the April 8 eclipse covered a wide swath of the U.S., outside the path of totality observers may have spotted a partial eclipse, where the moon covers some, but not all, of the sun, according to NASA. The closer they were to the path of totality, the larger the portion of the sun that was hidden.

NASA allowed viewers to input a ZIP code and see how much of the sun was to be covered in their locations.

Could there be cloud cover be during the solar eclipse?

Some areas along the path of totality had a higher likelihood of cloud cover that could interfere with viewing the eclipse. Here is a map showing the historical trends in cloud cover this time of year. 

You could have checked the latest forecast for your location with our partners at The Weather Channel .

United States map showing the percent of cloud cover in various regions of the eclipse path on April 8. The lakeshore region will be primarily affected.

Where did the solar eclipse reach totality for the longest?

Eclipse viewers near Torreón, Mexico, got to experience totality for the longest. Totality there lasted 4 minutes, 28 seconds, according to NASA. 

Most places along the centerline of the path of totality saw a totality duration of between 3.5 and 4 minutes, according to NASA. Some places in the U.S. came close to the maximum; Kerrville, Texas, had a totality duration of 4 minutes, 24 seconds.

What is the path of totality for the 2044 solar eclipse?

The next total solar eclipse that will be visible from the contiguous U.S. will be on Aug. 23, 2044.

Astronomy fans in the U.S. will have far fewer opportunities to see the 2044 eclipse they had on April 8. NASA has not yet made maps available for the 2044 eclipse but, according to The Planetary Society , the path of totality will only touch three states.

The 2024 eclipse will start in Greenland, pass over Canada and end as the sun sets in Montana, North Dakota and South Dakota, according to the Planetary Society.

Map showing the path of the 2044 total solar eclipse from Greenland, Canada and parts of the United States.

Aliza Chasan is a digital producer at 60 Minutes and CBSNews.com. She has previously written for outlets including PIX11 News, The New York Daily News, Inside Edition and DNAinfo. Aliza covers trending news, often focusing on crime and politics.

More from CBS News

Couple gets engaged on flight to see total solar eclipse

Severe weather, flooding, suspected tornadoes hit Southeast

TSA found more than 1,500 guns at airport checkpoints in 1st quarter of 2024

"Sunday Morning" archives: Impressionism at 150

Read our research on: Gun Policy | International Conflict | Election 2024

Regions & Countries

About half of americans say public k-12 education is going in the wrong direction.

School buses arrive at an elementary school in Arlington, Virginia. (Chen Mengtong/China News Service via Getty Images)

About half of U.S. adults (51%) say the country’s public K-12 education system is generally going in the wrong direction. A far smaller share (16%) say it’s going in the right direction, and about a third (32%) are not sure, according to a Pew Research Center survey conducted in November 2023.

Pew Research Center conducted this analysis to understand how Americans view the K-12 public education system. We surveyed 5,029 U.S. adults from Nov. 9 to Nov. 16, 2023.

The survey was conducted by Ipsos for Pew Research Center on the Ipsos KnowledgePanel Omnibus. The KnowledgePanel is a probability-based web panel recruited primarily through national, random sampling of residential addresses. The survey is weighted by gender, age, race, ethnicity, education, income and other categories.

Here are the questions used for this analysis , along with responses, and the survey methodology .

A diverging bar chart showing that only 16% of Americans say public K-12 education is going in the right direction.

A majority of those who say it’s headed in the wrong direction say a major reason is that schools are not spending enough time on core academic subjects.

These findings come amid debates about what is taught in schools , as well as concerns about school budget cuts and students falling behind academically.

Related: Race and LGBTQ Issues in K-12 Schools

Republicans are more likely than Democrats to say the public K-12 education system is going in the wrong direction. About two-thirds of Republicans and Republican-leaning independents (65%) say this, compared with 40% of Democrats and Democratic leaners. In turn, 23% of Democrats and 10% of Republicans say it’s headed in the right direction.

Among Republicans, conservatives are the most likely to say public education is headed in the wrong direction: 75% say this, compared with 52% of moderate or liberal Republicans. There are no significant differences among Democrats by ideology.

Similar shares of K-12 parents and adults who don’t have a child in K-12 schools say the system is going in the wrong direction.

A separate Center survey of public K-12 teachers found that 82% think the overall state of public K-12 education has gotten worse in the past five years. And many teachers are pessimistic about the future.

Related: What’s It Like To Be A Teacher in America Today?

Why do Americans think public K-12 education is going in the wrong direction?

We asked adults who say the public education system is going in the wrong direction why that might be. About half or more say the following are major reasons:

  • Schools not spending enough time on core academic subjects, like reading, math, science and social studies (69%)
  • Teachers bringing their personal political and social views into the classroom (54%)
  • Schools not having the funding and resources they need (52%)

About a quarter (26%) say a major reason is that parents have too much influence in decisions about what schools are teaching.

How views vary by party

A dot plot showing that Democrats and Republicans who say public education is going in the wrong direction give different explanations.

Americans in each party point to different reasons why public education is headed in the wrong direction.

Republicans are more likely than Democrats to say major reasons are:

  • A lack of focus on core academic subjects (79% vs. 55%)
  • Teachers bringing their personal views into the classroom (76% vs. 23%)

A bar chart showing that views on why public education is headed in the wrong direction vary by political ideology.

In turn, Democrats are more likely than Republicans to point to:

  • Insufficient school funding and resources (78% vs. 33%)
  • Parents having too much say in what schools are teaching (46% vs. 13%)

Views also vary within each party by ideology.

Among Republicans, conservatives are particularly likely to cite a lack of focus on core academic subjects and teachers bringing their personal views into the classroom.

Among Democrats, liberals are especially likely to cite schools lacking resources and parents having too much say in the curriculum.

Note: Here are the questions used for this analysis , along with responses, and the survey methodology .

big data research direction

Sign up for our weekly newsletter

Fresh data delivered Saturday mornings

‘Back to school’ means anytime from late July to after Labor Day, depending on where in the U.S. you live

Among many u.s. children, reading for fun has become less common, federal data shows, most european students learn english in school, for u.s. teens today, summer means more schooling and less leisure time than in the past, about one-in-six u.s. teachers work second jobs – and not just in the summer, most popular.

About Pew Research Center Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of The Pew Charitable Trusts .

  • Share full article

Advertisement

Supported by

What Researchers Discovered When They Sent 80,000 Fake Résumés to U.S. Jobs

Some companies discriminated against Black applicants much more than others, and H.R. practices made a big difference.

Claire Cain Miller

By Claire Cain Miller and Josh Katz

A group of economists recently performed an experiment on around 100 of the largest companies in the country, applying for jobs using made-up résumés with equivalent qualifications but different personal characteristics. They changed applicants’ names to suggest that they were white or Black, and male or female — Latisha or Amy, Lamar or Adam.

On Monday, they released the names of the companies . On average, they found, employers contacted the presumed white applicants 9.5 percent more often than the presumed Black applicants.

Yet this practice varied significantly by firm and industry. One-fifth of the companies — many of them retailers or car dealers — were responsible for nearly half of the gap in callbacks to white and Black applicants.

Two companies favored white applicants over Black applicants significantly more than others. They were AutoNation, a used car retailer, which contacted presumed white applicants 43 percent more often, and Genuine Parts Company, which sells auto parts including under the NAPA brand, and called presumed white candidates 33 percent more often.

In a statement, Heather Ross, a spokeswoman for Genuine Parts, said, “We are always evaluating our practices to ensure inclusivity and break down barriers, and we will continue to do so.” AutoNation did not respond to a request for comment.

Companies With the Largest and Smallest Racial Contact Gaps

Of the 97 companies in the experiment, two stood out as contacting presumed white job applicants significantly more often than presumed Black ones. At 14 companies, there was little or no difference in how often they called back the presumed white or Black applicants.

Source: Patrick Kline, Evan K. Rose and Christopher R. Walters

Known as an audit study , the experiment was the largest of its kind in the United States: The researchers sent 80,000 résumés to 10,000 jobs from 2019 to 2021. The results demonstrate how entrenched employment discrimination is in parts of the U.S. labor market — and the extent to which Black workers start behind in certain industries.

“I am not in the least bit surprised,” said Daiquiri Steele, an assistant professor at the University of Alabama School of Law who previously worked for the Department of Labor on employment discrimination. “If you’re having trouble breaking in, the biggest issue is the ripple effect it has. It affects your wages and the economy of your community going forward.”

Some companies showed no difference in how they treated applications from people assumed to be white or Black. Their human resources practices — and one policy in particular (more on that later) — offer guidance for how companies can avoid biased decisions in the hiring process.

A lack of racial bias was more common in certain industries: food stores, including Kroger; food products, including Mondelez; freight and transport, including FedEx and Ryder; and wholesale, including Sysco and McLane Company.

“We want to bring people’s attention not only to the fact that racism is real, sexism is real, some are discriminating, but also that it’s possible to do better, and there’s something to be learned from those that have been doing a good job,” said Patrick Kline, an economist at the University of California, Berkeley, who conducted the study with Evan K. Rose at the University of Chicago and Christopher R. Walters at Berkeley.

The researchers first published details of their experiment in 2021, but without naming the companies. The new paper, which is set to run in the American Economic Review, names the companies and explains the methodology developed to group them by their performance, while accounting for statistical noise.

Sample Résumés From the Experiment

Fictitious résumés sent to large U.S. companies revealed a preference, on average, for candidates whose names suggested that they were white.

Sample resume

To assign names, the researchers started with a prior list that had been assembled using Massachusetts birth certificates from 1974 to 1979. They then supplemented this list with names found in a database of speeding tickets issued in North Carolina between 2006 and 2018, classifying a name as “distinctive” if more than 90 percent of people with that name were of a particular race.

The study includes 97 firms. The jobs the researchers applied to were entry level, not requiring a college degree or substantial work experience. In addition to race and gender, the researchers tested other characteristics protected by law , like age and sexual orientation.

They sent up to 1,000 applications to each company, applying for as many as 125 jobs per company in locations nationwide, to try to uncover patterns in companies’ operations versus isolated instances. Then they tracked whether the employer contacted the applicant within 30 days.

A bias against Black names

Companies requiring lots of interaction with customers, like sales and retail, particularly in the auto sector, were most likely to show a preference for applicants presumed to be white. This was true even when applying for positions at those firms that didn’t involve customer interaction, suggesting that discriminatory practices were baked in to corporate culture or H.R. practices, the researchers said.

Still, there were exceptions — some of the companies exhibiting the least bias were retailers, like Lowe’s and Target.

The study may underestimate the rate of discrimination against Black applicants in the labor market as a whole because it tested large companies, which tend to discriminate less, said Lincoln Quillian, a sociologist at Northwestern who analyzes audit studies. It did not include names intended to represent Latino or Asian American applicants, but other research suggests that they are also contacted less than white applicants, though they face less discrimination than Black applicants.

The experiment ended in 2021, and some of the companies involved might have changed their practices since. Still, a review of all available audit studies found that discrimination against Black applicants had not changed in three decades. After the Black Lives Matter protests in 2020, such discrimination was found to have disappeared among certain employers, but the researchers behind that study said the effect was most likely short-lived.

Gender, age and L.G.B.T.Q. status

On average, companies did not treat male and female applicants differently. This aligns with other research showing that gender discrimination against women is rare in entry-level jobs, and starts later in careers.

However, when companies did favor men (especially in manufacturing) or women (mostly at apparel stores), the biases were much larger than for race. Builders FirstSource contacted presumed male applicants more than twice as often as female ones. Ascena, which owns brands like Ann Taylor, contacted women 66 percent more than men.

Neither company responded to requests for comment.

The consequences of being female differed by race. The differences were small, but being female was a slight benefit for white applicants, and a slight penalty for Black applicants.

The researchers also tested several other characteristics protected by law, with a smaller number of résumés. They found there was a small penalty for being over 40.

Overall, they found no penalty for using nonbinary pronouns. Being gay, as indicated by including membership in an L.G.B.T.Q. club on the résumé, resulted in a slight penalty for white applicants, but benefited Black applicants — although the effect was small, when this was on their résumés, the racial penalty disappeared.

Under the Civil Rights Act of 1964, discrimination is illegal even if it’s unintentional . Yet in the real world, it is difficult for job applicants to know why they did not hear back from a company.

“These practices are particularly challenging to address because applicants often do not know whether they are being discriminated against in the hiring process,” Brandalyn Bickner, a spokeswoman for the Equal Employment Opportunity Commission, said in a statement. (It has seen the data and spoken with the researchers, though it could not use an academic study as the basis for an investigation, she said.)

What companies can do to reduce discrimination

Several common measures — like employing a chief diversity officer, offering diversity training or having a diverse board — were not correlated with decreased discrimination in entry-level hiring, the researchers found.

But one thing strongly predicted less discrimination: a centralized H.R. operation.

The researchers recorded the voice mail messages that the fake applicants received. When a company’s calls came from fewer individual phone numbers, suggesting that they were originating from a central office, there tended to be less bias . When they came from individual hiring managers at local stores or warehouses, there was more. These messages often sounded frantic and informal, asking if an applicant could start the next day, for example.

“That’s when implicit biases kick in,” Professor Kline said. A more formalized hiring process helps overcome this, he said: “Just thinking about things, which steps to take, having to run something by someone for approval, can be quite important in mitigating bias.”

At Sysco, a wholesale restaurant food distributor, which showed no racial bias in the study, a centralized recruitment team reviews résumés and decides whom to call. “Consistency in how we review candidates, with a focus on the requirements of the position, is key,” said Ron Phillips, Sysco’s chief human resources officer. “It lessens the opportunity for personal viewpoints to rise in the process.”

Another important factor is diversity among the people hiring, said Paula Hubbard, the chief human resources officer at McLane Company. It procures, stores and delivers products for large chains like Walmart, and showed no racial bias in the study. Around 40 percent of the company’s recruiters are people of color, and 60 percent are women.

Diversifying the pool of people who apply also helps, H.R. officials said. McLane goes to events for women in trucking and puts up billboards in Spanish.

So does hiring based on skills, versus degrees . While McLane used to require a college degree for many roles, it changed that practice after determining that specific skills mattered more for warehousing or driving jobs. “We now do that for all our jobs: Is there truly a degree required?” Ms. Hubbard said. “Why? Does it make sense? Is experience enough?”

Hilton, another company that showed no racial bias in the study, also stopped requiring degrees for many jobs, in 2018.

Another factor associated with less bias in hiring, the new study found, was more regulatory scrutiny — like at federal contractors, or companies with more Labor Department citations.

Finally, more profitable companies were less biased, in line with a long-held economics theory by the Nobel Prize winner Gary Becker that discrimination is bad for business. Economists said that could be because the more profitable companies benefit from a more diverse set of employees. Or it could be an indication that they had more efficient business processes, in H.R. and elsewhere.

Claire Cain Miller writes about gender, families and the future of work for The Upshot. She joined The Times in 2008 and was part of a team that won a Pulitzer Prize in 2018 for public service for reporting on workplace sexual harassment issues. More about Claire Cain Miller

Josh Katz is a graphics editor for The Upshot, where he covers a range of topics involving politics, policy and culture. He is the author of “Speaking American: How Y’all, Youse, and You Guys Talk,” a visual exploration of American regional dialects. More about Josh Katz

From The Upshot: What the Data Says

Analysis that explains politics, policy and everyday life..

Employment Discrimination: Researchers sent 80,000 fake résumés to some of the largest companies in the United States. They found that some discriminated against Black applicants much more than others .

Pandemic School Closures: ​A variety of data about children’s academic outcomes and about the spread of Covid-19 has accumulated since the start of the pandemic. Here is what we learned from it .

Affirmative Action: The Supreme Court effectively ended race-based preferences in admissions. But will selective schools still be able to achieve diverse student bodies? Here is how they might try .

N.Y.C. Neighborhoods: We asked New Yorkers to map their neighborhoods and to tell us what they call them . The result, while imperfect, is an extremely detailed map of the city .

Dialect Quiz:  What does the way you speak say about where you’re from? Answer these questions to find out .

IMAGES

  1. What is Big Data?

    big data research direction

  2. 5 Steps of the Data Analysis Process

    big data research direction

  3. An overview of the Big Data Framework

    big data research direction

  4. Big Data: Challenges and Future Research Directions

    big data research direction

  5. 21 Best Big Data Research Topics

    big data research direction

  6. Big Data and its Applications

    big data research direction

VIDEO

  1. Big data research Assingment-4

  2. Big Data Challenges and Opportunities

  3. 🚀 Data Intelligence Company: Big Data, AI, and Security

  4. ISRO Start New Mars Mission In 2024 !! ISRO ID GOING HISTORY !! ISRO IS GOING NEW DISCOVER IN MARS !

  5. Crafting strong Research Problem Statement

  6. Complete Roadmap to Become a Data Scientist in 2023 ||

COMMENTS

  1. Handling big data: research challenges and future directions

    Today, an enormous amount of data is being continuously generated in all walks of life by all kinds of devices and systems every day. A significant portion of such data is being captured, stored, aggregated and analyzed in a systematic way without losing its "4V" (i.e., volume, velocity, variety, and veracity) characteristics. We review major drivers of big data today as well the recent ...

  2. Big Data: Challenges and Future Research Directions

    Industrial big data can benefit from past experiences, but challenges lie ahead. Figure 1. The big data movement stems from the availability of data, high-power computer technology, and analytics to handle data characterized by the four Vs — volume, variety, veracity, and velocity. Like any new, promising field, big data must be viewed in ...

  3. Moving back to the future of big data-driven research: reflecting on

    These developments and directions of genetic-based research and big data go far beyond the struggle of a discipline, namely sociology, with a paradigm shift in empirical research.

  4. Bibliometric mining of research directions and trends for big data

    In this paper a program and methodology for bibliometric mining of research trends and directions is presented. The method is applied to the research area Big Data for the time period 2012 to 2022, using the Scopus database. It turns out that the 10 most important research directions in Big Data are Machine learning, Deep learning and neural networks, Internet of things, Data mining, Cloud ...

  5. PDF Handling big data: research challenges and future directions

    Then, we present a classification of some of the most important challenges when handling big data. Based on this classification, we recommend solutions that could address the identified challenges, and in addition we highlight cross-disciplinary research directions that need further investigation in the future.

  6. Challenges and Future Directions of Big Data and Artificial

    11 Program of Learning Sciences, National Taiwan Normal University, Taipei, Taiwan. We discuss the new challenges and directions facing the use of big data and artificial intelligence (AI) in education research, policy-making, and industry. In recent years, applications of big data and AI in education have made significant headways.

  7. A review of big data and medical research

    In this descriptive review, we highlight the roles of big data, the changing research paradigm, and easy access to research participation via the Internet fueled by the need for quick answers. Universally, data volume has increased, with the collection rate doubling every 40 months, ever since the 1980s. 4 The big data age, starting in 2002 ...

  8. A bibliometric approach to tracking big data research trends

    The explosive growing number of data from mobile devices, social media, Internet of Things and other applications has highlighted the emergence of big data. This paper aims to determine the worldwide research trends on the field of big data and its most relevant research areas. A bibliometric approach was performed to analyse a total of 6572 papers including 28 highly cited papers and only ...

  9. Future research directions for big data in psychology.

    Also, the conclusion of this chapter offers a list of future research directions (immodestly abbreviated FReDs) that summarizes how psychology can be more strategically and actively involved in the big data research arena, alongside other disciplines already strongly engaged in this arena (e.g., applied statistics, computer science). Most FReDs ...

  10. Big Data Research

    About the journal. The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in dealing with big data, as ...

  11. Big Data computing and clouds: Trends and future directions

    Abstract. This paper discusses approaches and environments for carrying out analytics on Clouds for Big Data applications. It revolves around four important areas of analytics and Big Data, namely (i) data management and supporting architectures; (ii) model development and scoring; (iii) visualisation and user interaction; and (iv) business ...

  12. PDF Big Data Quality: A systematic literature review and future research

    researchers who are interested in the big data quality subject. Through careful study of the selected papers, we propose a research tree that divides the works based on the type of processing, task, and technique. Further, the challenges and research directions are discussed. Keywords Big data, data quality, evaluation, cleaning, outlier ...

  13. The Library Big Data Research: Status and Directions

    Affelt, A. 2015. The accidental data scientist: big data applications and opportunities for librarians and information professionals. Medford, New Jersey. Google Scholar; Armour, F. 2012. Introduction to big data, presentation at the symposium Big Data and Business Analytics: Defining a Framework.

  14. Big data in education: a state of the art, limitations, and future

    Big data is an essential aspect of innovation which has recently gained major attention from both academics and practitioners. Considering the importance of the education sector, the current tendency is moving towards examining the role of big data in this sector. So far, many studies have been conducted to comprehend the application of big data in different fields for various purposes.

  15. Future research directions for big data in psychology

    to offer a list of future research directions (immodestly abbreviated FReDs) that. summarize how psychology can be more strategically and actively involved. in the big data research arena ...

  16. PDF Big Data Directions in Entrepreneurship Research.indd

    Johnson & Shneiderman, 1991. BIG DATA DIRECTIONS IN ENTREPRENEURSHIP RESEARCH: RESEARCHER VIEWPOINTS. the relative size of each technology group as well as the prominence of a subgroup. Languages and framework were the most common application data technologies, while API tools were the most signifi cant utilities.

  17. Intellectual landscape and emerging trends of big data research in

    The superiority of big data has led to ample research on big data analytics in the hospitality and tourism context. It is thus important to capture the overall intellectual landscape by reviewing extant relevant literature. ... This study extends previous scientometric works in this research direction from three perspectives. Firstly, this ...

  18. (PDF) Big Data Analytics in Supply Chain Management: A Systematic

    Big Data Analytics in Supply Chain Management: A Systematic. Literature Review and Research Directions. In Lee * and George Mangalaraj. School of Computer Sciences, College of Business and ...

  19. Scientific Research and Big Data

    Scientific Research and Big Data. First published Fri May 29, 2020. Big Data promises to revolutionise the production of knowledge within and beyond science, by enabling novel, highly efficient ways to plan, conduct, disseminate and assess research. The last few decades have witnessed the creation of novel ways to produce, store, and analyse ...

  20. Handling big data: research challenges and future directions

    A classification of some of the most important challenges when handling big data is presented and solutions that could address the identified challenges are recommended. Today, an enormous amount of data is being continuously generated in all walks of life by all kinds of devices and systems every day. A significant portion of such data is being captured, stored, aggregated and analyzed in a ...

  21. Research Directions for Engineering Big Data Analytics Software

    Many software startups and research and development efforts are actively trying to harness the power of big data and create software with the potential to improve almost every aspect of human life. As these efforts continue to increase, full consideration needs to be given to the engineering aspects of big data software. Since these systems exist to make predictions on complex and continuous ...

  22. The Library Big Data Research: Status and Directions

    Request PDF | The Library Big Data Research: Status and Directions | Libraries are widely used by government, universities, research institutes, and the public since they are storing and managing ...

  23. Big Data Directions in Entrepreneurship Research: Researcher Viewpoints

    GoDaddy, which is the world's largest registrar of domain names, has collaborated with researchers from University of Iowa and Arizona State University, sharing de-identified data on the 20 million active U.S. domain name websites that have traffic and services attached to them. At least three-quarters of these websites are commercial.

  24. Data Best Practices Workshop

    This workshop introduces the basics of large language models and gives attendees hands-on experience in training and employing them. Thursday, April 4, 1:00pm - 4:00pm. Application is required. Apply here. This workshop is open to the Stanford research community. About the Data Best Practices Workshop Series This workshop series aims to ...

  25. Developing a Data Quality Framework for Gen AI

    Given the need for high quality data in order to power gen AI applications, it would be now worthwhile to dip deeper and create a comprehensive data quality framework to identify and address the issues of data quality related to the implementation of AI applications. First, a general word of caution to all managers of gen AI projects.

  26. Total solar eclipse: Where and when it was most visible

    In the US, an estimated 32 million people live within the path of totality and a total solar eclipse was visible for those in Texas, Oklahoma, Arkansas, Missouri, Illinois, Kentucky, Indiana, Ohio ...

  27. Creative Australia's report into the music festival sector shows how

    In short: A new report from arts investment and advisory body Creative Australia has delved into the current state of play of Australia's music festivals. It found that only 56 per cent of music ...

  28. Solar eclipse maps show 2024 totality path, peak times and how much of

    Total solar eclipse cuts path across U.S. 03:57 A total solar eclipse crossed North America Monday with parts of 15 U.S. states within the path of totality. Maps show where and when astronomy fans ...

  29. About half of Americans say public K-12 education ...

    About half of U.S. adults (51%) say the country's public K-12 education system is generally going in the wrong direction. A far smaller share (16%) say it's going in the right direction, and about a third (32%) are not sure, according to a Pew Research Center survey conducted in November 2023.

  30. What Researchers Discovered When They Sent 80,000 Fake Résumés to U.S

    Known as an audit study, the experiment was the largest of its kind in the United States: The researchers sent 80,000 résumés to 10,000 jobs from 2019 to 2021. The results demonstrate how ...