REVIEW article
Challenges and future directions of big data and artificial intelligence in education.
- 1 Institute for Research Excellence in Learning Sciences, National Taiwan Normal University, Taipei, Taiwan
- 2 National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan
- 3 School of Dentistry, Faculty of Medicine & Dentistry, University of Alberta, Edmonton, AB, Canada
- 4 Graduate School of Education, Rutgers – The State University of New Jersey, New Brunswick, NJ, United States
- 5 Apprendis, LLC, Berlin, MA, United States
- 6 Department of Computer Science and Information Engineering, College of Electrical Engineering and Computer Science, National Central University, Taoyuan City, Taiwan
- 7 Graduate School of Informatics, Kyoto University, Kyoto, Japan
- 8 Department of Electrical Engineering, College of Technology and Engineering, National Taiwan Normal University, Taipei, Taiwan
- 9 Centro de Tecnologia, Universidade Federal de Santa Maria, Santa Maria, Brazil
- 10 Department of Chinese and Bilingual Studies, Faculty of Humanities, The Hong Kong Polytechnic University, Kowloon, Hong Kong
- 11 Program of Learning Sciences, National Taiwan Normal University, Taipei, Taiwan
We discuss the new challenges and directions facing the use of big data and artificial intelligence (AI) in education research, policy-making, and industry. In recent years, applications of big data and AI in education have made significant headways. This highlights a novel trend in leading-edge educational research. The convenience and embeddedness of data collection within educational technologies, paired with computational techniques have made the analyses of big data a reality. We are moving beyond proof-of-concept demonstrations and applications of techniques, and are beginning to see substantial adoption in many areas of education. The key research trends in the domains of big data and AI are associated with assessment, individualized learning, and precision education. Model-driven data analytics approaches will grow quickly to guide the development, interpretation, and validation of the algorithms. However, conclusions from educational analytics should, of course, be applied with caution. At the education policy level, the government should be devoted to supporting lifelong learning, offering teacher education programs, and protecting personal data. With regard to the education industry, reciprocal and mutually beneficial relationships should be developed in order to enhance academia-industry collaboration. Furthermore, it is important to make sure that technologies are guided by relevant theoretical frameworks and are empirically tested. Lastly, in this paper we advocate an in-depth dialog between supporters of “cold” technology and “warm” humanity so that it can lead to greater understanding among teachers and students about how technology, and specifically, the big data explosion and AI revolution can bring new opportunities (and challenges) that can be best leveraged for pedagogical practices and learning.
Introduction
The purpose of this position paper is to present current status, opportunities, and challenges of big data and AI in education. The work has originated from the opinions and panel discussion minutes of an international conference on big data and AI in education ( The International Learning Sciences Forum, 2019 ), where prominent researchers and experts from different disciplines such as education, psychology, data science, AI, and cognitive neuroscience, etc., exchanged their knowledge and ideas. This article is organized as follows: we start with an overview of recent progress of big data and AI in education. Then we present the major challenges and emerging trends. Finally, based on our discussions of big data and AI in education, conclusion and future scope are suggested.
Rapid advancements in big data and artificial intelligence (AI) technologies have had a profound impact on all areas of human society including the economy, politics, science, and education. Thanks in large part to these developments, we are able to continue many of our social activities under the COVID-19 pandemic. Digital tools, platforms, applications, and the communications among people have generated vast amounts of data (‘big data’) across disparate locations. Big data technologies aim at harnessing the power of extensive data in real-time or otherwise ( Daniel, 2019 ). The characteristic attributes of big data are often referred to as the four V’s. That is, volume (amount of data), variety (diversity of sources and types of data), velocity (speed of data transmission and generation), and veracity (the accuracy and trustworthiness of data) ( Laney, 2001 ; Schroeck et al., 2012 ; Geczy, 2014 ). Recently, a 5th V was added, namely value (i.e., that data could be monetized; Dijcks, 2013 ). Because of intrinsic big data characteristics (the five Vs), large and complex datasets are impossible to process and utilize by using traditional data management techniques. Hence, novel and innovative computational technologies are required for the acquisition, storage, distribution, analysis, and management of big data ( Lazer et al., 2014 ; Geczy, 2015 ). Big data analytics commonly encompasses the processes of gathering, analyzing, and evaluating large datasets. Extraction of actionable knowledge and viable patterns from data are often viewed as the core benefits of the big data revolution ( Mayer-Schönberger and Cukier, 2013 ; Jagadish et al., 2014 ). Big data analytics employ a variety of technologies and tools, such as statistical analysis, data mining, data visualization, text analytics, social network analysis, signal processing, and machine learning ( Chen and Zhang, 2014 ).
As a subset of AI, machine learning focuses on building computer systems that can learn from and adapt to data automatically without explicit programming ( Jordan and Mitchell, 2015 ). Machine learning algorithms can provide new insights, predictions, and solutions to customize the needs and circumstances of each individual. With the availability of large quantity and high-quality input training data, machine learning processes can achieve accurate results and facilitate informed decision making ( Manyika et al., 2011 ; Gobert et al., 2012 , 2013 ; Gobert and Sao Pedro, 2017 ). These data-intensive, machine learning methods are positioned at the intersection of big data and AI, and are capable of improving the services and productivity of education, as well as many other fields including commerce, science, and government.
Regarding education, our main area of interest here, the application of AI technologies can be traced back to approximately 50 years ago. The first Intelligent Tutoring System “SCHOLAR” was designed to support geography learning, and was capable of generating interactive responses to student statements ( Carbonell, 1970 ). While the amount of data was relatively small at that time, it was comparable to the amount of data collected in other traditional educational and psychological studies. Research on AI in education over the past few decades has been dedicated to advancing intelligent computing technologies such as intelligent tutoring systems ( Graesser et al., 2005 ; Gobert et al., 2013 ; Nye, 2015 ), robotic systems ( Toh et al., 2016 ; Anwar et al., 2019 ), and chatbots ( Smutny and Schreiberova, 2020 ). With the breakthroughs in information technologies in the last decade, educational psychologists have had greater access to big data. Concretely speaking, social media (e.g., Facebook, Twitter), online learning environments [e.g., Massive Open Online Courses (MOOCs)], intelligent tutoring systems (e.g., AutoTutor), learning management systems (LMSs), sensors, and mobile devices are generating ever-growing amounts of dynamic and complex data containing students’ personal records, physiological data, learning logs and activities, as well as their learning performance and outcomes ( Daniel, 2015 ). Learning analytics, described as “the measurement, collection, analysis, and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs” ( Long and Siemens, 2011 , p. 34), are often implemented to analyze these huge amounts of data ( Aldowah et al., 2019 ). Machine learning and AI techniques further expand the capabilities of learning analytics ( Zawacki-Richter et al., 2019 ). The essential information extracted from big data could be utilized to optimize learning, teaching, and administration ( Daniel, 2015 ). Hence, research on big data and AI is gaining increasing significance in education ( Johnson et al., 2011 ; Becker et al., 2017 ; Hwang et al., 2018 ) and psychology ( Harlow and Oswald, 2016 ; Yarkoni and Westfall, 2017 ; Adjerid and Kelley, 2018 ; Cheung and Jak, 2018 ). Recently, the adoption of big data and AI in the psychology of learning and teaching has been trending as a novel method in cutting-edge educational research ( Daniel, 2015 ; Starcic, 2019 ).
The Position Formulation
A growing body of literature has attempted to uncover the value of big data at different education levels, from preschool to higher education ( Chen N.-S. et al., 2020 ). Several journal articles and book chapters have presented retrospective descriptions and the latest advances in the rapidly expanding research area from different angles, including systematic literature review ( Zawacki-Richter et al., 2019 ; Quadir et al., 2020 ), bibliometric study ( Hinojo-Lucena et al., 2019 ), qualitative analysis ( Malik et al., 2019 ; Chen L. et al., 2020 ), and social network analysis ( Goksel and Bozkurt, 2019 ). More details can be found in the previously mentioned reviews. In this paper, we aim at presenting the current progress of the application of big data and AI in education. By and large, the research on the learner side is devoted to identifying students’ learning and affective behavior patterns and profiles, improving methods of assessment and evaluation, predicting individual students’ learning performance or dropouts, and providing adaptive systems for personalized support ( Papamitsiou and Economides, 2014 ; Zawacki-Richter et al., 2019 ). On the teacher side, numerous studies have attempted to enhance course planning and curriculum development, evaluation of teaching, and teaching support ( Zawacki-Richter et al., 2019 ; Quadir et al., 2020 ). Additionally, teacher dashboards, such as Inq-Blotter, driven by big data techniques are being used to inform teachers’ instruction in real time while students simultaneously work in Inq-ITS ( Gobert and Sao Pedro, 2017 ; Mislevy et al., 2020 ). Big data technologies employing learning analytics and machine learning have demonstrated high predictive accuracy of students’ academic performance ( Huang et al., 2020 ). Only a small number of studies have focused on the effectiveness of learning analytics programs and AI applications. However, recent findings have revealed encouraging results in terms of improving students’ academic performance and retention, as well as supporting teachers in learning design and teaching strategy refinement ( Viberg et al., 2018 ; Li et al., 2019 ; Sonderlund et al., 2019 ; Mislevy et al., 2020 ).
Despite the growing number of reports and methods outlining implementations of big data and AI technologies in educational environments, we see a notable gap between contemporary technological capabilities and their utilization for education. The fast-growing education industry has developed numerous data processing techniques and AI applications, which may not be guided by current theoretical frameworks and research findings from psychology of learning and teaching. The rapid pace of technological progress and relatively slow educational adoption have contributed to the widening gap between technology readiness and its application in education ( Macfadyen, 2017 ). There is a pressing need to reduce this gap and stimulate technological adoption in education. This work presents varying viewpoints and their controversial issues, contemporary research, and prospective future developments in adoption of big data and AI in education. We advocate an interdisciplinary approach that encompasses educational, technological, and governmental spheres of influence. In the educational domain, there is a relative lack of knowledge and skills in AI and big data applications. On the technological side, few data scientists and AI developers are familiar with the advancements in education psychology, though this is changing with the advent of graduate programs at the intersection of Learning Sciences and Computer Science. Finally, in terms of government policies, the main challenges faced are the regulatory and ethical dilemmas between support of educational reforms and restrictions on adoptions of data-oriented technologies.
An Interdisciplinary Approach to Educational Adoption of Big Data and AI
In response to the new opportunities and challenges that the big data explosion and AI revolution are bringing, academics, educators, policy-makers, and professionals need to engage in productive collaboration. They must work together to cultivate our learners’ necessary competencies and essential skills important for the 21st century work, driven by the knowledge economy ( Bereiter, 2002 ). Collaboration across diverse disciplines and sectors is a demanding task—particularly when individual sides lack a clear vision of their mutually beneficial interests and the necessary knowledge and skills to realize that vision. We highlight several overlapping spheres of interest at the intersection of research, policy-making, and industry engagements. Researchers and the industry would benefit from targeted educational technology development and its efficient transfer to commercial products. Businesses and governments would benefit from legislature that stimulates technology markets while suitably protecting data and users’ privacy. Academics and policy makers would benefit from prioritizing educational reforms enabling greater adoption of technology-enhanced curricula. The recent developments and evolving future trends at intersections between researchers, policy-makers, and industry stakeholders arising from advancements and deployments of big data and AI technologies in education are illustrated in Figure 1 .
Figure 1. Contemporary developments and future trends at the intersections between research, policy, and industry driven by big data and AI advances in education.
The constructive domains among stakeholders progressively evolve along with scientific and technological developments. Therefore, it is important to reflect on longer-term projections and challenges. The following sections highlight the novel challenges and future directions of big data and AI technologies at the intersection of education research, policy-making, and industry.
Big Data and AI in Education: Research
An understanding of individual differences is critical for developing pedagogical tools to target specific students and to tailor education to individual needs at different stages. Intelligent educational systems employing big data and AI techniques are capable of collecting accurate and rich personal data. Data analytics can reveal students’ learning patterns and identify their specific needs ( Gobert and Sao Pedro, 2017 ; Mislevy et al., 2020 ). Hence, big data and AI have the potential to realize individualized learning to achieve precision education ( Lu et al., 2018 ). We see the following emerging trends, research gaps, and controversies in integrating big data and AI into education research so that there is a deep and rigorous understanding of individual differences that can be used to personalize learning in real time and at scale.
(1) Education is progressively moving from a one-size-fits-all approach to precision education or personalized learning ( Lu et al., 2018 ; Tsai et al., 2020 ). The one-size-fits-all approach was designed for average students, whereas precision education takes into consideration the individual differences of learners in their learning environments, along with their learning strategies. The main idea of precision education is analogous to “precision medicine,” where researchers harvest big data to identify patterns relevant to specific patients such that prevention and treatment can be customized. Based on the analysis of student learning profiles and patterns, precision education predicts students’ performance and provides timely interventions to optimize learning. The goal of precision education is to improve the diagnosis, prediction, treatment, and prevention of learning outcomes ( Lu et al., 2018 ). Contemporary research gaps related to adaptive tools and personalized educational experiences are impeding the transition to precision education. Adaptive educational tools and flexible learning systems are needed to accommodate individual learners’ interaction, pace, and learning progress, and to fit the specific needs of the individual learners, such as students with learning disabilities ( Xie et al., 2019 ; Zawacki-Richter et al., 2019 ). Hence, as personalized learning is customized for different people, researchers are able to focus on individualized learning that is adaptive to individual needs in real time ( Gobert and Sao Pedro, 2017 ; Lu et al., 2018 ).
(2) The research focus on deploying AI in education is gradually shifting from a computational focus that demonstrates use cases of new technology to cognitive focus that incorporates cognition in its design, such as perception ( VanRullen, 2017 ), emotion ( Song et al., 2016 ), and cognitive thinking ( Bramley et al., 2017 ). Moreover, it is also shifting from a single domain (e.g., domain expertise, or expert systems) to a cross-disciplinary approach through collaboration ( Spikol et al., 2018 ; Krouska et al., 2019 ) and domain transfers ( L’heureux et al., 2017 ). These controversial shifts are facilitating transitions from the knowing of the unknown (gaining insights through reasoning) to the unknown of the unknown (figuring out hidden values and unknown results through algorithms) ( Abed Ibrahim and Fekete, 2019 ; Cutumisu and Guo, 2019 ). In other words, deterministic learning, aimed at deductive/inductive reasoning and inference engines, predominated in traditional expert systems and old AI. Whereas, today, dynamic and stochastic learning, the outcome of which involves some randomness and uncertainty, is gradually becoming the trend in modern machine learning techniques.
(3) The format of machine-generated data and the purpose of machine learning algorithms should be carefully designed. There is a notable gap between theoretical design and its applicability. A theoretical model is needed to guide the development, interpretation, and validation of algorithms ( Gobert et al., 2013 ; Hew et al., 2019 ). The outcomes of data analytics and algorithmically generated evidence must be shared with educators and applied with caution. For instance, efforts to algorithmically detect mental states such as boredom, frustration, and confusion ( Baker et al., 2010 ) must be supported by the operational definitions and constructs that have been prudently evaluated. Additionally, the affective data collected by AI systems should take into account the cultural differences combined with contextual factors, teachers’ observations, and students’ opinions ( Yadegaridehkordi et al., 2019 ). Data need to be informatively and qualitatively balanced, in order to avoid implicit biases that may propagate into algorithms trained on such data ( Staats, 2016 ).
(4) There are ethical and algorithmic challenges when balancing human provided learning and machine assisted learning. The significant influence of AI and contemporary technologies is a double-edged sword ( Khechine and Lakhal, 2018 ). On the one hand, it facilitates better usability and drives progress. On the other, it might lead to the algorithmic bias and loss of certain essential skills among students who are extensively relying on technology. For instance, in creativity- or experience-based learning, technology may even become an obstacle to learning, since it may hinder students from attaining first-hand experiences and participating in the learning activities ( Cuthbertson et al., 2004 ). Appropriately balancing the technology adoption and human involvement in various educational contexts will be a challenge in the foreseeable future. Nonetheless, the convergence of human and machine learning has the potential for highly effective teaching and learning beyond the simple “sum of the parts of human and artificial intelligence” ( Topol, 2019 ).
(5) Algorithmic bias is another controversial issue ( Obermeyer et al., 2019 ). Since modern AI algorithms extensively rely on data, their performance is governed solely by data. Algorithms adapt to inherent qualitative and quantitative characteristics of data. For example, if data is unbalanced and contains disproportionately better information on students from general population in comparison to minorities, the algorithms may produce systematic and repeatable errors disadvantaging minorities. These controversial issues need to be addressed before its wide implementation in education practice since every single student is precious. More rigorous studies and validation in real learning environments are required though work along these lines is being done ( Sao Pedro et al., 2013 ).
(6) The fast expansion of technology and inequalities of learning opportunities has aroused great controversies. Due to the exponential nature of technological progress, particularly big data and AI revolution, a fresh paradigm and new learning landscape are on the horizon. For instance, the elite smartphone 10 years ago, in 2010, was BlackBerry. Today, 10 years later, even in sub-Saharan Africa, 75% of the population has mobile phones several generations more advanced ( GSMA Intelligence, 2020 ). Hence, the entry barriers are shifting from the technical requirements to the willingness of and/or need for adoption. This has been clearly demonstrated during the COVID-19 pandemic. The need for social distancing and continuing education has led to online/e-learning deployments within months ( United Nations, 2020 ). A huge amount of learning data is created accordingly. The extraction of meaningful patterns and the discovery of knowledge from these data is expected to be carried out through learning analytics and AI techniques. Inevitably, the current learning cultures, learning experiences, and classroom dynamics are changing as “we live algorithmic lives” ( Bucher, 2018 ). Thus, there is a critical need to adopt proper learning theories of educational psychology and to encourage our learners to be active participants rather than passive recipients or merely tracked objects ( Loftus and Madden, 2020 ). For example, under the constructionist framework ( Tsai, 2000 ), the technology-enhanced or AI-powered education may empower students to know their learning activities and patterns, predict their possible learning outcomes, and strategically regulate their learning behavior ( Koh et al., 2014 ; Loftus and Madden, 2020 ). On the other hand, in the era of information explosion and AI revolution, the disadvantaged students and developing countries are indeed facing a wider digital divide. To reduce the inequalities and bring more opportunities, cultivating young people’s competencies is seemed like one of the most promising means ( UNESCO, 2015 ). Meanwhile, overseas support from international organizations such as World Bank and UNESCO are imperative for developing countries in their communication infrastructure establishment (e.g., hardware, software, connectivity, electricity). Naturally, technology will not replace or hinder human learning; rather, a smart use of new technologies will facilitate transfer and acquisition of knowledge ( Azevedo et al., 2019 ).
An overarching theme from the above trends of research is that we need theories of cognitive and educational psychology to guide our understanding of the individual learner (and individual differences), in order to develop best tools, algorithms, and practices for personalized learning. Take, for example, VR (virtual reality) or AR (augmented reality) as a fast-developing technology for education. The industry has developed many different types of VR/AR applications (e.g., Google Expeditions with over 100 virtual field trips), but these have typically been developed in the views of the industry (see further discussion below) and may not be informed by theories and data from educational psychology about how students actually learn. To make VR/AR effective learning tools, we must separate the technological features from the human experiences and abilities (e.g., cognitive, linguistic, spatial abilities of the learner; see Li et al., 2020 ). For example, VR provides a high-fidelity 3D real-life virtual environment, and the technological tools are built on the assumption that 3D realism enables the learner to gain ‘perceptual grounding’ during learning (e.g., having access to visual, auditory, tactile experiences as in real world). Following the ‘embodied cognition’ theory ( Barsalou, 2008 ), we should expect VR learning to yield better learning outcomes compared with traditional classroom learning. However, empirical data suggest that there are significant individual differences in that some students benefit more than others from VR learning. It may be that the individuals with higher cognitive and perceptual abilities need no additional visuospatial information (provided in VR) to succeed in learning. In any case, we need to understand how embodied experiences (provided by the technology) interact with different learners’ inherent abilities (as well as their prior knowledge and background) for the best application of the relevant technology in education.
Big Data and AI in Education: Policy-Making
Following the revolution triggered by breakthroughs in big data and AI technology, policy-makers have attempted to formulate strategies and policies regarding how to incorporate AI and emerging technologies into primary, secondary, and tertiary education ( Pedró et al., 2019 ). Major challenges must be overcome in order to suitably integrate big data and AI into educational practice. The following three segments highlight pertinent policy-oriented challenges, gaps, and evolving trends.
(1) In digitally-driven knowledge economies, traditional formal education systems are undergoing drastic changes or even a paradigm shift ( Peters, 2018 ). Lifelong learning is quickly being adopted and implemented through online or project-based learning schemes that incorporate multiple ways of teaching ( Lenschow, 1998 ; Sharples, 2000 ; Field, 2001 ; Koper and Tattersall, 2004 ). This new concept of continual education will require micro-credits or micro-degrees to sustain learners’ efforts ( Manuel Moreno-Marcos et al., 2019 ). The need to change the scope and role of education will become evident in the near future ( Williams, 2019 ). For example, in the next few years, new instruction methods, engagement, and assessment will need to be developed in formal education to support lifelong education. The system should be based on micro-credits or micro-degrees.
(2) Solutions for integrating cutting-edge research findings, innovative theory-driven curricula, and emerging technologies into students’ learning are evidently beneficial, and perhaps even ready for adoption. However, there is an apparent divergence between a large number of pre-service and in-service teachers and their willingness to support and adopt these emerging technologies ( Pedró et al., 2019 ). Pre-service teachers have greater exposure to modern technologies and, in general, are more willing to adopt them. In-service teachers have greater practical experience and tend to more rely on it. To bridge the gap, effective teacher education programs and continuing education programs have to be developed and offered to support the adoption of these new technologies so that they can be implemented with fidelity ( O’Donnell, 2008 ). This issue could become even more pressing to tackle in light of the extended period of the COVID-19 pandemic.
(3) A suitable legislative framework is needed to protect personal data from unscrupulous collection, unauthorized disclosure, commercial exploitation, and other abuses ( Boyd and Crawford, 2012 ; Pardo and Siemens, 2014 ). Education records and personal data are highly sensitive. There are significant risks associated with students’ educational profiles, records, and other personal data. Appropriate security measures must be adopted by educational institutions. Commercial educational system providers are actively exploiting both legislative gaps and concealed data acquisition channels. Increasing numbers of industry players are implementing data-oriented business models ( Geczy, 2018 ). There is a vital role to play for legislative, regulatory, and enforcing bodies at both the national and local levels. It is pertinent that governments enact, implement, and enforce privacy and personal data protection legislation and measures. In doing so, there is a need to strike a proper balance between desirable use of personal data for educational purposes and undesirable commercial monetization and abuse of personal data.
Big Data and AI in Education: Industry
As scientific and academic aspects of big data and AI in education have their unique challenges, so does the commercialization of educational tools and systems ( Renz et al., 2020 ). Numerous countries have attempted to stimulate innovation-based growth through enhancing technology transfer and fostering academia-industry collaboration ( Huggins and Thompson, 2015 ). In the United States, this was initiated by the Bayh-Dole Act ( Mowery et al., 2001 ). Building a reciprocal and sustained partnership is strongly encouraged. It facilitates technology transfers and strengthens the links between academia and the education industry. There are several points to be considered when approaching academia-industry collaboration. It is important that collaboration is mutually beneficial. The following points highlight the overlapping spheres of benefits for both educational commerce and academia. They also expose existing gaps and future prospects.
(1) Commercializing intelligent educational tools and systems that include the latest scientific and technological advances can provide educators with tools for developing more effective curricula, pedagogical frameworks, assessments, and programs. Timely release of educational research advances onto commercial platforms is desirable by vendors from development, marketing, and revenue perspectives ( Renz and Hilbig, 2020 ). Implementation of the latest research enables progressive development of commercial products and distinctive differentiation for marketing purposes. This could also potentially solve the significant gap between what the industry knows and develops and what the academic research says with regard to student learning. Novel features may also be suitably monetized—hence, expanding revenue streams. The gaps between availability of the latest research and its practical adoption are slowing progress and negatively impacting commercial vendors. A viable solution is a closer alignment and/or direct collaboration between academia and industry.
(2) A greater spectrum of commercially and freely available tools helps maintain healthy market competition. It also helps to avoid monopolies and oligopolies that stifle innovation, limit choices, and damage markets for educational tools. Some well-stablished or free-of-charge platforms (e.g., Moodle, LMS) might show such potential of oligopolies during the COVID-19 pandemic. With more tools available on the market, educators and academics may explore novel avenues for improving education and research. New and more effective forms of education may be devised. For instance, multimodal virtual educational environments have high potential future prospects. These are environments that would otherwise be impossible in conventional physical settings (see previous discussion of VR/AR). Expanding educational markets and commerce should inevitably lead to expanding resources for research and development funding ( Popenici and Kerr, 2017 ). Collaborative research projects sponsored by the industry should provide support and opportunities for academics to advance educational research. Controversially, in numerous geographies there is a decreasing trend in collaborative research. To reverse the trend, it is desirable that academic researchers and industry practitioners increase their engagements via mutual presentations, educations, and even government initiatives. All three stakeholders (i.e., academia, industry, and government) should play more active roles.
(3) Vocational and practical education provides numerous opportunities for fruitful academia-industry collaboration. With the changing nature of work and growing technology adoption, there is an increasing demand for radical changes in vocational education—for both teachers and students ( World Development and Report, 2019 ). Domain knowledge provided by teachers is beneficially supplemented by AI-assisted learning environments in academia. Practical skills are enhanced in industrial environments with hands-on experience and feedback from both trainers and technology tools. Hence, students benefit from acquiring domain knowledge and enhancing their skills via interactions with human teachers and trainers. Equally, they benefit from gaining the practical skills via interactions with simulated and real-world technological environments. Effective vocational training demands teachers and trainers on the human-learning side, and AI environments and actual technology tools on machine-learning side. Collaboration between academia and industry, as well as balanced human and machine learning approaches are pertinent for vocational education.
Discussion and Conclusion
Big data and AI have enormous potential to realize highly effective learning and teaching. They stimulate new research questions and designs, exploit innovative technologies and tools in data collection and analysis, and ultimately become a mainstream research paradigm ( Daniel, 2019 ). Nonetheless, they are still fairly novel and unfamiliar to many researchers and educators. In this paper, we have described the general background, core concepts, and recent progress of this rapidly growing domain. Along with the arising opportunities, we have highlighted the crucial challenges and emerging trends of big data and AI in education, which are reflected in educational research, policy-making, and industry. Table 1 concisely summarizes the major challenges and possible solutions of big data and AI in education. In summary, future studies should be aimed at theory-based precision education, incorporating cross-disciplinary application, and appropriately using educational technologies. The government should be devoted to supporting lifelong learning, offering teacher education programs, and protecting personal data. With regard to the education industry, reciprocal and mutually beneficial relationships should be developed in order to enhance academia-industry collaboration.
Table 1. Major challenges and possible solutions for integrating big data and AI into education.
Regarding the future development of big data and AI, we advocate an in-depth dialog between the supporters of “cold” technology and “warm” humanity so that users of technology can benefit from its capacity and not see it as a threat to their livelihood. An equally important issue is that overreliance on technology may lead to an underestimation of the role of humans in education. Remember the fundamental role of schooling: the school is a great equalizer as well as a central socialization agent. We need to better understand the role of social and affective processing (e.g., emotion, motivation) in addition to cognitive processing in student learning successes (or failures). After all, human learning is a social behavior, and a number of key regions in our brains are wired to be socially engaged (see Li and Jeong, 2020 for a discussion).
It has been estimated that approximately half of the current routine jobs might be automated in the near future ( Frey and Osborne, 2017 ; World Development and Report, 2019 ). However, the teacher’s job could not be replaced. The teacher-student relationship is indispensable in students’ learning, and inspirational in students’ personal growth ( Roorda et al., 2011 ; Cheng and Tsai, 2019 ). On the other hand, new developments in technologies will enable us to collect and analyze large-scale, multimodal, and continuous real-time data. Such data-intensive and technology-driven analysis of human behavior, in real-world and simulated environments, may assist teachers in identifying students’ learning trajectories and patterns, developing corresponding lesson plans, and adopting effective teaching strategies ( Klašnja-Milicevic et al., 2017 ; Gierl and Lai, 2018 ). It may also support teachers in tackling students’ more complex problems and cultivating students’ higher-order thinking skills by freeing the teachers from their monotonous and routine tasks ( Li, 2007 ; Belpaeme et al., 2018 ). Hence, it is now imperative for us to embrace AI and technology and prepare our teachers and students for the future of AI-enhanced and technology-supported education.
The adoption of big data and AI in learning and teaching is still in its infancy and limited by technological and mindset challenges for now; however, the convergence of developments in psychology, data science, and computer science shows great promise in revolutionizing educational research, practice, and industry. We hope that the latest achievements and future directions presented in this paper will advance our shared goal of helping learners and teachers pursue sustainable development.
Author Contributions
HLu wrote the initial draft of the manuscript. PG, HLa, JG, and PL revised the drafts and provided theoretical background. SY, HO, JB, and RG contributed content for the original draft preparation of the manuscript. C-CT provided theoretical focus, design, draft feedback, and supervised throughout the research. All authors contributed to the article and approved the submitted version.
This work was financially supported by the Institute for Research Excellence in Learning Sciences of National Taiwan Normal University (NTNU) from the Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan.
Conflict of Interest
JG was employed by company Apprendis, LLC, Berlin.
The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abed Ibrahim, L., and Fekete, I. (2019). What machine learning can tell us about the role of language dominance in the diagnostic accuracy of german litmus non-word and sentence repetition tasks. Front. Psychol. 9:2757. doi: 10.3389/fpsyg.2018.02757
CrossRef Full Text | Google Scholar
Adjerid, I., and Kelley, K. (2018). Big data in psychology: a framework for research advancement. Am. Psychol. 73, 899–917. doi: 10.1037/amp0000190
PubMed Abstract | CrossRef Full Text | Google Scholar
Aldowah, H., Al-Samarraie, H., and Fauzy, W. M. (2019). Educational data mining and learning analytics for 21st century higher education: a review and synthesis. Telemat. Inform. 37, 13–49. doi: 10.1016/j.tele.2019.01.007
Anwar, S., Bascou, N. A., Menekse, M., and Kardgar, A. (2019). A systematic review of studies on educational robotics. J. Pre-College Eng. Educ. Res. (J-PEER) 9, 19–42. doi: 10.7771/2157-9288.1223
Azevedo, J. P. W. D., Crawford, M. F., Nayar, R., Rogers, F. H., Barron Rodriguez, M. R., Ding, E. Y. Z., et al. (2019). Ending Learning Poverty: What Will It Take?. Washington, D.C: The World Bank.
Google Scholar
Baker, R. S. J. D., D’Mello, S. K., Rodrigo, M. M. T., and Graesser, A. C. (2010). Better to be frustrated than bored: the incidence, persistence, and impact of learners’ cognitive-affective states during interactions with three different computer-based learning environments. Int. J. Human-Comp. Stud. 68, 223–241. doi: 10.1016/j.ijhcs.2009.12.003
Barsalou, L. W. (2008). “Grounding symbolic operations in the brain’s modal systems,” in Embodied Grounding: Social, Cognitive, Affective, and Neuroscientific Approaches , eds G. R. Semin and E. R. Smith (Cambridge: Cambridge University Press), 9–42. doi: 10.1017/cbo9780511805837.002
Becker, S. A., Cummins, M., Davis, A., Freeman, A., Hall, C. G., and Ananthanarayanan, V. (2017). NMC Horizon Report: 2017 Higher Education Edition. Austin, TX: The New Media Consortium.
Belpaeme, T., Kennedy, J., Ramachandran, A., Scassellati, B., and Tanaka, F. (2018). Social robots for education: a review. Sci. Robot. 3:eaat5954. doi: 10.1126/scirobotics.aat5954
Bereiter, C. (2002). Education and MIND in the Knowledge Age. Mahwah, NJ: LEA.
Boyd, D., and Crawford, K. (2012). Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Inform. Commun. Soc. 15, 662–679. doi: 10.1080/1369118x.2012.678878
Bramley, N. R., Dayan, P., Griffiths, T. L., and Lagnado, D. A. (2017). Formalizing Neurath’s ship: approximate algorithms for online causal learning. Psychol. Rev. 124, 301–338. doi: 10.1037/rev0000061
Bucher, T. (2018). If Then: Algorithmic Power and Politics. New York, NY: Oxford University Press.
Carbonell, J. R. (1970). AI in CAI: an artificial-intelligence approach to computer-assisted instruction. IEEE Trans. Man-Machine Sys. 11, 190–202. doi: 10.1109/TMMS.1970.299942
Chen, C. P., and Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inform. Sci. 275, 314–347. doi: 10.1016/j.ins.2014.01.015
Chen, L., Chen, P., and Lin, Z. (2020). Artificial intelligence in education: a review. IEEE Access 8, 75264–75278. doi: 10.1109/ACCESS.2020.2988510
Chen, N.-S., Yin, C., Isaias, P., and Psotka, J. (2020). Educational big data: extracting meaning from data for smart education. Interact. Learn. Environ. 28, 142–147. doi: 10.1080/10494820.2019.1635395
Cheng, K.-H., and Tsai, C.-C. (2019). A case study of immersive virtual field trips in an elementary classroom: students’ learning experience and teacher-student interaction behaviors. Comp. Educ. 140:103600. doi: 10.1016/j.compedu.2019.103600
Cheung, M. W.-L., and Jak, S. (2018). Challenges of big data analyses and applications in psychology. Zeitschrift Fur Psychol. J. Psychol. 226, 209–211. doi: 10.1027/2151-2604/a000348
Cuthbertson, B., Socha, T. L., and Potter, T. G. (2004). The double-edged sword: critical reflections on traditional and modern technology in outdoor education. J. Adv. Educ. Outdoor Learn. 4, 133–144. doi: 10.1080/14729670485200491
Cutumisu, M., and Guo, Q. (2019). Using topic modeling to extract pre-service teachers’ understandings of computational thinking from their coding reflections. IEEE Trans. Educ. 62, 325–332. doi: 10.1109/te.2019.2925253
Daniel, B. (2015). Big data and analytics in higher education: opportunities and challenges. Br. J. Educ. Technol. 46, 904–920. doi: 10.1111/bjet.12230
Daniel, B. K. (2019). Big data and data science: a critical review of issues for educational research. Br. J. Educ. Technol. 50, 101–113. doi: 10.1111/bjet.12595
Dijcks, J. (2013). Oracle: Big data for the enterprise. Oracle White Paper . Redwood Shores, CA: Oracle Corporation.
Field, J. (2001). Lifelong education. Int. J. Lifelong Educ. 20, 3–15. doi: 10.1080/09638280010008291
Frey, C. B., and Osborne, M. A. (2017). The future of employment: how susceptible are jobs to computerisation? Technol. Forecast. Soc. Change 114, 254–280. doi: 10.1016/j.techfore.2016.08.019
Geczy, P. (2014). Big data characteristics. Macrotheme Rev. 3, 94–104.
Geczy, P. (2015). Big data management: relational framework. Rev. Bus. Finance Stud. 6, 21–30.
Geczy, P. (2018). Data-Oriented business models: gaining competitive advantage. Global J. Bus. Res. 12, 25–36.
Gierl, M. J., and Lai, H. (2018). Using automatic item generation to create solutions and rationales for computerized formative testing. Appl. Psychol. Measurement 42, 42–57. doi: 10.1177/0146621617726788
Gobert, J., Sao Pedro, M., Raziuddin, J., and Baker, R. S. (2013). From log files to assessment metrics for science inquiry using educational data mining. J. Learn. Sci. 22, 521–563. doi: 10.1080/10508406.2013.837391
Gobert, J. D., and Sao Pedro, M. A. (2017). “Digital assessment environments for scientific inquiry practices,” in The Wiley Handbook of Cognition and Assessment , eds A. A. Rupp and J. P. Leighton (West Sussex: Frameworks, Methodologies, and Applications), 508–534. doi: 10.1002/9781118956588.ch21
Gobert, J. D., Sao Pedro, M. A., Baker, R. S., Toto, E., and Montalvo, O. (2012). Leveraging educational data mining for real-time performance assessment of scientific inquiry skills within microworlds. J. Educ. Data Min. 4, 104–143. doi: 10.5281/zenodo.3554645
Goksel, N., and Bozkurt, A. (2019). “Artificial intelligence in education: current insights and future perspectives,” in Handbook of Research on Learning in the Age of Transhumanism , eds S. Sisman-Ugur and G. Kurubacak (Hershey, PA: IGI Global), 224–236 doi: 10.4018/978-1-5225-8431-5.ch014
Graesser, A. C., Chipman, P., Haynes, B. C., and Olney, A. (2005). AutoTutor: an intelligent tutoring system with mixed-initiative dialogue. IEEE Trans. Educ. 48, 612–618. doi: 10.1109/te.2005.856149
GSMA Intelligence (2020). The Mobile Economy 2020 . London: GSM Association.
Harlow, L. L., and Oswald, F. L. (2016). Big data in psychology: introduction to the special issue. Psychol. Methods 21, 447–457. doi: 10.1037/met0000120
Hew, K. F., Lan, M., Tang, Y., Jia, C., and Lo, C. K. (2019). Where is the “theory” within the field of educational technology research? Br. J. Educ. Technol. 50, 956–971. doi: 10.1111/bjet.12770
Hinojo-Lucena, F. J., Aznar-Díaz, I., Cáceres-Reche, M. P., and Romero-Rodríguez, J. M. (2019). Artificial intelligence in higher education: a bibliometric study on its impact in the scientific literature. Educ. Sci. 9:51. doi: 10.3390/educsci9010051
Huang, A. Y., Lu, O. H., Huang, J. C., Yin, C., and Yang, S. J. (2020). Predicting students’ academic performance by using educational big data and learning analytics: evaluation of classification methods and learning logs. Int. Learn. Environ. 28, 206–230. doi: 10.1080/10494820.2019.1636086
Huggins, R., and Thompson, P. (2015). Entrepreneurship, innovation and regional growth: a network theory. Small Bus. Econ. 45, 103–128. doi: 10.1007/s11187-015-9643-3
Hwang, G.-J., Spikol, D., and Li, K.-C. (2018). Guest editorial: trends and research issues of learning analytics and educational big data. Educ. Technol. Soc. 21, 134–136.
Jagadish, H. V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J. M., Ramakrishnan, R., et al. (2014). Big data and its technical challenges. Commun. ACM. 57, 86–94. doi: 10.1145/2611567
Johnson, L., Smith, R., Willis, H., Levine, A., and Haywood, K. (2011). The 2011 Horizon Report. Austin, TX: The New Media Consortium.
Jordan, M. I., and Mitchell, T. M. (2015). Machine learning: trends, perspectives, and prospects. Science 349, 255–260. doi: 10.1126/science.aaa8415
Khechine, H., and Lakhal, S. (2018). Technology as a double-edged sword: from behavior prediction with UTAUT to students’ outcomes considering personal characteristics. J. Inform. Technol. Educ. Res. 17, 63–102. doi: 10.28945/4022
Klašnja-Milicevic, A., Ivanovic, M., and Budimac, Z. (2017). Data science in education: big data and learning analytics. Comput. Applicat. Eng. Educ. 25, 1066–1078. doi: 10.1002/cae.21844
Koh, J. H. L., Chai, C. S., and Tsai, C. C. (2014). Demographic factors, TPACK constructs, and teachers’ perceptions of constructivist-oriented TPACK. J. Educ. Technol. Soc. 17, 185–196.
Koper, R., and Tattersall, C. (2004). New directions for lifelong learning using network technologies. Br. J. Educ. Technol. 35, 689–700. doi: 10.1111/j.1467-8535.2004.00427.x
Krouska, A., Troussas, C., and Virvou, M. (2019). SN-Learning: an exploratory study beyond e-learning and evaluation of its applications using EV-SNL framework. J. Comp. Ass. Learn. 35, 168–177. doi: 10.1111/jcal.12330
Laney, D. (2001). 3D data management: controlling data volume, velocity and variety. META Group Res. Note 6, 70–73.
Lazer, D., Kennedy, R., King, G., and Vespignani, A. (2014). The parable of Google Flu: traps in big data analysis. Science 343, 1203–1205. doi: 10.1126/science.1248506
Lenschow, R. J. (1998). From teaching to learning: a paradigm shift in engineering education and lifelong learning. Eur. J. Eng. Educ. 23, 155–161. doi: 10.1080/03043799808923494
L’heureux, A., Grolinger, K., Elyamany, H. F., and Capretz, M. A. (2017). Machine learning with big data: challenges and approaches. IEEE Access 5, 7776–7797. doi: 10.1109/ACCESS.2017.2696365
Li, H., Gobert, J., and Dickler, R. (2019). “Evaluating the transfer of scaffolded inquiry: what sticks and does it last?,” in Artificial Intelligence in Education , eds S. Isotani, E. Millán, A. Ogan, P. Hastings, B. McLaren, and R. Luckin (Cham: Springer), 163–168. doi: 10.1007/978-3-030-23207-8_31
Li, P., and Jeong, H. (2020). The social brain of language: grounding second language learning in social interaction. npj Sci. Learn. 5:8. doi: 10.1038/s41539-020-0068-7
Li, P., Legault, J., Klippel, A., and Zhao, J. (2020). Virtual reality for student learning: understanding individual differences. Hum. Behav. Brain 1, 28–36. doi: 10.37716/HBAB.2020010105
Li, X. (2007). Intelligent agent–supported online education. Dec. Sci. J. Innovat. Educ. 5, 311–331. doi: 10.1111/j.1540-4609.2007.00143.x
Loftus, M., and Madden, M. G. (2020). A pedagogy of data and Artificial intelligence for student subjectification. Teach. Higher Educ. 25, 456–475. doi: 10.1080/13562517.2020.1748593
Long, P., and Siemens, G. (2011). Penetrating the fog: analytics in learning and education. Educ. Rev. 46, 31–40. doi: 10.1007/978-3-319-38956-1_4
Lu, O. H. T., Huang, A. Y. Q., Huang, J. C. H., Lin, A. J. Q., Ogata, H., and Yang, S. J. H. (2018). Applying learning analytics for the early prediction of students’ academic performance in blended learning. Educ. Technol. Soc. 21, 220–232.
Macfadyen, L. P. (2017). Overcoming barriers to educational analytics: how systems thinking and pragmatism can help. Educ. Technol. 57, 31–39.
Malik, G., Tayal, D. K., and Vij, S. (2019). “An analysis of the role of artificial intelligence in education and teaching,” in Recent Findings in Intelligent Computing Techniques. Advances in Intelligent Systems and Computing , eds P. Sa, S. Bakshi, I. Hatzilygeroudis, and M. Sahoo (Singapore: Springer), 407–417.
Manuel Moreno-Marcos, P., Alario-Hoyos, C., Munoz-Merino, P. J., and Delgado Kloos, C. (2019). Prediction in MOOCs: a review and future research directions. IEEE Trans. Learn. Technol. 12, 384–401. doi: 10.1109/TLT.2018.2856808
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., et al. (2011). Big data: The Next Frontier for Innovation, Competition and Productivity. New York, NY: McKinsey Global Institute.
Mayer-Schönberger, V., and Cukier, K. (2013). Big data: A Revolution That Will Transform How we live, Work, and Think. Boston, MA: Houghton Mifflin Harcourt.
Mislevy, R. J., Yan, D., Gobert, J., and Sao Pedro, M. (2020). “Automated scoring in intelligent tutoring systems,” in Handbook of Automated Scoring , eds D. Yan, A. A. Rupp, and P. W. Foltz (London: Chapman and Hall/CRC), 403–422. doi: 10.1201/9781351264808-22
Mowery, D. C., Nelson, R. R., Sampat, B. N., and Ziedonis, A. A. (2001). The growth of patenting and licensing by US universities: an assessment of the effects of the Bayh–Dole act of 1980. Res. Pol. 30, 99–119. doi: 10.1515/9780804796361-008
Nye, B. D. (2015). Intelligent tutoring systems by and for the developing world: a review of trends and approaches for educational technology in a global context. Int. J. Art. Intell. Educ. 25, 177–203. doi: 10.1007/s40593-014-0028-6
Obermeyer, Z., Powers, B., Vogeli, C., and Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453. doi: 10.1126/science.aax2342
O’Donnell, C. (2008). Defining, conceptualizing, and measuring fidelity of implementation and its relationship to outcomes in K-12 curriculum intervention research. Rev. Educ. Res. 78, 33–84. doi: 10.3102/0034654307313793
Papamitsiou, Z., and Economides, A. A. (2014). Learning analytics and educational data mining in practice: a systematic literature review of empirical evidence. Educ. Technol. Soc. 17, 49–64.
Pardo, A., and Siemens, G. (2014). Ethical and privacy principles for learning analytics. Br. J. Educ. Technol. 45, 438–450. doi: 10.1111/bjet.12152
Pedró, F., Subosa, M., Rivas, A., and Valverde, P. (2019). Artificial Intelligence in Education: Challenges and Opportunities for Sustainable Development. Paris: UNESCO.
Peters, M. A. (2018). Deep learning, education and the final stage of automation. Educ. Phil. Theory 50, 549–553. doi: 10.1080/00131857.2017.1348928
Popenici, S. A., and Kerr, S. (2017). Exploring the impact of artificial intelligence on teaching and learning in higher education. Res. Pract. Technol. Enhanced Learn. 12:22. doi: 10.1186/s41039-017-0062-8
Quadir, B., Chen, N.-S., and Isaias, P. (2020). Analyzing the educational goals, problems and techniques used in educational big data research from 2010 to 2018. Int. Learn. Environ. 1–17. doi: 10.1080/10494820.2020.1712427
Renz, A., and Hilbig, R. (2020). Prerequisites for artificial intelligence in further education: identification of drivers, barriers, and business models of educational technology companies. Int. J. Educ. Technol. Higher Educ. 17:14. doi: 10.1186/s41239-020-00193-3
Renz, A., Krishnaraja, S., and Gronau, E. (2020). Demystification of artificial intelligence in education–how much ai is really in the educational technology? Int. J. Learn. Anal. Art. Intell. Educ. (IJAI). 2, 4–30. doi: 10.3991/ijai.v2i1.12675
Roorda, D. L., Koomen, H. M. Y., Spilt, J. L., and Oort, F. J. (2011). The influence of affective teacher-student relationships on students’ school engagement and achievement: a meta-analytic approach. Rev. Educ. Res. 81, 493–529. doi: 10.3102/0034654311421793
Sao Pedro, M., Baker, R., and Gobert, J. (2013). “What different kinds of stratification can reveal about the generalizability of data-mined skill assessment models,” in Proceedings of the 3rd Conference on Learning Analytics and Knowledge (Leuven), 190–194.
Schroeck, M., Shockley, R., Smart, J., Romero-Morales, D., and Tufano, P. (2012). Analytics: the real-world use of big data. IBM Global Bus. Serv. 12, 1–20. doi: 10.1002/9781119204183.ch1
Sharples, M. (2000). The design of personal mobile technologies for lifelong learning. Comp. Educ. 34, 177–193. doi: 10.1016/s0360-1315(99)00044-5
Smutny, P., and Schreiberova, P. (2020). Chatbots for learning: a review of educational chatbots for the facebook messenger. Comp. Educ. 151:103862. doi: 10.1016/j.compedu.2020.103862
Sonderlund, A. L., Hughes, E., and Smith, J. (2019). The efficacy of learning analytics interventions in higher education: a systematic review. Br. J. Educ. Technol. 50, 2594–2618. doi: 10.1111/bjet.12720
Song, Y., Dai, X.-Y., and Wang, J. (2016). Not all emotions are created equal: expressive behavior of the networked public on China’s social media site. Comp. Hum. Behav. 60, 525–533. doi: 10.1016/j.chb.2016.02.086
Spikol, D., Ruffaldi, E., Dabisias, G., and Cukurova, M. (2018). Supervised machine learning in multimodal learning analytics for estimating success in project-based learning. J. Comp. Ass. Learn. 34, 366–377. doi: 10.1111/jcal.12263
Staats, C. (2016). Understanding implicit bias: what educators should know. Am. Educ. 39, 29–33. doi: 10.2307/3396655
Starcic, A. I. (2019). Human learning and learning analytics in the age of artificial intelligence. Br. J. Educ. Technol. 50, 2974–2976. doi: 10.1111/bjet.12879
The International Learning Sciences Forum (2019). The International Learning Sciences Forum: International Trends for Ai and Big Data in Learning Sciences. Taipei: National Taiwan Normal University.
Toh, L. P. E., Causo, A., Tzuo, P. W., Chen, I. M., and Yeo, S. H. (2016). A review on the use of robots in education and young children. J. Educ. Technol. Soc. 19, 148–163.
Topol, E. J. (2019). High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25, 44–56. doi: 10.1038/s41591-018-0300-7
Tsai, C. C. (2000). Relationships between student scientific epistemological beliefs and perceptions of constructivist learning environments. Educ. Res. 42, 193–205. doi: 10.1080/001318800363836
Tsai, S. C., Chen, C. H., Shiao, Y. T., Ciou, J. S., and Wu, T. N. (2020). Precision education with statistical learning and deep learning: a case study in Taiwan. Int. J. Educ. Technol. Higher Educ. 17, 1–13. doi: 10.1186/s41239-020-00186-2
UNESCO (2015). SDG4-Education 2030, Incheon Declaration (ID) and Framework for Action. For the Implementation of Sustainable Development Goal 4, Ensure Inclusive and Equitable Quality Education and Promote Lifelong Learning Opportunities for All, ED-2016/WS/28. London: UNESCO
United Nations (2020). Policy Brief: Education During Covid-19 and Beyond. New York, NY: United Nations
VanRullen, R. (2017). Perception science in the age of deep neural networks. Front. Psychol. 8:142. doi: 10.3389/fpsyg.2017.00142
Viberg, O., Hatakka, M., Bälter, O., and Mavroudi, A. (2018). The current landscape of learning analytics in higher education. Comput. Human Behav. 89, 98–110. doi: 10.1016/j.chb.2018.07.027
Williams, P. (2019). Does competency-based education with blockchain signal a new mission for universities? J. Higher Educ. Pol. Manag. 41, 104–117. doi: 10.1080/1360080x.2018.1520491
World Development and Report (2019). The Changing Nature of Work. Washington, DC: The World Bank/International Bank for Reconstruction and Development.
Xie, H., Chu, H.-C., Hwang, G.-J., and Wang, C.-C. (2019). Trends and development in technology-enhanced adaptive/personalized learning: a systematic review of journal publications from 2007 to 2017. Comp. Educ. 140:103599. doi: 10.1016/j.compedu.2019.103599
Yadegaridehkordi, E., Noor, N. F. B. M., Ayub, M. N. B., Affal, H. B., and Hussin, N. B. (2019). Affective computing in education: a systematic review and future research. Comp. Educ. 142:103649. doi: 10.1016/j.compedu.2019.103649
Yarkoni, T., and Westfall, J. (2017). Choosing prediction over explanation in psychology: lessons from machine learning. Perspect. Psychol. Sci. 12, 1100–1122. doi: 10.1177/1745691617693393
Zawacki-Richter, O., Marín, V. I., Bond, M., and Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education–where are the educators? Int. J. Educ. Technol. Higher Educ. 16:39. doi: 10.1186/s41239-019-0171-0
Keywords : big data, artificial intelligence, education, learning, teaching
Citation: Luan H, Geczy P, Lai H, Gobert J, Yang SJH, Ogata H, Baltes J, Guerra R, Li P and Tsai C-C (2020) Challenges and Future Directions of Big Data and Artificial Intelligence in Education. Front. Psychol. 11:580820. doi: 10.3389/fpsyg.2020.580820
Received: 07 July 2020; Accepted: 22 September 2020; Published: 19 October 2020.
Reviewed by:
Copyright © 2020 Luan, Geczy, Lai, Gobert, Yang, Ogata, Baltes, Guerra, Li and Tsai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Chin-Chung Tsai, [email protected]
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
- View all journals
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- Open access
- Published: 04 August 2020
Moving back to the future of big data-driven research: reflecting on the social in genomics
- Melanie Goisauf ORCID: orcid.org/0000-0002-3909-8071 1 , 2 na1 ,
- Kaya Akyüz ORCID: orcid.org/0000-0002-2444-2095 1 , 2 na1 &
- Gillian M. Martin ORCID: orcid.org/0000-0002-5281-8117 3 na1
Humanities and Social Sciences Communications volume 7 , Article number: 55 ( 2020 ) Cite this article
3366 Accesses
8 Citations
9 Altmetric
Metrics details
- Science, technology and society
With the advance of genomics, specific individual conditions have received increased attention in the generation of scientific knowledge. This spans the extremes of the aim of curing genetic diseases and identifying the biological basis of social behaviour. In this development, the ways knowledge is produced have gained significant relevance, as the data-intensive search for biology/sociality associations has repercussions on doing social research and on theory. This article argues that an in-depth discussion and critical reflection on the social configurations that are inscribed in, and reproduced by genomic data-intensive research is urgently needed. This is illustrated by debating a recent case: a large-scale genome-wide association study (GWAS) on sexual orientation that suggested partial genetic basis for same-sex sexual behaviour (Ganna et al. 2019b ). This case is analysed from three angles: (1) the demonstration of how, in the process of genomics research, societal relations, understandings and categorizations are used and inscribed into social phenomena and outcomes; (2) the exploration of the ways that the (big) data-driven research is constituted by increasingly moving away from theory and methodological generation of theoretical concepts that foster the understanding of societal contexts and relations (Kitchin 2014a ). Big Data Soc and (3) the demonstration of how the assumption of ‘free from theory’ in this case does not mean free of choices made, which are themselves restricted by data that are available. In questioning how key sociological categories are incorporated in a wider scientific debate on genetic conditions and knowledge production, the article shows how underlying classification and categorizations, which are inherently social in their production, can have wide ranging implications. The conclusion cautions against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.
Similar content being viewed by others
Using genetics for social science
Genetic determinism, essentialism and reductionism: semantic clarity for contested science
Participation bias in the UK Biobank distorts genetic associations and downstream analyses
Introduction.
With the advance of genomic research, specific individual conditions received increased attention in scientific knowledge generation. While understanding the genetic foundations of diseases has become an important driver for the advancement of personalized medicine, the focus of interest has also expanded from disease to social behaviour. These developments are embedded in a wider discourse in science and society about the opportunities and limits of genomic research and intervention. With the emergence of the genome as a key concept for ‘life itself’, understandings of health and disease, responsibility and risk, and the relation between present conditions and future health outcomes have shifted, impacting also the ways in which identities are conceptualized under new genetic conditions (Novas and Rose 2000 ). At the same time, the growing literature of postgenomics points to evolving understandings of what ‘gene’ and ‘environment’ are (Landecker and Panofsky 2013 ; Fox Keller 2014 ; Meloni 2016 ). The postgenomic genome is no longer understood as merely directional and static, but rather as a complex and dynamic system that responds to its environment (Fox Keller 2015 ), where the social as part of the environment becomes a signal for activation or silencing of genes (Landecker 2016 ). At the same time, genetic engineering, prominently known as the gene-editing technology CRISPR/Cas9, has received considerable attention, but also caused concerns regarding its ethical, legal and societal implications (ELSI) and governance (Howard et al. 2018 ; Jasanoff and Hurlbut 2018 ). Taking these developments together, the big question of nature vs. nurture has taken on a new significance.
Studies which aim to reveal how biology and culture are being put in relation to each other appear frequently and pursue a genomic re-thinking of social outcomes and phenomena, such as educational attainment (Lee et al. 2018 ) or social stratification (Abdellaoui et al. 2019 ). Yet, we also witness very controversial applications of biotechnology, such as the first known case of human germline editing by He Jiankui in China, which has impacted the scientific community both as an impetus of wide protests and insecurity about the future of gene-editing and its use, but also instigated calls towards public consensus to (re-)set boundaries to what is editable (Morrison and de Saille 2019 ).
Against this background, we are going to debate in this article a particular case that appeared within the same timeframe as these developments: a large-scale genome-wide association study (GWAS) on sexual orientation Footnote 1 , which suggested partial genetic basis for same-sex sexual behaviour (Ganna et al. 2019b ). Some scientists have been claiming sexual orientation to be partly heritable and trying to identify genetic basis for sexual orientation for years (Hamer et al. 1993 ); however, this was the first time that genetic variants were identified as statistically significant and replicated in an independent sample. We consider this GWAS not only by questioning the ways genes are associated with “the social” within this research, but also by exploring how the complexity of the social is reduced through specific data practices in research.
The sexual orientation study also constitutes an interesting case to reflect on how knowledge is produced at a time the data-intensive search for biology/sociality associations has repercussions on doing social research and on theory (Meloni 2014 ). Large amounts of genomic data are needed to identify genetic variations and for finding correlations with different biological and social factors. The rise of the genome corresponds to the rise of big data as the collection and sharing of genomic data gains power with the development of big data analytics (Parry and Greenhough 2017 ). Growing number of correlations, e.g. in genomics of educational attainment (Lee et al. 2018 ; Okbay et al. 2016 ), are being found that are linking the genome to the social, increasingly blurring the established biological/social divide. These could open up new ways of understanding life, and underpin the importance of culture, while, paradoxically, may also carry the risk of new genetic determinism and essentialism. The changing understanding of the now molecularised and datafied body also illustrates the changing significance of empirical research and sociology (Savage and Burrows 2007 ) in the era of postgenomics and ‘datafication’ (Ruckenstein and Schüll 2017 ). These developments are situated within methodological debates in which social sciences often appear through the perspective of ELSI.
As the field of genomics is progressing rapidly and the intervention in the human genome is no longer science fiction, we argue that it is important to discuss and reflect now on the social configurations that are inscribed in, and reproduced by genomic data-driven research. These may co-produce the conception of certain potentially editable conditions, i.e. create new, and reproduce existing classifications that are largely shaped by societal understandings of difference and order. Such definitions could have real consequences—as Thomas and Thomas ( 1929 ) remind us—for individuals and societies, and mark what has been described as an epistemic shift in biomedicine from the clinical gaze to the ‘molecular gaze’ where the processes of “medicalisation and biomedicalisation both legitimate and compel interventions that may produce transformations in individual, familial and other collective identities” (Clarke et al. 2013 , p. 23). While Science and Technology Studies (STS) has demonstrated how science and society are co-produced in research (Jasanoff 2004 ), we want to use the momentum of the current discourse to critically reflect on these developments from three angles: (1) we demonstrate how, in the process of genomics research, societal relations, understandings and categorizations are used and inscribed into social phenomena and outcomes; (2) we explore the ways that the (big) data-driven research is constituted by increasingly moving away from theory and methodological generation of theoretical concepts that foster the understanding of societal contexts and relations (Kitchin 2014a ) and (3) using the GWAS case in focus, we show how the assumption of ‘free from theory’ (Kichin 2014a ) in this case does not mean free of choices made, choices which are themselves restricted by data that are available. We highlight Griffiths’ ( 2016 ) contention that the material nature of genes, their impacts on biological makeup of individuals and their socially and culturally situated behaviour are not deterministic, and need to be understood within the dynamic, culturally and temporally situated context within which knowledge claims are made. We conclude by making the important point that ignoring the social may lead to a distorted, datafied, genomised body which ignores the key fact that “genes are not stable but essentially malleable” (Prainsack 2015 ) and that this ‘malleability’ is rooted in the complex interplay between biological and social environments.
From this perspective, the body is understood through the lens of embodiment, considering humans ‘live’ their genome within their own lifeworld contexts (Rehmann-Sutter and Mahr 2016 ). We also consider this paper as an intervention into the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.
In the following reflections, we proceed step by step: First, we introduce the case of the GWAS on same-sex sexual behaviour, as well as its limits, context and impact. Second, we recall key sociological theory on categorizations and their implications. Third, we discuss the emergence of a digital-datafication of scientific knowledge production. Finally, we conclude by cautioning against the marginalization of social science in the wake of developments in data-driven research that neglect social theory, established methodology and the contextual relevance of the social environment.
Studying sexual orientation: The case of same-sex sexual behaviour
Currently, a number of studies at the intersection of genetic and social conditions appear on the horizon. Just as in the examples we have already mentioned, such as those on educational attainment (Lee et al. 2018 ), or social stratification (Abdellaoui et al. 2019 ), it is important to note that the limit to such studies is only the availability of the data itself. In other words, once the data is available, there is always the potential that it would eventually be used. This said, an analysis of the entirety of the genomic research on social outcomes and behaviour is beyond the scope of this article. Therefore, we want to exemplify our argument with reference to the research on the genetics of same-sex sexual behaviour.
Based on a sample of half a million individuals of European ancestry, the first large-scale GWAS of its kind claims five genetic variants to be contributing to the assessed “same-sex sexual behaviour” (Ganna et al. 2019b ). Among these variants, two are useful only for male–male sexual behaviour, one for female–female sexual behaviour, and the remaining two for both. The data that has led to this analysis was sourced from biobanks/cohorts with different methods of data collection. The authors conclude that these genetic variations are not predictive of sexual orientation; not only because genetics is supposedly only part of the picture, but also because the variations are only a small part (<1% of the variance in same-sex sexual behaviour, p. 4) of the approximated genetic basis (8–25% of the variance in same-sex sexual behaviour) that may be identified with large sample sizes (p. 1). The study is an example of how the ‘gay gene’ discourse that has been around for years, gets transformed with the available data accumulating in the biobanks and the consequent genomic analysis, offering only one facet of a complex social phenomenon: same-sex sexual behaviour.
The way the GWAS has been conducted was not novel in terms of data collection. Genome-wide studies of similar scale, e.g. on insomnia (Jansen et al. 2019 ) or blood pressure (Evangelou et al. 2018 ), often rely on already collected data in biobanks rather than trying to collect hundreds of thousands of individuals’ DNA from scratch. Furthermore, in line with wider developments, the study was preregistered Footnote 2 with an analysis plan for the data to be used by the researchers. Unlike other GWASes, however, the researchers partnered with an LGBTQIA+ advocacy group (GLAAD) and a science communication charity (Sense About Science), where individuals beyond the research team interpreted the findings and discussed how to convey the results Footnote 3 . Following these engagements, the researchers have produced a website Footnote 4 with potential frequently asked questions as well as a video about the study, highlighting what it does and what it does not claim.
Despite efforts to control the drifting away of the study into genetic deterministic and discriminatory interpretations, the study has been criticized by many Footnote 5 . Indeed, the controversial “How gay are you?” Footnote 6 app on the GenePlaza website utilized the findings of the study, which in turn raised the alarm bells and, ultimately, was taken down after much debate. The application, however, showed how rapidly such findings can translate into individualized systems of categorization, and consequently feed into and be fed by the public imaginary. One of the study authors demands continuation of research by noting “[s]cientists have a responsibility to describe the human condition in a more nuanced and deeper way” (Maxmen, 2019 , p. 610). Critics, however, note that the context of data collected from the individuals may have influence on the findings; for instance, past developments (i.e. decriminalization of homosexuality, the HIV/AIDS epidemic, and legalization of same-sex marriage) are relevant to understand the UK Biobank’s donor profile and if the GWAS were to be redone according to the birth year of the individuals, different findings could have come out of the study (Richardson et al. 2019 , p. 1461).
It has been pointed out that such research should be assessed by a competent ethical review board according to its potential risks and benefits (Maxmen 2019 , p. 610), in addition to the review and approval by the UK Biobank Access Sub-Committee (Ganna et al. 2019a , p. 1461). Another ethical issue of concern raised by critics is that the informed consent form of UK Biobank does not specify that it could be used for such research since “homosexuality has long been removed from disease classifications” and that the broad consent forms allow only “health-related research” (Holm and Ploug 2019 , p. 1460). We do not want to make a statement here for or against broad consent. However, we argue that discussions about informed consent showcase the complexities related to secondary use of data in research. Similarly, the ‘gay gene’ app developed in the wake of the sexual orientation study, revealed the difficulty of controlling how the produced knowledge may be used, including in ways that are openly denounced by the study authors.
To the best of our knowledge, there have not been similar genome-wide studies published on sexual orientation and, while we acknowledge the limitations associated with focusing on a single case in our discussion, we see this case as relevant to opening up the following question: How are certain social categorizations incorporated into the knowledge production practices? We want to answer this by first revisiting some of the fundamental sociological perspectives into categorizations and the social implications these may have.
Categorizing sex, gender, bodies, disease and knowledge
Sociological perspectives on categorizations.
Categorizations and classifications take a central role in the sociology of knowledge, social stratifications and data-based knowledge production. Categories like gender, race, sexuality and class (and their intersection, see Crenshaw 1989 ) have become key classifications for the study of societies and in understanding the reproduction of social order. One of the most influential theories about the intertwining of categories like gender and class with power relations was formulated by Bourdieu ( 2010 , 2001 ). He claimed that belonging to a certain class or gender is an embodied practice that ensures the reproduction of social structure which is shaped by power relations. The position of subjects within this structure reflects the acquired cultural capital, such as education. Incorporated dispositions, schemes of perception, appreciation, classification that make up the individual’s habitus are shaped by social structure, which actors reproduce in practices. One key mechanism of social categorization is gender classification. The gender order appears to be in the ‘nature of things’ of biologically different bodies, whereas it is in fact an incorporated social construction that reflects and constitutes power relations. Bourdieu’s theory links the function of structuring classifications with embodied knowledge and demonstrates that categories of understanding are pervaded by societal power relations.
In a similar vein Foucault ( 2003 , 2005 ) describes the intertwining of ordering classifications, bodies and power in his study of the clinic. Understandings of and knowledge about the body follow a specific way of looking at it—the ‘medical gaze’ of separating the patient’s body from identity and distinguishing healthy from the diseased, which, too, is a process pervaded by power differentials. Such classifications evolved historically. Foucault reminds us that all periods in history are characterized by specific epistemological assumptions that shape discourses and manifest in modalities of order that made certain kinds of knowledge, for instance scientific knowledge, possible. The unnoticed “order of things”, as well as the social order, is implemented in classifications. Such categorizations also evolved historically for the discourse about sexuality, or, in particular as he pointed out writing in the late 1970s, distinguishing sexuality of married couples from other forms, such as homosexuality (Foucault 1998 ).
Bourdieu and Foucault offer two influential approaches within the wider field of sociology of knowledge that provide a theoretical framework on how categorizations and classifications structure the world in conjunction with social practice and power relations. Their work demonstrates that such structuration is never free from theory, i.e. they are not existing prediscursively, but are embedded within a certain temporal and spatial context that constitutes ‘situated knowledge’ (Haraway 1988 ). Consequently, classifications create (social) order that cannot be understood as ‘naturally’ given but as a result of relational social dynamics embedded in power differentials.
Feminist theory in the 1970s emphasized the inherently social dimension of male and female embodiment, which distinguished between biological sex and socially rooted gender. This distinction built the basis for a variety of approaches that examined gender as a social phenomenon, as something that is (re-)constructed in social interaction, impacted by collectively held beliefs and normative expectations. Consequently, the difference between men and women was no longer simply understood as a given biological fact, but as something that is, also, a result of socialization and relational exchanges within social contexts (see, e.g., Connell 2005 ; Lorber 1994 ). Belonging to a gender or sex is a complex practice of attribution, assignment, identification and, consequently, classification (Kessler and McKenna 1978 ). The influential concept of ‘doing gender’ emphasized that not only the gender, but also the assignment of sex is based on socially agreed-upon biological classification criteria, that form the basis of placing a person in a sex category , which needs to be practically sustained in everyday life. The analytical distinction between sex and gender became eventually implausible as it obscures the process in which the body itself is subject to social forces (West and Zimmerman 1991 ).
In a similar way, sexual behaviour and sexuality are also shaped by society, as societal expectations influence sexual attraction—in many societies within normative boundaries of gender binary and heteronormativity (Butler 1990 ). This also had consequences for a deviation from this norm, resulting for example in the medicalisation of homosexuality (Foucault 1998 ).
Reference to our illustrative case study on the recently published research into the genetic basis of sexuality brings the relevance of this theorization into focus. The study cautions against the ‘gay gene’ discourse, the use of the findings for prediction, and genetic determinism of sexual orientation, noting “the richness and diversity of human sexuality” and stressing that the results do not “make any conclusive statements about the degree to which ‘nature’ and ‘nurture’ influence sexual preference” (Ganna et al. 2019b , p. 6).
Coming back to categorizations, more recent approaches from STS are also based on the assumption that classifications are a “spatio-temporal segmentation of the world” (Bowker and Star 2000 , p. 10), and that classification systems are, similar to concepts of gender theory (e.g. Garfinkel 1967 ), consistent, mutually exclusive and complete. The “International Classification of Diseases (lCD)”, a classification scheme of diseases based on their statistical significance, is an example of such a historically grown knowledge system. How the ICD is utilized in practice points to the ethical and social dimensions involved (Bowker and Star 2000 ). Such approaches help to unravel current epistemological shifts in medical research and intervention, including removal of homosexuality from the disease classification half a century ago.
Re-classifying diseases in tandem with genetic conditions creates new forms of ‘genetic responsibilities (Novas and Rose 2000 ). For instance, this may result in a change of the ‘sick role’ (described early in Parsons 1951 ) in creating new obligations not only for diseased but also for actually healthy persons in relation to potential futures. Such genetic knowledge is increasingly produced using large-scale genomic databases and creates new categories based on genetic risk, and consequently, may result in new categories of individuals that are ‘genetically at risk’ (Novas and Rose 2000 ). The question now is how these new categories will alter, structure or replace evolved categories, in terms of constructing the social world and medical practice.
While advancement in genomics is changing understandings of bodies and diseases, the meanings of certain social categories for medical research remain rather stable. Developments of personalized medicine go along with “the ‘re-inscription’ of traditional epidemiological categories into people’s DNA” and adherence to “old population categories while working out new taxonomies of individual difference” (Prainsack 2015 , pp. 28–29). This, again, highlights the fact that knowledge production draws on and is shaped by categories that have a political and cultural meaning within a social world that is pervaded by power relations.
From categorization to social implication and intervention
While categorizations are inherently social in their production, their use in knowledge production has wide ranging implications. Such is the case of how geneticisation of sexual orientation has been an issue that troubled and comforted the LGBTQIA+ communities. Despite the inexistence of an identified gene, ‘gay gene’ has been part of societal discourse. Such circulation disseminates an unequal emphasis on the biologized interpretations of sexual orientation, which may be portrayed differently in media and appeal to groups of opposing views in contrasting ways (Conrad and Markens 2001 ). Geneticisation, especially through media, moves sexual orientation to an oppositional framework between individual choice and biological consequence (Fausto-Sterling 2007 ) and there have been mixed opinions within LGBTQIA+ communities, whether this would resolve the moralization of sexual orientation or be a move back into its medicalisation (Nelkin and Lindee 2004 ). Thus, while some activists support geneticisation, others resist it and work against the potential medicalisation of homosexuality (Shostak et al. 2008 ). The ease of communicating to the general public simple genetic basis for complex social outcomes which are genetically more complex than reported, contributes to the geneticisation process, while the scientific failures of replicating ‘genetic basis’ claims do not get reported (Conrad 1999 ). In other words, while finding a genetic basis becomes entrenched as an idea in the public imaginary, research showing the opposite does not get an equal share in the media and societal discourse, neither of course does the social sciences’ critique of knowledge production that has been discussed for decades.
A widely, and often quantitatively, studied aspect of geneticisation of sexual orientation is how this plays out in the broader understanding of sexual orientation in society. While there are claims that geneticisation of sexual orientation can result in depoliticization of the identities (O’Riordan 2012 ), it may at the same time lead to polarization of society. According to social psychologists, genetic attributions to conditions are likely to lead to perceptions of immutability, specificity in aetiology, homogeneity and discreteness as well naturalistic fallacy (Dar-Nimrod and Heine 2011 ). Despite the multitude of suggestive surveys that belief in genetic basis of homosexuality correlates with acceptance, some studies suggest learning about genetic attribution to homosexuality can be polarizing and confirmatory of the previously held negative or positive attitudes (Boysen and Vogel 2007 ; Mitchell and Dezarn 2014 ). Such conclusions can be taken as a precaution that just as scientific knowledge production is social, its consequences are, too.
Looking beyond the case
We want to exemplify this argument by taking a detour to another case where the intersection between scientific practice, knowledge production and the social environment is of particular interest. While we have discussed the social implications of geneticisation with a focus on sexual orientation, recent developments in biomedical sciences and biotechnology also have the potential to reframe the old debates in entirely different ways. For instance, while ‘designer babies’ were only an imaginary concept until recently, the facility and affordability of processes, such as in vitro selection of baby’s genotype and germline genome editing, have potentially important impacts in this regard. When CRISPR/Cas9 technique was developed for rapid and easy gene editing, both the hopes and worries associated with its use were high. Martin and others ( 2020 , pp. 237–238) claim gene editing is causing both disruption within the postgenomic regime, specifically to its norms and practices, and the convergence of various biotechnologies such as sequencing and editing. Against this background, He Jiankui’s announcement in November 2018 through YouTube Footnote 7 that twins were born with edited genomes was an unwelcome surprise for many. This unexpected move may have hijacked the discussions on ethical, legal, societal implications of human germline genome-editing, but also rang the alarm bells across the globe for similar “rogue” scientists planning experimentation with the human germline (Morrison and de Saille 2019 ). The facility to conduct germline editing is, logically, only one step away from ‘correcting’ and if there is a correction, then that would mean a return to a normative state. He’s construction of HIV infection as a genetic risk can be read as a placeholder for numerous questions to human germline editing: What are the variations that are “valuable” enough for a change in germline? For instance, there are plans by Denis Rebrikov in Russia to genome edit embryos to ‘fix’ a mutation that causes congenital deafness (Cyranoski 2019 ). If legalized, what would be the limits applied and who would be able to afford such techniques? At a time when genomics research into human sociality is booming, would the currently produced knowledge in this field and others translate into ‘corrective’ genome-editing? Who would decide?
The science, in itself is still unclear at this stage as, for many complex conditions, using gene editing to change one allele to another is often minuscule in effect, considering that numerous alleles altogether may affect phenotypes, while at the same time a single allele may affect multiple phenotypes. In another GWAS case, social genomicists claim there are thousands of variations that are found to be influential for a particular social outcome such as educational attainment (Lee et al. 2018 ), with each having minimal effect. It has also been shown in the last few years, as the same study is conducted with ever more larger samples, more genomic variants are associated with the social outcome, i.e. 74 single nucleotide polymorphisms (SNPs) associated with the outcome in a sample size of 293,723 (Okbay et al. 2016 ) and 1271 SNPs associated with the outcome in a sample size of 1.1 million individuals (Lee et al. 2018 ).
Applying this reasoning to the GWAS on same-sex sexual behaviour, it is highly probable that the findings will be superseded in the following years with similar studies of bigger data, increasing the number of associations.
A genomic re-thinking?
The examples outlined here have served to show how focusing the discussion on “genetic determinism” is fruitless considering the complexity of the knowledge production practices and how the produced knowledge could both mirror social dynamics and shape these further. Genomic rethinking of the social necessitates a new formulation of social equality, where genomes are also relevant. Within the work of social genomics researchers, there has been cautious optimism toward the contribution of findings from genomics research to understanding social outcomes of policy change (Conley and Fletcher 2018 ; Lehrer and Ding 2019 ). Two fundamental thoughts govern this thinking. First, genetic basis is not to be equalized with fate; in other words, ‘genetic predispositions’ make sense only within the broader social and physical environmental frame, which often allows room for intervention. Second, genetics often relates to heterogeneity of the individuals within a population, in ways that the same policy may be positive, neutral or negative for different individuals due to their genes. In this respect, knowledge gained via social genomics may be imagined as a basis for a more equal society in ‘uncovering’ invisible variables, while, paradoxically, it may also be a justification for exclusion of certain groups. For example, a case that has initially raised the possibility that policies affect individuals differently because of their genetic background was a genetic variant that was correlated to being unaffected by tax increases on tobacco (Fletcher 2012 ). The study suggested that raising the taxes may be an ineffective tool for lowering smoking rates below a certain level, since those who are continuing to smoke may be those who cannot easily stop due to their genetic predisposition to smoking. Similar ideas could also apply to a diverse array of knowledge produced in social genomics, where the policies may be under scrutiny according to how they are claimed to variably influence the members of a society due to their genetics.
Datafication of scientific knowledge production
From theory to data-driven science.
More than a decade has gone by since Savage and Burrows ( 2007 ) described a crisis in empirical research, where the well-developed methodologies for collecting data about the social world would become marginal as such data are being increasingly generated and collected as a by-product of daily virtual transactions. Today, sociological research faces a widely datafied world, where (big) data analytics are profoundly changing the paradigm of knowledge production, as Facebook, Twitter, Google and others produce large amounts of socially relevant data. A similar phenomenon is taking place through opportunities that public and private biobanks, such as UK Biobank or 23andMe, offer. Crossing the boundaries of social sciences and biological sciences is facilitated through mapping correlations between genomic data, and data on social behaviour or outcomes.
This shift from theory to data-driven science misleadingly implies a purely inductive knowledge production, neglecting the fact that data is not produced free of preceding theoretical framing, methodological decisions, technological conditions and the interpretation of correlations—i.e. an assemblage situated within a specific place, time, political regime and cultural context (Kitchin 2014a ). It glosses over the fact that data cannot simply be treated as raw materials, but rather as “inherently partial, selective and representative”, the collection of which has consequences (Kitchin 2014b , p. 3). How knowledge of the body is generated starts with how data is produced and how it is used and mobilized. Through sequencing, biological samples are translated into digital data that are circulated and merged and correlated with other data. With the translation from genes into data, their meaning also changes (Saukko 2017 ). The kind of knowledge that is produced is also not free of scientific and societal concepts.
Individually assigned categorical variables to genomes have become important for genomic research and are impacting the ways in which identities are conceptualized under (social) genomic conditions. These characteristics include those of social identity, such as gender, ethnicity, educational and socioeconomic status. They are often used for the study of human genetic variation and individual differences with the aim to advance personalized medicine and based on demographic and ascribed social characteristics.
The sexual orientation study that is central to this paper can be read as a case where such categories intersect with the mode of knowledge production. As the largest contributor of data to the study, UK Biobank’s data used in this research are revealing since they are based on the answer to the following question “Have you ever had sexual intercourse with someone of the same sex?” along with the statement “Sexual intercourse includes vaginal, oral or anal intercourse.” Footnote 8 .
Furthermore, the authors accept having made numerous reductive assumptions and that their study has methodological limitations. For instance, Ganna et al. ( 2019b ) acknowledge both within the article (p. 1) and an accompanying website Footnote 9 that the research is based on a binary ‘sex’ system with exclusions of non-complying groups as the authors report that they “dropped individuals from [the] study whose biological sex and self-identified sex/gender did not match” (p. 2). However, both categorizing sexual orientation mainly on practice rather than attraction or desire, and building it on normative assumptions about sexuality, i.e. gender binary and heteronormativity, are problematic, as sexual behaviour is diverse and does not necessarily correspond with such assumptions.
The variations found in the sexual orientation study, as is true for other genome-wide association studies, are often relevant for the populations studied and in this case, those mainly belong to certain age groups and European ancestry. While the study avoids critique in saying that their research is not genetics of sexual orientation, but rather of same-sex sexual behaviour, whether such a genomic study would be possible is also questionable. This example demonstrates that, despite the increasing influence of big data, a fundamental problem with the datafication of many social phenomena is whether or not they are amenable to measurement. In the case of sexual orientation, whether the answer to the sexual orientation questions corresponds to the “homosexuality” or “willingness to reveal homosexuality”/“stated sexual orientation” is debatable, considering the social pressure and stigma that may be an element in certain social contexts (Conley 2009 , p. 242).
While our aim is to bring a social scientific perspective, biologists have raised at least two different critical opinions on the knowledge production practice here in the case of the sexual orientation study, first on the implications of the produced knowledge Footnote 10 and second on the problems and flaws of the search for a genetic basis Footnote 11 . In STS, however, genetic differences that were hypothesized to be relevant for health, especially under the category of race in the US, have been a major point of discussion within the genomic ‘inclusion’ debates of 1990s (Reardon 2017 , p. 49; Bliss 2015 ). In other words, a point of criticism towards the knowledge production was the focus on certain “racial” or racialized groups, such as American of European ancestry, which supposedly biased the findings and downstream development of therapies for ‘other’ groups. However, measuring health and medical conditions against the background of groups that are constituted based on social or cultural categories (e.g. age, gender, ethnicity), may also result in a reinscription/reconstitution of social inequalities attached to these categories (Prainsack 2015 ) and at the same time result in health justice being a topic seen through a postgenomics lens, where postgenomics is “a frontline weapon against inequality” (Bliss 2015 p. 175). Social-economic factors may recede in the background, while data with its own often invisible politics are foregrounded.
Unlike what Savage and Burrows suggested in 2007, the coming crisis can not only be seen as a crisis of sociology, but of science in general. Just as the shift of focus in social sciences towards digital data is only one part of the picture, another part could be the developments in genomisation of the social. Considering that censuses and large-scale statistics are not new, the distinction of the current phenomenon is possibly the opportunity to individualize the data, while categories themselves are often unable to capture the complexity, despite producing knowledge more efficiently. In that sense, the above-mentioned survey questions do not do justice to the complexity of social behaviour. What is most important to flag within these transformations is the lack of reflexivity regarding how big data comes to represent the world and whether it adds and/or takes away from the ways of knowing before big data. These developments and directions of genetic-based research and big data go far beyond the struggle of a discipline, namely sociology, with a paradigm shift in empirical research. They could set the stage for real consequences for individuals and groups. Just as what is defined as an editable condition happens as a social process that relies on socio-political categories, the knowledge acquired from big data relies in similar way on the same kind of categories.
The data choices and restrictions: ‘Free from theory’ or freedom of choice
Data, broadly understood, have become a fundamental part of our lives, from accepting and granting different kinds of consent for our data to travel on the internet, to gaining the ‘right to be forgotten’ in certain countries, as well as being able to retrieve collected information about ourselves from states, websites, even supermarket chains. While becoming part of our lives, the data collected about individuals in the form of big data is transferred between academic and non-academic research, scientific and commercial enterprises. The associated changes in the knowledge production have important consequences for the ways in which we understand and live in the world (Jasanoff 2004 ). The co-productionist perspective in this sense does not relate to whether or how the social and the biological are co-produced, but rather it is pointing to how produced knowledge in science is both shaped by and shaping societies. Thus, the increasing impact and authority of big data in general, and within the sexual orientation study in focus here, opens up new avenues to claim as some suggest, that we have reached the end of theory.
The “end of theory” has actively been debated within and beyond science. Kitchin ( 2014a ) locates the recent origin of this debate in a piece in the Wired , where the author states “Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson 2008 ). Others call this a paradigm shift towards data-intensive research leaving behind the empirical and theoretical stages (Gray 2009 , p. xviii). While Google and others form the basis for this data-driven understanding in their predictive capacity or letting the data speak, the idea that knowledge production is ‘free from theory’ in this case seems to be, at best, an ignorance of any data infrastructure and how the categories are formed within it.
Taking a deeper look at the same-sex sexual behaviour study from this angle suggests that such research cannot be free from theory as it has to make an assumption regarding the role of genetics in the context of social dynamics. In other words, it has to move sexual orientation, at least partially in the form of same-sex sexual behaviour, out of the domain of the social towards the biological. In doing so, just as the study concludes the complexity of sexual orientation, the authors note in their informative video Footnote 12 on their website, that “they found that about a third of the differences between people in their sexual behaviour could be explained by inherited genetic factors. But the environment also plays a large role in shaping these differences.” While the study points to a minuscule component of the biological, it also frames biology as the basis on which the social, as part of the environment, acts upon.
Reconsidering how the biology and the social are represented in the study, three theoretical choices are made due to the limitation of the data. First of all, the biological is taken to be “the genome-wide data” in the biobanks that the study relies on. This means sexual orientation is assumed to be within the SNPs, or points on the genome that are common variations across a population, and not in other kinds of variations that are rare or not captured by the genotyped SNPs. These differences include, but are not limited to, large-scale to small-scale duplications and deletions of the genomic regions, rare variants or even common variants in the population that the SNP chips do not capture. Such ignored differences are very important for a number of conditions, from cancer to neurobiology. Similarly, the genomic focus leaves aside the epigenetic factors that could theoretically be the missing link between genomes and environments. In noting this, we do not suggest that the authors of the study are unaware or uninterested in epigenetics; however, regardless of their interest and/or knowledge, the availability of large-scale genome-wide data puts such data ahead of any other variation in the genome and epigenome. In other words, if the UK Biobank and 23andMe had similar amounts of epigenomic or whole genome data beyond the SNPs, the study would have most possibly relied on these other variations in the genome. The search for genetic basis within SNPs is a theoretical choice, and in this case this choice is pre-determined by the limitations of the data infrastructures.
The second choice that the authors make is to take three survey questions, i.e. in the case of UK Biobank data, as encompassing enough of the complexity of sexual orientation for their research. As partly discussed earlier, these questions are simply asking about sexual behaviour. Based on the UK Biobank’s definition of sexual intercourse as “vaginal, oral or anal intercourse” the answers to the following questions were relevant for the research: “Have you ever had sexual intercourse with someone of the same sex?” (Data-Field 2159), “How many sexual partners of the same sex have you had in your lifetime?” (Data-Field 3669), and, “About how many sexual partners have you had in your lifetime?” (Data-Field 2149). Answers to such questions do little justice to the complexity of the topic. Considering that they are not included in the biobank as data for the purpose of identifying a genetic basis to same-sex sexual behaviour, there is much to consider in what capacity they are useful for that. It is worth noting here that the UK Biobank is primarily focused on health-related research, and thus these three survey questions could not have been asked with a genomic exploration of ‘same-sex sexual behaviour’ or ‘sexual orientation’ in mind. The degree of success in the way they have been used to identify the genetic basis for complex social behaviours is questionable.
The authors of the study consider the UK Biobank sample to be comprised of relatively old individuals and this to be a shortcoming Footnote 13 . Similarly, the study authors claim that 23andMe samples may be biased because “[i]ndividuals who engage in same-sex sexual behaviour may be more likely to self-select the sexual orientation survey”, which then explains the high percentage of such individuals (18.9%) (Ganna et al. 2019b , p. 1). However, the authors do not problematize that there is at least three-fold difference between the youngest and oldest generation in the UK Biobank sample in their response to the same-sex sexual behaviour question (Ganna et al. 2019b , p. 2). The study, thus, highlights the problematic issue about who should be regarded as the representative sample to be asked about their “same-sex sexual behaviour”. Still, this is a data choice that the authors make in concluding a universal explanation out of a very specific and socially constrained collection of self-reported data that encompasses only part of what the researchers are interested in.
The third choice is a choice unmade. The study data mainly came from UK Biobank, following a proposal by Brendan Zietsch with the title “Direct test whether genetic factors predisposing to homosexuality increase mating success in heterosexuals” Footnote 14 . The original plan for research frames “homosexuality” as a condition that heterosexuals can be “predisposed” to and as this condition is not eliminated through evolution, scientists hypothesize that whatever genetic variation that predisposes an individual to homosexuality may also be functional in increasing the individual’s reproductive capacity. Despite using such an evolutionary explanation as the theoretical basis for obtaining the data from the UK Biobank, the authors use evolution/evolutionary only three times in the article, whereas the concept “mating success” is totally missing. Unlike the expectation in the research proposal, authors observe lower number of offspring for individuals reporting same-sex sexual behaviour, and they conclude briefly “This reproductive deficit raises questions about the evolutionary maintenance of the trait, but we do not address these here” (Ganna et al. 2019b , p. 2). In other words, the hypothesis that allowed scientists to acquire the UK Biobank data becomes irrelevant for the researchers, when they are reporting their findings.
In this section, we have performed an analysis of how data choices are made at different steps of the research and hinted at how these choices reflect certain understandings of how society functions. These are evident in the ways sexual behaviour is represented and categorized according to quantitative data, and, the considerations of whether certain samples are contemporary enough (UK Biobank) or too self-selecting (same-sex sexual behaviour being too high in 23andMe). The study, however, does not problematize how the percentage of individuals reporting same-sex sexual behaviour steadily increases according to year of birth, at least tripling for males and increasing more than five-fold for females from 1940 and 1970 (for UK Biobank). Such details are among the data that the authors display as descriptive statistics in Fig. 1 (Ganna et al. 2019b , p. 2); however, these do not attract a discussion that genomic data receives. The study itself starts from the idea that genetic markers that are associated with same-sex sexual behaviour could have an evolutionary advantage and ends in saying the behaviour is complex. Critics claim the “approach [of the study] implies that it is acceptable to issue claims of genetic drivers of behaviours and then lay the burden of proof on social scientists to perform post-hoc socio-cultural analysis” (Richardson et al. 2019 , p. 1461).
In this paper, we have ‘moved back to the future’—taking stock of the present-day accelerated impact of big data and of its potential and real consequences. Using the sexual orientation GWAS as point of reference, we have shown that claims to working under the premise of ‘pure science’ of genomics are untenable as the social is present by default—within the methodological choices made by the researchers, the impact on/of the social imaginary or epigenetic context.
By focusing on the contingency of the knowledge production on the social categories that are themselves reflections of the social in the data practices, we have highlighted the relational processes at the root of knowledge production. We are experiencing a period where the repertoire of what gets quantified continuously, and possibly exponentially, increases; however, this does not necessarily mean that our understanding of complexity increases at the same rate, rather, it may lead to unintended simplification where meaningful levels of understanding of causality are lost in the “triumph of correlations” in big data (Mayer-Schönberger and Cukier 2013 ; cited in Leonelli 2014 ). While sociology has much to offer through its qualitative roots, we think it should do more than critique, especially considering the culturally and temporally specific understandings of the social are also linked to the socio-material consequences.
We want to highlight that now is the time to think about the broader developments in science and society, not merely from an external perspective, but within a new framework. Clearly, our discussion of a single case here cannot sustain suggestions for a comprehensive and applicable framework for any study; however, we can flag the urgency of its requirement. We have shown that, in the context of the rapid developments within big data-driven, and socio-genomic research, it is necessary to renew the argument for bringing the social, and its interrelatedness to the biological, clearly back into focus. We strongly believe that reemphasizing this argument is essential to underline the analytical strength of the social science perspective, and in order to avoid the possibility of losing sight of the complexity of social phenomena, which risk being oversimplified in mainly statistical data-driven science.
We can also identify three interrelated dimensions of scientific practice that the framework would valorize: (1) Recognition of the contingency of choices made within the research process, and sensibility of their consequent impact within the social context. (2) Ethical responsibilities that move beyond procedural contractual requirements, to sustaining a process rooted in clear understanding of societal environments. (3) Interdisciplinarity in analytical practice that potentiates the impact of each perspectival lens.
Such a framework would facilitate moving out of the disciplinary or institutionalized silos of ELSI, STS, sociology, genetics, or even emerging social genomics. Rather than competing for authority on ‘the social’, the aim should be to critically complement each other and refract the produced knowledge with a multiplicity of lenses. Zooming ‘back to the future’ within the field of socio-biomedical science, we would flag the necessity of re-calibrating to a multi-perspectival endeavour—one that does justice to the complex interplay of social and biological processes within which knowledge is produced.
The GWAS primarily uses the term “same-sex sexual behaviour” as one of the facets of “sexual orientation” where the former becomes the component that is directly associable with the genes and the latter the broader phenomenon of interest. Thus, while the article is referring to “same-sex sexual behaviour” in its title, it is editorially presented in the same Science issue under Human Genetics heading with the subheading “The genetics of sexual orientation” (p. 880) (see Funk 2019 ). Furthermore, the request for data from UK Biobank by the corresponding author Brendan P. Zietsch (see footnote 14) refers only to sexual orientation and homosexuality and not to same-sex sexual behaviour. Therefore, we follow the same interchangeable use in this article.
Source: https://osf.io/xwfe8 (04.03.2020).
Source: https://www.wsj.com/articles/research-finds-genetic-links-to-same-sex-behavior-11567101661 (04.03.2020).
Source: https://geneticsexbehavior.info (04.03.2020).
In addition to footnotes 10 and 11, for a discussion please see: https://www.nytimes.com/2019/08/29/science/gay-gene-sex.html (04.03.2020).
Later “122 Shades of Grey”: https://www.geneplaza.com/app-store/72/preview (04.03.2020).
Source: https://www.youtube.com/watch?v=th0vnOmFltc (04.03.2020).
Source: http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=2159 (04.03.2020).
Source: https://geneticsexbehavior.info/ (04.03.2020).
Source: https://www.broadinstitute.org/blog/opinion-big-data-scientists-must-be-ethicists-too (04.03.2020).
Source: https://medium.com/@cecilejanssens/study-finds-no-gay-gene-was-there-one-to-find-ce5321c87005 (03.03.2020).
Source: https://videos.files.wordpress.com/2AVNyj7B/gosb_subt-4_dvd.mp4 (04.03.2020).
Source: https://geneticsexbehavior.info/what-we-found/ (04.03.2020).
Source: https://www.ukbiobank.ac.uk/2017/04/direct-test-whether-genetic-factors-predisposing-to-homosexuality-increase-mating-success-in-heterosexuals/ (04.03.2020).
Abdellaoui A, Hugh-Jones D, Yengo L, Kemper KE, Nivard MG, Veul L, Holtz Y, Zietsch BP, Frayling TM, Wray NR (2019) Genetic correlates of social stratification in Great Britain. Nat Hum Behav 1–21. https://doi.org/10.1038/s41562-019-0757-5
Anderson C (2008) The end of theory: the data deluge makes the scientific method obsolete, Wired https://www.wired.com/2008/06/pb-theory/ . Accessed 31 Mar 2020
Bliss C (2015) Defining health justice in the postgenomic era. In: Richardson SS, Stevens H (eds) Postgenomics: perspectives on biology after the genome. Duke University Press, Durham, Durham/London, pp. 174–191
Chapter Google Scholar
Bourdieu P (2001) Masculine domination. Stanford University Press, Stanford
Google Scholar
Bourdieu P (2010) Distinction: a social critique of the judgement of taste. Routledge, London/New York
Bowker GC, Star SL (2000) Sorting things out: classification and its consequences. MIT Press, Cambridge/London
Book Google Scholar
Boysen GA, Vogel DL (2007) Biased assimilation and attitude polarization in response to learning about biological explanations of homosexuality. Sex Roles 57(9–10):755–762. https://doi.org/10.1007/s11199-007-9256-7
Article Google Scholar
Butler J (1990) Gender trouble. Feminism and the subversion of identity. Routledge, New York
Clarke AE, Shim JK, Shostak S, Nelson A (2013) Biomedicalising genetic health, diseases and identities. In: Atkinson P, Glasner P, Lock M (eds) Handbook of genetics and society: mapping the new genomc era. Routledge, Oxon, pp. 21–40
Conley D (2009) The promise and challenges of incorporating genetic data into longitudinal social science surveys and research. Biodemogr Soc Biol 55(2):238–251. https://doi.org/10.1080/19485560903415807
Conley D, Fletcher J (2018) The genome factor: what the social genomics revolution reveals about ourselves, our history, and the future. Princeton University Press, Princeton/Oxford
Connell RW (2005) Masculinities. Polity, Cambridge
Conrad P (1999) A mirage of genes. Sociol Health Illn 21(2):228–241. https://doi.org/10.1111/1467-9566.00151
Conrad P, Markens S (2001) Constructing the ‘gay gene’ in the news: optimism and skepticism in the US and British press. Health 5(3):373–400. https://doi.org/10.1177/136345930100500306
Crenshaw K (1989) Demarginalizing the intersection of race and sex: a black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics, vol 1989(8). University of Chicago Legal Forum. http://chicagounbound.uchicago.edu/uclf/vol1989/iss1/8 . Accessed 1 Apr 2020
Cyranoski D (2019) Russian ‘CRISPR-baby’ scientist has started editing genes in human eggs with goal of altering deaf gene. Nature 574(7779):465–466. https://doi.org/10.1038/d41586-019-03018-0
Article ADS CAS PubMed Google Scholar
Dar-Nimrod I, Heine SJ (2011) Genetic essentialism: on the deceptive determinism of DNA. Psychol Bull 137(5):800–818. https://doi.org/10.1037/a0021860
Article PubMed PubMed Central Google Scholar
Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, Ntritsos G, Dimou N, Cabrera CP, Karaman I (2018) Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nat Genet 50(10):1412–1425. https://doi.org/10.1038/s41588-018-0205-x
Article CAS PubMed PubMed Central Google Scholar
Fausto-Sterling A (2007) Frameworks of desire. Daedalus 136(2):47–57. https://doi.org/10.1162/daed.2007.136.2.47
Fletcher JM (2012) Why have tobacco control policies stalled? Using genetic moderation to examine policy impacts. PLoS ONE 7(12):e50576. https://doi.org/10.1371/journal.pone.0050576
Article ADS CAS PubMed PubMed Central Google Scholar
Foucault M (1998) The history of sexuality 1: the will to knowledge. Penguin Books, London
Foucault M (2003) The birth of the clinic. Routledge, London/New York
Foucault M (2005) The order of things. Routledge, London/New York
Fox Keller E (2014) From gene action to reactive genomes. J Physiol 592(11):2423–2429. https://doi.org/10.1113/jphysiol.2014.270991
Article CAS Google Scholar
Fox Keller E (2015) The postgenomic genome. In: Richardson SS, Stevens H (eds) Postgenomics: perspectives on biology after the genome. Duke University Press, Durham/London, pp. 9–31
Funk M (2019) The genetics of sexual orientation. Science 365(6456):878–880. https://doi.org/10.1126/science.365.6456.878-k
Article ADS Google Scholar
Ganna A, Verweij KJ, Nivard MG, Maier R, Wedow R, Busch AS, Abdellaoui A, Guo S, Sathirapongsasuti JF, Lichtenstein P (2019a) Genome studies must account for history—response. Science 366(6472):1461–1462. https://doi.org/10.1126/science.aaz8941
Ganna A, Verweij KJ, Nivard MG, Maier R, Wedow R, Busch AS, Abdellaoui A, Guo S, Sathirapongsasuti JF, Lichtenstein P (2019b) Large-scale GWAS reveals insights into the genetic architecture of same-sex sexual behavior. Science 365(6456):eaat7693. https://doi.org/10.1126/science.aat7693
Garfinkel H (1967) Studies in ethnomethodology. Polity Press, Cambridge
Gray J (2009) Jim Gray on eScience: a transformed scientific method. In: Hey T, Tansley S, Tolle KM (eds) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond, pp. xvii–xxxi
Griffiths DA (2016) Queer genes: realism, sexuality and science. J Crit Realism 15(5):511–529. https://doi.org/10.1080/14767430.2016.1210872
Hamer DH, Hu S, Magnuson VL, Hu N, Pattatucci AM (1993) A linkage between DNA markers on the X chromosome and male sexual orientation. Science 261(5119):321–327. https://doi.org/10.1126/science.8332896
Haraway D (1988) Situated knowledges: the science question in feminism and the privilege of partial perspective. Fem Stud 14(3):575–599
Holm S, Ploug T (2019) Genome studies reveal flaws in broad consent. Science 366(6472):1460–1461. https://doi.org/10.1126/science.aaz3797
Howard HC, van El CG, Forzano F, Radojkovic D, Rial-Sebbag E, de Wert G, Borry P, Cornel MC (2018) One small edit for humans, one giant edit for humankind? Points and questions to consider for a responsible way forward for gene editing in humans. Eur J Hum Genet 26(1):1. https://doi.org/10.1038/s41431-017-0024-z
Article CAS PubMed Google Scholar
Jansen PR, Watanabe K, Stringer S, Skene N, Bryois J, Hammerschlag AR, de Leeuw CA, Benjamins JS, Muñoz-Manchado AB, Nagel M, Savage JE, Tiemeier H, White T, Agee M, Alipanahi B, Auton A, Bell RK, Bryc K, Elson SL, Fontanillas P, Furlotte NA, Hinds DA, Huber KE, Kleinman A, Litterman NK, McCreight JC, McIntyre MH, Mountain JL, Noblin ES, Northover CAM, Pitts SJ, Sathirapongsasuti JF, Sazonova OV, Shelton JF, Shringarpure S, Tian C, Wilson CH, Tung JY, Hinds DA, Vacic V, Wang X, Sullivan PF, van der Sluis S, Polderman TJC, Smit AB, Hjerling-Leffler J, Van Someren EJW, Posthuma D, The 23andMe Research, T. (2019) Genome-wide analysis of insomnia in 1,331,010 individuals identifies new risk loci and functional pathways. Nat Genet 51(3):394–403. https://doi.org/10.1038/s41588-018-0333-3
Jasanoff S (2004) The idiom of co-production. In: Jasanoff S (ed.) States of knowledge: the co-production of science and social order. Routledge, London, p 1–12
Jasanoff S, Hurlbut JB (2018) A global observatory for gene editing. Nature 555:435–437. https://doi.org/10.1038/d41586-018-03270-w
Kessler SJ, McKenna W (1978) Gender: an ethnomethodological approach. John Wiley & Sons, New York
Kitchin, R. (2014a) Big Data, new epistemologies and paradigm shifts. Big Data Soc. https://doi.org/10.1177/2053951714528481
Kitchin R (2014b) The data revolution. Big data, open data, data infrastructures and their consequences. Sage, London
Landecker H (2016) The social as signal in the body of chromatin. Sociol Rev 64(1_suppl):79–99. https://doi.org/10.1111/2059-7932.12014
Landecker H, Panofsky A (2013) From social structure to gene regulation, and back: a critical introduction to environmental epigenetics for sociology. Annu Rev Sociol 39:333–357. https://doi.org/10.1146/annurev-soc-071312-145707
Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, Nguyen-Viet TA, Bowers P, Sidorenko J, Linnér RK (2018) Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment. Nat Genet 50(8):1112. https://doi.org/10.1038/s41588-018-0147-3
Lehrer SF, Ding W (2019) Can social scientists use molecular genetic data to explain individual differences and inform public policy? In: Foster G (ed.) Biophysical measurement in experimental social science research. Academic Press, London/San Diego/Cambridge/Oxford, pp. 225–265
Leonelli, S. (2014) What difference does quantity make? On the epistemology of Big Data in biology. Big Data Soc. https://doi.org/10.1177/2053951714534395
Lorber J (1994) Paradoxes of gender. Yale University Press, New Haven
Martin P, Morrison M, Turkmendag I, Nerlich B, McMahon A, de Saille S, Bartlett A (2020) Genome editing: the dynamics of continuity, convergence, and change in the engineering of life. New Genet Soc 39(2):219–242. https://doi.org/10.1080/14636778.2020.1730166
Maxmen A (2019) Controversial ‘gay gene’ app provokes fears of a genetic Wild West. Nature 574(7780):609–610. https://doi.org/10.1038/d41586-019-03282-0
Mayer-Schönberger V, Cukier K (2013) Big data: a revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, Boston/New York
Meloni M (2014) Biology without biologism: social theory in a postgenomic age. Sociology 48(4):731–746. https://doi.org/10.1177/0038038513501944
Meloni M (2016) Political biology: Science and social values in human heredity from eugenics to epigenetics. Palgrave Macmillan, n.p.p
Mitchell RW, Dezarn L (2014) Does knowing why someone is gay influence tolerance? Genetic, environmental, choice, and “reparative” explanations. Sex Cult 18(4):994–1009. https://doi.org/10.1007/s12119-014-9233-6
Morrison M, de Saille S (2019) CRISPR in context: towards a socially responsible debate on embryo editing. Palgrave Commun 5(1):1–9. https://doi.org/10.1057/s41599-019-0319-5
Nelkin D, Lindee MS (2004) The DNA mystique: the gene as a cultural icon. University of Michigan Press, Ann Arbor
Novas C, Rose N (2000) Genetic risk and the birth of the somatic individual. Econ Soc 29(4):485–513. https://doi.org/10.1080/03085140050174750
O’Riordan K (2012) The life of the gay gene: from hypothetical genetic marker to social reality. J Sex Res 49(4):362–368. https://doi.org/10.1080/00224499.2012.663420
Article PubMed Google Scholar
Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, Rietveld CA, Turley P, Chen G-B, Emilsson V, Meddens SFW (2016) Genome-wide association study identifies 74 loci associated with educational attainment. Nature 533(7604):539–542. https://doi.org/10.1038/nature17671
Parry B, Greenhough B (2017) Bioinformation. Polity Press, Cambridge
Parsons T (1951) The social system. Free Press, New York
Prainsack B (2015) Is personalized medicine different? (Reinscription: the sequel) A response to Troy Duster. Br J Sociol 66(1):28–35. https://doi.org/10.1111/1468-4446.12117
Reardon J (2017) The postgenomic condition: ethics, justice, and knowledge after the genome. University of Chicago Press, Chicago/London
Rehmann-Sutter C, Mahr D (2016) The lived genome. In: Whitehead A, Woods A (eds) Edinburgh companion to the critical medical humanities. Edinburgh University Press, Edinburgh, pp. 87–103
Richardson SS, Borsa A, Boulicault M, Galka J, Ghosh N, Gompers A, Noll NE, Perret M, Reiches MW, Sandoval JCB (2019) Genome studies must account for history. Science 366(6472):1461. https://doi.org/10.1126/science.aaz6594
Ruckenstein M, Schüll ND (2017) The datafication of health. Annu Rev Anthropol 46(261–278). https://doi.org/10.1146/annurev-anthro-102116-041244
Saukko P (2017) Shifting metaphors in direct-to-consumer genetic testing: from genes as information to genes as big data. New Genet Soc 36(3):296–313. https://doi.org/10.1080/14636778.2017.1354691
Savage M, Burrows R (2007) The coming crisis of empirical sociology. Sociology 41(5):885–899. https://doi.org/10.1177/0038038507080443
Shostak S, Conrad P, Horwitz AV (2008) Sequencing and its consequences: path dependence and the relationships between genetics and medicalization. Am J Sociol 114(S1):S287–S316. https://doi.org/10.1086/595570
Thomas WJ, Thomas DS (1929) The child in America. Behavior problems and programs. Knopf, New York
West C, Zimmerman DH (1991) Doing gender. In: Lorber J, Farrell SA (eds) The social construction of gender. Sage, Newbury Park/London, pp. 13–37
Download references
Acknowledgements
Open access funding provided by University of Vienna. The authors thank Brígida Riso for contributing to a previous version of this article.
Author information
These authors contributed equally: Melanie Goisauf, Kaya Akyüz, Gillian M. Martin.
Authors and Affiliations
Department of Science and Technology Studies, University of Vienna, Vienna, Austria
Melanie Goisauf & Kaya Akyüz
BBMRI-ERIC, Graz, Austria
Department of Sociology, University of Malta, Msida, Malta
Gillian M. Martin
You can also search for this author in PubMed Google Scholar
Corresponding authors
Correspondence to Melanie Goisauf or Kaya Akyüz .
Ethics declarations
Competing interests.
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
About this article
Cite this article.
Goisauf, M., Akyüz, K. & Martin, G.M. Moving back to the future of big data-driven research: reflecting on the social in genomics. Humanit Soc Sci Commun 7 , 55 (2020). https://doi.org/10.1057/s41599-020-00544-5
Download citation
Received : 15 November 2019
Accepted : 09 July 2020
Published : 04 August 2020
DOI : https://doi.org/10.1057/s41599-020-00544-5
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
This article is cited by
Biobanking and risk assessment: a comprehensive typology of risks for an adaptive risk governance.
- Gauthier Chassang
- Michaela Th. Mayrhofer
Life Sciences, Society and Policy (2021)
Quick links
- Explore articles by subject
- Guide to authors
- Editorial policies
- Survey paper
- Open access
- Published: 04 June 2019
Uncertainty in big data analytics: survey, opportunities, and challenges
- Reihaneh H. Hariri ORCID: orcid.org/0000-0003-2173-1331 1 ,
- Erik M. Fredericks 1 &
- Kate M. Bowers 1
Journal of Big Data volume 6 , Article number: 44 ( 2019 ) Cite this article
91k Accesses
324 Citations
26 Altmetric
Metrics details
Big data analytics has gained wide attention from both academia and industry as the demand for understanding trends in massive datasets increases. Recent developments in sensor networks, cyber-physical systems, and the ubiquity of the Internet of Things (IoT) have increased the collection of data (including health care, social media, smart cities, agriculture, finance, education, and more) to an enormous scale. However, the data collected from sensors, social media, financial records, etc. is inherently uncertain due to noise, incompleteness, and inconsistency. The analysis of such massive amounts of data requires advanced analytical techniques for efficiently reviewing and/or predicting future courses of action with high precision and advanced decision-making strategies. As the amount, variety, and speed of data increases, so too does the uncertainty inherent within, leading to a lack of confidence in the resulting analytics process and decisions made thereof. In comparison to traditional data techniques and platforms, artificial intelligence techniques (including machine learning, natural language processing, and computational intelligence) provide more accurate, faster, and scalable results in big data analytics. Previous research and surveys conducted on big data analytics tend to focus on one or two techniques or specific application domains. However, little work has been done in the field of uncertainty when applied to big data analytics as well as in the artificial intelligence techniques applied to the datasets. This article reviews previous work in big data analytics and presents a discussion of open challenges and future directions for recognizing and mitigating uncertainty in this domain.
Introduction
According to the National Security Agency, the Internet processes 1826 petabytes (PB) of data per day [ 1 ]. In 2018, the amount of data produced every day was 2.5 quintillion bytes [ 2 ]. Previously, the International Data Corporation (IDC) estimated that the amount of generated data will double every 2 years [ 3 ], however 90% of all data in the world was generated over the last 2 years, and moreover Google now processes more than 40,000 searches every second or 3.5 billion searches per day [ 2 ]. Facebook users upload 300 million photos, 510,000 comments, and 293,000 status updates per day [ 2 , 4 ]. Needless to say, the amount of data generated on a daily basis is staggering. As a result, techniques are required to analyze and understand this massive amount of data, as it is a great source from which to derive useful information.
Advanced data analysis techniques can be used to transform big data into smart data for the purposes of obtaining critical information regarding large datasets [ 5 , 6 ]. As such, smart data provides actionable information and improves decision-making capabilities for organizations and companies. For example, in the field of health care, analytics performed upon big datasets (provided by applications such as Electronic Health Records and Clinical Decision Systems) may enable health care practitioners to deliver effective and affordable solutions for patients by examining trends in the overall history of the patient, in comparison to relying on evidence provided with strictly localized or current data. Big data analysis is difficult to perform using traditional data analytics [ 7 ] as they can lose effectiveness due to the five V’s characteristics of big data: high volume, low veracity, high velocity, high variety, and high value [ 7 , 8 , 9 ]. Moreover, many other characteristics exist for big data, such as variability, viscosity, validity, and viability [ 10 ]. Several artificial intelligence (AI) techniques, such as machine learning (ML), natural language processing (NLP), computational intelligence (CI), and data mining were designed to provide big data analytic solutions as they can be faster, more accurate, and more precise for massive volumes of data [ 8 ]. The aim of these advanced analytic techniques is to discover information, hidden patterns, and unknown correlations in massive datasets [ 7 ]. For instance, a detailed analysis of historical patient data could lead to the detection of destructive disease at an early stage, thereby enabling either a cure or more optimal treatment plan [ 11 , 12 ]. Additionally, risky business decisions (e.g., entering a new market or launching a new product) can profit from simulations that have better decision-making skills [ 13 ].
While big data analytics using AI holds a lot of promise, a wide range of challenges are introduced when such techniques are subjected to uncertainty. For instance, each of the V characteristics introduce numerous sources of uncertainty, such as unstructured, incomplete, or noisy data. Furthermore, uncertainty can be embedded in the entire analytics process (e.g., collecting, organizing, and analyzing big data). For example, dealing with incomplete and imprecise information is a critical challenge for most data mining and ML techniques. In addition, an ML algorithm may not obtain the optimal result if the training data is biased in any way [ 14 , 15 ]. Wang et al. [ 16 ] introduced six main challenges in big data analytics, including uncertainty. They focus mainly on how uncertainty impacts the performance of learning from big data, whereas a separate concern lies in mitigating uncertainty inherent within a massive dataset. These challenges normally present in data mining and ML techniques. Scaling these concerns up to the big data level will effectively compound any errors or shortcomings of the entire analytics process. Therefore, mitigating uncertainty in big data analytics must be at the forefront of any automated technique, as uncertainty can have a significant influence on the accuracy of its results.
Based on our examination of existing research, little work has been done in terms of how uncertainty significantly impacts the confluence of big data and the analytics techniques in use. To address this shortcoming, this article presents an overview of the existing AI techniques for big data analytics, including ML, NLP, and CI from the perspective of uncertainty challenges, as well as suitable directions for future research in these domains. The contributions of this work are as follows. First, we consider uncertainty challenges in each of the 5 V’s big data characteristics. Second, we review several techniques on big data analytics with impact of uncertainty for each technique, and also review the impact of uncertainty on several big data analytic techniques. Third, we discuss available strategies to handle each challenge presented by uncertainty.
To the best of our knowledge, this is the first article surveying uncertainty in big data analytics. The remainder of the paper is organized as follows. “ Background ” section presents background information on big data, uncertainty, and big data analytics. “ Uncertainty perspective of big data analytics ” section considers challenges and opportunities regarding uncertainty in different AI techniques for big data analytics. “ Summary of mitigation strategies ” section correlates the surveyed works with their respective uncertainties. Lastly, “ Discussion ” section summarizes this paper and presents future directions of research.
This section reviews background information on the main characteristics of big data, uncertainty, and the analytics processes that address the uncertainty inherent in big data.
In May 2011, big data was announced as the next frontier for productivity, innovation, and competition [ 11 ]. In 2018, the number of Internet users grew 7.5% from 2016 to over 3.7 billion people [ 2 ]. In 2010, over 1 zettabyte (ZB) of data was generated worldwide and rose to 7 ZB by 2014 [ 17 ]. In 2001, the emerging characteristics of big data were defined with three V’s (Volume, Velocity, and Variety) [ 18 ]. Similarly, IDC defined big data using four V’s (Volume, Variety, Velocity, and Value) in 2011 [ 19 ]. In 2012, Veracity was introduced as a fifth characteristic of big data [ 20 , 21 , 22 ]. While many other V’s exist [ 10 ], we focus on the five most common characteristics of big data, as next illustrated in Fig. 1 .
Common big data characteristics
Volume refers to the massive amount of data generated every second and applies to the size and scale of a dataset. It is impractical to define a universal threshold for big data volume (i.e., what constitutes a ‘big dataset’) because the time and type of data can influence its definition [ 23 ]. Currently, datasets that reside in the exabyte (EB) or ZB ranges are generally considered as big data [ 8 , 24 ], however challenges still exist for datasets in smaller size ranges. For example, Walmart collects 2.5 PB from over a million customers every hour [ 25 ]. Such huge volumes of data can introduce scalability and uncertainty problems (e.g., a database tool may not be able to accommodate infinitely large datasets). Many existing data analysis techniques are not designed for large-scale databases and can fall short when trying to scan and understand the data at scale [ 8 , 15 ].
Variety refers to the different forms of data in a dataset including structured data, semi-structured data, and unstructured data. Structured data (e.g., stored in a relational database) is mostly well-organized and easily sorted, but unstructured data (e.g., text and multimedia content) is random and difficult to analyze. Semi-structured data (e.g., NoSQL databases) contains tags to separate data elements [ 23 , 26 ], but enforcing this structure is left to the database user. Uncertainty can manifest when converting between different data types (e.g., from unstructured to structured data), in representing data of mixed data types, and in changes to the underlying structure of the dataset at run time. From the point of view of variety, traditional big data analytics algorithms face challenges for handling multi-modal, incomplete and noisy data. Because such techniques (e.g., data mining algorithms) are designed to consider well-formatted input data, they may not be able to deal with incomplete and/or different formats of input data [ 7 ]. This paper focuses on uncertainty with regard to big data analytics, however uncertainty can impact the dataset itself as well.
Efficiently analysing unstructured and semi-structured data can be challenging, as the data under observation comes from heterogeneous sources with a variety of data types and representations. For example, real-world databases are negatively influenced by inconsistent, incomplete, and noisy data. Therefore, a number of data preprocessing techniques, including data cleaning, data integrating, and data transforming used to remove noise from data [ 27 ]. Data cleaning techniques address data quality and uncertainty problems resulting from variety in big data (e.g., noise and inconsistent data). Such techniques for removing noisy objects during the analysis process can significantly enhance the performance of data analysis. For example, data cleaning for error detection and correction is facilitated by identifying and eliminating mislabeled training samples, ideally resulting in an improvement in classification accuracy in ML [ 28 ].
Velocity comprises the speed (represented in terms of batch, near-real time, real time, and streaming) of data processing, emphasizing that the speed with which the data is processed must meet the speed with which the data is produced [ 8 ]. For example, Internet of Things (IoT) devices continuously produce large amounts of sensor data. If the device monitors medical information, any delays in processing the data and sending the results to clinicians may result in patient injury or death (e.g., a pacemaker that reports emergencies to a doctor or facility) [ 20 ]. Similarly, devices in the cyber-physical domain often rely on real-time operating systems enforcing strict timing standards on execution, and as such, may encounter problems when data provided from a big data application fails to be delivered on time.
Veracity represents the quality of the data (e.g., uncertain or imprecise data). For example, IBM estimates that poor data quality costs the US economy $3.1 trillion per year [ 21 ]. Because data can be inconsistent, noisy, ambiguous, or incomplete, data veracity is categorized as good, bad, and undefined. Due to the increasingly diverse sources and variety of data, accuracy and trust become more difficult to establish in big data analytics. For example, an employee may use Twitter to share official corporate information but at other times use the same account to express personal opinions, causing problems with any techniques designed to work on the Twitter dataset. As another example, when analyzing millions of health care records to determine or detect disease trends, for instance to mitigate an outbreak that could impact many people, any ambiguities or inconsistencies in the dataset can interfere or decrease the precision of the analytics process [ 21 ].
Value represents the context and usefulness of data for decision making, whereas the prior V’s focus more on representing challenges in big data. For example, Facebook, Google, and Amazon have leveraged the value of big data via analytics in their respective products. Amazon analyzes large datasets of users and their purchases to provide product recommendations, thereby increasing sales and user participation. Google collects location data from Android users to improve location services in Google Maps. Facebook monitors users’ activities to provide targeted advertising and friend recommendations. These three companies have each become massive by examining large sets of raw data and drawing and retrieving useful insight to make better business decisions [ 29 ].
- Uncertainty
Generally, “uncertainty is a situation which involves unknown or imperfect information” [ 30 ]. Uncertainty exists in every phase of big data learning [ 7 ] and comes from many different sources, such as data collection (e.g., variance in environmental conditions and issues related to sampling), concept variance (e.g., the aims of analytics do not present similarly) and multimodality (e.g., the complexity and noise introduced with patient health records from multiple sensors include numerical, textual, and image data). For instance, most of the attribute values relating to the timing of big data (e.g., when events occur/have occurred) are missing due to noise and incompleteness. Furthermore, the number of missing links between data points in social networks is approximately 80% to 90% and the number of missing attribute values within patient reports transcribed from doctor diagnoses are more than 90% [ 31 ]. Based on IBM research in 2014, industry analysts believe that, by 2015, 80% of the world’s data will be uncertain [ 32 ].
Various forms of uncertainty exist in big data and big data analytics that may negatively impact the effectiveness and accuracy of the results. For example, if training data is biased in any way, incomplete, or obtained through inaccurate sampling, the learning algorithm using corrupted training data will likely output inaccurate results. Therefore, it is critical to augment big data analytic techniques to handle uncertainty. Recently, meta-analysis studies that integrate uncertainty and learning from data have seen a sharp increase [ 33 , 34 , 35 ]. The handling of the uncertainty embedded in the entire process of data analytics has a significant effect on the performance of learning from big data [ 16 ]. Other research also indicates that two more features for big data, such as multimodality (very complex types of data) and changed-uncertainty (the modeling and measure of uncertainty for big data) is remarkably different from that of small-size data. There is also a positive correlation in increasing the size of a dataset to the uncertainty of data itself and data processing [ 34 ]. For example, fuzzy sets may be applied to model uncertainty in big data to combat vague or incorrect information [ 36 ]. Moreover, and because the data may contain hidden relationships, the uncertainty is further increased.
Therefore, it is not an easy task to evaluate uncertainty in big data, especially when the data may have been collected in a manner that creates bias. To combat the many types of uncertainty that exist, many theories and techniques have been developed to model its various forms. We next describe several common techniques.
Bayesian theory assumes a subjective interpretation of the probability based on past event/prior knowledge. In this interpretation the probability is defined as an expression of a rational agent’s degrees of belief about uncertain propositions [ 37 ]. Belief function theory is a framework for aggregating imperfect data through an information fusion process when under uncertainty [ 38 ]. Probability theory incorporates randomness and generally deals with the statistical characteristics of the input data [ 34 ]. Classification entrop y measures ambiguity between classes to provide an index of confidence when classifying. Entropy varies on a scale from zero to one, where values closer to zero indicate more complete classification in a single class, while values closer to one indicate membership among several different classes [ 39 ]. Fuzziness is used to measure uncertainty in classes, notably in human language (e.g., good and bad) [ 16 , 33 , 40 ]. Fuzzy logic then handles the uncertainty associated with human perception by creating an approximate reasoning mechanism [ 41 , 42 ]. The methodology was intended to imitate human reasoning to better handle uncertainty in the real world [ 43 ]. Shannon’s entropy quantifies the amount of information in a variable to determine the amount of missing information on average in a random source [ 44 , 45 ]. The concept of entropy in statistics was introduced into the theory of communication and transmission of information by Shannon [ 46 ]. Shannon entropy provides a method of information quantification when it is not possible to measure criteria weights using a decision–maker. Rough set theory provides a mathematical tool for reasoning on vague, uncertain or incomplete information. With the rough set approach, concepts are described by two approximations (upper and lower) instead of one precise concept [ 47 ], making such methods invaluable to dealing with uncertain information systems [ 48 ]. Probabilistic theory and Shannon’s entropy are often used to model imprecise, incomplete, and inaccurate data. Moreover, fuzzy set and rough theory are used for modeling vague or ambiguous data [ 49 ], as shown in Fig. 2 .
Measuring uncertainty in big data
Evaluating the level of uncertainty is a critical step in big data analytics. Although a variety of techniques exist to analyze big data, the accuracy of the analysis may be negatively affected if uncertainty in the data or the technique itself is ignored. Uncertainty models such as probability theory, fuzziness, rough set theory, etc. can be used to augment big data analytic techniques to provide more accurate and more meaningful results. Based on the previous research, Bayesian model and fuzzy set theory are common for modeling uncertainty and decision-making. Table 1 compares and summarizes the techniques we have identified as relevant, including a comparison between different uncertainty strategies, focusing on probabilistic theory, Shannon’s entropy, fuzzy set theory, and rough set theory.
- Big data analytics
Big data analytics describe the process of analyzing massive datasets to discover patterns, unknown correlations, market trends, user preferences, and other valuable information that previously could not be analyzed with traditional tools [ 52 ]. With the formalization of the big data’s five V characteristics, analysis techniques needed to be reevaluated to overcome their limitations on processing in terms of time and space [ 29 ]. Opportunities for utilizing big data are growing in the modern world of digital data. The global annual growth rate of big data technologies and services is predicted to increase about 36% between 2014 and 2019, with the global income for big data and business analytics anticipated to increase more than 60% [ 53 ].
Several advanced data analysis techniques (i.e., ML, data mining, NLP, and CI) and potential strategies such as parallelization, divide-and-conquer, incremental learning, sampling, granular computing, feature selection [ 16 ], and instance selection [ 34 ] can convert big problems to small problems and can be used to make better decisions, reduce costs, and enable more efficient processing.
With respect to big data analytics, parallelization reduces computation time by splitting large problems into smaller instances of itself and performing the smaller tasks simultaneously (e.g., distributing the smaller tasks across multiple threads, cores, or processors). Parallelization does not decrease the amount of work performed but rather reduces computation time as the small tasks are completed at the same point in time instead of one after another sequentially [ 16 ].
The divide - and - conquer strategy plays an important role in processing big data. Divide-and-conquer consists of three phases: (1) reduce one large problem into several smaller problems, (2) complete the smaller problems, where the solving of each small problem contributes to the solving of the large problem, and (3) incorporate the solutions of the smaller problems into one large solution such that the large problem is considered solved. For many years the divide-and-conquer strategy has been used in very massive databases to manipulate records in groups rather than all the data at once [ 54 ].
Incremental learning is a learning algorithm popularly used with streaming data that is trained only with new data rather than only training with existing data. Incremental learning adjusts the parameters in the learning algorithm over time according to each new input data and each input is used for training only once [ 16 ].
Sampling can be used as a data reduction method for big data analytics for deriving patterns in large data sets by choosing, manipulating, and analyzing a subset of the data [ 16 , 55 ]. Some research indicates that obtaining effective results using sampling depends on the data sampling criteria used [ 56 ].
Granular computing groups elements from a large space to simplify the elements into subsets, or granules [ 57 , 58 ]. Granular computing is an effective approach to define uncertainty of objects in the search space as it reduces large objects to a smaller search space [ 59 ].
Feature selection is a conventional approach to handle big data with the purpose of choosing a subset of relative features for an aggregate but more precise data representation [ 60 , 61 ]. Feature selection is a very useful strategy in data mining for preparing high-scale data [ 60 ].
Instance selection is practical in many ML or data mining tasks as a major feature in data pre-processing. By utilizing instance selection, it is possible to reduce training sets and runtime in the classification or training phases [ 62 ].
The costs of uncertainty (both monetarily and computationally) and challenges in generating effective models for uncertainties in big data analytics have become key to obtaining robust and performant systems. As such, we examine several open issues of the impacts of uncertainty on big data analytics in the next section.
Uncertainty perspective of big data analytics
This section examines the impact of uncertainty on three AI techniques for big data analytics. Specifically, we focus on ML, NLP, and CI, although many other analytics techniques exist. For each presented technique, we examine the inherent uncertainties and discuss methods and strategies for their mitigation.
Machine learning and big data
When dealing with data analytics, ML is generally used to create models for prediction and knowledge discovery to enable data-driven decision-making. Traditional ML methods are not computationally efficient or scalable enough to handle both the characteristics of big data (e.g., large volumes, high speeds, varying types, low value density, incompleteness) and uncertainty (e.g., biased training data, unexpected data types, etc.). Several commonly used advanced ML techniques proposed for big data analysis include feature learning, deep learning, transfer learning, distributed learning, and active learning. Feature learning includes a set of techniques that enables a system to automatically discover the representations needed for feature detection or classification from raw data. The performances of the ML algorithms are strongly influenced by the selection of data representation. Deep learning algorithms are designed for analyzing and extracting valuable knowledge from massive amounts of data and data collected from various sources (e.g., separate variations within an image, such as a light, various materials, and shapes) [ 56 ], however current deep learning models incur a high computational cost. Distributed learning can be used to mitigate the scalability problem of traditional ML by carrying out calculations on data sets distributed among several workstations to scale up the learning process [ 63 ]. Transfer learning is the ability to apply knowledge learned in one context to new contexts, effectively improving a learner from one domain by transferring information from a related domain [ 64 ]. Active learning refers to algorithms that employ adaptive data collection [ 65 ] (i.e., processes that automatically adjust parameters to collect the most useful data as quickly as possible) in order to accelerate ML activities and overcome labeling problems. The uncertainty challenges of ML techniques can be mainly attributed to learning from data with low veracity (i.e., uncertain and incomplete data) and data with low value (i.e., unrelated to the current problem). We found that, among the ML techniques, active learning, deep learning, and fuzzy logic theory are uniquely suited to support the challenge of reducing uncertainty, as shown in Fig. 3 . Uncertainty can impact ML in terms of incomplete or imprecise training samples, unclear classification boundaries, and rough knowledge of the target data. In some cases, the data is represented without labels, which can become a challenge. Manually labeling large data collections can be an expensive and strenuous task, yet learning from unlabeled data is very difficult as classifying data with unclear guidelines yields unclear results. Active learning has solved this issue by selecting a subset of the most important instances for labeling [ 65 , 66 ]. Deep learning is another learning method that can handle incompleteness and inconsistency issues in the classification procedure [ 15 ]. Fuzzy logic theory has been also shown to model uncertainty efficiently. For example, in fuzzy support vector machines (FSVMs), a fuzzy membership is applied to each input point of the support vector machines (SVM). The learning procedure then has the benefits of flexibility provided by fuzzy logic, enabling an improvement in the SVM by decreasing the result of noises in data points [ 67 ]. Hence, while uncertainty is a notable problem for ML algorithms, incorporating effective techniques for measuring and modeling uncertainty can lead towards systems that are more flexible and efficient, respective.
How ML techniques handle uncertainty in big data
Natural language processing and big data
NLP is a technique grounded in ML that enables devices to analyze, interpret, and even generate text [ 8 ]. NLP and big data analytics tackle huge amounts of text data and can derive value from such a dataset in real-time [ 68 ]. Some common NLP methods include lexical acquisition (i.e., obtains information about the lexical units of a language), word sense disambiguation (i.e., determining which sense of the word is used in a sentence when a word has multiple meanings), and part-of-speech (POS) tagging (i.e., determining the function of the words through labeling categories such as verb, noun, etc.). Several NLP-based techniques have been applied to text mining including information extraction, topic modeling, text summarization, classification, clustering, question answering, and opinion mining [ 8 ]. For example, financial and fraud investigations may involve finding evidence of a crime in massive datasets. NLP techniques (particularly named entity extraction and information retrieval) can help manage and sift through huge amounts of textual information, such as criminal names and bank records, to support fraud investigations. Moreover, NLP techniques can help to create new traceability links and recover traceability links (i.e., missing or broken links at run-time) by finding semantic similarity among available textual artifacts [ 69 ]. Furthermore, NLP and big data can be used to analyze news articles and predict rises and falls on the composite stock price index [ 68 ].
Uncertainty impacts NLP in big data in a variety of ways. For example, keyword search is a classic approach in text mining that is used to handle large amounts of textual data. Keyword search accepts as input a list of relevant words or phrases and searches the desired set of data (e.g., a document or database) for occurrences of the relevant words (i.e., search terms). Uncertainty can impact keyword search, as a document that contains a keyword is not an assurance of a document’s relevance. For example, a keyword search usually matches exact strings and ignores words with spelling errors that may still be relevant. Boolean operators and fuzzy search technologies permit greater flexibility in that they can be used to search for words similar to the desired spelling [ 70 ]. Although keyword or key phrase search is useful, limited sets of search terms can miss key information. In comparison, using a wider set of search terms can result in a large set of ‘hits’ that can contain large numbers of irrelevant false positives [ 71 ]. Another example of uncertainty impacting NLP involves automatic POS taggers that must handle the ambiguity of certain words (Fig. 4 ) (e.g., the word “bimonthly” can mean twice a month or every two months depending on the context, the word “quite” having different meaning to American and British audiences, etc.), as well as classification problems due to the ambiguity of periods (‘.’) that can be interpreted as part of a token (e.g., abbreviation), punctuation (e.g., full stop), or both [ 72 , 73 ]. Although recent research indicates that using IBM Content Analytics (ICA) can mitigate these problems, there remains the open issue in this topic regarding large-scale data [ 73 ]. Also, uncertainty and ambiguity impact the POS tagging especially when using biomedical language, which quite different from general English. It has been reported uncertainty and not sufficient tagging accuracy when trained taggers from Treebank corpus and applied to biomedical data [ 74 ]. To this end, stream processing systems deal with high data throughput while achieving low response latencies. The integration of NLP techniques with the help of uncertainty modeling such as fuzzy and probabilistic sets with big data analytics may offer the ability to support handling big textual data in real time, however additional work is necessary in this area.
Words with more than one POS tag (ambiguity)
Computational intelligence and big data
CI includes a set of nature-inspired computational techniques that play an important role in big data analysis [ 75 ]. CIs have been used to tackle complicated data processes and analytics challenges such as high complexity, uncertainty, and any processes where traditional techniques are not sufficient. Common techniques that are currently available in CI are evolutionary algorithms (EAs), artificial neural networks (ANN), and fuzzy logic [ 76 ], with examples spanning search-based problems such as parameter optimization to optimizing a robot controller.
CI techniques are suitable for dealing with the real-world challenges of big data as they are fundamentally capable of handling numerous amounts of uncertainty. For example, generating models for predicting emotions of users is one problem with many potential pitfalls for uncertainty. Such models deal with large databases of information relating to human emotion and its inherent fuzziness [ 77 ]. Many challenges still exist in current CI techniques, especially when dealing with the value and veracity characteristics of big data. Accordingly, there is great interest in developing new CI techniques that can efficiently address massive amounts of data and to have the ability to quickly respond to modifications in the dataset [ 78 ]. As reported by [ 78 ], big data analysis can be optimized by employing algorithms such as swarm intelligence, AI, and ML. These techniques are used for training machines in performing predictive analysis tasks, collaborative filtering, and building empirical statistical predictive models. It is possible to minimize the complexity and uncertainty on processing massive volumes of data and improve analysis results by using CI-based big data analytics solutions.
To support CI, fuzzy logic provides an approach for approximate reasoning and modeling of qualitative data for uncertainty challenges in big data analytics [ 76 , 79 , 80 ] using linguistic quantifiers (i.e., fuzzy sets). It represents uncertain real-word and user-defined concepts and interpretable fuzzy rules that can be used for inference and decision-making. Big data analytics also bear challenges due to the existence of noise in data where the data consists of high degrees of uncertainty and outlier artifacts. Iqbal et al. [ 76 ] have demonstrated that fuzzy logic systems can efficiently handle inherent uncertainties related to the data. In another study, fuzzy logic-based matching algorithms and MapReduce were used to perform big data analytics for clinical decision support. The developed system demonstrated great flexibility and could handle data from various sources [ 81 ]. Another useful CI technique for tackling the challenges of big data analytics are EAs that discover the optimal solution(s) to a complex problem by mimicking the evolution process by gradually developing a population of candidate solutions [ 73 ]. Since big data includes high volume, variety, and low veracity, EAs are excellent tools for analyzing such datasets [ 82 ]. For example, applying parallel genetic algorithms to medical image processing yields an effective result in a system using Hadoop [ 83 ]. However, the results of CI-based algorithms may be impacted by motion, noise, and unexpected environments. Moreover, an algorithm that can deal with one of these problems may function poorly when impacted by multiple factors [ 79 ].
Summary of mitigation strategies
This paper has reviewed numerous techniques on big data analytics and the impact of uncertainty of each technique. Table 2 summarizes these findings. First, each AI technique is categorized as either ML, NLP, or CI. The second column illustrates how uncertainty impacts each technique, both in terms of uncertainty in the data and the technique itself. Finally, the third column summarizes proposed mitigation strategies for each uncertainty challenge. For example, the first row of Table 2 illustrates one possibility for uncertainty to be introduced in ML via incomplete training data. One approach to overcome this specific form of uncertainty is to use an active learning technique that uses a subset of the data chosen to be the most significant, thereby countering the problem of limited available training data.
Note that we explained each big data characteristic separately. However, combining one or more big data characteristics will incur exponentially more uncertainty, thus requiring even further study.
This paper has discussed how uncertainty can impact big data, both in terms of analytics and the dataset itself. Our aim was to discuss the state of the art with respect to big data analytics techniques, how uncertainty can negatively impact such techniques, and examine the open issues that remain. For each common technique, we have summarized relevant research to aid others in this community when developing their own techniques. We have discussed the issues surrounding the five V’s of big data, however many other V’s exist. In terms of existing research, much focus has been provided on volume, variety, velocity, and veracity of data, with less available work in value (e.g., data related to corporate interests and decision making in specific domains).
Future research directions
This paper has uncovered many avenues for future work in this field. First, additional study must be performed on the interactions between each big data characteristic, as they do not exist separately but naturally interact in the real world. Second, the scalability and efficacy of existing analytics techniques being applied to big data must be empirically examined. Third, new techniques and algorithms must be developed in ML and NLP to handle the real-time needs for decisions made based on enormous amounts of data. Fourth, more work is necessary on how to efficiently model uncertainty in ML and NLP, as well as how to represent uncertainty resulting from big data analytics. Fifth, since the CI algorithms are able to find an approximate solution within a reasonable time, they have been used to tackle ML problems and uncertainty challenges in data analytics and process in recent years. However, there is a lack of CI metaheuristics algorithms to apply to big data analytics for mitigating uncertainty.
Availability of data and materials
Not applicable.
Abbreviations
Internet of things
international data corporation
artificial intelligence
machine learning
natural language processing
computational intelligence
fuzzy support vector machines
support vector machines
part-of-speech
IBM content analytics
evolutionary algorithms
artificial neural networks
Jaseena KU, David JM. Issues, challenges, and solutions: big data mining. Comput Sci Inf Technol (CS & IT). 2014;4:131–40.
Google Scholar
Marr B. Forbes. How much data do we create every day? 2018. https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#4146a89b60ba .
McAfee A, Brynjolfsson E, Davenport TH, Patil DJ, Barton D. Big data: the management revolution. Harvard Bus Rev. 2012;90(10):60–8.
Zephoria. Digital Marketing. The top 20 valuable Facebook statistics—updated November 2018. 2018. https://zephoria.com/top-15-valuable-facebook-statistics/ .
Iafrate F. A journey from big data to smart data. In: Digital enterprise design and management. Cham: Springer; p. 25–33. 2014.
Lenk A, Bonorden L, Hellmanns A, Roedder N, Jaehnichen S. Towards a taxonomy of standards in smart data. In: IEEE international conference on big data (Big Data), 2015. Piscataway: IEEE. p. 1749–54. 2015.
Tsai CW, Lai CF, Chao HC, Vasilakos AV. Big data analytics: a survey. J Big Data. 2015;2(1):21.
Article Google Scholar
Chen M, Mao S, Liu Y. Big data: a survey. Mobile Netw Appl. 2014;19(2):171–209.
Ma C, Zhang HH, Wang X. Machine learning for big data analytics in plants. Trends Plant Sci. 2014;19(12):798–808.
Borne K. Top 10 big data challenges a serious look at 10 big data v’s. Recuperat de. 2014. https://mapr.com/blog/top-10-big-data-challenges-serious-look-10-big-data-vs . Accessed 11 Apr 2014.
Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH. Big data: the next frontier for innovation, competition, and productivity. 2011.
Pouyanfar S, Yang Y, Chen SC, Shyu ML, Iyengar SS. Multimedia big data analytics: a survey. ACM Comput Surv (CSUR). 2018;51(1):10.
Cimaglobal. Using big data to reduce uncertainty in decision making. 2015. http://www.cimaglobal.com/Pages-that-we-will-need-to-bring-back/velocity-archive/Student-e-magazine/Velocity-December-2015/P2-using-big-data-to-reduce-uncertainty-in-decision-making/ .
Maugis PA. Big data uncertainties. J Forensic Legal Med. 2018;57:7–11.
Saidulu D, Sasikala R. Machine learning and statistical approaches for Big Data: issues, challenges and research directions. Int J Appl Eng Res. 2017;12(21):11691–9.
Wang X, He Y. Learning from uncertainty for big data: future analytical challenges and strategies. IEEE Syst Man Cybern Mag. 2016;2(2):26–31.
Villars RL, Olofson CW, Eastwood M. Big data: what it is and why you should care. White Paper IDC. 2011;14:1–14.
Laney D. 3D data management: controlling data volume, velocity and variety. META Group Res Note. 2001;6(70):1.
Gantz J, Reinsel D. Extracting value from chaos. IDC iview. 2011;1142(2011):1–12.
Jain A. The 5 Vs of big data. IBM Watson Health Perspectives. 2017. https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data/ . Accessed 30 May 2017.
IBM big data and analytics hub. Extracting Business Value from the 4 V’s of Big Data. 2016. http://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data .
Snow D. Dwaine Snow’s thoughts on databases and data management. 2012.
Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manage. 2015;35(2):137–44.
Vajjhala NR, Strang KD, Sun Z. Statistical modeling and visualizing open big data using a terrorism case study. In: 3rd international conference on future Internet of things and cloud (FiCloud), 2015. IEEE. p. 489–96. 2015.
Marr B. Really big data at Walmart: real-time insights from their 40+ Petabyte data cloud. 2017. https://www.forbes.com/sites/bernardmarr/2017/01/23/really-big-data-at-walmart-real-time-insights-from-their-40-petabyte-data-cloud/#2a0c16916c10 .
Pokorný J, Škoda P, Zelinka I, Bednárek D, Zavoral F, Kruliš M, Šaloun P. Big data movement: a challenge in data processing. In: Big Data in complex systems. Cham: Springer; p. 29–69. 2015
Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.
MATH Google Scholar
Xiong H, Pandey G, Steinbach M, Kumar V. Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng. 2006;18(3):304–19.
Court D. Getting big impact from big data. McKinsey Q. 2015;1:52–60.
Knight FH. Risk, uncertainty and profit, library of economics and liberty. 1921. (Retrieved May 17 2011).
DeLine R. Research opportunities for the big data era of software engineering. In: Proceedings of the first international workshop on BIG Data software engineering. Piscataway: IEEE Press; p. 26–9. 2015.
IBM Think Leaders. (2014). Veracity of data for marketing: Step-by-step. https://www.ibm.com/blogs/insights-on-business/ibmix/veracity-of-data-for-marketing-step-by-step/ .
Wang XZ, Ashfaq RAR, Fu AM. Fuzziness based sample categorization for classifier performance improvement. J Intell Fuzzy Syst. 2015;29(3):1185–96.
Article MathSciNet Google Scholar
Wang Xizhao, Huang JZ, Wang X, Huang JZ. Editorial: uncertainty in learning from big data. Fuzzy Sets Syst. 2015;258(1):1–4.
Article MathSciNet MATH Google Scholar
Xu ZB, Liang JY, Dang CY, Chin KS. Inclusion degree: a perspective on measures for rough set data analysis. Inf Sci. 2002;141(3–4):227–36.
López V, del Río S, Benítez JM, Herrera F. Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst. 2015;258:5–38.
Bernardo JM, Smith AF. Bayesian theory, vol. 405. Hoboken: Wiley; 2009.
Cuzzolin F. (Ed.). Belief functions: theory and applications. Berlin: Springer International Publishing; 2014.
Brown DG. Classification and boundary vagueness in mapping presettlement forest types. Int J Geogr Inf Sci. 1998;12(2):105–29.
Correa CD, Chan YH, Ma KL. A framework for uncertainty-aware visual analytics. In: IEEE symposium on visual analytics science and technology, VAST 2009. Piscataway: IEEE; p. 51–8. 2009.
Zadeh LA. Toward a perception-based theory of probabilistic reasoning with imprecise probabilities. J Stat Plann Inference. 2002;105(2002):233–64.
Zadeh LA. Toward a generalized theory of uncertainty (GTU)-an outline. Inf Sci. 2005;172(1–2):1–40.
Özkan I, Türkşen IB. Uncertainty and fuzzy decisions. In: Chaos theory in politics. Dordrecht: Springer; p. 17–27. 2014.
Lesne A. Shannon entropy: a rigorous notion at the crossroads between probability, information theory, dynamical systems and statistical physics. Math Struct Comput Sci. 2014;24(3).
Vajapeyam S. Understanding Shannon’s entropy metric for information. 2014. arXiv preprint arXiv:1405.2061 .
Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27(3):379–423.
Pawlak Z. Rough sets. Int J Comput Inform Sci. 1982;11(5):341–56.
Article MATH Google Scholar
Rissino S, Lambert-Torres G. Rough set theory - fundamental concepts, principals, data extraction, and applications. In: Data mining and knowledge discovery in real life applications. New York: InTech; 2009.
Tavana M, Liu W, Elmore P, Petry FE, Bourgeois BS. A practical taxonomy of methods and literature for managing uncertain spatial data in geographic information systems. Measurement. 2016;81:123–62.
Salahdine F, Kaabouch N, El Ghazi H. Techniques for dealing with uncertainty in cognitive radio networks. In: 2017 IEEE 7th annual computing and communication workshop and conference (CCWC). Piscataway: IEEE. p. 1–6. 2017.
Düntsch I, Gediga G. Rough set dependency analysis in evaluation studies: an application in the study of repeated heart attacks. Inf Res Rep. 1995;10:25–30.
Golchha N. Big data—the information revolution. IJAR. 2015;1(12):791–4.
Khan M, Ayyoob M. Big data analytics evaluation. Int J Eng Res Comput Sci Eng (IJERCSE). 2018;5(2):25–8.
Jordan MI. Divide-and-conquer and statistical inference for big data. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM; p. 4. 2012.
Wang XZ, Dong LC, Yan JH. Maximum ambiguity-based sample selection in fuzzy decision tree induction. IEEE Trans Knowl Data Eng. 2012;24(8):1491–505.
Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in big data analytics. J Big Data. 2015;2(1):1.
Bargiela A, Pedrycz W. Granular computing. In: Handbook on computational intelligence. Fuzzy logic, systems, artificial neural networks, and learning systems, vol 1, p. 43–66. 2016.
Chapter Google Scholar
Kacprzyk J, Filev D, Beliakov G. (Eds.). Granular, Soft and fuzzy approaches for intelligent systems: dedicated to Professor Ronald R. Yager (Vol. 344). Berlin: Springer; 2016.
Yager RR. Decision making under measure-based granular uncertainty. Granular Comput. 1–9. 2018.
Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1–3):389–422.
Liu H, Motoda H. (Eds.). Computational methods of feature selection. Boca Raton: CRC Press; 2007.
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF, Kittler J. A review of instance selection methods. Artif Intell Rev. 2010;34(2):133–43.
Qiu J, Wu Q, Ding G, Xu Y, Feng S. A survey of machine learning for big data processing. EURASIP J Adv Signal Process. 2016;2016(1):67.
Weiss K, Khoshgoftaar TM, Wang D. A survey of transfer learning. J Big Data. 2016;3(1):9.
Athmaja S, Hanumanthappa M, Kavitha V. A survey of machine learning algorithms for big data analytics. In: International conference on innovations in information, embedded and communication systems (ICIIECS), 2017. Piscataway: IEEE; p. 1–4. 2017.
Fu Y, Li B, Zhu X, Zhang C. Active learning without knowing individual instance labels: a pairwise label homogeneity query approach. IEEE Trans Knowl Data Eng. 2014;26(4):808–22.
Lin CF, Wang SD. Fuzzy support vector machines. IEEE Trans Neural Netw. 2002;13(2):464–71.
Wang L, Wang G, Alexander CA. Natural language processing systems and Big Data analytics. Int J Comput Syst Eng. 2015;2(2):76–84.
Hariri RH, Fredericks EM. Towards traceability link recovery for self-adaptive systems. In: Workshops at the thirty-second AAAI conference on artificial intelligence. 2018.
Crabb ES. “Time for some traffic problems”: enhancing e-discovery and big data processing tools with linguistic methods for deception detection. J Digit Forensics Secur Law. 2014;9(2):14.
Khan E. Addressing bioinformatics big data problems using natural language processing: help advancing scientific discovery and biomedical research. In: Buzatu C, editor. Modern computer applications in science and education. 2014; p. 221–8.
Clark A, Fox C, Lappin S. (Eds.). The handbook of computational linguistics and natural language processing. Hoboken: Wiley; 2013.
Holzinger A, Stocker C, Ofner B, Prohaska G, Brabenetz A, Hofmann-Wellenhof R. Combining HCI, natural language processing, and knowledge discovery-potential of IBM content analytics as an assistive technology in the biomedical field. In: Human-Computer Interaction and knowledge discovery in complex, unstructured, big data. Berlin, Heidelberg: Springer; p. 13–24. 2013.
Tsuruoka Y, Tateishi Y, Kim JD, Ohta T, McNaught J, Ananiadou S, Tsujii J. Developing a robust part-of-speech tagger for biomedical text. In: 10th Panhellenic conference on informatics Volos: Springer; 2005. p. 382–92.
Fulcher J. Computational intelligence: an introduction. In: Computational intelligence: a compendium. Berlin, Heidelberg: Springer; p. 3–78. 2008.
Iqbal R, Doctor F, More B, Mahmud S, Yousuf U. Big data analytics: computational intelligence techniques and application areas. Technol Forecast Soc Change. 2018. https://doi.org/10.1016/j.techfore.2018.03.024 .
Wu D. Fuzzy sets and systems in building closed-loop affective computing systems for human-computer interaction: advances and new research directions. In: IEEE international conference on fuzzy systems (FUZZ-IEEE), 2012. IEEE. p. 1–8. 2012.
Gupta A. Big data analysis using computational intelligence and Hadoop: a study. In: 2nd international conference on computing for sustainable global development (INDIACom), 2015. Piscataway: IEEE; p. 1397–1401. 2015.
Doctor F, Syue CH, Liu YX, Shieh JS, Iqbal R. Type-2 fuzzy sets applied to multivariable self-organizing fuzzy logic controllers for regulating anesthesia. Appl Soft Comput. 2016;38:872–89.
Zadeh LA. Fuzzy sets. Inf Control. 1965;8(3):338–53.
Duggal R, Khatri SK, Shukla B. Improving patient matching: single patient view for clinical decision support using big data analytics. In: 4th International conference on reliability, infocom technologies and optimization (ICRITO) (trends and future directions), 2015. Piscataway: IEEE; p. 1–6. 2015.
Bhattacharya M, Islam R, Abawajy J. Evolutionary optimization: a big data perspective. J Netw Comput Appl. 2016;59:416–26.
Augustine DP. Enhancing the efficiency of parallel genetic algorithms for medical image processing with Hadoop. Int J Comput Appl. 2014;108(17):11–6.
Download references
Acknowledgements
The authors would like to thank Rana H. Hariri for her assistance with this paper.
This research has been supported in part by NSF Grant CNS-1657061, the Michigan Space Grant Consortium, the Comcast Innovation Fund, and Oakland University. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Oakland University or other research sponsors.
Funding was provided by National Science Foundation (Grant No.CNS-1657061), Arizona Space Grant Consortium, Comcast Innovation, Oakland University.
Author information
Authors and affiliations.
Oakland University, Rochester, MI, USA
Reihaneh H. Hariri, Erik M. Fredericks & Kate M. Bowers
You can also search for this author in PubMed Google Scholar
Contributions
RHH proposed the idea of the survey, performed the literature review, analysis for the work, and wrote the manuscript. EMF supervised and provided technical guidance throughout the research and writing phases of the manuscript. KMB assisted with editing the manuscript. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Reihaneh H. Hariri .
Ethics declarations
Competing interests.
The authors declare that they have no competing interests.
Additional information
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Reprints and permissions
About this article
Cite this article.
Hariri, R.H., Fredericks, E.M. & Bowers, K.M. Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6 , 44 (2019). https://doi.org/10.1186/s40537-019-0206-3
Download citation
Received : 09 March 2019
Accepted : 20 May 2019
Published : 04 June 2019
DOI : https://doi.org/10.1186/s40537-019-0206-3
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Artificial intelligence
- Table of Contents
- Random Entry
- Chronological
- Editorial Information
- About the SEP
- Editorial Board
- How to Cite the SEP
- Special Characters
- Advanced Tools
- Support the SEP
- PDFs for SEP Friends
- Make a Donation
- SEPIA for Libraries
- Entry Contents
Bibliography
Academic tools.
- Friends PDF Preview
- Author and Citation Info
- Back to Top
Scientific Research and Big Data
Big Data promises to revolutionise the production of knowledge within and beyond science, by enabling novel, highly efficient ways to plan, conduct, disseminate and assess research. The last few decades have witnessed the creation of novel ways to produce, store, and analyse data, culminating in the emergence of the field of data science , which brings together computational, algorithmic, statistical and mathematical techniques towards extrapolating knowledge from big data. At the same time, the Open Data movement—emerging from policy trends such as the push for Open Government and Open Science—has encouraged the sharing and interlinking of heterogeneous research data via large digital infrastructures. The availability of vast amounts of data in machine-readable formats provides an incentive to create efficient procedures to collect, organise, visualise and model these data. These infrastructures, in turn, serve as platforms for the development of artificial intelligence, with an eye to increasing the reliability, speed and transparency of processes of knowledge creation. Researchers across all disciplines see the newfound ability to link and cross-reference data from diverse sources as improving the accuracy and predictive power of scientific findings and helping to identify future directions of inquiry, thus ultimately providing a novel starting point for empirical investigation. As exemplified by the rise of dedicated funding, training programmes and publication venues, big data are widely viewed as ushering in a new way of performing research and challenging existing understandings of what counts as scientific knowledge.
This entry explores these claims in relation to the use of big data within scientific research, and with an emphasis on the philosophical issues emerging from such use. To this aim, the entry discusses how the emergence of big data—and related technologies, institutions and norms—informs the analysis of the following themes:
- how statistics, formal and computational models help to extrapolate patterns from data, and with which consequences;
- the role of critical scrutiny (human intelligence) in machine learning, and its relation to the intelligibility of research processes;
- the nature of data as research components;
- the relation between data and evidence, and the role of data as source of empirical insight;
- the view of knowledge as theory-centric;
- understandings of the relation between prediction and causality;
- the separation of fact and value; and
- the risks and ethics of data science.
These are areas where attention to research practices revolving around big data can benefit philosophy, and particularly work in the epistemology and methodology of science. This entry doesn’t cover the vast scholarship in the history and social studies of science that has emerged in recent years on this topic, though references to some of that literature can be found when conceptually relevant. Complementing historical and social scientific work in data studies, the philosophical analysis of data practices can also elicit significant challenges to the hype surrounding data science and foster a critical understanding of the role of data-fuelled artificial intelligence in research.
1. What Are Big Data?
2. extrapolating data patterns: the role of statistics and software, 3. human and artificial intelligence, 4. the nature of (big) data, 5. big data and evidence, 6. big data, knowledge and inquiry, 7. big data between causation and prediction, 8. the fact/value distinction, 9. big data risks and the ethics of data science, 10. conclusion: big data and good science, other internet resources, related entries.
We are witnessing a progressive “datafication” of social life. Human activities and interactions with the environment are being monitored and recorded with increasing effectiveness, generating an enormous digital footprint. The resulting “big data” are a treasure trove for research, with ever more sophisticated computational tools being developed to extract knowledge from such data. One example is the use of various different types of data acquired from cancer patients, including genomic sequences, physiological measurements and individual responses to treatment, to improve diagnosis and treatment. Another example is the integration of data on traffic flow, environmental and geographical conditions, and human behaviour to produce safety measures for driverless vehicles, so that when confronted with unforeseen events (such as a child suddenly darting into the street on a very cold day), the data can be promptly analysed to identify and generate an appropriate response (the car swerving enough to avoid the child while also minimising the risk of skidding on ice and damaging to other vehicles). Yet another instance is the understanding of the nutritional status and needs of a particular population that can be extracted from combining data on food consumption generated by commercial services (e.g., supermarkets, social media and restaurants) with data coming from public health and social services, such as blood test results and hospital intakes linked to malnutrition. In each of these cases, the availability of data and related analytic tools is creating novel opportunities for research and for the development of new forms of inquiry, which are widely perceived as having a transformative effect on science as a whole.
A useful starting point in reflecting on the significance of such cases for a philosophical understanding of research is to consider what the term “big data” actually refers to within contemporary scientific discourse. There are multiple ways to define big data (Kitchin 2014, Kitchin & McArdle 2016). Perhaps the most straightforward characterisation is as large datasets that are produced in a digital form and can be analysed through computational tools. Hence the two features most commonly associated with Big Data are volume and velocity. Volume refers to the size of the files used to archive and spread data. Velocity refers to the pressing speed with which data is generated and processed. The body of digital data created by research is growing at breakneck pace and in ways that are arguably impossible for the human cognitive system to grasp and thus require some form of automated analysis.
Volume and velocity are also, however, the most disputed features of big data. What may be perceived as “large volume” or “high velocity” depends on rapidly evolving technologies to generate, store, disseminate and visualise the data. This is exemplified by the high-throughput production, storage and dissemination of genomic sequencing and gene expression data, where both data volume and velocity have dramatically increased within the last two decades. Similarly, current understandings of big data as “anything that cannot be easily captured in an Excel spreadsheet” are bound to shift rapidly as new analytic software becomes established, and the very idea of using spreadsheets to capture data becomes a thing of the past. Moreover, data size and speed do not take account of the diversity of data types used by researchers, which may include data that are not generated in digital formats or whose format is not computationally tractable, and which underscores the importance of data provenance (that is, the conditions under which data were generated and disseminated) to processes of inference and interpretation. And as discussed below, the emphasis on physical features of data obscures the continuing dependence of data interpretation on circumstances of data use, including specific queries, values, skills and research situations.
An alternative is to define big data not by reference to their physical attributes, but rather by virtue of what can and cannot be done with them. In this view, big data is a heterogeneous ensemble of data collected from a variety of different sources, typically (but not always) in digital formats suitable for algorithmic processing, in order to generate new knowledge. For example boyd and Crawford (2012: 663) identify big data with “the capacity to search, aggregate and cross-reference large datasets”, while O’Malley and Soyer (2012) focus on the ability to interrogate and interrelate diverse types of data, with the aim to be able to consult them as a single body of evidence. The examples of transformative “big data research” given above are all easily fitted into this view: it is not the mere fact that lots of data are available that makes a different in those cases, but rather the fact that lots of data can be mobilised from a wide variety of sources (medical records, environmental surveys, weather measurements, consumer behaviour). This account makes sense of other characteristic “v-words” that have been associated with big data, including:
- Variety in the formats and purposes of data, which may include objects as different as samples of animal tissue, free-text observations, humidity measurements, GPS coordinates, and the results of blood tests;
- Veracity , understood as the extent to which the quality and reliability of big data can be guaranteed. Data with high volume, velocity and variety are at significant risk of containing inaccuracies, errors and unaccounted-for bias. In the absence of appropriate validation and quality checks, this could result in a misleading or outright incorrect evidence base for knowledge claims (Floridi & Illari 2014; Cai & Zhu 2015; Leonelli 2017);
- Validity , which indicates the selection of appropriate data with respect to the intended use. The choice of a specific dataset as evidence base requires adequate and explicit justification, including recourse to relevant background knowledge to ground the identification of what counts as data in that context (e.g., Loettgers 2009, Bogen 2010);
- Volatility , i.e., the extent to which data can be relied upon to remain available, accessible and re-interpretable despite changes in archival technologies. This is significant given the tendency of formats and tools used to generate and analyse data to become obsolete, and the efforts required to update data infrastructures so as to guarantee data access in the long term (Bowker 2006; Edwards 2010; Lagoze 2014; Borgman 2015);
- Value , i.e., the multifaceted forms of significance attributed to big data by different sections of society, which depend as much on the intended use of the data as on historical, social and geographical circumstances (Leonelli 2016, D’Ignazio and Klein 2020). Alongside scientific value, researchers may impute financial, ethical, reputational and even affective value to data, depending on their intended use as well as the historical, social and geographical circumstances of their use. The institutions involved in governing and funding research also have ways of valuing data, which may not always overlap with the priorities of researchers (Tempini 2017).
This list of features, though not exhaustive, highlights how big data is not simply “a lot of data”. The epistemic power of big data lies in their capacity to bridge between different research communities, methodological approaches and theoretical frameworks that are difficult to link due to conceptual fragmentation, social barriers and technical difficulties (Leonelli 2019a). And indeed, appeals to big data often emerge from situations of inquiry that are at once technically, conceptually and socially challenging, and where existing methods and resources have proved insufficient or inadequate (Sterner & Franz 2017; Sterner, Franz, & Witteveen 2020).
This understanding of big data is rooted in a long history of researchers grappling with large and complex datasets, as exemplified by fields like astronomy, meteorology, taxonomy and demography (see the collections assembled by Daston 2017; Anorova et al. 2017; Porter & Chaderavian 2018; as well as Anorova et al. 2010, Sepkoski 2013, Stevens 2016, Strasser 2019 among others). Similarly, biomedical research—and particularly subfields such as epidemiology, pharmacology and public health—has an extensive tradition of tackling data of high volume, velocity, variety and volatility, and whose validity, veracity and value are regularly negotiated and contested by patients, governments, funders, pharmaceutical companies, insurances and public institutions (Bauer 2008). Throughout the twentieth century, these efforts spurred the development of techniques, institutions and instruments to collect, order, visualise and analyse data, such as: standard classification systems and formats; guidelines, tools and legislation for the management and security of sensitive data; and infrastructures to integrate and sustain data collections over long periods of time (Daston 2017).
This work culminated in the application of computational technologies, modelling tools and statistical methods to big data (Porter 1995; Humphreys 2004; Edwards 2010), increasingly pushing the boundaries of data analytics thanks to supervised learning, model fitting, deep neural networks, search and optimisation methods, complex data visualisations and various other tools now associated with artificial intelligence. Many of these tools are based on algorithms whose functioning and results are tested against specific data samples (a process called “training”). These algorithms are programmed to “learn” from each interaction with novel data: in other words, they have the capacity to change themselves in response to new information being inputted into the system, thus becoming more attuned to the phenomena they are analysing and improving their ability to predict future behaviour. The scope and extent of such changes is shaped by the assumptions used to build the algorithms and the capability of related software and hardware to identify, access and process information of relevance to the learning in question. There is however a degree of unpredictability and opacity to these systems, which can evolve to the point of defying human understanding (more on this below).
New institutions, communication platforms and regulatory frameworks also emerged to assemble, prepare and maintain data for such uses (Kitchin 2014), such as various forms of digital data infrastructures, organisations aiming to coordinate and improve the global data landscape (e.g., the Research Data Alliance), and novel measures for data protection, like the General Data Protection Regulation launched in 2017 by the European Union. Together, these techniques and institutions afford the opportunity to assemble and interpret data at a much broader scale, while also promising to deliver finer levels of granularity in data analysis. [ 1 ] They increase the scope of any investigation by making it possible for researchers to link their own findings to those of countless others across the world, both within and beyond the academic sphere. By enhancing the mobility of data, they facilitate their repurposing for a variety of goals that may have been unforeseeable when the data were originally generated. And by transforming the role of data within research, they heighten their status as valuable research outputs in and of themselves. These technological and methodological developments have significant implications for philosophical conceptualisations of data, inferential processes and scientific knowledge, as well as for how research is conducted, organised, governed and assessed. It is to these philosophical concerns that I now turn.
Big data are often associated to the idea of data-driven research, where learning happens through the accumulation of data and the application of methods to extract meaningful patterns from those data. Within data-driven inquiry, researchers are expected to use data as their starting point for inductive inference, without relying on theoretical preconceptions—a situation described by advocates as “the end of theory”, in contrast to theory-driven approaches where research consists of testing a hypothesis (Anderson 2008, Hey et al. 2009). In principle at least, big data constitute the largest pool of data ever assembled and thus a strong starting point to search for correlations (Mayer-Schönberger & Cukier 2013). Crucial to the credibility of the data-driven approach is the efficacy of the methods used to extrapolate patterns from data and evaluate whether or not such patterns are meaningful, and what “meaning” may involve in the first place. Hence, some philosophers and data scholars have argued that
the most important and distinctive characteristic of Big Data [is] its use of statistical methods and computational means of analysis, (Symons & Alvarado 2016: 4)
such as for instance machine learning tools, deep neural networks and other “intelligent” practices of data handling.
The emphasis on statistics as key adjudicator of validity and reliability of patterns extracted from data is not novel. Exponents of logical empiricism looked for logically watertight methods to secure and justify inference from data, and their efforts to develop a theory of probability proceeded in parallel with the entrenchment of statistical reasoning in the sciences in the first half of the twentieth century (Romeijn 2017). In the early 1960s, Patrick Suppes offered a seminal link between statistical methods and the philosophy of science through his work on the production and interpretation of data models. As a philosopher deeply embedded in experimental practice, Suppes was interested in the means and motivations of key statistical procedures for data analysis such as data reduction and curve fitting. He argued that once data are adequately prepared for statistical modelling, all the concerns and choices that motivated data processing become irrelevant to their analysis and interpretation. This inspired him to differentiate between models of theory, models of experiment and models of data, noting that such different components of inquiry are governed by different logics and cannot be compared in a straightforward way. For instance,
the precise definition of models of the data for any given experiment requires that there be a theory of the data in the sense of the experimental procedure, as well as in the ordinary sense of the empirical theory of the phenomena being studied. (Suppes 1962: 253)
Suppes viewed data models as necessarily statistical: that is, as objects
designed to incorporate all the information about the experiment which can be used in statistical tests of the adequacy of the theory. (Suppes 1962: 258)
His formal definition of data models reflects this decision, with statistical requirements such as homogeneity, stationarity and order identified as the ultimate criteria to identify a data model Z and evaluate its adequacy:
Z is an N-fold model of the data for experiment Y if and only if there is a set Y and a probability measure P on subsets of Y such that \(Y = \langle Y, P\rangle\) is a model of the theory of the experiment, Z is an N-tuple of elements of Y , and Z satisfies the statistical tests of homogeneity, stationarity and order. (1962: 259)
This analysis of data models portrayed statistical methods as key conduits between data and theory, and hence as crucial components of inferential reasoning.
The focus on statistics as entry point to discussions of inference from data was widely promoted in subsequent philosophical work. Prominent examples include Deborah Mayo, who in her book Error and the Growth of Experimental Knowledge asked:
What should be included in data models? The overriding constraint is the need for data models that permit the statistical assessment of fit (between prediction and actual data); (Mayo 1996: 136)
and Bas van Fraassen, who also embraced the idea of data models as “summarizing relative frequencies found in data” (Van Fraassen 2008: 167). Closely related is the emphasis on statistics as means to detect error within datasets in relation to specific hypotheses, most prominently endorsed by the error-statistical approach to inference championed by Mayo and Aris Spanos (Mayo & Spanos 2009a). This approach aligns with the emphasis on computational methods for data analysis within big data research, and supports the idea that the better the inferential tools and methods, the better the chance to extract reliable knowledge from data.
When it comes to addressing methodological challenges arising from the computational analysis of big data, however, statistical expertise needs to be complemented by computational savvy in the training and application of algorithms associated to artificial intelligence, including machine learning but also other mathematical procedures for operating upon data (Bringsjord & Govindarajulu 2018). Consider for instance the problem of overfitting, i.e., the mistaken identification of patterns in a dataset, which can be greatly amplified by the training techniques employed by machine learning algorithms. There is no guarantee that an algorithm trained to successfully extrapolate patterns from a given dataset will be as successful when applied to other data. Common approaches to this problem involve the re-ordering and partitioning of both data and training methods, so that it is possible to compare the application of the same algorithms to different subsets of the data (“cross-validation”), combine predictions arising from differently trained algorithms (“ensembling”) or use hyperparameters (parameters whose value is set prior to data training) to prepare the data for analysis.
Handling these issues, in turn, requires
familiarity with the mathematical operations in question, their implementations in code, and the hardware architectures underlying such implementations. (Lowrie 2017: 3)
For instance, machine learning
aims to build programs that develop their own analytic or descriptive approaches to a body of data, rather than employing ready-made solutions such as rule-based deduction or the regressions of more traditional statistics. (Lowrie 2017: 4)
In other words, statistics and mathematics need to be complemented by expertise in programming and computer engineering. The ensemble of skills thus construed results in a specific epistemological approach to research, which is broadly characterised by an emphasis on the means of inquiry as the most significant driver of research goals and outputs. This approach, which Sabina Leonelli characterised as data-centric , involves “focusing more on the processes through which research is carried out than on its ultimate outcomes” (Leonelli 2016: 170). In this view, procedures, techniques, methods, software and hardware are the prime motors of inquiry and the chief influence on its outcomes. Focusing more specifically on computational systems, John Symons and Jack Horner argued that much of big data research consists of software-intensive science rather than data-driven research: that is, science that depends on software for its design, development, deployment and use, and thus encompasses procedures, types of reasoning and errors that are unique to software, such as for example the problems generated by attempts to map real-world quantities to discrete-state machines, or approximating numerical operations (Symons & Horner 2014: 473). Software-intensive science is arguably supported by an algorithmic rationality focused on the feasibility, practicality and efficiency of algorithms, which is typically assessed by reference to concrete situations of inquiry (Lowrie 2017).
Algorithms are enormously varied in their mathematical structures and underpinning conceptual commitments, and more philosophical work needs to be carried out on the specifics of computational tools and software used in data science and related applications—with emerging work in philosophy of computer science providing an excellent way forward (Turner & Angius 2019). Nevertheless, it is clear that whether or not a given algorithm successfully applies to the data at hand depends on factors that cannot be controlled through statistical or even computational methods: for instance, the size, structure and format of the data, the nature of the classifiers used to partition the data, the complexity of decision boundaries and the very goals of the investigation.
In a forceful critique informed by the philosophy of mathematics, Christian Calude and Giuseppe Longo argued that there is a fundamental problem with the assumption that more data will necessarily yield more information:
very large databases have to contain arbitrary correlations. These correlations appear only due to the size, not the nature, of data. (Calude & Longo 2017: 595)
They conclude that big data analysis is by definition unable to distinguish spurious from meaningful correlations and is therefore a threat to scientific research. A related worry, sometimes dubbed “the curse of dimensionality” by data scientists, concerns the extent to which the analysis of a given dataset can be scaled up in complexity and in the number of variables being considered. It is well known that the more dimensions one considers in classifying samples, for example, the larger the dataset on which such dimensions can be accurately generalised. This demonstrates the continuing, tight dependence between the volume and quality of data on the one hand, and the type and breadth of research questions for which data need to serve as evidence on the other hand.
Determining the fit between inferential methods and data requires high levels of expertise and contextual judgement (a situation known within machine learning as the “no free lunch theorem”). Indeed, overreliance on software for inference and data modelling can yield highly problematic results. Symons and Horner note that the use of complex software in big data analysis makes margins of error unknowable, because there is no clear way to test them statistically (Symons & Horner 2014: 473). The path complexity of programs with high conditionality imposes limits on standard error correction techniques. As a consequence, there is no effective method for characterising the error distribution in the software except by testing all paths in the code, which is unrealistic and intractable in the vast majority of cases due to the complexity of the code.
Rather than acting as a substitute, the effective and responsible use of artificial intelligence tools in big data analysis requires the strategic exercise of human intelligence—but for this to happen, AI systems applied to big data need to be accessible to scrutiny and modification. Whether or not this is the case, and who is best qualified to exercise such scrutiny, is under dispute. Thomas Nickles argued that the increasingly complex and distributed algorithms used for data analysis follow in the footsteps of long-standing scientific attempts to transcend the limits of human cognition. The resulting epistemic systems may no longer be intelligible to humans: an “alien intelligence” within which “human abilities are no longer the ultimate criteria of epistemic success” (Nickles forthcoming). Such unbound cognition holds the promise of enabling powerful inferential reasoning from previously unimaginable volumes of data. The difficulties in contextualising and scrutinising such reasoning, however, sheds doubt on the reliability of the results. It is not only machine learning algorithms that are becoming increasingly inaccessible to evaluation: beyond the complexities of programming code, computational data analysis requires a whole ecosystem of classifications, models, networks and inference tools which typically have different histories and purposes, and whose relation to each other—and effects when they are used together—are far from understood and may well be untraceable.
This raises the question of whether the knowledge produced by such data analytic systems is at all intelligible to humans, and if so, what forms of intelligibility it yields. It is certainly the case that deriving knowledge from big data may not involve an increase in human understanding, especially if understanding is understood as an epistemic skill (de Regt 2017). This may not be a problem to those who await the rise of a new species of intelligent machines, who may master new cognitive tools in a way that humans cannot. But as Nickles, Nicholas Rescher (1984), Werner Callebaut (2012) and others pointed out, even in that case “we would not have arrived at perspective-free science” (Nickles forthcoming). While the human histories and assumptions interwoven into these systems may be hard to disentangle, they still affect their outcomes; and whether or not these processes of inquiry are open to critical scrutiny, their telos, implications and significance for life on the planet arguably should be. As argued by Dan McQuillan (2018), the increasing automation of big data analytics may foster acceptance of a Neoplatonist machinic metaphysics , within which mathematical structures “uncovered” by AI would trump any appeal to human experience. Luciano Floridi echoes this intuition in his analysis of what he calls the infosphere :
The great opportunities offered by Information and Communication Technologies come with a huge intellectual responsibility to understand them and take advantage of them in the right way. (2014: vii)
These considerations parallel Paul Humphreys’s long-standing critique of computer simulations as epistemically opaque (Humphreys 2004, 2009)—and particularly his definition of what he calls essential epistemic opacity:
A process is essentially epistemically opaque to X if and only if it is impossible , given the nature of X , for X to know all of the epistemically relevant elements of the process. (Humphreys 2009: 618)
Different facets of the general problem of epistemic opacity are stressed within the vast philosophical scholarship on the role of modelling, computing and simulations in the sciences: the implications of lacking experimental access to the concrete parts of the world being modelled, for instance (Morgan 2005; Parker 2009; Radder 2009); the difficulties in testing the reliability of computational methods used within simulations (Winsberg 2010; Morrison 2015); the relation between opacity and justification (Durán & Formanek 2018); the forms of black-boxing associated to mechanistic reasoning implemented in computational analysis (Craver and Darden 2013; Bechtel 2016); and the debate over the intrinsic limits of computational approaches and related expertise (Collins 1990; Dreyfus 1992). Roman Frigg and Julian Reiss argued that such issues do not constitute fundamental challenges to the nature of inquiry and modelling, and in fact exist in a continuum with traditional methodological issues well-known within the sciences (Frigg & Reiss 2009). Whether or not one agrees with this position (Humphreys 2009; Beisbart 2012), big data analysis is clearly pushing computational and statistical methods to their limit, thus highlighting the boundaries to what even technologically augmented human beings are capable of knowing and understanding.
Research on big data analysis thus sheds light on elements of the research process that cannot be fully controlled, rationalised or even considered through recourse to formal tools.
One such element is the work required to present empirical data in a machine-readable format that is compatible with the software and analytic tools at hand. Data need to be selected, cleaned and prepared to be subjected to statistical and computational analysis. The processes involved in separating data from noise, clustering data so that it is tractable, and integrating data of different formats turn out to be highly sophisticated and theoretically structured, as demonstrated for instance by James McAllister’s (1997, 2007, 2011) and Uljana Feest’s (2011) work on data patterns, Marcel Boumans’s and Leonelli’s comparison of clustering principles across fields (forthcoming), and James Griesemer’s (forthcoming) and Mary Morgan’s (forthcoming) analyses of the peculiarities of datasets. Suppes was so concerned by what he called the “bewildering complexity” of data production and processing activities, that he worried that philosophers would not appreciate the ways in which statistics can and does help scientists to abstract data away from such complexity. He described the large group of research components and activities used to prepare data for modelling as “pragmatic aspects” encompassing “every intuitive consideration of experimental design that involved no formal statistics” (Suppes 1962: 258), and positioned them as the lowest step of his hierarchy of models—at the opposite end of its pinnacle, which are models of theory. Despite recent efforts to rehabilitate the methodology of inductive-statistical modelling and inference (Mayo & Spanos 2009b), this approach has been shared by many philosophers who regard processes of data production and processing as so chaotic as to defy systematic analysis. This explains why data have received so little consideration in philosophy of science when compared to models and theory.
The question of how data are defined and identified, however, is crucial for understanding the role of big data in scientific research. Let us now consider two philosophical views—the representational view and the relational view —that are both compatible with the emergence of big data, and yet place emphasis on different aspects of that phenomenon, with significant implications for understanding the role of data within inferential reasoning and, as we shall see in the next section, as evidence. The representational view construes data as reliable representations of reality which are produced via the interaction between humans and the world. The interactions that generate data can take place in any social setting regardless of research purposes. Examples range from a biologist measuring the circumference of a cell in the lab and noting the result in an Excel file, to a teacher counting the number of students in her class and transcribing it in the class register. What counts as data in these interactions are the objects created in the process of description and/or measurement of the world. These objects can be digital (the Excel file) or physical (the class register) and form a footprint of a specific interaction with the natural world. This footprint—“trace” or “mark”, in the words of Ian Hacking (1992) and Hans-Jörg Rheinberger (2011), respectively—constitutes a crucial reference point for analytic study and for the extraction of new insights. This is the reason why data forms a legitimate foundation to empirical knowledge: the production of data is equivalent to “capturing” features of the world that can be used for systematic study. According to the representative approach, data are objects with fixed and unchangeable content, whose meaning, in virtue of being representations of reality, needs to be investigated and revealed step-by-step through adequate inferential methods. The data documenting cell shape can be modelled to test the relevance of shape to the elasticity, permeability and resilience of cells, producing an evidence base to understand cell-to-cell signalling and development. The data produced counting students in class can be aggregated with similar data collected in other schools, producing an evidence base to evaluate the density of students in the area and their school attendance frequency.
This reflects the intuition that data, especially when they come in the form of numerical measurements or images such as photographs, somehow mirror the phenomena that they are created to document, thus providing a snapshot of those phenomena that is amenable to study under the controlled conditions of research. It also reflects the idea of data as “raw” products of research, which are as close as it gets to unmediated knowledge of reality. This makes sense of the truth-value sometimes assigned to data as irrefutable sources of evidence—the Popperian idea that if data are found to support a given claim, then that claim is corroborated as true at least as long as no other data are found to disprove it. Data in this view represent an objective foundation for the acquisition of knowledge and this very objectivity—the ability to derive knowledge from human experience while transcending it—is what makes knowledge empirical. This position is well-aligned with the idea that big data is valuable to science because it facilitates the (broadly understood) inductive accumulation of knowledge: gathering data collected via reliable methods produces a mountain of facts ready to be analysed and, the more facts are produced and connected with each other, the more knowledge can be extracted.
Philosophers have long acknowledged that data do not speak for themselves and different types of data require different tools for analysis and preparation to be interpreted (Bogen 2009 [2013]). According to the representative view, there are correct and incorrect ways of interpreting data, which those responsible for data analysis need to uncover. But what is a “correct” interpretation in the realm of big data, where data are consistently treated as mobile entities that can, at least in principle, be reused in countless different ways and towards different objectives? Perhaps more than at any other time in the history of science, the current mobilisation and re-use of big data highlights the degree to which data interpretation—and with it, whatever data is taken to represent—may differ depending on the conceptual, material and social conditions of inquiry. The analysis of how big data travels across contexts shows that the expectations and abilities of those involved determine not only the way data are interpreted, but also what is regarded as “data” in the first place (Leonelli & Tempini forthcoming). The representative view of data as objects with fixed and contextually independent meaning is at odds with these observations.
An alternative approach is to embrace these findings and abandon the idea of data as fixed representations of reality altogether. Within the relational view , data are objects that are treated as potential or actual evidence for scientific claims in ways that can, at least in principle, be scrutinised and accounted for (Leonelli 2016). The meaning assigned to data depends on their provenance, their physical features and what these features are taken to represent, and the motivations and instruments used to visualise them and to defend specific interpretations. The reliability of data thus depends on the credibility and strictness of the processes used to produce and analyse them. The presentation of data; the way they are identified, selected, and included (or excluded) in databases; and the information provided to users to re-contextualise them are fundamental to producing knowledge and significantly influence its content. For instance, changes in data format—as most obviously involved in digitisation, data compression or archival procedures— can have a significant impact on where, when, and who uses the data as source of knowledge.
This framework acknowledges that any object can be used as a datum, or stop being used as such, depending on the circumstances—a consideration familiar to big data analysts used to pick and mix data coming from a vast variety of sources. The relational view also explains how, depending on the research perspective interpreting it, the same dataset may be used to represent different aspects of the world (“phenomena” as famously characterised by James Bogen and James Woodward, 1988). When considering the full cycle of scientific inquiry from the viewpoint of data production and analysis, it is at the stage of data modelling that a specific representational value is attributed to data (Leonelli 2019b).
The relational view of data encourages attention to the history of data, highlighting their continual evolution and sometimes radical alteration, and the impact of this feature on the power of data to confirm or refute hypotheses. It explains the critical importance of documenting data management and transformation processes, especially with big data that transit far and wide over digital channels and are grouped and interpreted in different ways and formats. It also explains the increasing recognition of the expertise of those who produce, curate, and analyse data as indispensable to the effective interpretation of big data within and beyond the sciences; and the inextricable link between social and ethical concerns around the potential impact of data sharing and scientific concerns around the quality, validity, and security of data (boyd & Crawford 2012; Tempini & Leonelli, 2018).
Depending on which view on data one takes, expectations around what big data can do for science will vary dramatically. The representational view accommodates the idea of big data as providing the most comprehensive, reliable and generative knowledge base ever witnessed in the history of science, by virtue of its sheer size and heterogeneity. The relational view makes no such commitment, focusing instead on what inferences are being drawn from such data at any given point, how and why.
One thing that the representational and relational views agree on is the key epistemic role of data as empirical evidence for knowledge claims or interventions. While there is a large philosophical literature on the nature of evidence (e.g., Achinstein 2001; Reiss 2015; Kelly 2016), however, the relation between data and evidence has received less attention. This is arguably due to an implicit acceptance, by many philosophers, of the representational view of data. Within the representational view, the identification of what counts as data is prior to the study of what those data can be evidence for: in other words, data are “givens”, as the etymology of the word indicates, and inferential methods are responsible for determining whether and how the data available to investigators can be used as evidence, and for what. The focus of philosophical attention is thus on formal methods to single out errors and misleading interpretations, and the probabilistic and/or explanatory relation between what is unproblematically taken to be a body of evidence and a given hypothesis. Hence much of the expansive philosophical work on evidence avoids the term “data” altogether. Peter Achinstein’s seminal work is a case in point: it discusses observed facts and experimental results, and whether and under which conditions scientists would have reasons to believe such facts, but it makes no mention of data and related processing practices (Achinstein 2001).
By contrast, within the relational view an object can only be identified as datum when it is viewed as having value as evidence. Evidence becomes a category of data identification, rather than a category of data use as in the representational view (Canali 2019). Evidence is thus constitutive of the very notion of data and cannot be disentangled from it. This involves accepting that the conditions under which a given object can serve as evidence—and thus be viewed as datum - may change; and that should this evidential role stop altogether, the object would revert back into an ordinary, non-datum item. For example, the photograph of a plant taken by a tourist in a remote region may become relevant as evidence for an inquiry into the morphology of plants from that particular locality; yet most photographs of plants are never considered as evidence for an inquiry into the features and functioning of the world, and of those who are, many may subsequently be discarded as uninteresting or no longer pertinent to the questions being asked.
This view accounts for the mobility and repurposing that characterises big data use, and for the possibility that objects that were not originally generated in order to serve as evidence may be subsequently adopted as such. Consider Mayo and Spanos’s “minimal scientific principle for evidence”, which they define as follows:
Data x 0 provide poor evidence for H if they result from a method or procedure that has little or no ability of finding flaws in H , even if H is false. (Mayo & Spanos 2009b)
This principle is compatible with the relational view of data since it incorporates cases where the methods used to generate and process data may not have been geared towards the testing of a hypothesis H: all it asks is that such methods can be made relevant to the testing of H, at the point in which data are used as evidence for H (I shall come back to the role of hypotheses in the handling of evidence in the next section).
The relational view also highlights the relevance of practices of data formatting and manipulation to the treatment of data as evidence, thus taking attention away from the characteristics of the data objects alone and focusing instead on the agency attached to and enabled by those characteristics. Nora Boyd has provided a way to conceptualise data processing as an integral part of inferential processes, and thus of how we should understand evidence. To this aim she introduced the notion of “line of evidence”, which she defines as:
a sequence of empirical results including the records of data collection and all subsequent products of data processing generated on the way to some final empirical constraint. (Boyd 2018:406)
She thus proposes a conception of evidence that embraces both data and the way in which data are handled, and indeed emphasises the importance of auxiliary information used when assessing data for interpretation, which includes
the metadata regarding the provenance of the data records and the processing workflow that transforms them. (2018: 407)
As she concludes,
together, a line of evidence and its associated metadata compose what I am calling an “enriched line of evidence”. The evidential corpus is then to be made up of many such enriched lines of evidence. (2018: 407)
The relational view thus fosters a functional and contextualist approach to evidence as the manner through which one or more objects are used as warrant for particular knowledge items (which can be propositional claims, but also actions such as specific decisions or modes of conduct/ways of operating). This chimes with the contextual view of evidence defended by Reiss (2015), John Norton’s work on the multiple, tangled lines of inferential reasoning underpinning appeals to induction (2003), and Hasok Chang’s emphasis on the epistemic activities required to ground evidential claims (2012). Building on these ideas and on Stephen Toulmin’s seminal work on research schemas (1958), Alison Wylie has gone one step further in evaluating the inferential scaffolding that researchers (and particularly archaeologists, who so often are called to re-evaluate the same data as evidence for new claims; Wylie 2017) need to make sense of their data, interpret them in ways that are robust to potential challenges, and modify interpretations in the face of new findings. This analysis enabled Wylie to formulate a set of conditions for robust evidential reasoning, which include epistemic security in the chain of evidence, causal anchoring and causal independence of the data used as evidence, as well as the explicit articulation of the grounds for calibration of the instruments and methods involved (Chapman & Wylie 2016; Wylie forthcoming). A similar conclusion is reached by Jessey Wright’s evaluation of the diverse data analysis techniques that neuroscientists use to make sense of functional magnetic resonance imaging of the brain (fMRI scans):
different data analysis techniques reveal different patterns in the data. Through the use of multiple data analysis techniques, researchers can produce results that are locally robust. (Wright 2017: 1179)
Wylie’s and Wright’s analyses exemplify how a relational approach to data fosters a normative understanding of “good evidence” which is anchored in situated judgement—the arguably human prerogative to contextualise and assess the significance of evidential claims. The advantages of this view of evidence are eloquently expressed by Nancy Cartwright’s critique of both philosophical theories and policy approaches that do not recognise the local and contextual nature of evidential reasoning. As she notes,
we need a concept that can give guidance about what is relevant to consider in deciding on the probability of the hypothesis, not one that requires that we already know significant facts about the probability of the hypothesis on various pieces of evidence. (Cartwright 2013: 6)
Thus she argues for a notion of evidence that is not too restrictive, takes account of the difficulties in combining and selecting evidence, and allows for contextual judgement on what types of evidence are best suited to the inquiry at hand (Cartwright 2013, 2019). Reiss’s proposal of a pragmatic theory of evidence similarly aims to
takes scientific practice [..] seriously, both in terms of its greater use of knowledge about the conditions under which science is practised and in terms of its goal to develop insights that are relevant to practising scientists. (Reiss 2015: 361)
A better characterisation of the relation between data and evidence, predicated on the study of how data are processed and aggregated, may go a long way towards addressing these demands. As aptly argued by James Woodward, the evidential relationship between data and claims is not a “a purely formal, logical, or a priori matter” (Woodward 2000: S172–173). This again sits uneasily with the expectation that big data analysis may automate scientific discovery and make human judgement redundant.
Let us now return to the idea of data-driven inquiry, often suggested as a counterpoint to hypothesis-driven science (e.g., Hey et al. 2009). Kevin Elliot and colleagues have offered a brief history of hypothesis-driven inquiry (Elliott et al. 2016), emphasising how scientific institutions (including funding programmes and publication venues) have pushed researchers towards a Popperian conceptualisation of inquiry as the formulation and testing of a strong hypothesis. Big data analysis clearly points to a different and arguably Baconian understanding of the role of hypothesis in science. Theoretical expectations are no longer seen as driving the process of inquiry and empirical input is recognised as primary in determining the direction of research and the phenomena—and related hypotheses—considered by researchers.
The emphasis on data as a central component of research poses a significant challenge to one of the best-established philosophical views on scientific knowledge. According to this view, which I shall label the theory-centric view of science, scientific knowledge consists of justified true beliefs about the world. These beliefs are obtained through empirical methods aiming to test the validity and reliability of statements that describe or explain aspects of reality. Hence scientific knowledge is conceptualised as inherently propositional: what counts as an output are claims published in books and journals, which are also typically presented as solutions to hypothesis-driven inquiry. This view acknowledges the significance of methods, data, models, instruments and materials within scientific investigations, but ultimately regards them as means towards one end: the achievement of true claims about the world. Reichenbach’s seminal distinction between contexts of discovery and justification exemplifies this position (Reichenbach 1938). Theory-centrism recognises research components such as data and related practical skills as essential to discovery, and more specifically to the messy, irrational part of scientific work that involves value judgements, trial-and-error, intuition and exploration and within which the very phenomena to be investigated may not have been stabilised. The justification of claims, by contrast, involves the rational reconstruction of the research that has been performed, so that it conforms to established norms of inferential reasoning. Importantly, within the context of justification, only data that support the claims of interest are explicitly reported and discussed: everything else—including the vast majority of data produced in the course of inquiry—is lost to the chaotic context of discovery. [ 2 ]
Much recent philosophy of science, and particularly modelling and experimentation, has challenged theory-centrism by highlighting the role of models, methods and modes of intervention as research outputs rather than simple tools, and stressing the importance of expanding philosophical understandings of scientific knowledge to include these elements alongside propositional claims. The rise of big data offers another opportunity to reframe understandings of scientific knowledge as not necessarily centred on theories and to include non-propositional components—thus, in Cartwright’s paraphrase of Gilbert Ryle’s famous distinction, refocusing on knowing-how over knowing-that (Cartwright 2019). One way to construe data-centric methods is indeed to embrace a conception of knowledge as ability, such as promoted by early pragmatists like John Dewey and more recently reprised by Chang, who specifically highlighted it as the broader category within which the understanding of knowledge-as-information needs to be placed (Chang 2017).
Another way to interpret the rise of big data is as a vindication of inductivism in the face of the barrage of philosophical criticism levelled against theory-free reasoning over the centuries. For instance, Jon Williamson (2004: 88) has argued that advances in automation, combined with the emergence of big data, lend plausibility to inductivist philosophy of science. Wolfgang Pietsch agrees with this view and provided a sophisticated framework to understand just what kind of inductive reasoning is instigated by big data and related machine learning methods such as decision trees (Pietsch 2015). Following John Stuart Mill, he calls this approach variational induction and presents it as common to both big data approaches and exploratory experimentation, though the former can handle a much larger number of variables (Pietsch 2015: 913). Pietsch concludes that the problem of theory-ladenness in machine learning can be addressed by determining under which theoretical assumptions variational induction works (2015: 910ff).
Others are less inclined to see theory-ladenness as a problem that can be mitigated by data-intensive methods, and rather see it as a constitutive part of the process of empirical inquiry. Arching back to the extensive literature on perspectivism and experimentation (Gooding 1990; Giere 2006; Radder 2006; Massimi 2012), Werner Callebaut has forcefully argued that the most sophisticated and standardised measurements embody a specific theoretical perspective, and this is no less true of big data (Callebaut 2012). Elliott and colleagues emphasise that conceptualising big data analysis as atheoretical risks encouraging unsophisticated attitudes to empirical investigation as a
“fishing expedition”, having a high probability of leading to nonsense results or spurious correlations, being reliant on scientists who do not have adequate expertise in data analysis, and yielding data biased by the mode of collection. (Elliott et al. 2016: 880)
To address related worries in genetic analysis, Ken Waters has provided the useful characterisation of “theory-informed” inquiry (Waters 2007), which can be invoked to stress how theory informs the methods used to extract meaningful patterns from big data, and yet does not necessarily determine either the starting point or the outcomes of data-intensive science. This does not resolve the question of what role theory actually plays. Rob Kitchin (2014) has proposed to see big data as linked to a new mode of hypothesis generation within a hypothetical-deductive framework. Leonelli is more sceptical of attempts to match big data approaches, which are many and diverse, with a specific type of inferential logic. She rather focused on the extent to which the theoretical apparatus at work within big data analysis rests on conceptual decisions about how to order and classify data—and proposed that such decisions can give rise to a particular form of theorization, which she calls classificatory theory (Leonelli 2016).
These disagreements point to big data as eliciting diverse understandings of the nature of knowledge and inquiry, and the complex iterations through which different inferential methods build on each other. Again, in the words of Elliot and colleagues,
attempting to draw a sharp distinction between hypothesis-driven and data-intensive science is misleading; these modes of research are not in fact orthogonal and often intertwine in actual scientific practice. (Elliott et al. 2016: 881, see also O’Malley et al. 2009, Elliott 2012)
Another epistemological debate strongly linked to reflection on big data concerns the specific kinds of knowledge emerging from data-centric forms of inquiry, and particularly the relation between predictive and causal knowledge.
Big data science is widely seen as revolutionary in the scale and power of predictions that it can support. Unsurprisingly perhaps, a philosophically sophisticated defence of this position comes from the philosophy of mathematics, where Marco Panza, Domenico Napoletani and Daniele Struppa argued for big data science as occasioning a momentous shift in the predictive knowledge that mathematical analysis can yield, and thus its role within broader processes of knowledge production. The whole point of big data analysis, they posit, is its disregard for causal knowledge:
answers are found through a process of automatic fitting of the data to models that do not carry any structural understanding beyond the actual solution of the problem itself. (Napoletani, Panza, & Struppa 2014: 486)
This view differs from simplistic popular discourse on “the death of theory” (Anderson 2008) and the “power of correlations” (Mayer-Schoenberg and Cukier 2013) insofar as it does not side-step the constraints associated with knowledge and generalisations that can be extracted from big data analysis. Napoletani, Panza and Struppa recognise that there are inescapable tensions around the ability of mathematical reasoning to overdetermine empirical input, to the point of providing a justification for any and every possible interpretation of the data. In their words,
the problem arises of how we can gain meaningful understanding of historical phenomena, given the tremendous potential variability of their developmental processes. (Napoletani et al. 2014: 487)
Their solution is to clarify that understanding phenomena is not the goal of predictive reasoning, which is rather a form of agnostic science : “the possibility of forecasting and analysing without a structured and general understanding” (Napoletani et al. 2011: 12). The opacity of algorithmic rationality thus becomes its key virtue and the reason for the extraordinary epistemic success of forecasting grounded on big data. While “the phenomenon may forever re-main hidden to our understanding”(ibid.: 5), the application of mathematical models and algorithms to big data can still provide meaningful and reliable answers to well-specified problems—similarly to what has been argued in the case of false models (Wimsatt 2007). Examples include the use of “forcing” methods such as regularisation or diffusion geometry to facilitate the extraction of useful insights from messy datasets.
This view is at odds with accounts that posit scientific understanding as a key aim of science (de Regt 2017), and the intuition that what researchers are ultimately interested in is
whether the opaque data-model generated by machine-learning technologies count as explanations for the relationships found between input and output. (Boon 2020: 44)
Within the philosophy of biology, for example, it is well recognised that big data facilitates effective extraction of patterns and trends, and that being able to model and predict how an organism or ecosystem may behave in the future is of great importance, particularly within more applied fields such as biomedicine or conservation science. At the same time, researchers are interested in understanding the reasons for observed correlations, and typically use predictive patterns as heuristics to explore, develop and verify causal claims about the structure and functioning of entities and processes. Emanuele Ratti (2015) has argued that big data mining within genome-wide association studies often used in cancer genomics can actually underpin mechanistic reasoning, for instance by supporting eliminative inference to develop mechanistic hypotheses and by helping to explore and evaluate generalisations used to analyse the data. In a similar vein, Pietsch (2016) proposed to use variational induction as a method to establish what counts as causal relationships among big data patterns, by focusing on which analytic strategies allow for reliable prediction and effective manipulation of a phenomenon.
Through the study of data sourcing and processing in epidemiology, Stefano Canali has instead highlighted the difficulties of deriving mechanistic claims from big data analysis, particularly where data are varied and embodying incompatible perspectives and methodological approaches (Canali 2016, 2019). Relatedly, the semantic and logistical challenges of organising big data give reason to doubt the reliability of causal claims extracted from such data. In terms of logistics, having a lot of data is not the same as having all of them, and cultivating illusions of comprehensiveness is a risky and potentially misleading strategy, particularly given the challenges encountered in developing and applying curatorial standards for data other than the high-throughput results of “omics” approaches (see also the next section). The constant worry about the partiality and reliability of data is reflected in the care put by database curators in enabling database users to assess such properties; and in the importance given by researchers themselves, particularly in the biological and environmental sciences, to evaluating the quality of data found on the internet (Leonelli 2014, Fleming et al. 2017). In terms of semantics, we are back to the role of data classifications as theoretical scaffolding for big data analysis that we discussed in the previous section. Taxonomic efforts to order and visualise data inform causal reasoning extracted from such data (Sterner & Franz 2017), and can themselves constitute a bottom-up method—grounded in comparative reasoning—for assigning meaning to data models, particularly in situation where a full-blown theory or explanation for the phenomenon under investigation is not available (Sterner 2014).
It is no coincidence that much philosophical work on the relation between causal and predictive knowledge extracted from big data comes from the philosophy of the life sciences, where the absence of axiomatized theories has elicited sophisticated views on the diversity of forms and functions of theory within inferential reasoning. Moreover, biological data are heterogeneous both in their content and in their format; are curated and re-purposed to address the needs of highly disparate and fragmented epistemic communities; and present curators with specific challenges to do with tracking complex, diverse and evolving organismal structures and behaviours, whose relation to an ever-changing environment is hard to pinpoint with any stability (e.g., Shavit & Griesemer 2009). Hence in this domain, some of the core methods and epistemic concerns of experimental research—including exploratory experimentation, sampling and the search for causal mechanisms—remain crucial parts of data-centric inquiry.
At the start of this entry I listed “value” as a major characteristic of big data and pointed to the crucial role of valuing procedures in identifying, processing, modelling and interpreting data as evidence. Identifying and negotiating different forms of data value is an unavoidable part of big data analysis, since these valuation practices determine which data is made available to whom, under which conditions and for which purposes. What researchers choose to consider as reliable data (and data sources) is closely intertwined not only with their research goals and interpretive methods, but also with their approach to data production, packaging, storage and sharing. Thus, researchers need to consider what value their data may have for future research by themselves and others, and how to enhance that value—such as through decisions around which data to make public, how, when and in which format; or, whenever dealing with data already in the public domain (such as personal data on social media), decisions around whether the data should be shared and used at all, and how.
No matter how one conceptualises value practices, it is clear that their key role in data management and analysis prevents facile distinctions between values and “facts” (understood as propositional claims for which data provide evidential warrant). For example, consider a researcher who values both openness —and related practices of widespread data sharing—and scientific rigour —which requires a strict monitoring of the credibility and validity of conditions under which data are interpreted. The scale and manner of big data mobilisation and analysis create tensions between these two values. While the commitment to openness may prompt interest in data sharing, the commitment to rigour may hamper it, since once data are freely circulated online it becomes very difficult to retain control over how they are interpreted, by whom and with which knowledge, skills and tools. How a researcher responds to this conflict affects which data are made available for big data analysis, and under which conditions. Similarly, the extent to which diverse datasets may be triangulated and compared depends on the intellectual property regimes under which the data—and related analytic tools—have been produced. Privately owned data are often unavailable to publicly funded researchers; and many algorithms, cloud systems and computing facilities used in big data analytics are only accessible to those with enough resources to buy relevant access and training. Whatever claims result from big data analysis are, therefore, strongly dependent on social, financial and cultural constraints that condition the data pool and its analysis.
This prominent role of values in shaping data-related epistemic practices is not surprising given existing philosophical critiques of the fact/value distinction (e.g., Douglas 2009), and the existing literature on values in science—such as Helen Longino’s seminal distinction between constitutive and contextual values, as presented in her 1990 book Science as Social Knowledge —may well apply in this case too. Similarly, it is well-established that the technological and social conditions of research strongly condition its design and outcomes. What is particularly worrying in the case of big data is the temptation, prompted by hyped expectations around the power of data analytics, to hide or side-line the valuing choices that underpin the methods, infrastructures and algorithms used for big data extraction.
Consider the use of high-throughput data production tools, which enable researchers to easily generate a large volume of data in formats already geared to computational analysis. Just as in the case of other technologies, researchers have a strong incentive to adopt such tools for data generation; and may do so even in cases where such tools are not good or even appropriate means to pursue the investigation. Ulrich Krohs uses the term convenience experimentation to refer to experimental designs that are adopted not because they are the most appropriate ways of pursuing a given investigation, but because they are easily and widely available and usable, and thus “convenient” means for researchers to pursue their goals (Krohs 2012).
Appeals to convenience can extend to other aspects of data-intensive analysis. Not all data are equally easy to digitally collect, disseminate and link through existing algorithms, which makes some data types and formats more convenient than others for computational analysis. For example, research databases often display the outputs of well-resourced labs within research traditions which deal with “tractable” data formats (such as “omics”). And indeed, the existing distribution of resources, infrastructure and skills determines high levels of inequality in the production, dissemination and use of big data for research. Big players with large financial and technical resources are leading the development and uptake of data analytics tools, leaving much publicly funded research around the world at the receiving end of innovation in this area. Contrary to popular depictions of the data revolution as harbinger of transparency, democracy and social equality, the digital divide between those who can access and use data technologies, and those who cannot, continues to widen. A result of such divides is the scarcity of data relating to certain subgroups and geographical locations, which again limits the comprehensiveness of available data resources.
In the vast ecosystem of big data infrastructures, it is difficult to keep track of such distortions and assess their significance for data interpretation, especially in situations where heterogeneous data sources structured through appeal to different values are mashed together. Thus, the systematic aggregation of convenient datasets and analytic tools over others often results in a big data pool where the relevant sources and forms of bias are impossible to locate and account for (Pasquale 2015; O’Neill 2016; Zuboff 2017; Leonelli 2019a). In such a landscape, arguments for a separation between fact and value—and even a clear distinction between the role of epistemic and non-epistemic values in knowledge production—become very difficult to maintain without discrediting the whole edifice of big data science. Given the extent to which this approach has penetrated research in all domains, it is arguably impossible, however, to critique the value-laden structure of big data science without calling into question the legitimacy of science itself. A more constructive approach is to embrace the extent to which big data science is anchored in human choices, interests and values, and ascertain how this affects philosophical views on knowledge, truth and method.
In closing, it is important to consider at least some of the risks and related ethical questions raised by research with big data. As already mentioned in the previous section, reliance on big data collected by powerful institutions or corporations risks raises significant social concerns. Contrary to the view that sees big and open data as harbingers of democratic social participation in research, the way that scientific research is governed and financed is not challenged by big data. Rather, the increasing commodification and large value attributed to certain kinds of data (e.g., personal data) is associated to an increase in inequality of power and visibility between different nations, segments of the population and scientific communities (O’Neill 2016; Zuboff 2017; D’Ignazio and Klein 2020). The digital gap between those who not only can access data, but can also use it, is widening, leading from a state of digital divide to a condition of “data divide” (Bezuidenout et al. 2017).
Moreover, the privatisation of data has serious implications for the world of research and the knowledge it produces. Firstly, it affects which data are disseminated, and with which expectations. Corporations usually only release data that they regard as having lesser commercial value and that they need public sector assistance to interpret. This introduces another distortion on the sources and types of data that are accessible online while more expensive and complex data are kept secret. Even many of the ways in which citizens -researchers included - are encouraged to interact with databases and data interpretation sites tend to encourage participation that generates further commercial value. Sociologists have recently described this type of social participation as a form of exploitation (Prainsack & Buyx 2017; Srnicek 2017). In turn, these ways of exploiting data strengthen their economic value over their scientific value. When it comes to the commerce of personal data between companies working in analysis, the value of the data as commercial products -which includes the evaluation of the speed and efficiency with which access to certain data can help develop new products - often has priority over scientific issues such as for example, representativity and reliability of the data and the ways they were analysed. This can result in decisions that pose a problem scientifically or that simply are not interested in investigating the consequences of the assumptions made and the processes used. This lack of interest easily translates into ignorance of discrimination, inequality and potential errors in the data considered. This type of ignorance is highly strategic and economically productive since it enables the use of data without concerns over social and scientific implications. In this scenario the evaluation on the quality of data shrinks to an evaluation of their usefulness towards short-term analyses or forecasting required by the client. There are no incentives in this system to encourage evaluation of the long-term implications of data analysis. The risk here is that the commerce of data is accompanied by an increasing divergence between data and their context. The interest in the history of the transit of data, the plurality of their emotional or scientific value and the re-evaluation of their origins tend to disappear over time, to be substituted by the increasing hold of the financial value of data.
The multiplicity of data sources and tools for aggregation also creates risks. The complexity of the data landscape is making it harder to identify which parts of the infrastructure require updating or have been put in doubt by new scientific developments. The situation worsens when considering the number of databases that populate every area of scientific research, each containing assumptions that influence the circulation and interoperability of data and that often are not updated in a reliable and regular way. Just to provide an idea of the numbers involved, the prestigious scientific publication Nucleic Acids Research publishes a special issue on new databases that are relevant to molecular biology every year and included: 56 new infrastructures in 2015, 62 in 2016, 54 in 2017 and 82 in 2018. These are just a small proportion of the hundreds of databases that are developed each year in the life sciences sector alone. The fact that these databases rely on short term funding means that a growing percentage of resources remain available to consult online although they are long dead. This is a condition that is not always visible to users of the database who trust them without checking whether they are actively maintained or not. At what point do these infrastructures become obsolete? What are the risks involved in weaving an ever more extensive tapestry of infrastructures that depend on each other, given the disparity in the ways they are managed and the challenges in identifying and comparing their prerequisite conditions, the theories and scaffolding used to build them? One of these risks is rampant conservativism: the insistence on recycling old data whose features and management elements become increasingly murky as time goes by, instead of encouraging the production of new data with features that specifically respond to the requirements and the circumstances of their users. In disciplines such as biology and medicine that study living beings and therefore are by definition continually evolving and developing, such trust in old data is particularly alarming. It is not the case, for example, that data collected on fungi ten, twenty or even a hundred years ago is reliable to explain the behaviour of the same species of fungi now or in the future (Leonelli 2018).
Researchers of what Luciano Floridi calls the infosphere —the way in which the introduction of digital technologies is changing the world - are becoming aware of the destructive potential of big data and the urgent need to focus efforts for management and use of data in active and thoughtful ways towards the improvement of the human condition. In Floridi’s own words:
ICT yields great opportunity which, however, entails the enormous intellectual responsibility of understanding this technology to use it in the most appropriate way. (Floridi 2014: vii; see also British Academy & Royal Society 2017)
In light of these findings, it is essential that ethical and social issues are seen as a core part of the technical and scientific requirements associated with data management and analysis. The ethical management of data is not obtained exclusively by regulating the commerce of research and management of personal data nor with the introduction of monitoring of research financing, even though these are important strategies. To guarantee that big data are used in the most scientifically and socially forward-thinking way it is necessary to transcend the concept of ethics as something external and alien to research. An analysis of the ethical implications of data science should become a basic component of the background and activity of those who take care of data and the methods used to view and analyse it. Ethical evaluations and choices are hidden in every aspect of data management, including those choices that may seem purely technical.
This entry stressed how the emerging emphasis on big data signals the rise of a data-centric approach to research, in which efforts to mobilise, integrate, disseminate and visualise data are viewed as central contributions to discovery. The emergence of data-centrism highlights the challenges involved in gathering, classifying and interpreting data, and the concepts, technologies and institutions that surround these processes. Tools such as high-throughput measurement instruments and apps for smartphones are fast generating large volumes of data in digital formats. In principle, these data are immediately available for dissemination through internet platforms, which can make them accessible to anybody with a broadband connection in a matter of seconds. In practice, however, access to data is fraught with conceptual, technical, legal and ethical implications; and even when access can be granted, it does not guarantee that the data can be fruitfully used to spur further research. Furthermore, the mathematical and computational tools developed to analyse big data are often opaque in their functioning and assumptions, leading to results whose scientific meaning and credibility may be difficult to assess. This increases the worry that big data science may be grounded upon, and ultimately supporting, the process of making human ingenuity hostage to an alien, artificial and ultimately unintelligible intelligence.
Perhaps the most confronting aspect of big data science as discussed in this entry is the extent to which it deviates from understandings of rationality grounded on individual agency and cognitive abilities (on which much of contemporary philosophy of science is predicated). The power of any one dataset to yield knowledge lies in the extent to which it can be linked with others: this is what lends high epistemic value to digital objects such as GPS locations or sequencing data, and what makes extensive data aggregation from a variety of sources into a highly effective surveillance tool. Data production and dissemination channels such as social media, governmental databases and research repositories operate in a globalised, interlinked and distributed network, whose functioning requires a wide variety of skills and expertise. The distributed nature of decision-making involved in developing big data infrastructures and analytics makes it impossible for any one individual to retain oversight over the quality, scientific significance and potential social impact of the knowledge being produced.
Big data analysis may therefore constitute the ultimate instance of a distributed cognitive system. Where does this leave accountability questions? Many individuals, groups and institutions end up sharing responsibility for the conceptual interpretation and social outcomes of specific data uses. A key challenge for big data governance is to find mechanisms for allocating responsibilities across this complex network, so that erroneous and unwarranted decisions—as well as outright fraudulent, unethical, abusive, discriminatory or misguided actions—can be singled out, corrected and appropriately sanctioned. Thinking about the complex history, processing and use of data can encourage philosophers to avoid ahistorical, uncontextualized approaches to questions of evidence, and instead consider the methods, skills, technologies and practices involved in handling data—and particularly big data—as crucial to understanding empirical knowledge-making.
- Achinstein, Peter, 2001, The Book of Evidence , Oxford: Oxford University Press. doi:10.1093/0195143892.001.0001
- Anderson, Chris, 2008, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”, Wired Magazine , 23 June 2008.
- Aronova, Elena, Karen S. Baker, and Naomi Oreskes, 2010, “Big science and big data in biology: From the International Geophysical Year through the International Biological Program to the Long Term Ecological Research (LTER) Network, 1957–present”, Historical Studies in the Natural Sciences , 40: 183–224.
- Aronova, Elena, Christine von Oertzen, and David Sepkoski, 2017, “Introduction: Historicizing Big Data”, Osiris , 32(1): 1–17. doi:10.1086/693399
- Bauer, Susanne, 2008, “Mining Data, Gathering Variables and Recombining Information: The Flexible Architecture of Epidemiological Studies”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 39(4): 415–428. doi:10.1016/j.shpsc.2008.09.008
- Bechtel, William, 2016, “Using Computational Models to Discover and Understand Mechanisms”, Studies in History and Philosophy of Science Part A , 56: 113–121. doi:10.1016/j.shpsa.2015.10.004
- Beisbart, Claus, 2012, “How Can Computer Simulations Produce New Knowledge?”, European Journal for Philosophy of Science , 2(3): 395–434. doi:10.1007/s13194-012-0049-7
- Bezuidenhout, Louise, Leonelli, Sabina, Kelly, Ann and Rappert, Brian, 2017, “Beyond the Digital Divide: Towards a Situated Approach to Open Data”. Science and Public Policy , 44(4): 464–475. doi: 10.1093/scipol/scw036
- Bogen, Jim, 2009 [2013], “Theory and Observation in Science”, in The Stanford Encyclopedia of Philosophy (Spring 2013 Edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/spr2013/entries/science-theory-observation/ >.
- –––, 2010, “Noise in the World”, Philosophy of Science , 77(5): 778–791. doi:10.1086/656006
- Bogen, James and James Woodward, 1988, “Saving the Phenomena”, The Philosophical Review , 97(3): 303. doi:10.2307/2185445
- Bokulich, Alisa, 2018, “Using Models to Correct Data: Paleodiversity and the Fossil Record”, in S.I.: Abstraction and Idealization in Scientific Modelling by Synthese , 29 May 2018. doi:10.1007/s11229-018-1820-x
- Boon, Mieke, 2020, “How Scientists Are Brought Back into Science—The Error of Empiricism”, in A Critical Reflection on Automated Science , Marta Bertolaso and Fabio Sterpetti (eds.), (Human Perspectives in Health Sciences and Technology 1), Cham: Springer International Publishing, 43–65. doi:10.1007/978-3-030-25001-0_4
- Borgman, Christine L., 2015, Big Data, Little Data, No Data , Cambridge, MA: MIT Press.
- Boumans, M.J. and Sabina Leonelli, forthcoming, “From Dirty Data to Tidy Facts: Practices of Clustering in Plant Phenomics and Business Cycles”, in Leonelli and Tempini forthcoming.
- Boyd, Danah and Kate Crawford, 2012, “Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon”, Information, Communication & Society , 15(5): 662–679. doi:10.1080/1369118X.2012.678878
- Boyd, Nora Mills, 2018, “Evidence Enriched”, Philosophy of Science , 85(3): 403–421. doi:10.1086/697747
- Bowker, Geoffrey C., 2006, Memory Practices in the Sciences , Cambridge, MA: The MIT Press.
- Bringsjord, Selmer and Naveen Sundar Govindarajulu, 2018, “Artificial Intelligence”, in The Stanford Encyclopedia of Philosophy (Fall 2018 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/fall2018/entries/artificial-intelligence/ >.
- British Academy & Royal Society, 2017, Data Management and Use: Governance in the 21st Century. A Joint Report of the Royal Society and the British Academy , British Academy & Royal Society 2017 available online (see Report).
- Cai, Li and Yangyong Zhu, 2015, “The Challenges of Data Quality and Data Quality Assessment in the Big Data Era”, Data Science Journal , 14: 2. doi:10.5334/dsj-2015-002
- Callebaut, Werner, 2012, “Scientific Perspectivism: A Philosopher of Science’s Response to the Challenge of Big Data Biology”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 69–80. doi:10.1016/j.shpsc.2011.10.007
- Calude, Cristian S. and Giuseppe Longo, 2017, “The Deluge of Spurious Correlations in Big Data”, Foundations of Science , 22(3): 595–612. doi:10.1007/s10699-016-9489-4
- Canali, Stefano, 2016, “Big Data, Epistemology and Causality: Knowledge in and Knowledge out in EXPOsOMICS”, Big Data & Society , 3(2): 205395171666953. doi:10.1177/2053951716669530
- –––, 2019, “Evaluating Evidential Pluralism in Epidemiology: Mechanistic Evidence in Exposome Research”, History and Philosophy of the Life Sciences , 41(1): art. 4. doi:10.1007/s40656-019-0241-6
- Cartwright, Nancy D., 2013, Evidence: For Policy and Wheresoever Rigor Is a Must , London School of Economics and Political Science (LSE), Order Project Discussion Paper Series [Cartwright 2013 available online ].
- –––, 2019, Nature, the Artful Modeler: Lectures on Laws, Science, How Nature Arranges the World and How We Can Arrange It Better (The Paul Carus Lectures) , Chicago, IL: Open Court.
- Chang, Hasok, 2012, Is Water H2O? Evidence, Realism and Pluralism , (Boston Studies in the Philosophy of Science 293), Dordrecht: Springer Netherlands. doi:10.1007/978-94-007-3932-1
- –––, 2017, “VI—Operational Coherence as the Source of Truth”, Proceedings of the Aristotelian Society , 117(2): 103–122. doi:10.1093/arisoc/aox004
- Chapman, Robert and Alison Wylie, 2016, Evidential Reasoning in Archaeology , London: Bloomsbury Publishing Plc.
- Collins, Harry M., 1990, Artificial Experts: Social Knowledge and Intelligent Machines , Cambridge, MA: MIT Press.
- Craver, Carl F. and Lindley Darden, 2013, In Search of Mechanisms: Discoveries Across the Life Sciences , Chicago: University of Chicago Press.
- Daston, Lorraine, 2017, Science in the Archives: Pasts, Presents, Futures , Chicago: University of Chicago Press.
- De Regt, Henk W., 2017, Understanding Scientific Understanding , Oxford: Oxford University Press. doi:10.1093/oso/9780190652913.001.0001
- D’Ignazio, Catherine and Klein, Lauren F., 2020, Data Feminism , Cambridge, MA: The MIT Press.
- Douglas, Heather E., 2009, Science, Policy and the Value-Free Ideal , Pittsburgh, PA: University of Pittsburgh Press.
- Dreyfus, Hubert L., 1992, What Computers Still Can’t Do: A Critique of Artificial Reason , Cambridge, MA: MIT Press.
- Durán, Juan M. and Nico Formanek, 2018, “Grounds for Trust: Essential Epistemic Opacity and Computational Reliabilism”, Minds and Machines , 28(4): 645–666. doi:10.1007/s11023-018-9481-6
- Edwards, Paul N., 2010, A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming , Cambridge, MA: The MIT Press.
- Elliott, Kevin C., 2012, “Epistemic and methodological iteration in scientific research”. Studies in History and Philosophy of Science , 43: 376–382.
- Elliott, Kevin C., Kendra S. Cheruvelil, Georgina M. Montgomery, and Patricia A. Soranno, 2016, “Conceptions of Good Science in Our Data-Rich World”, BioScience , 66(10): 880–889. doi:10.1093/biosci/biw115
- Feest, Uljana, 2011, “What Exactly Is Stabilized When Phenomena Are Stabilized?”, Synthese , 182(1): 57–71. doi:10.1007/s11229-009-9616-7
- Fleming, Lora, Niccolò Tempini, Harriet Gordon-Brown, Gordon L. Nichols, Christophe Sarran, Paolo Vineis, Giovanni Leonardi, Brian Golding, Andy Haines, Anthony Kessel, Virginia Murray, Michael Depledge, and Sabina Leonelli, 2017, “Big Data in Environment and Human Health”, in Oxford Research Encyclopedia of Environmental Science , by Lora Fleming, Niccolò Tempini, Harriet Gordon-Brown, Gordon L. Nichols, Christophe Sarran, Paolo Vineis, Giovanni Leonardi, Brian Golding, Andy Haines, Anthony Kessel, Virginia Murray, Michael Depledge, and Sabina Leonelli, Oxford: Oxford University Press. doi:10.1093/acrefore/9780199389414.013.541
- Floridi, Luciano, 2014, The Fourth Revolution: How the Infosphere is Reshaping Human Reality , Oxford: Oxford University Press.
- Floridi, Luciano and Phyllis Illari (eds.), 2014, The Philosophy of Information Quality , (Synthese Library 358), Cham: Springer International Publishing. doi:10.1007/978-3-319-07121-3
- Frigg, Roman and Julian Reiss, 2009, “The Philosophy of Simulation: Hot New Issues or Same Old Stew?”, Synthese , 169(3): 593–613. doi:10.1007/s11229-008-9438-z
- Frigg, Roman and Stephan Hartmann, 2016, “Models in Science”, in The Stanford Encyclopedia of Philosophy (Winter 2016 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/win2016/entries/models-science/ >.
- Gooding, David C., 1990, Experiment and the Making of Meaning , Dordrecht & Boston: Kluwer.
- Giere, Ronald, 2006, Scientific Perspectivism , Chicago: University of Chicago Press.
- Griesemer, James R., forthcoming, “A Data Journey through Dataset-Centric Population Biology”, in Leonelli and Tempini forthcoming.
- Hacking, Ian, 1992, “The Self-Vindication of the Laboratory Sciences”, In Science as Practice and Culture , Andrew Pickering (ed.), Chicago, IL: The University of Chicago Press, 29–64.
- Harris, Todd, 2003, “Data Models and the Acquisition and Manipulation of Data”, Philosophy of Science , 70(5): 1508–1517. doi:10.1086/377426
- Hey Tony, Stewart Tansley, and Kristin Tolle, 2009, The Fourth Paradigm. Data-Intensive Scientific Discovery , Redmond, WA: Microsoft Research.
- Humphreys, Paul, 2004, Extending Ourselves: Computational Science, Empiricism, and Scientific Method , Oxford: Oxford University Press. doi:10.1093/0195158709.001.0001
- –––, 2009, “The Philosophical Novelty of Computer Simulation Methods”, Synthese , 169(3): 615–626. doi:10.1007/s11229-008-9435-2
- Karaca, Koray, 2018, “Lessons from the Large Hadron Collider for Model-Based Experimentation: The Concept of a Model of Data Acquisition and the Scope of the Hierarchy of Models”, Synthese , 195(12): 5431–5452. doi:10.1007/s11229-017-1453-5
- Kelly, Thomas, 2016, “Evidence”, in The Stanford Encyclopedia of Philosophy (Winter 2016 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/win2016/entries/evidence/ >.
- Kitchin, Rob, 2013, The Data Revolution: Big Data, Open Data, Data Infrastructures & Their Consequences , Los Angeles: Sage.
- –––, 2014, “Big Data, new epistemologies and paradigm shifts”, Big Data and Society , 1(1) April-June. doi: 10.1177/2053951714528481
- Kitchin, Rob and Gavin McArdle, 2016, “What Makes Big Data, Big Data? Exploring the Ontological Characteristics of 26 Datasets”, Big Data & Society , 3(1): 205395171663113. doi:10.1177/2053951716631130
- Krohs, Ulrich, 2012, “Convenience Experimentation”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 52–57. doi:10.1016/j.shpsc.2011.10.005
- Lagoze, Carl, 2014, “Big Data, data integrity, and the fracturing of the control zone,” Big Data and Society , 1(2) July-December. doi: 10.1177/2053951714558281
- Leonelli, Sabina, 2014, “What Difference Does Quantity Make? On the Epistemology of Big Data in Biology”, Big Data & Society , 1(1): 205395171453439. doi:10.1177/2053951714534395
- –––, 2016, Data-Centric Biology: A Philosophical Study , Chicago: University of Chicago Press.
- –––, 2017, “Global Data Quality Assessment and the Situated Nature of ‘Best’ Research Practices in Biology”, Data Science Journal , 16: 32. doi:10.5334/dsj-2017-032
- –––, 2018, “The Time of Data: Timescales of Data Use in the Life Sciences”, Philosophy of Science , 85(5): 741–754. doi:10.1086/699699
- –––, 2019a, La Recherche Scientifique à l’Ère des Big Data: Cinq Façons Donc les Données Massive Nuisent à la Science, et Comment la Sauver , Milano: Éditions Mimésis.
- –––, 2019b, “What Distinguishes Data from Models?”, European Journal for Philosophy of Science , 9(2): 22. doi:10.1007/s13194-018-0246-0
- Leonelli, Sabina and Niccolò Tempini, 2018, “Where Health and Environment Meet: The Use of Invariant Parameters in Big Data Analysis”, Synthese , special issue on the Philosophy of Epidemiology , Sean Valles and Jonathan Kaplan (eds.). doi:10.1007/s11229-018-1844-2
- –––, forthcoming, Data Journeys in the Sciences , Cham: Springer International Publishing.
- Loettgers, Andrea, 2009, “Synthetic Biology and the Emergence of a Dual Meaning of Noise”, Biological Theory , 4(4): 340–356. doi:10.1162/BIOT_a_00009
- Longino, Helen E., 1990, Science as Social Knowledge: Values and Objectivity in Scientific Inquiry , Princeton, NJ: Princeton University Press.
- Lowrie, Ian, 2017, “Algorithmic Rationality: Epistemology and Efficiency in the Data Sciences”, Big Data & Society , 4(1): 1–13. doi:10.1177/2053951717700925
- MacLeod, Miles and Nancy J. Nersessian, 2013, “Building Simulations from the Ground Up: Modeling and Theory in Systems Biology”, Philosophy of Science , 80(4): 533–556. doi:10.1086/673209
- Massimi, Michela, 2011, “From Data to Phenomena: A Kantian Stance”, Synthese , 182(1): 101–116. doi:10.1007/s11229-009-9611-z
- –––, 2012, “ Scientific perspectivism and its foes”, Philosophica , 84: 25–52.
- –––, 2016, “Three Tales of Scientific Success”, Philosophy of Science , 83(5): 757–767. doi:10.1086/687861
- Mayer-Schönberger, Victor and Kenneth Cukier, 2013, Big Data: A Revolution that Will Transform How We Live, Work, and Think , New York: Eamon Dolan/Houghton Mifflin Harcourt.
- Mayo, Deborah G., 1996, Error and the Growth of Experimental Knowledge , Chicago: University of Chicago Press.
- Mayo, Deborah G. and Aris Spanos (eds.), 2009a, Error and Inference , Cambridge: Cambridge University Press.
- Mayo, Deborah G. and Aris Spanos, 2009b, “Introduction and Background”, in Mayo and Spanos (eds.) 2009a, pp. 1–27.
- McAllister, James W., 1997, “Phenomena and Patterns in Data Sets”, Erkenntnis , 47(2): 217–228. doi:10.1023/A:1005387021520
- –––, 2007, “Model Selection and the Multiplicity of Patterns in Empirical Data”, Philosophy of Science , 74(5): 884–894. doi:10.1086/525630
- –––, 2011, “What Do Patterns in Empirical Data Tell Us about the Structure of the World?”, Synthese , 182(1): 73–87. doi:10.1007/s11229-009-9613-x
- McQuillan, Dan, 2018, “Data Science as Machinic Neoplatonism”, Philosophy & Technology , 31(2): 253–272. doi:10.1007/s13347-017-0273-3
- Mitchell, Sandra D., 2003, Biological Complexity and Integrative Pluralism , Cambridge: Cambridge University Press. doi:10.1017/CBO9780511802683
- Morgan, Mary S., 2005, “Experiments versus Models: New Phenomena, Inference and Surprise”, Journal of Economic Methodology , 12(2): 317–329. doi:10.1080/13501780500086313
- –––, forthcoming, “The Datum in Context”, in Leonelli and Tempini forthcoming.
- Morrison, Margaret, 2015, Reconstructing Reality: Models, Mathematics, and Simulations , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780199380275.001.0001
- Müller-Wille, Staffan and Isabelle Charmantier, 2012, “Natural History and Information Overload: The Case of Linnaeus”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 4–15. doi:10.1016/j.shpsc.2011.10.021
- Napoletani, Domenico, Marco Panza, and Daniele C. Struppa, 2011, “Agnostic Science. Towards a Philosophy of Data Analysis”, Foundations of Science , 16(1): 1–20. doi:10.1007/s10699-010-9186-7
- –––, 2014, “Is Big Data Enough? A Reflection on the Changing Role of Mathematics in Applications”, Notices of the American Mathematical Society , 61(5): 485–490. doi:10.1090/noti1102
- Nickles, Thomas, forthcoming, “Alien Reasoning: Is a Major Change in Scientific Research Underway?”, Topoi , first online: 20 March 2018. doi:10.1007/s11245-018-9557-1
- Norton, John D., 2003, “A Material Theory of Induction”, Philosophy of Science , 70(4): 647–670. doi:10.1086/378858
- O’Malley M, Maureen A., Kevin C. Elliott, Chris Haufe, and Richard Burian, 2009. “Philosophies of funding”. Cell , 138: 611–615. doi: 10.1016/j.cell.2009.08.008
- O’Malley, Maureen A. and Orkun S. Soyer, 2012, “The Roles of Integration in Molecular Systems Biology”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 43(1): 58–68. doi:10.1016/j.shpsc.2011.10.006
- O’Neill, Cathy, 2016, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy , New York: Crown.
- Parker, Wendy S., 2009, “Does Matter Really Matter? Computer Simulations, Experiments, and Materiality”, Synthese , 169(3): 483–496. doi:10.1007/s11229-008-9434-3
- –––, 2017, “Computer Simulation, Measurement, and Data Assimilation”, The British Journal for the Philosophy of Science , 68(1): 273–304. doi:10.1093/bjps/axv037
- Pasquale, Frank, 2015, The Black Box Society: The Secret Algorithms That Control Money and Information , Cambridge, MA: Harvard University Press.
- Pietsch, Wolfgang, 2015, “Aspects of Theory-Ladenness in Data-Intensive Science”, Philosophy of Science , 82(5): 905–916. doi:10.1086/683328
- –––, 2016, “The Causal Nature of Modeling with Big Data”, Philosophy & Technology , 29(2): 137–171. doi:10.1007/s13347-015-0202-2
- –––, 2017, “Causation, probability and all that: Data science as a novel inductive paradigm”, in Frontiers in Data Science , Matthias Dehmer and Frank Emmert-Streib (eds.), Boca Raton, FL: CRC, 329–353.
- Porter, Theodore M., 1995, Trust in Numbers: The Pursuit of Objectivity in Science and Public Life , Princeton, NJ: Princeton University Press.
- Porter, Theodore M. and Soraya de Chadarevian, 2018, “Introduction: Scrutinizing the Data World”, Historical Studies in the Natural Sciences , 48(5): 549–556. doi:10.1525/hsns.2018.48.5.549
- Prainsack, Barbara and Buyx, Alena, 2017, Solidarity in Biomedicine and Beyond , Cambridge, UK: Cambridge University Press.
- Radder, Hans, 2009, “The Philosophy of Scientific Experimentation: A Review”, Automated Experimentation , 1(1): 2. doi:10.1186/1759-4499-1-2
- Ratti, Emanuele, 2015, “Big Data Biology: Between Eliminative Inferences and Exploratory Experiments”, Philosophy of Science , 82(2): 198–218. doi:10.1086/680332
- Reichenbach, Hans, 1938, Experience and Prediction: An Analysis of the Foundations and the Structure of Knowledge , Chicago, IL: The University of Chicago Press.
- Reiss, Julian, 2015, “A Pragmatist Theory of Evidence”, Philosophy of Science , 82(3): 341–362. doi:10.1086/681643
- Reiss, Julian, 2015, Causation, Evidence, and Inference , New York: Routledge.
- Rescher, Nicholas, 1984, The Limits of Science , Berkely, CA: University of California Press.
- Rheinberger, Hans-Jörg, 2011, “Infra-Experimentality: From Traces to Data, from Data to Patterning Facts”, History of Science , 49(3): 337–348. doi:10.1177/007327531104900306
- Romeijn, Jan-Willem, 2017, “Philosophy of Statistics”, in The Stanford Encyclopedia of Philosophy (Spring 2017), Edward N. Zalta (ed.), URL: https://plato.stanford.edu/archives/spr2017/entries/statistics/ .
- Sepkoski, David, 2013, “Toward ‘a natural history of data’: Evolving practices and epistemologies of data in paleontology, 1800–2000”, Journal of the History of Biology , 46: 401–444.
- Shavit, Ayelet and James Griesemer, 2009, “There and Back Again, or the Problem of Locality in Biodiversity Surveys*”, Philosophy of Science , 76(3): 273–294. doi:10.1086/649805
- Srnicek, Nick, 2017, Platform capitalism , Cambridge, UK and Malden, MA: Polity Press.
- Sterner, Beckett, 2014, “The Practical Value of Biological Information for Research”, Philosophy of Science , 81(2): 175–194. doi:10.1086/675679
- Sterner, Beckett and Nico M. Franz, 2017, “Taxonomy for Humans or Computers? Cognitive Pragmatics for Big Data”, Biological Theory , 12(2): 99–111. doi:10.1007/s13752-017-0259-5
- Sterner, Beckett W., Nico M. Franz, and J. Witteveen, 2020, “Coordinating dissent as an alternative to consensus classification: insights from systematics for bio-ontologies”, History and Philosophy of the Life Sciences , 42(1): 8. doi: 10.1007/s40656-020-0300-z
- Stevens, Hallam, 2016, “Hadooping the Genome: The Impact of Big Data Tools on Biology”, BioSocieties , 11: 352–371.
- Strasser, Bruno, 2019, Collecting Experiments: Making Big Data Biology , Chicago: University of Chicago Press.
- Suppes, Patrick, 1962, “Models of data”, in Logic, Methodology and Philosophy of Science , Ernest Nagel, Patrick Suppes, & Alfred Tarski (eds.), Stanford: Stanford University Press, 252–261.
- Symons, John and Ramón Alvarado, 2016, “Can We Trust Big Data? Applying Philosophy of Science to Software”, Big Data & Society , 3(2): 1-17. doi:10.1177/2053951716664747
- Symons, John and Jack Horner, 2014, “Software Intensive Science”, Philosophy & Technology , 27(3): 461–477. doi:10.1007/s13347-014-0163-x
- Tempini, Niccolò, 2017, “Till Data Do Us Part: Understanding Data-Based Value Creation in Data-Intensive Infrastructures”, Information and Organization , 27(4): 191–210. doi:10.1016/j.infoandorg.2017.08.001
- Tempini, Niccolò and Sabina Leonelli, 2018, “Concealment and Discovery: The Role of Information Security in Biomedical Data Re-Use”, Social Studies of Science , 48(5): 663–690. doi:10.1177/0306312718804875
- Toulmin, Stephen, 1958, The Uses of Arguments , Cambridge: Cambridge University Press.
- Turner, Raymond and Nicola Angius, 2019, “The Philosophy of Computer Science”, in The Stanford Encyclopedia of Philosophy (Spring 2019 edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/spr2019/entries/computer-science/ >.
- Van Fraassen, Bas C., 2008, Scientific Representation: Paradoxes of Perspective , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780199278220.001.0001
- Waters, C. Kenneth, 2007, “The Nature and Context of Exploratory Experimentation: An Introduction to Three Case Studies of Exploratory Research”, History and Philosophy of the Life Sciences , 29(3): 275–284.
- Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, et al., 2016, “The FAIR Guiding Principles for Scientific Data Management and Stewardship”, Scientific Data , 3(1): 160018. doi:10.1038/sdata.2016.18
- Williamson, Jon, 2004 “A dynamic interaction between machine learning and the philosophy of science”, Minds and Machines , 14(4): 539–54. doi:10.1093/bjps/axx012
- Wimsatt, William C., 2007, Re-Engineering Philosophy for Limited Beings: Piecewise Approximations to Reality , Cambridge, MA: Harvard University Press.
- Winsberg, Eric, 2010, Science in the Age of Computer Simulation , Chicago: University of Chicago Press.
- Woodward, James, 2000, “Data, phenomena and reliability”, Philosophy of Science , 67(supplement): Proceedings of the 1998 Biennial Meetings of the Philosophy of Science Association. Part II: Symposia Papers (Sep., 2000), pp. S163–S179. https://www.jstor.org/stable/188666
- –––, 2010, “Data, Phenomena, Signal, and Noise”, Philosophy of Science , 77(5): 792–803. doi:10.1086/656554
- Wright, Jessey, 2017, “The Analysis of Data and the Evidential Scope of Neuroimaging Results”, The British Journal for the Philosophy of Science , 69(4): 1179–1203. doi:10.1093/bjps/axx012
- Wylie, Alison, 2017, “How Archaeological Evidence Bites Back: Strategies for Putting Old Data to Work in New Ways”, Science, Technology, & Human Values , 42(2): 203–225. doi:10.1177/0162243916671200
- –––, forthcoming, “Radiocarbon Dating in Archaeology: Triangulation and Traceability”, in Leonelli and Tempini forthcoming.
- Zuboff, Shoshana, 2017, The Age of Surveillance Capitalism: The Fight for the Future at the New Frontier of Power , New York: Public Affairs.
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.
[Please contact the author with suggestions.]
artificial intelligence | Bacon, Francis | biology: experiment in | computer science, philosophy of | empiricism: logical | evidence | human genome project | models in science | Popper, Karl | science: theory and observation in | scientific explanation | scientific method | scientific theories: structure of | statistics, philosophy of
Acknowledgments
The research underpinning this entry was funded by the European Research Council (grant award 335925) and the Alan Turing Institute (EPSRC Grant EP/N510129/1).
Copyright © 2020 by Sabina Leonelli < s . leonelli @ exeter . ac . uk >
- Accessibility
Support SEP
Mirror sites.
View this site from another server:
- Info about mirror sites
The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University
Library of Congress Catalog Data: ISSN 1095-5054
Business Intelligence and Analytics: Research Directions
New Citation Alert added!
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
New Citation Alert!
Please log in to your account
Information & Contributors
Bibliometrics & citations.
- Tomas C Amsler A Pinto-López I Montaudon-Tomás I (2024) Evolution of Business Intelligence Strengthening Industrial Cybersecurity to Protect Business Intelligence 10.4018/979-8-3693-0839-4.ch003 (57-80) Online publication date: 26-Apr-2024 https://doi.org/10.4018/979-8-3693-0839-4.ch003
- Islem Adnane Y Zerari M (2024) Optimizing Business Intelligence Classification Rule Mining Using Quantum-Inspired Genetic Algorithm IEEE Access 10.1109/ACCESS.2024.3463506 12 (137041-137053) Online publication date: 2024 https://doi.org/10.1109/ACCESS.2024.3463506
- Makhaye N Mwapwele S (2024) The Contributions of Business Intelligence and Big Data to Public Healthcare in South Africa Implications of Information and Digital Technologies for Development 10.1007/978-3-031-66986-6_22 (296-308) Online publication date: 1-Aug-2024 https://doi.org/10.1007/978-3-031-66986-6_22
- Show More Cited By
Index Terms
Information systems
Information systems applications
Data mining
Recommendations
Analytics service oriented architecture for enterprise information systems.
Big data analytics and business analytics are disruptive technology and innovative solution for enterprise development. However, what is the relationship between big data analytics and business analytics? What is the relationship between business ...
A unified foundation for business analytics
Synthesizing prior research, this paper designs a relatively comprehensive and holistic characterization of business analytics - one that serves as a foundation on which researchers, practitioners, and educators can base their studies of business ...
Business Analytics/Business Intelligence and IT Infrastructure: Impact on Organizational Agility
This is an empirical research investigating the impact of business analytics (BA) and business intelligence (BI) use, IT infrastructure flexibility, and their interactions on organizational agility. Synthesizing the systems theory and awareness-...
Reviewer: Kalman Balogh
Business intelligence and analytics (BIA) is one of the most rapidly growing branches of information and communications technology (ICT), playing an increasingly fundamental role in business, government, healthcare, and traffic, among many other areas. The authors of this paper are leaders in their respective organizations, recognized in different application fields within the domain of knowledge discovery and BIA. Because they have a special perspective on the state of the art of BIA, it is interesting to read their opinions about its status, open problems, and fruitful directions for future research. In the first part of the paper, the authors summarize their views on BIA, both its history and the current emerging industry, and on related data (kind, source, and use) and platform technology trends. The main part of the paper focuses on research directions. These are grouped into three areas-big data, text, and network analytics-and are investigated for possible solutions and important problems for future consideration. Each group is described from different aspects and then relevant research questions are enumerated. I do not think that any overview of the needed research directions can be complete, but this one highlights many essential open problems. The paper is application oriented and informal, without mentioning theoretical methods. It is quite readable, with a simple structure and clear phrasing. I recommend this paper for those who want to find real theoretical and practical problems to solve in this important application field. Online Computing Reviews Service
Access critical reviews of Computing literature here
Become a reviewer for Computing Reviews.
Information
Published in.
Association for Computing Machinery
New York, NY, United States
Publication History
Permissions, check for updates, author tags.
- Business intelligence
- business analytics
- Research-article
Contributors
Other metrics, bibliometrics, article metrics.
- 122 Total Citations View Citations
- 10,069 Total Downloads
- Downloads (Last 12 months) 613
- Downloads (Last 6 weeks) 33
- Kumar S K. K Aithal P (2023) Tech-Business Analytics in Primary Industry Sector International Journal of Case Studies in Business, IT, and Education 10.47992/IJCSBE.2581.6942.0279 (381-413) Online publication date: 30-Jun-2023 https://doi.org/10.47992/IJCSBE.2581.6942.0279
- Amiri M Ramos C (2023) Effects and Potentials of Business Intelligence Tools on Tourism Companies in a Tourism 4.0 Environment Internet of Behaviors Implementation in Organizational Contexts 10.4018/978-1-6684-9039-6.ch008 (153-174) Online publication date: 1-Nov-2023 https://doi.org/10.4018/978-1-6684-9039-6.ch008
- Rahman M (2023) The Effect of Business Intelligence on Bank Operational Efficiency and Perceptions of Profitability FinTech 10.3390/fintech2010008 2 :1 (99-119) Online publication date: 23-Feb-2023 https://doi.org/10.3390/fintech2010008
- Prokopowicz D Gołębiowska A Such-Pyrgiel M (2023) The role of Big Data and Data Science in the context of information security and cybersecurity Journal of Modern Science 10.13166/jms/177036 53 :4 (9-42) Online publication date: 30-Dec-2023 https://doi.org/10.13166/jms/177036
- Huy P Phuc V (2023) Big data in relation with business intelligence capabilities and e-commerce during COVID-19 pandemic in accountant’s perspective Future Business Journal 10.1186/s43093-023-00221-4 9 :1 Online publication date: 5-Sep-2023 https://doi.org/10.1186/s43093-023-00221-4
- Balaji K (2023) Unlocking the Potential of BI-Enhancing Banking Transactions Through AI&ML Tools 2023 International Conference on Applied Intelligence and Sustainable Computing (ICAISC) 10.1109/ICAISC58445.2023.10200018 (1-6) Online publication date: 16-Jun-2023 https://doi.org/10.1109/ICAISC58445.2023.10200018
- Al-Debei M (2023) The era of business analytics: identifying and ranking the differences between business intelligence and data science from practitioners’ perspective using the Delphi method Journal of Business Analytics 10.1080/2573234X.2023.2285483 (1-26) Online publication date: 21-Nov-2023 https://doi.org/10.1080/2573234X.2023.2285483
View Options
Login options.
Check if you have access through your login credentials or your institution to get full access on this article.
Full Access
View options.
View or Download as a PDF file.
View online with eReader .
Share this Publication link
Copying failed.
Share on social media
Affiliations, export citations.
- Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
- Download citation
- Copy citation
We are preparing your search results for download ...
We will inform you here when the file is ready.
Your file of search results citations is now ready.
Your search export query has expired. Please try again.
- Review article
- Open access
- Published: 02 November 2020
Big data in education: a state of the art, limitations, and future research directions
- Maria Ijaz Baig 1 ,
- Liyana Shuib ORCID: orcid.org/0000-0002-7907-0671 1 &
- Elaheh Yadegaridehkordi 1
International Journal of Educational Technology in Higher Education volume 17 , Article number: 44 ( 2020 ) Cite this article
62k Accesses
93 Citations
36 Altmetric
Metrics details
Big data is an essential aspect of innovation which has recently gained major attention from both academics and practitioners. Considering the importance of the education sector, the current tendency is moving towards examining the role of big data in this sector. So far, many studies have been conducted to comprehend the application of big data in different fields for various purposes. However, a comprehensive review is still lacking in big data in education. Thus, this study aims to conduct a systematic review on big data in education in order to explore the trends, classify the research themes, and highlight the limitations and provide possible future directions in the domain. Following a systematic review procedure, 40 primary studies published from 2014 to 2019 were utilized and related information extracted. The findings showed that there is an increase in the number of studies that address big data in education during the last 2 years. It has been found that the current studies covered four main research themes under big data in education, mainly, learner’s behavior and performance, modelling and educational data warehouse, improvement in the educational system, and integration of big data into the curriculum. Most of the big data educational researches have focused on learner’s behavior and performances. Moreover, this study highlights research limitations and portrays the future directions. This study provides a guideline for future studies and highlights new insights and directions for the successful utilization of big data in education.
Introduction
The world is changing rapidly due to the emergence of innovational technologies (Chae, 2019 ). Currently, a large number of technological devices are used by individuals (Shorfuzzaman, Hossain, Nazir, Muhammad, & Alamri, 2019 ). In every single moment, an enormous amount of data is produced through these devices (ur Rehman et al., 2019 ). In order to cater for this massive data, current technologies and applications are being developed. These technologies and applications are useful for data analysis and storage (Kalaian, Kasim, & Kasim, 2019 ). Now, big data has become a matter of interest for researchers (Anshari, Alas, & Yunus, 2019 ). Researchers are trying to define and characterize big data in different ways (Mikalef, Pappas, Krogstie, & Giannakos, 2018 ).
According to Yassine, Singh, Hossain, and Muhammad ( 2019 ), big data is a large volume of data. However, De Mauro, Greco, and Grimaldi ( 2016 ) referred to it as an informational asset that is characterized by high quantity, speed, and diversity. Moreover, Shahat ( 2019 ) described big data as large data sets that are difficult to process, control or examine in a traditional way. Big data is generally characterized into 3 Vs which are Volume, Variety, and Velocity (Xu & Duan, 2019 ). The volume refers to as a large amount of data or increasing scale of data. The size of big data can be measured in terabytes and petabytes (Herschel & Miori, 2017 ). In order to cater for the large volume of data, high capacity storage systems are required. The variety refers to as a type or heterogeneity of data. The data can be in a structured format (databases) or unstructured format (images, video, emails). Big data analytical tools are helpful in handling unstructured data. Velocity refers to as the speed at which big data can access. The data is virtually present in a real-time environment (Internet logs) (Sivarajah, Kamal, Irani, & Weerakkody, 2017 ).
Currently, the concept of 3 V’s is inflated into several V’s. For instance, Demchenko, Grosso, De Laat, and Membrey ( 2013 ) classified big data into 5vs, which are Volume, Velocity, Variety, Veracity, and Value. Similarly, Saggi and Jain ( 2018 ) characterized big data into 7 V’s namely Volume, Velocity, Variety, Valence, Veracity, Variability, and Value.
Big data demand is significantly increasing in different fields of endeavour such as insurance and construction (Dresner Advisory Services, 2017 ), healthcare (Wang, Kung, & Byrd, 2018 ), telecommunication (Ahmed et al., 2018 ), and e-commerce (Wu & Lin, 2018 ). According to Dresner Advisory Services ( 2017 ), technology (14%), financial services (10%), consulting (9%), healthcare (9%), education (8%) and telecommunication (7%) are the most active sectors in producing a vast amount of data.
However, the educational sector is not an exception in this situation. In the educational realm, a large volume of data is produced through online courses, teaching and learning activities (Oi, Yamada, Okubo, Shimada, & Ogata, 2017 ). With the advent of big data, now teachers can access student’s academic performance, learning patterns and provide instant feedback (Black & Wiliam, 2018 ). The timely and constructive feedback motivates and satisfies the students, which gives a positive impact on their performance (Zheng & Bender, 2019 ). Academic data can help teachers to analyze their teaching pedagogy and affect changes according to students’ needs and requirement. Many online educational sites have been designed, and multiple courses based on individual student preferences have been introduced (Holland, 2019 ). The improvement in the educational sector depends upon acquisition and technology. The large-scale administrative data can play a tremendous role in managing various educational problems (Sorensen, 2018 ). Therefore, it is essential for professionals to understand the effectiveness of big data in education in order to minimize educational issues.
So far, several review studies have been conducted in the big data realm. Mikalef et al. ( 2018 ) conducted a systematic literature review study that focused on big data analytics capabilities in the firm. Mohammad & Torabi ( 2018 ), in their review study on big data, observed the emerging trends of big data in the oil and gas industry. Furthermore, another systematic literature review was conducted by Neilson, Daniel, and Tjandra ( 2019 ) on big data in the transportation system. Kamilaris, Kartakoullis, and Prenafeta-Boldú ( 2017 ), conducted a review study on the use of big data in agriculture. Similarly, Wolfert, Ge, Verdouw, and Bogaardt ( 2017 ) conducted a review study on the use of big data in smart farming. Moreover, Camargo Fiorini, Seles, Jabbour, Mariano, and Sousa Jabbour ( 2018 ) conducted a review study on big data and management theory. Even though that many fields have been covered in the previous review studies, yet, a comprehensive review of big data in the education sector is still lacking today. Thus, this study aims to conduct a systematic review of big data in education in order to identify the primary studies, their trends & themes, as well as limitations and possible future directions. This research can play a significant role in the advancement of big data in the educational domain. The identified limitations and future directions will be helpful to the new researchers to bring encroachment in this particular realm.
The research questions of this study are stated below:
What are the trends in the papers published on big data in education?
What research themes have been addressed in big data in education domain?
What are the limitations and possible future directions?
The remainder of this study is organized as follows: Section 2 explains the review methodology and exposes the SLR results; Section 3 reports the findings of research questions; and finally, Section 4 presents the discussion and conclusion and research implications.
Review methodology
In order to achieve the aforementioned objective, this study employs a systematic literature review method. An effective review is based on analysis of literature, find the limitations and research gap in a particular area. A systematic review can be defined as a process of analyzing, accessing and understanding the method. It explains the relevant research questions and area of research. The essential purpose of conducting the systematic review is to explore and conceptualize the extant studies, identification of the themes, relations & gaps, and the description of the future directions accordingly. Thus, the identified reasons are matched with the aim of this study. This research applies the Kitchenham and Charters ( 2007 ) strategies. A systematic review comprised of three phases: Organizing the review, managing the review, and reporting the review. Each phase has specific activities. These activities are: 1) Develop review protocol 2) Formulate inclusion and exclusion criteria 3) Describe the search strategy process 4) Define the selection process 5) Perform the quality evaluation procedure and 6) Data extraction and synthesis. The description of each activity is provided in the following sections.
Review protocol
The review protocol provides the foundation and mechanism to undertake a systematic literature review. The essential purpose of the review protocol is to minimize the research bias. The review protocol comprised of background, research questions, search strategy, selection process, quality assessment, and extraction of data and synthesis. The review protocol helps to maintain the consistency of review and easy update at a later stage when new findings are incorporated. This is the most significant aspect that discriminates SLR from other literature reviews.
Inclusion and exclusion criteria
The aim of defining the inclusion and exclusion criteria is to be rest assured that only highly relevant researches are included in this study. This study considers the published articles in journals, workshops, conferences, and symposium. The articles that consist of introductions, tutorials and posters and summaries were eliminated. However, complete and full-length relevant studies published in the English language between January 2014 to 2019 March were considered for the study. The searched words should be present in title, abstract, or in the keywords section.
Table 1 shows a summary of the inclusion and exclusion criteria.
Search strategy process
The search strategy comprised of two stages, namely S1 (automatic stage) and S2 (manual stage). Initially, an automatic search (S1) process was applied to identify the primary studies of big data in education. The following databases and search engines were explored: Science Direct, SAGE.
Journals, Emerald Insight, Springer Link, IEEE Xplore, ACM Digital Library, Taylor and Francis and AIS e-Library. These databases were considered as it possessed highest impact journals and germane conference proceedings, workshops and symposium. According to Kitchenham and Charters ( 2007 ), electronic databases provide a broad perspective on a subject rather than a limited set of specific journals and conferences. In order to find the relevant articles, keywords on big data and education were searched to obtain relatable results. The general words correlated to education were also explored (education OR academic OR university OR learning.
OR curriculum OR higher education OR school). This search string was paired with big data. The second stage is a manual search stage (S2). In this stage, a manual search was performed on the references of all initial searched studies. Kitchenham ( 2004 ) suggested that manual search should be applied to the primary study references. However, EndNote was used to manage, sort and remove the replicate studies easily.
Selection process
The selection process is used to identify the researches that are relevant to the research questions of this review study. The selection process of this study is presented in Fig. 1 . By applying the string of keywords, a total number of 559 studies were found through automatic search. However, 348 studies are replica studies and were removed using the EndNote library. The inclusion and exclusion criteria were applied to the remaining 211 studies. According to Kitchenham and Charters ( 2007 ), recommendation and irrelevant studies should be excluded from the review subject. At this phase, 147 studies were excluded as full-length articles were not available to download. Thus, 64 full-length articles were present to download and were downloaded. To ensure the comprehensiveness of the initial search results, the snowball technique was used. In the second stage, manual search (S2) was performed on the references of all the relevant papers through Google Scholar (Fig. 1 ). A total of 1 study was found through Google Scholar search. The quality assessment criteria were applied to 65 studies. However, 25 studies were excluded, as these studies did not fulfil the quality assessment criteria. Therefore, a total of 40 highly relevant primary studies were included in this research. The selection of studies from different databases and sources before and after results retrieval is shown in Table 2 . It has been found that majority of research studies were present in Science Direct (90), SAGE Journals (50), Emerald Insight (81), Springer Link (38), IEEE Xplore (158), ACM Digital Library (73), Taylor and Francis (17) and AIS e-Library (52). Google Scholar was employed only for the second round of manual search.
Selection Process
Quality assessment
According to (Kitchenham & Charters, 2007 ), quality assessment plays a significant role in order to check the quality of primary researches. The subtleties of assessment are totally dependent on the quality of the instruments. This assessment mechanism can be based on the checklist of components or a set of questions. The primary purpose of the checklist of components and a set of questions is to analyze the quality of every study. Nonetheless, for this study, four quality measurements standard was created to evaluate the quality of each research. The measurement standards are given as:
QA1. Does the topic address in the study related to big data in education?
QA2. Does the study describe the context?
QA3. Does the research method given in the paper?
QA4. Does data collection portray in the article?
The four quality assessment standards were applied to 65 selected studies to determine the integrity of each research. The measurement standards were categorized into low, medium and high. The quality of each study depends on the total number of score. Each quality assessment has two-point scores. If the study meets the full standard, a score of 2 is awarded. In the case of partial fulfillment, a score of 1 is acquired. If none of the assessment standards is met, then a score of 0 is awarded. In the total score, if the study gets below 4, it is counted as ‘low’ and exact 4 considered as ‘medium’. However, the above 4 is reflected as ‘high’. The details of studies are presented in Table 11 in Appendix B . The 25 studies were excluded as it did not meet the quality assessment standard. Therefore, based on the quality assessment standard, a total of 40 primary studies were included in this systemic literature review (Table 10 in Appendix A ). The scores of the studies (in terms of low, medium and high) are presented in Fig. 2 .
Scores of studies
Data extraction and synthesis
The data extraction and synthesis process were carried by reading the 65 primary studies. The studies were thoroughly studied, and the required details extracted accordingly. The objective of this stage is to find out the needed facts and figure from primary studies. The data was collected through the aspects of research ID, names of author, the title of the research, its publishing year and place, research themes, research context, research method, and data collection method. Data were extracted from 65 studies by using this aspect. The narration of each item is given in Table 3 . The data extracted from all primary studies are tabulated. The process of data synthesizing is presented in the next section.
Figure 3 presented the allocation of studies based on their publication sources. All publications were from high impact journals, high-level conferences, and workshops. The primary studies are comprised of 21 journals, 17 conferences, 1 workshop, and 1 symposium. However, 14 studies were from Science Direct journals and conferences. A total of 5 primary studies were from the SAGE group, 1 primary study from SpringerLink. Whereas 6 studies were from IEEE conferences, 2 studies were from IEEE symposium and workshop. Moreover, 1 primary study from AISeL Conference. Hence, 4 studies were from Emraldinsight journals, 5 studies were from ACM conferences and 2 studies were from Taylor and Francis. The summary of published sources is given in Table 4 .
Allocation of studies based on publication
Temporal view of researches
The selection period of this study is from January 2014–March 2019. The yearly allocation of primary studies is presented in Fig. 4 . The big data in education trend started in the year 2014. This trend gradually gained popularity. In 2015, 8 studies were published in this domain. It has been found that a number of studies rise in the year 2017. Thus, the highest number of publication in big data in the education realm was observed in the year 2017. In 2017, 12 studies were published. This trend continued in 2018, and in that year, 11 studies that belong to big data in education were published. In 2019, the trend of this domain is still continued as this paper covers that period of March 2019. Thus, 4 studies were published until March 2019.
Temporal view of Papers
In order to find the total citation count for the studies, Google Scholar was used. The number of citation is shown in Fig. 5 . It has been observed that 28 studies were cited by other sources 1–50 times. However, 11 studies were not cited by any other source. Thus, 1 study was cited by other sources 127 times. The top cited studies with their titles are presented in Table 5 , which provides general verification. The data provided here is not for comparison purpose among the studies.
Research methodologies
The research methods employed by primary studies are shown in Fig. 6 . It has been found that majority of them are review based studies. These reviews were conducted in a different educational context and big data. However, reviews covered 28% of primary studies. The second most used research method was quantitative. This method covered 23% of the total primary studies. Only 3% of the study was based on a mix method approach. Moreover, design science method also covered 3% of primary studies. Nevertheless, 20% of the studies used qualitative research method, whereas the remaining 25% of the studies were not discussed and given in the articles.
Distribution of Research Methods of Primary Studies
Data collection methods
The data collection methods used by primary studies are shown in Fig. 7 . The primary studies employed different data collection methods. However, the majority of studies used extant literature. The 5 types of research conducted surveys which covered 13% of primary Studies. The 4 studies carried experiments for data collection, which covered 10% of primary studies. Nevertheless, 6 studies conducted interviews for data collection, which is based on 15% of primary studies. The 4 studies used data logs which are based on 10% of primary studies. The 2 studies collected data through observations, 1 study used social network data, and 3 studies used website data. The observational, social network data and website-based researches covered 5%, 3% and 8% of primary studies. Moreover, 11 studies used extant literature and 1 study extracted data from a focus group discussion. The extant literature and focus group-based studies covered 28% and 3% of primary studies. However, the data collection method is not available for the remaining 3 studies.
Distribution of Data Collection Methods of Primary Studies
What research themes have been addressed in educational studies of big data?
The theme refers to an idea, topic or an area covered by different research studies. The central idea reflects the theme that can be helpful in developing real insight and analysis. A theme can be in single or combination of more words (Rimmon-Kenan, 1995 ). This study classified big data research themes into four groups (Table 6 ). Thus, Fig. 8 shows a mind map of big data in education research themes, sub-themes, and the methodologies.
Mind Map of big data in education research themes, sub-themes, and the methodologies
Figure 9 presents, research themes under big data in education, namely learner’s behavior and performance, modelling, and educational data warehouse, improvement of the educational system, and integration of big data into the curriculum.
Research Themes
The first research theme was based on the leaner’s behavior and performance. This theme covers 21 studies, which consists of 53% of overall primary studies (Fig. 9 ). The theme studies are based on teaching and learning analytics, big data frameworks, user behaviour, and attitude, learner’s strategies, adaptive learning, and satisfaction. The total number of 8 studies relies on teaching and learning analytics (Table 7 ). Three (3) studies deal with big data framework. However, 6 studies concentrated on user behaviour and attitude. Nevertheless, 2 studies dwell on learning strategies. The adaptive learning and satisfaction covered 1 study, respectively. In this theme, 2 studies conducted surveys, 4 studies carried out experiments and 1 study employed the observational method. The 5 studies reported extant literature. In addition, 4 studies used event log data and 5 conducted interviews (Fig. 10 ).
Number of Studies and Data Collection Methods
In the second theme, studies conducted focused on modeling and educational data warehouses. In this theme, 6 studies covered 15% of primary studies. This theme studies investigated the cloud environment, big data modeling, cluster analysis, and data warehouse for educational purpose (Table 8 ). Three (3) studies introduced big data modeling in education and highlighted the potential for organizing data from multiple sources. However, 1 study analyzed data warehouse with big data tools (Hadoop). Moreover, 1 study analyzed the accessibility of huge academic data in a cloud computing environment whereas, 1 study used clustering techniques and data warehouse for educational purpose. In this theme, 4 studies reported extant review, 1 study conduct survey, and 1 study used social network data.
The third theme concentrated on the improvement of the educational system. In this theme, 9 studies covered 23% of the primary studies. They consist of statistical tools and measurements, educational research implications, big data training, the introduction of the ranking system, usage of websites, big data educational challenges and effectiveness (Table 9 ). Two (2) studies considered statistical tools and measurements. Educational research implications, ranking system, usage of websites, and big data training covered 1 study respectively. However, 3 studies considered big data effectiveness and challenges. In this theme, 1 study conducted a survey for data collection, 2 studies used website traffic data, and 1 study exploited the observational method. However, 3 studies reported extant literature.
The fourth theme concentrated on incorporating the big data approaches into the curriculum. In this theme, 4 studies covered 10% of the primary studies. These 4 studies considered the introduction of big data topics into different courses. However, 1 study conducted interviews, 1 study employed survey method and 1 study used focus group discussion.
The 20% of the studies (Fig. 6 ) used qualitative research methods (Dinter et al., 2017 ; Veletsianos et al., 2016 ; Yang & Du, 2016 ). Qualitative methods are mostly applicable to observe the single variable and its relationship with other variables. However, this method does not quantify relationships. In qualitative researches, understanding is attained through ‘wording’ (Chaurasia & Frieda Rosin, 2017 ). The behaviors, attitude, satisfaction, performance, and overall learning performance are related with human phenomenons (Cantabella et al., 2019 ; Elia et al., 2018 ; Sedkaoui & Khelfaoui, 2019 ). Qualitative researches are not statistically tested (Chaurasia & Frieda Rosin, 2017 ). Big data educational studies which employed qualitative methods lacks some certainties that are present in quantitative research methods. Therefore, future researches might quantify the educational big data applications and its impact on higher education.
The six studies conducted interviews for data collection (Chaurasia et al., 2018 ; Chaurasia & Frieda Rosin, 2017 ; Nelson & Pouchard, 2017 ; Troisi et al., 2018 ; Veletsianos et al., 2016 ). However, 2 studies used observational method (Maldonado-Mahauad et al., 2018 ; Sooriamurthi, 2018 ) and one (1) study conducted focus group discussion (Buffum et al., 2014 ) for data collection (Fig. 10 ). The observational studies were conducted in uncontrolled environments. Sometimes results of these studies lead to self-selection biased. There is a chance of ambiguities in data collection where human language and observation are involved. The findings of interviews, observations and focus group discussions are limited and cannot be extended to a wider population of learners (Dinter et al., 2017 ).
The four big data educational studies analyzed the event log data and conducted interviews (Cantabella et al., 2019 ; Hirashima et al., 2017 ; Liang et al., 2016 ; Yang & Du, 2016 ). However, longitudinal data are more appropriate for multidimensional measurements and to analyze the large data sets in the future (Sorensen, 2018 ).
The eight studies considered the teaching and learning analytics (Chaurasia et al., 2018 ; Chaurasia & Frieda Rosin, 2017 ; Dessì et al., 2019 ; Roy & Singh, 2017 ). There are limited researches that covered the aspects of learning environments, ethical and cultural values and government support in the adoption of educational big data (Yang & Du, 2016 ). In the future, comparison of big data in different learning environments, ethical and cultural values, government support and training in adopting big data in higher education can be covered through leading journals and conferences.
The three studies are related to big data frameworks for education (Cantabella et al., 2019 ; Muthukrishnan & Yasin, 2018 ). However, the existed frameworks did not cover the organizational and institutional cultures, yet lacking robust theoretical grounds (Dubey & Gunasekaran, 2015 ; Muthukrishnan & Yasin, 2018 ). In the future, big data educational framework that concentrates on theories and adoption of big data technology is recommended. The extension of existed models and interpretation of data models are recommended. This will help in better decision and ensure the predictive analysis in the academic realm. Moreover, further relations can be tested by integrating other constructs like university size and type (Chaurasia et al., 2018 ).
The three studies dwelled on big data modeling (Pardos, 2017 ; Petrova-Antonova et al., 2017 ; Wassan, 2015 ). These models do not incorporate with the present systems (Santoso & Yulia, 2017 ). Therefore, efficient research solutions that can manage the educational data, new interchanging and resources are required in the future. One (1) study explored a cloud-based solution for managing academic big data (Logica & Magdalena, 2015 ). However, this solution is expensive. In the future, a combination of LMS that is supported by open-source applications and software’s can be used. This development will help universities to obtain benefits from unified LMS and to introduce new trends and economic opportunities for the academic industry. The data warehouse with big data tools was investigated by one (1) study (Santoso & Yulia, 2017 ). Nevertheless, a manifold node cluster can be implemented to process and access the structural and un-structural data in future (Ramos et al., 2015 ). In addition, new techniques that are based on relational and nonrelational databases and development of index catalogs are recommended to improve the overall retrieval system. Furthermore, the applicability of the least analytical tools and parallel programming models are needed to be tested for academic big data. MapReduce, MongoDB, pig,
Cassandra, Yarn, and Mahout are suggested for exploring and analysis of educational big data (Wassan, 2015 ). These tools will improve the analysis process and help in the development of reliable models for academic analytics.
One (1) study detected ICT factors through data mining techniques and tools in order to enhance educational effectiveness and improves its system (Martínez-Abad et al., 2018 ). Additionally, two studies also employed big data analytic tools on popular websites to examine the academic user’s interest (Martínez-Abad et al., 2018 ; Qiu et al., 2015 ). Thus, in future research, more targeted strategies and regions can be selected for organizing the academic data. Similarly, in-depth data mining techniques can be applied according to the nature of the data. Thus, the foreseen research can be used to validate the findings by applying it on other educational websites. The present research can be extended by analyzing the socioeconomic backgrounds and use of other websites (Qiu et al., 2015 ).
The two research studies were conducted on measurements and selection of statistical software for educational big data (Ozgur et al., 2015 ; Selwyn, 2014 ). However, there is no statistical software that is fit for every academic project. Therefore, in future research, all in one’ type statistical software is recommended for big data in order to fulfill the need of all academic projects. The four research studies were based on incorporating the big data academic curricula (Buffum et al., 2014 ; Sledgianowski et al., 2017 ). However, in order to integrate the big data into the curriculum, the significant changes are required. Firstly, in future researches, curricula need to be redeveloped or restructured according to the level and learning environment (Nelson & Pouchard, 2017 ). Secondly, the training factor, learning objectives, and outcomes should be well designed in future studies. Lastly, comparable exercises, learning activities and assessment plan need to be well structured before integrating big data into curricula (Dinter et al., 2017 ).
Discussion and conclusion
Big data has become an essential part of the educational realm. This study presented a systematic review of the literature on big data in the educational sector. However, three research questions were formulated to present big data educational studies trends, themes, and identification of the limitations and directions for further research. The primary studies were collected by performing a systematic search through IEEE Xplore, ScienceDirect, Emerald Insight, AIS Electronic Library, Sage, ACM Digital Library, Springer Link, Taylor and Francis, and Google Scholar databases. Finally, 40 studies were selected that meet the research protocols. These studies were published between the years 2014 (January) and 2019 (April). Through the findings of this study, it can be concluded that 53% of extant studies were conducted on learner’s behavior and performance theme. Moreover, 15% of the studies were on modeling and educational Data Warehouse, and 23% of the studies were on the improvement of educational system themes. However, only 10% of the studies were on the integration of big data into the curriculum theme.
Thus, a large number of studies were conducted in learner’s behavior and performance theme. However, other themes gained lesser attention. Therefore, more researches are expected in modeling and educational Data Warehouse in the future, in order to improve the educational system and integration of big data into the curriculum, related themes.
It has been found that 20% of the studies used qualitative research methods. However, 6 studies conducted interviews, 2 studies used observational method and 1 study conducted focus group discussion for data collection. The findings of interviews, observations and focus group discussions are limited and cannot be extended to a wider population of learners. Therefore, prospect researches might quantify the educational big data applications and its impact in higher education. The longitudinal data are more appropriate for multidimensional measurements and future analysis of the large data sets. The eight studies were carried out on teaching and learning analytics. In the future, comparison of big data in different learning environments, ethical and cultural values, government support and training to adopt big data in higher education can be covered through leading journals and conferences.
The three studies were related to big data frameworks for education. In the future, big data educational framework that dwells on theories and extension of existed models are recommended. The three studies concentrated on big data modeling. These models cannot incorporate with present systems. Therefore, efficient research solutions are that can manage the educational data, new interchanging and resources are required in a future study. The two studies explored a cloud-based solution for managing academic big data and investigated data warehouse with big data tools. Nevertheless, in the future, a manifold node cluster can be implemented for processing and accessing of the structural and un-structural data. The applicability of the least analytical tools and parallel programming models needs to be tested for academic big data.
One (1) study considered the detection of ICT factors through data mining technique and 2 studies employed big data analytic tools on popular websites to examine the academic user’s interest. Thus, more targeted strategies and regions can be selected for organizing the academic data in future. Four (4) research studies featured on incorporating the big data academic curricula. However, the big data based curricula need to be redeveloped by considering the learning objectives. In the future, well-designed learning activities for big data curricula are suggested.
Research implications
This study has two folded implications for stakeholders and researchers. Firstly, this review explored the trends published on big data in education realm. The identified trends uncover the studies allocation, publication sources, sequential view and most cited papers. In addition, it highlights the research methods used in these studies. The described trends can provide opportunities and new ideas to researchers to predict the accurate direction in future studies.
Secondly, this research explored the themes, sub-themes, and the methodologies in big data in education domain. The classified themes, sub-themes, and the methodologies present a comprehensive overview of existing literature of big data in education. The described themes and sub-themes can be helpful for researchers to identify new research gap and avoid using repeated themes in future studies. Meanwhile, it can help researchers to focus on the combination of different themes in order to uncover new insights on how big data can improve the learning and teaching process. In addition, illustrated methodologies can be useful for researchers in the selection of method according to nature of the study in future.
Identified research can be an implication for stakeholders towards the holistic expansion of educational competencies. The identified themes give new insight to universities to plan mixed learning programs that combine conventional learning with web-based learning. This permits students to accomplish focused learning outcomes, engrossing exercises at an ideal pace. It can be helpful for teachers to apprehend the ways to gauge students learning behaviour and attitude simultaneously and advance teaching strategy accordingly. Understanding the latest trends in big data and education are of growing importance for the ministry of education as they can develop flexible possibly to support the institutions to improve the educational system.
Lastly, the identified limitations and possible future directions can provide guidelines for researchers about what has been explored or need to explore in future. In addition, stakeholders can also extract ideas to impart the future cohort and comprehend the learning and academic requirements.
Availability of data and materials
Not applicable.
Ahmed, E., Yaqoob, I., Hashem, I. A. T., Shuja, J., Imran, M., Guizani, N., & Bakhsh, S. T. (2018). Recent advances and challenges in mobile big data. IEEE Communications Magazine , 56 (2), 102–108. China: East China Normal University. https://doi.org/10.1109/MCOM.2018.1700294 .
Anshari, M., Alas, Y., & Yunus, N. (2019). A survey study of smartphones behavior in Brunei: A proposal of Modelling big data strategies. In Multigenerational Online Behavior and Media Use: Concepts, Methodologies, Tools, and Applications , (pp. 201–214). IGI global.
Black, P., & Wiliam, D. (2018). Classroom assessment and pedagogy. Assessment in Education: Principles, Policy & Practice , 25 (6), 551–575. https://doi.org/10.1080/0969594X.2018.1441807 .
Article Google Scholar
Buffum, P. S., Martinez-Arocho, A. G., Frankosky, M. H., Rodriguez, F. J., Wiebe, E. N., & Boyer, K. E. (2014, March). CS principles goes to middle school: Learning how to teach big data. In Proceedings of the 45th ACM technical Computer science education , (pp. 151–156). New York: ACM. https://doi.org/10.1145/2538862.2538949 .
Camargo Fiorini, P., Seles, B. M. R. P., Jabbour, C. J. C., Mariano, E. B., & Sousa Jabbour, A. B. L. (2018). Management theory and big data literature: From a review to a research agenda. International Journal of Information Management , 43 , 112–129. https://doi.org/10.1016/j.ijinfomgt.2018.07.005 .
Cantabella, M., Martínez-España, R., Ayuso, B., Yáñez, J. A., & Muñoz, A. (2019). Analysis of student behavior in learning management systems through a big data framework. Future Generation Computer Systems , 90 (2), 262–272. https://doi.org/10.1016/j.future.2018.08.003 .
Chae, B. K. (2019). A general framework for studying the evolution of the digital innovation ecosystem: The case of big data. International Journal of Information Management , 45 , 83–94. https://doi.org/10.1016/j.ijinfomgt.2018.10.023 .
Chaurasia, S. S., & Frieda Rosin, A. (2017). From big data to big impact: Analytics for teaching and learning in higher education. Industrial and Commercial Training , 49 (7), 321–328. https://doi.org/10.1108/ict-10-2016-0069 .
Chaurasia, S. S., Kodwani, D., Lachhwani, H., & Ketkar, M. A. (2018). Big data academic and learning analytics. International Journal of Educational Management , 32 (6), 1099–1117. https://doi.org/10.1108/ijem-08-2017-0199 .
Coccoli, M., Maresca, P., & Stanganelli, L. (2017). The role of big data and cognitive computing in the learning process. Journal of Visual Languages & Computing , 38 , 97–103. https://doi.org/10.1016/j.jvlc.2016.03.002 .
De Mauro, A., Greco, M., & Grimaldi, M. (2016). A formal definition of big data based on its essential features. Library Review , 65 (3), 122–135. https://doi.org/10.1108/LR-06-2015-0061 .
Demchenko, Y., Grosso, P., De Laat, C., & Membrey, P. (2013). Addressing big data issues in scientific data infrastructure. In Collaboration Technologies and Systems (CTS), 2013 International Conference on , (pp. 48–55). San Diego: IEEE. https://doi.org/10.1109/CTS.2013.6567203 .
Dessì, D., Fenu, G., Marras, M., & Reforgiato Recupero, D. (2019). Bridging learning analytics and cognitive computing for big data classification in micro-learning video collections. Computers in Human Behavior , 92 (1), 468–477. https://doi.org/10.1016/j.chb.2018.03.004 .
Dinter, B., Jaekel, T., Kollwitz, C., & Wache, H. (2017). Teaching Big Data Management – An Active Learning Approach for Higher Education . North America: Paper presented at the proceedings of the pre-ICIS 2017 SIGDSA, (pp. 1–17). North America: AISeL.
Dresner Advisory Services. (2017). Big data adoption: State of the market. ZoomData. Retrieved from https://www.zoomdata.com/master-class/state-market/big-data-adoption
Google Scholar
Dubey, R., & Gunasekaran, A. (2015). Education and training for successful career in big data and business analytics. Industrial and Commercial Training , 47 (4), 174–181. https://doi.org/10.1108/ict-08-2014-0059 .
Elia, G., Solazzo, G., Lorenzo, G., & Passiante, G. (2018). Assessing learners’ satisfaction in collaborative online courses through a big data approach. Computers in Human Behavior , 92 , 589–599. https://doi.org/10.1016/j.chb.2018.04.033 .
Gupta, D., & Rani, R. (2018). A study of big data evolution and research challenges. Journal of Information Science. , 45 (3), 322–340. https://doi.org/10.1177/0165551518789880 .
Herschel, R., & Miori, V. M. (2017). Ethics & big data. Technology in Society , 49 , 31–36. https://doi.org/10.1016/j.techsoc.2017.03.003 .
Hirashima, T., Supianto, A. A., & Hayashi, Y. (2017, September). Model-based approach for educational big data analysis of learners thinking with process data. In 2017 International Workshop on Big Data and Information Security (IWBIS) (pp. 11-16). San Diego: IEEE. https://doi.org/10.1177/0165551518789880
Holland, A. A. (2019). Effective principles of informal online learning design: A theory-building metasynthesis of qualitative research. Computers & Education , 128 , 214–226. https://doi.org/10.1016/j.compedu.2018.09.026 .
Kalaian, S. A., Kasim, R. M., & Kasim, N. R. (2019). Descriptive and predictive analytical methods for big data. In Web Services: Concepts, Methodologies, Tools, and Applications , (pp. 314–331). USA: IGI global. https://doi.org/10.4018/978-1-5225-7501-6.ch018 .
Kamilaris, A., Kartakoullis, A., & Prenafeta-Boldú, F. X. (2017). A review on the practice of big data analysis in agriculture. Computers and Electronics in Agriculture , 143 , 23–37. https://doi.org/10.1016/j.compag.2017.09.037 .
Kitchenham, B. (2004). Procedures for performing systematic reviews. Keele, UK, Keele University , 33 (2004), 1–26.
Kitchenham, B., & Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering version 2.3. Engineering , 45 (4), 13–65.
Lia, Y., & Zhaia, X. (2018). Review and prospect of modern education using big data. Procedia Computer Science , 129 (3), 341–347. https://doi.org/10.1016/j.procs.2018.03.085 .
Liang, J., Yang, J., Wu, Y., Li, C., & Zheng, L. (2016). Big Data Application in Education: Dropout Prediction in Edx MOOCs. In Paper presented at the 2016 IEEE second international conference on multimedia big data (BigMM) , (pp. 440–443). USA: IEEE. https://doi.org/10.1109/BigMM.2016.70 .
Logica, B., & Magdalena, R. (2015). Using big data in the academic environment. Procedia Economics and Finance , 33 (2), 277–286. https://doi.org/10.1016/s2212-5671(15)01712-8 .
Maldonado-Mahauad, J., Pérez-Sanagustín, M., Kizilcec, R. F., Morales, N., & Munoz-Gama, J. (2018). Mining theory-based patterns from big data: Identifying self-regulated learning strategies in massive open online courses. Computers in Human Behavior , 80 (1), 179196. https://doi.org/10.1016/j.chb.2017.11.011 .
Martínez-Abad, F., Gamazo, A., & Rodríguez-Conde, M. J. (2018). Big Data in Education. In Paper presented at the proceedings of the sixth international conference on technological ecosystems for enhancing Multiculturality - TEEM'18, Salamanca, Spain , (pp. 145–150). New York: ACM. https://doi.org/10.1145/3284179.3284206 .
Mikalef, P., Pappas, I. O., Krogstie, J., & Giannakos, M. (2018). Big data analytics capabilities: A systematic literature review and research agenda. Information Systems and e-Business Management , 16 (3), 547–578. https://doi.org/10.1007/10257-017-0362-y .
Mohammadpoor, M., & Torabi, F. (2018). Big Data analytics in oil and gas industry: An emerging trend. Petroleum. In press. https://doi.org/10.1016/j.petlm.2018.11.001 .
Muthukrishnan, S. M., & Yasin, N. B. M. (2018). Big Data Framework for Students’ Academic. Paper presented at the symposium on computer applications & industrial electronics (ISCAIE), Penang, Malaysia (pp. 376–382). USA: IEEE. https://doi.org/10.1109/ISCAIE.2018.8405502
Neilson, A., Daniel, B., & Tjandra, S. (2019). Systematic review of the literature on big data in the transportation Domain: Concepts and Applications. Big Data Research . In press. https://doi.org/10.1016/j.bdr.2019.03.001 .
Nelson, M., & Pouchard, L. (2017). A pilot “big data” education modular curriculum for engineering graduate education: Development and implementation. In Paper presented at the Frontiers in education conference (FIE), Indianapolis, USA , (pp. 1–5). USA: IEEE. https://doi.org/10.1109/FIE.2017.8190688 .
Nie, M., Yang, L., Sun, J., Su, H., Xia, H., Lian, D., & Yan, K. (2018). Advanced forecasting of career choices for college students based on campus big data. Frontiers of Computer Science , 12 (3), 494–503. https://doi.org/10.1007/s11704-017-6498-6 .
Oi, M., Yamada, M., Okubo, F., Shimada, A., & Ogata, H. (2017). Reproducibility of findings from educational big data. In Paper presented at the proceedings of the Seventh International Learning Analytics & Knowledge Conference , (pp. 536–537). New York: ACM. https://doi.org/10.1145/3027385.3029445 .
Ong, V. K. (2015). Big Data and Its Research Implications for Higher Education: Cases from UK Higher Education Institutions. In Paper presented at the 2015 IIAI 4th international confress on advanced applied informatics , (pp. 487–491). USA: IEEE. https://doi.org/10.1109/IIAI-AAI.2015.178 .
Ozgur, C., Kleckner, M., & Li, Y. (2015). Selection of statistical software for solving big data problems. SAGE Open , 5 (2), 59–94. https://doi.org/10.1177/2158244015584379 .
Pardos, Z. A. (2017). Big data in education and the models that love them. Current Opinion in Behavioral Sciences , 18 (2), 107–113. https://doi.org/10.1016/j.cobeha.2017.11.006 .
Petrova-Antonova, D., Georgieva, O., & Ilieva, S. (2017, June). Modelling of educational data following big data value chain. In Proceedings of the 18th International Conference on Computer Systems and Technologies (pp. 88–95). New York City: ACM. https://doi.org/10.1145/3134302.3134335
Qiu, R. G., Huang, Z., & Patel, I. C. (2015, June). A big data approach to assessing the US higher education service. In 2015 12th International Conference on Service Systems and Service Management (ICSSSM) (pp. 1–6). New York: IEEE. https://doi.org/10.1109/ICSSSM.2015.7170149
Ramos, T. G., Machado, J. C. F., & Cordeiro, B. P. V. (2015). Primary education evaluation in Brazil using big data and cluster analysis. Procedia Computer Science , 55 (1), 10311039. https://doi.org/10.1016/j.procs.2015.07.061 .
Rimmon-Kenan, S. (1995). What Is Theme and How Do We Get at It?. Thematics: New Approaches, 9–20.
Roy, S., & Singh, S. N. (2017). Emerging trends in applications of big data in educational data mining and learning analytics. In 2017 7th International Conference on Cloud Computing, Data Science & Engineering-Confluence , (pp. 193–198). New York: IEEE. https://doi.org/10.1109/confluence.2017.7943148 .
Saggi, M. K., & Jain, S. (2018). A survey towards an integration of big data analytics to big insights for value-creation. Information Processing & Management , 54 (5), 758–790. https://doi.org/10.1016/j.ipm.2018.01.010 .
Santoso, L. W., & Yulia (2017). Data warehouse with big data Technology for Higher Education. Procedia Computer Science , 124 (1), 93–99. https://doi.org/10.1016/j.procs.2017.12.134 .
Sedkaoui, S., & Khelfaoui, M. (2019). Understand, develop and enhance the learning process with big data. Information Discovery and Delivery , 47 (1), 2–16. https://doi.org/10.1108/idd-09-2018-0043 .
Selwyn, N. (2014). Data entry: Towards the critical study of digital data and education. Learning, Media and Technology , 40 (1), 64–82. https://doi.org/10.1080/17439884.2014.921628 .
Shahat, O. A. (2019). A novel big data analytics framework for smart cities. Future Generation Computer Systems , 91 (1), 620–633. https://doi.org/10.1016/j.future.2018.06.046 .
Shorfuzzaman, M., Hossain, M. S., Nazir, A., Muhammad, G., & Alamri, A. (2019). Harnessing the power of big data analytics in the cloud to support learning analytics in mobile learning environment. Computers in Human Behavior , 92 (1), 578–588. https://doi.org/10.1016/j.chb.2018.07.002 .
Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of big data challenges and analytical methods. Journal of Business Research , 70 , 263–286. https://doi.org/10.1016/j.jbusres.2016.08.001 .
Sledgianowski, D., Gomaa, M., & Tan, C. (2017). Toward integration of big data, technology and information systems competencies into the accounting curriculum. Journal of Accounting Education , 38 (1), 81–93. https://doi.org/10.1016/j.jaccedu.2016.12.008 .
Sooriamurthi, R. (2018). Introducing big data analytics in high school and college. In Proceedings of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education (pp. 373–374). New York: ACM. https://doi.org/10.1145/3197091.3205834
Sorensen, L. C. (2018). "Big data" in educational administration: An application for predicting school dropout risk. Educational Administration Quarterly , 45 (1), 1–93. https://doi.org/10.1177/0013161x18799439 .
Article MathSciNet Google Scholar
Su, Y. S., Ding, T. J., Lue, J. H., Lai, C. F., & Su, C. N. (2017). Applying big data analysis technique to students’ learning behavior and learning resource recommendation in a MOOCs course. In 2017 International conference on applied system innovation (ICASI) (pp. 1229–1230). New York: IEEE. https://doi.org/10.1109/ICASI.2017.7988114
Troisi, O., Grimaldi, M., Loia, F., & Maione, G. (2018). Big data and sentiment analysis to highlight decision behaviours: A case study for student population. Behaviour & Information Technology , 37 (11), 1111–1128. https://doi.org/10.1080/0144929x.2018.1502355 .
Ur Rehman, M. H., Yaqoob, I., Salah, K., Imran, M., Jayaraman, P. P., & Perera, C. (2019). The role of big data analytics in industrial internet of things. Future Generation Computer Systems , 92 , 578–588. https://doi.org/10.1016/j.future.2019.04.020 .
Veletsianos, G., Reich, J., & Pasquini, L. A. (2016). The Life Between Big Data Log Events. AERA Open , 2 (3), 1–45. https://doi.org/10.1177/2332858416657002 .
Wang, Y., Kung, L., & Byrd, T. A. (2018). Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change , 126 , 3–13. https://doi.org/10.1016/j.techfore.2015.12.019 .
Wassan, J. T. (2015). Discovering big data modelling for educational world. Procedia - Social and Behavioral Sciences , 176 , 642–649. https://doi.org/10.1016/j.sbspro.2015.01.522 .
Wolfert, S., Ge, L., Verdouw, C., & Bogaardt, M. J. (2017). Big data in smart farming–a review. Agricultural Systems , 153 , 69–80. https://doi.org/10.1016/j.agsy.2017.01.023 .
Wu, P. J., & Lin, K. C. (2018). Unstructured big data analytics for retrieving e-commerce logistics knowledge. Telematics and Informatics , 35 (1), 237–244. https://doi.org/10.1016/j.tele.2017.11.004 .
Xu, L. D., & Duan, L. (2019). Big data for cyber physical systems in industry 4.0: A survey. Enterprise Information Systems , 13 (2), 148–169. https://doi.org/10.1080/17517575.2018.1442934 .
Yang, F., & Du, Y. R. (2016). Storytelling in the age of big data. Asia Pacific Media Educator , 26 (2), 148–162. https://doi.org/10.1177/1326365x16673168 .
Yassine, A., Singh, S., Hossain, M. S., & Muhammad, G. (2019). IoT big data analytics for smart homes with fog and cloud computing. Future Generation Computer Systems , 91 (2), 563–573. https://doi.org/10.1016/j.future.2018.08.040 .
Zhang, M. (2015). Internet use that reproduces educational inequalities: Evidence from big data. Computers & Education , 86 (1), 212–223. https://doi.org/10.1016/j.compedu.2015.08.007 .
Zheng, M., & Bender, D. (2019). Evaluating outcomes of computer-based classroom testing: Student acceptance and impact on learning and exam performance. Medical Teacher , 41 (1), 75–82. https://doi.org/10.1080/0142159X.2018.1441984 .
Download references
Acknowledgements
Not applicable
Author information
Authors and affiliations.
Department of Information Systems, Faculty of Computer Science & Information Technology University of Malaya, 50603, Kuala Lumpur, Malaysia
Maria Ijaz Baig, Liyana Shuib & Elaheh Yadegaridehkordi
You can also search for this author in PubMed Google Scholar
Contributions
Maria Ijaz Baig composed the manuscript under the guidance of Elaheh Yadegaridehkordi. Liyana Shuib supervised the project. All authors discussed the results and contributed to the final manuscript.
Corresponding author
Correspondence to Liyana Shuib .
Ethics declarations
Competing interests.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
About this article
Cite this article.
Baig, M.I., Shuib, L. & Yadegaridehkordi, E. Big data in education: a state of the art, limitations, and future research directions. Int J Educ Technol High Educ 17 , 44 (2020). https://doi.org/10.1186/s41239-020-00223-0
Download citation
Received : 09 March 2020
Accepted : 10 June 2020
Published : 02 November 2020
DOI : https://doi.org/10.1186/s41239-020-00223-0
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Data science applications in education
- Learning communities
- Teaching/learning strategies
IEEE Account
- Change Username/Password
- Update Address
Purchase Details
- Payment Options
- Order History
- View Purchased Documents
Profile Information
- Communications Preferences
- Profession and Education
- Technical Interests
- US & Canada: +1 800 678 4333
- Worldwide: +1 732 981 0060
- Contact & Support
- About IEEE Xplore
- Accessibility
- Terms of Use
- Nondiscrimination Policy
- Privacy & Opting Out of Cookies
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
Research on the impact of industrial big data on the collaborative governance of pollution reduction and carbon reduction
- Original Paper
- Published: 17 October 2024
Cite this article
- Xiaofeng Zhang 1 ,
- Yuhui Li 1 ,
- Xiaoli Lv 1 &
- Dongri Han 1
Explore all metrics
Climate change has become a common concern of the international community, transcending national boundaries and becoming a global challenge. The ecological progress of China has entered a pivotal period of focusing on carbon reduction as a strategic direction, achieving qualitative change in ecological and environmental quality. Taking industrial big data as a new carrier to drive the collaborative governance of pollution reduction and carbon reduction has also risen to the strategic level. This paper uses the coupling coordination degree model to measure the characteristic variables of the collaborative governance of pollution reduction and carbon reduction. To explore the direct relationship, indirect impact and nonlinear impact of industrial big data on collaborative governance of pollution reduction, random effects model, intermediate effects model, and threshold regression model were used based on China’s provincial scale data from 2011 to 2021. Industrial big data can significantly promote the collaborative governance of pollution reduction and carbon reduction, and its effect coefficient is 0.107; green process innovation plays a significant intermediary role in its influence process; based on provincial heterogeneity, industrial big data have a diversified threshold effect on the collaborative governance of pollution reduction and carbon reduction. Under a higher level of green process innovation (the threshold interval is written) and a more reasonable energy consumption structure (threshold interval), industrial big data can play a more favorable positive impact. Further analysis found that industrial big data can also promote sustainable economic development. As China enters a new stage of a big data system, it is necessary to further improve and promote the development of industrial big data to strengthen the coordinated governance of big data and pollution reduction and carbon reduction as well as sustainable economic development. The research expands the research boundary of “data empowerment” in the field of management, and the research conclusions have certain reference values for the government to formulate provincial differentiation policies and build a good digital ecology.
Graphical abstract
This is a preview of subscription content, log in via an institution to check access.
Access this article
Subscribe and save.
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Price includes VAT (Russian Federation)
Instant access to the full article PDF.
Rent this article via DeepDyve
Institutional subscriptions
Similar content being viewed by others
Unveiling the impact of digital industrialization on synergistic governance of pollution and carbon reduction in China: a geospatial perspective
Research on the dynamic evolution and convergence of collaborative capacity of pollution control and carbon reduction: from the perspective of whole-process governance
Digital Economy and Sustainable Development: Insight From Synergistic Pollution Control and Carbon Reduction
Data availability.
No datasets were generated or analyzed during the current study.
The World Health Organization is a specialized agency under the United Nations, headquartered in Geneva, Switzerland, and only sovereign states can participate in it. It is the largest intergovernmental health organization in the world. The purpose of the World Health Organization is to achieve the highest possible level of health for the world’s people.
Ait-Sahalia Y, Xiu D (2019) Principal component analysis of high-frequency data. J Am Stat Assoc 114(525):287–303. https://doi.org/10.1080/01621459.2017.1401542
Article CAS Google Scholar
Akbari M, Loganathan N, Tavokolian H, Mardani A, Streimikiene D (2021) The dynamic effect of micro-structural shocks on private investment behavior. Acta Montan Slovaca 26(01):1–17. https://doi.org/10.46544/AMS.v26i1.01
Article Google Scholar
Arisanty D, Jedrasiak K, Rajiani I et al (2020) The destructive impact of burned peatlands to physical and chemical properties of soil. Acta Montan Slovaca 25(02):213–323. https://doi.org/10.46544/AMS.v25i2.8
Awan U, Shamim S, Khan Z et al (2021) Big data analytics capability and decision-making: the role of data-driven insight on circular economy performance. Technol Forecast Soc Chang 168:120766. https://doi.org/10.1007/s11846-022-00596-8
Bloch H, Metcalfe S (2018) Innovation, creative destruction, and price theory. Ind Corp Chang 27(01):1–13. https://doi.org/10.1093/icc/dtx020
Cao G, Tian N, Blankson C (2022) Big data, marketing analytics, and firm marketing capabilities. J Comput Inf Syst 62(03):442–451. https://doi.org/10.1080/08874417.2020.1842270
Di QB, Chen XL, Hou ZW (2022) Regional differences and key pathway identification of the coordinated governance of pollution control and carbon emission reduction in the three major urban agglomerations of China under the “double-carbon” targets. Resour Sci 44(06):1155–1167. https://doi.org/10.18402/resci.2022.06.05
Dvorský J, Bednarz J, Blajer-Gołębiewska A (2023) The impact of corporate reputation and social media engagement on the sustainability of SMEs: perceptions of top managers and the owners. Equilib Q J Econ Econ Policy 18(03):779–811. https://doi.org/10.1007/s13198-015-0353-7
Fan JY, Zhu GY (2002) Evolution and structural decomposition of regional disparity in China. J Manag World 07:37–44
Google Scholar
Feldl N, Merlis TMA (2023) A semi-analytical model for water vapor, temperature, and surface-albedo feedbacks in comprehensive climate models. Geophys Res Lett 50(21):e2023GL105796. https://doi.org/10.1029/2023GL105796
Gao X, Liu N, Hua Y (2022) Environmental protection tax law on the synergy of pollution reduction and carbon reduction in China: evidence from a panel data of 107 cities. Sustain Prod Consum 33:425–437. https://doi.org/10.1016/j.spc.2022.07.006
Ghiat I, Al-Ansari T (2021) A review of carbon capture and utilisation as a CO 2 abatement opportunity within the EWF nexus. J CO2 Util 45:101432. https://doi.org/10.1016/j.jcou.2020.101432
Guo SQ (2018) Research on countermeasures of urban sustainable development in the context of climate change based on big data governance. J Southwest Univ Natl 39(03):205–213. https://doi.org/10.3969/j.issn.1004-3926.2018.03.031
Guo J, Zhou Y, Ali S et al (2021) Exploring the role of green innovation and investment in energy for environmental quality: an empirical appraisal from provincial data of China. J Environ Manag 292(01):112779. https://doi.org/10.1016/j.jenvman.2021.112779
Guo YY, Gou XJ, Xu ZS et al (2022) Carbon pricing mechanism for the energy industry: a bibliometric study of optimal pricing policies. Acta Montan Slovaca 27(01):49–69
CAS Google Scholar
Hąbek P, Biały W, Livenskaya G (2019) Stakeholder engagement in corporate social responsibility reporting. The case of mining companies. Acta Montan Slovaca 24(01):25–34
Hao X, Wen S, Li Y et al (2022) Can the digital economy development curb carbon emissions? Evidence from China. Front Psychol 13:938918. https://doi.org/10.3389/fpsyg.2022.938918
Hernandez-Ramirez AG, Martinez-Tavera E, Rodriguez-Espinosa PF et al (2019) Detection, provenance and associated environmental risks of water quality pollutants during anomaly events in River Atoyac, Central Mexico: a real-time monitoring approach. Sci Total Environ 669:1019–1032. https://doi.org/10.1016/j.scitotenv.2019.03.138
Hu J (2023) Synergistic effect of pollution reduction and carbon emission mitigation in the digital economy. J Environ Manag 337:117755. https://doi.org/10.1016/j.jenvman.2023.117755
Huajun L, Liecheng Q, Lixiang G (2022) Coordinated promotion of pollution and carbon reduction and China’s 3E performance. J Financ Econ 48(09):4–17. https://doi.org/10.16538/j.cnki.jfe.20211218.202
Huang QH, Yu YZ, Zhang SL (2019) Internet development and productivity growth in manufacturing industry: internal mechanism and China experiences. Chin Ind Econ 08:5–23. https://doi.org/10.19581/j.cnki.ciejournal.2019.08.001
Kliestik T, Nica E, Durana P et al (2023) Artificial intelligence-based predictive maintenance, time-sensitive networking, and big data-driven algorithmic decision-making in the economics of Industrial internet of things. Oecon Copernic 14(04):1097–1138. https://doi.org/10.24136/oc.2023.033
Kumar V, Rajan B, Venkatesan R et al (2019) Understanding the role of artificial intelligence in personalized engagement marketing. Calif Manag Rev 61(04):135–155. https://doi.org/10.1177/0008125619859317
Kumar R, Singh S, Bilga PS et al (2021) Revealing the benefits of entropy weights method for multi-objective optimization in machining operations: a critical review. J Market Res 10:1471–1492. https://doi.org/10.1016/j.jmrt.2020.12.114
Li H, Dong YL (2021) China’s high-quality economic development level and the source of differences: based on the inclusive green TFP perspective. J Financ Econ 08:4–18. https://doi.org/10.16538/j.cnki.jfe.20210615.201
Li G, Shao F (2024) Environmental regulation, technological innovation and optimization and upgrading of industrial structure basis on the Yangtze river delta. J Dalian Univ 45(02):86–96+104
Li L, Zhang J (2016) The difference and decomposition of industrial green development level of the Yangtze river economic belt based on the comparative study of 108 cities from 2004 to 2013. Soft Sci 30(11):48–53. https://doi.org/10.13956/j.ss.1001-8409.2016.11.11
Lin BQ, Jiang ZJ (2009) Prediction of environmental Kuznets curve of carbon dioxide in China and analysis of influencing factors. J Manag World 04:27–36
Lin JF, Wei LM, Hu YJ (2008) Green productivity, circular economy and sustainable economic development. Ecol Econ 04:71–74
Liu Y, Wang W, Fang Su (2020) A case study of enterprise digital transformation in the context of industrial big data. J Manag 33(01):60–69. https://doi.org/10.19808/j.cnki.41-1408/F.2020.01.007
Liu MH, Liu SN, Li J et al (2022) Evaluation and prediction of the synergistic effect of pollution reduction and carbon reduction in Tianjin. China Environ Sci 42(08):3940–3949. https://doi.org/10.19674/j.cnki.issn1000-6923.20220329.004
Manová E, Čulková K, Lukáč J et al (2018) Position of the chosen industrial companies in connection to the mining. Acta Montan Slovaca 23(02):132–140. https://doi.org/10.1016/0022-1031(86)90045-4
Mansoor M, Mariun N, AbdulWahab NI (2017) Innovating problem solving for sustainable green roofs: potential usage of TRIZ–theory of inventive problem solving. Ecol Eng 99:209–221. https://doi.org/10.1016/j.ecoleng.2016.11.036
Mardoyan A, Braun P (2015) Analysis of Czech subsidies for solid biofuels. Int J Green Energy 12(04):405–408. https://doi.org/10.1080/15435075.2013.841163
Maroušek J (2013) Use of continuous pressure shockwaves apparatus in rapeseed oil processing. Clean Technol Environ Policy 15:721–725. https://doi.org/10.1007/s10098-012-0549-3
Maroušek J, Itoh S, Higa O et al (2013) Pressure shockwaves to enhance oil extraction from Jatropha curcas L. Biotechnol Biotechnol Equip 27(02):3654–3658. https://doi.org/10.5504/BBEQ.2012.0143
Maroušek J, Gavurová B, Strunecký O et al (2023a) Techno-economic identification of production factors threatening the competitiveness of algae biodiesel. Fuel 344:128056. https://doi.org/10.1016/j.fuel.2023.128056
Maroušek J, Maroušková A, Gavurová B et al (2023b) Competitive algae biodiesel depends on advances in mass algae cultivation. Biores Technol 374:128802. https://doi.org/10.1016/j.biortech.2023.128802
Maroušek J, Maroušková A, Gavurová B et al (2023c) Techno-economic considerations on cement substitute obtained from waste refining. J Clean Prod 412:137326. https://doi.org/10.1016/j.jclepro.2023.137326
Melakessou F, Kugener P, Alnaffakh N et al (2020) Heterogeneous sensing data analysis for commercial waste collection. Sensors 20(04):978. https://doi.org/10.3390/s20040978
Miao XD, Lv MY, Zhang XD et al (2023) The impact of industrial big data on the green development of Chinese industries manufacturing industry spatial effect test based on provincial panel data. Soft Sci 37(03):1–10
Nibedita B, Irfan M (2023) The dynamic nexus among energy diversification and carbon emissions in the E7 economies: investigating the moderating role of financial development. Emerg Mark Financ Trade 59(14):3968–3981. https://doi.org/10.1080/1540496X.2022.2161817
Pavolová P, Bakalár T, Kyšeľa K, Klimek M, Hajduová Z, Zawada M (2021) The analysis of investment into industries based on portfolio managers. Acta Montan Slovaca 26(01):161–170. https://doi.org/10.46544/AMS.v26i1.14
Popa CL, Bretcan P, Radulescu C et al (2019) Spatial distribution of groundwater quality in connection with the surrounding land use and anthropogenic activity in rural areas. Acta Montan Slovaca 24(02):73
Qi SZ, Li Y (2018) Threshold effects of renewable energy consumption on economic growth under energy transformation. China Popul Resour Environ 28(02):19–27. https://doi.org/10.1080/10042857.2017.1416049
Qian H, Xu S, Cao J et al (2021) Air pollution reduction and climate co-benefits in China’s industries. Nat Sustain 4(05):417–425. https://doi.org/10.1038/s41893-021-00683-w
Rabe M, Droždž W, Widera K et al (2022) Assessment of energy storage for energy strategies development on a regional scale. Acta Montan Slovaca 27(01):163–177. https://doi.org/10.46544/AMS.v27i1.12
Saravanan A, Kumar PS, Jeevanantham S et al (2021) Effective water/wastewater treatment methodologies for toxic pollutants removal: processes and applications towards sustainable development. Chemosphere 280:130595. https://doi.org/10.1016/j.chemosphere.2021.130595
Sekar M, Kumar TRP, Kumar MSG et al (2021) Techno-economic review on short-term anthropogenic emissions of air pollutants and particulate matter. Fuel 305:121544. https://doi.org/10.1016/j.fuel.2021.121544
Shi D (2022) Evolution of industrial development trend under digital economy. China Ind Econ 11:26–42
Shi Q, Tang J, Xu MS (2020) Research progress of intelligent blast furnace iron-making technology based on industrial big data. Iron Steel Res J 34(12):1314–1324
Skare M, Porada-Rochon M, Blazevic-Buric S (2021) Energy cycles: nature, turning points and role in England economic growth from 1700 to 2018. Acta Montan Slovaca 26(02):281–302. https://doi.org/10.46544/AMS.v26i2.08
Sousa JCG, Ribeiro AR, Barbosa MO et al (2018) A review on environmental monitoring of water organic pollutants identified by EU guidelines. J Hazard Mater 344:146–162. https://doi.org/10.1016/j.jhazmat.2017.09.058
Sun L (2016) The promotion and innovation of industrial big data to wisdom cloud manufacturing. Sci Technol Manag Res 36(13):156–158. https://doi.org/10.3969/j.issn.1000-7695.2016.13.028
Tian Z, Shi X (2022) Proposing energy performance indicators to identify energy-wasting operations on big time-series data. Energy Build 269:112244. https://doi.org/10.1016/j.enbuild.2022.112244
Toole JL, Colak S, Sturt B et al (2015) The path most traveled: travel demand estimation using big data resources. Transp Res Part C Emerg Technol 58:162–177. https://doi.org/10.1016/j.trc.2015.04.022
Tu ZG (2023) On the synergy of carbon reduction, pollution reduction and efficiency enhancement-measurement of sulfur and carbon emission reduction efficiency for high energy consumption enterprises based on SBM model. J Cent China Norm Univ (Humanities and Social Sciences) 62(05):161–174. https://doi.org/10.19992/j.cnki.1000-2456.2023.05.014
Valaskova K, Gajdosikova D, Lazaroiu G (2023) Has the COVID-19 pandemic affected the corporate financial performance? A case study of Slovak enterprises. Equilib Q J Econ Econ Policy 18(04):1133–1178. https://doi.org/10.24136/eq.2023.036
Vochozka M, Horak J, Krulický T et al (2020a) Predicting future Brent oil price on global markets. Acta Montan Slovaca 25(03):375–392. https://doi.org/10.46544/AMS.v25i3.10
Vochozka M, Rowland Z, Suler P et al (2020b) The influence of the international price of oil on the value of the eur/usd exchange rate. J Compet. https://doi.org/10.7441/joc.2020.02.10
Wang SJ, Hou Y, Zhang XL et al (2003) Comprehensive evaluation method for water resources carrying capacity in river basins. J Hydraul Eng 01:88–92. https://doi.org/10.3966/160792642020032102008
Wang MW, Jin JL, Li L (2004) Application of PP method based on rage to assessment of sand liquefaction potential. Chin J Rock Mech Eng 04:631–634. https://doi.org/10.1007/BF02911033
Wang J, Yang Y, Wang T et al (2020) Big data service architecture: a survey. J Internet Technol 21(02):393–405. https://doi.org/10.3966/160792642020032102008
Wang SJ, Kong W, Ren L et al (2021) Research on misuses and modification of coupling coordination degree model in China. J Nat Resour 36(03):793–810. https://doi.org/10.31497/zrzyxb.20210319
Wang FZ, Liu XL, Zhang L et al (2022) Does digitalization promote green technology innovation of resource-based enterprises. Stud Sci 40(02):332–344. https://doi.org/10.16192/j.cnki.1003-2053.20210824.001
Wei LL, Hou YQ (2022) Research on the impact of China’s digital economy on urban green development. J Quant Technol Econ 39(08):60–79. https://doi.org/10.31497/zrzyxb.20210319
Weidlich W (1991) (1991) Physics and social science–the approach of synergetics. Phys Rep 204(01):1–163. https://doi.org/10.1016/0370-1573(91)90024-G
Wu XQ, Han LB (2017) Research on Chinese local government debt competition-evidence from provincial spatial panel data. Financ Trade Econ 38(09):48–62. https://doi.org/10.3969/j.issn.1002-8102.2017.09.004
Wu CQ, Meng XQ (2022) Research on the impact of digital transformation on the green development of manufacturing industry in the Yangtze river economic belt. J Nantong Univ 38(06):106127. https://doi.org/10.1016/j.eneco.2022.106127
Wu J, Guo S, Li J et al (2016) Big data meet green challenges: Big data toward green applications. IEEE Syst J 10(03):888–900. https://doi.org/10.1109/JSYST.2016.2550530
Yan HP, Xu P, Xiong KJ (2024) Study on the impact of industrial structure adjustment on regional environmental pollution——evidence from 278 prefecture-level cities in China. Resour Dev Mark 2024:1–19
Yao X, Wang X, Xu Z et al (2022) Bibliometric analysis of the energy efficiency research. Acta Montan Slovaca 27(02):505–521. https://doi.org/10.46544/AMS.v27i2.17
Ye X, Du Y, He W (2021) Employment structure effect of digital economy development. J Financ Trade 32(04):1–13. https://doi.org/10.19337/j.cnki.34-1093/f.2021.04.001
Yi L, Yang TT, Du X et al (2022) Collaborative pathways of pollution reduction and carbon abatement: typical countries’ driving mechanisms and their implications for China. China Popul Resour Environ 32(09):53–65
Yin C, Sun BD, Yao XJ (2024) Exploring the association of population density with urban livability. Sci Geogr Sin 44(02):179–191. https://doi.org/10.13249/j.cnki.sgs.20221507
Zeng QH, He LY (2023) Study on the synergistic effect of air pollution prevention and carbon emission reduction in the context of “dual carbon”: evidence from China’s transport sector. Energy Policy 173:113370. https://doi.org/10.1016/j.enpol.2022.113370
Zeng S, Su B, Zhang M et al (2021) Analysis and forecast of China’s energy consumption structure. Energy Policy 159:112630. https://doi.org/10.1016/j.enpol.2021.112630
Zhang M (2017) Industrial land transfer, investment quality bottom line competition and industrial pollution emission: an empirical study based on urban panel data. J Zhejiang Prov Party Sch 33(04):107–114
Zhang ZF (2024) The impact of intellectual property Protection on green innovation efficiency of enterprises. Stat Decis 39(23):184–188. https://doi.org/10.13546/j.cnki.tjyjc.2023.23.033
Zhang TF, Yang J, Sheng PF (2016) The impacts and channels of urbanization on carbon dioxide emissions in China. China Popul Resour Environ 26(02):47–57. https://doi.org/10.3969/j.issn.1002-2104.2016.02.007
Zhang XY, Shi XT, He JY (2021) A study on the impact of digital economy development level of RCEP member countries on China’s cultural product exports. Foreign Trade 04:29–32+39
Zhang W, Wang XK, Shi YJ et al (2023) Construction technology of intelligent manufacturing service systems driven by industrial big data. Sci China Tech Sci 53(07):1084–1096. https://doi.org/10.1360/SST-2022-0372
Zheng Y, Zeshui XU, Skare M et al (2021) A comprehensive bibliometric analysis of the energy poverty literature: from 1942 to 2020. Acta Montan Slovaca 26(03):512–533. https://doi.org/10.46544/AMS.v26i3.10
Download references
Acknowledgements
We are very grateful to the editors and anonymous reviewers for reviewing
This work was supported by National Social Science Foundation Project: Research on Driving Mechanism and Path Selection of synergistic Efficiency of Pollution Reduction and carbon Reduction in Energy-rich Areas under Whole-process Governance (23CGL041).
Author information
Authors and affiliations.
Shandong University of Technology, Business School, Zibo, 255000, People’s Republic of China
Xiaofeng Zhang, Yuhui Li, Xiaoli Lv & Dongri Han
You can also search for this author in PubMed Google Scholar
Contributions
Xiaofeng Zhang and Dongri Han contributed to conceptualization and supervision; Yuhui Li and Xiaoli Lv were involved in methodology and visualization and provided software; Dongri Han contributed to formal analysis, resources, data curation, and funding acquisition; Xiaofeng Zhang was involved in investigation and writing—original draft preparation; and Xiaoli Lv contributed to writing—review and editing.
Corresponding author
Correspondence to Dongri Han .
Ethics declarations
Conflict of interest.
The authors declare no competing interests.
Additional information
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file1 (DOCX 16 KB)
Supplementary file2 (docx 22 kb), supplementary file3 (txt 1 kb), supplementary file4 (xlsx 61 kb), rights and permissions.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
About this article
Zhang, X., Li, Y., Lv, X. et al. Research on the impact of industrial big data on the collaborative governance of pollution reduction and carbon reduction. Clean Techn Environ Policy (2024). https://doi.org/10.1007/s10098-024-03038-z
Download citation
Received : 19 February 2024
Accepted : 03 October 2024
Published : 17 October 2024
DOI : https://doi.org/10.1007/s10098-024-03038-z
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Industrial big data
- Collaborative governance of pollution reduction and carbon reduction
- Mediating effect
- Threshold regression
- Find a journal
- Publish with us
- Track your research
IMAGES
VIDEO
COMMENTS
Big Data is still gaining attention as a fundamental building block of the Artificial Intelligence and Machine Learning world. Therefore, a lot of effort has been pushed into Big Data research in the last 15 years. The objective of this Systematic Literature Review is to summarize the current state of the art of the previous 15 years of research about Big Data by providing answers to a set of ...
Today, an enormous amount of data is being continuously generated in all walks of life by all kinds of devices and systems every day. A significant portion of such data is being captured, stored, aggregated and analyzed in a systematic way without losing its "4V" (i.e., volume, velocity, variety, and veracity) characteristics. We review major drivers of big data today as well the recent ...
11 Program of Learning Sciences, National Taiwan Normal University, Taipei, Taiwan. We discuss the new challenges and directions facing the use of big data and artificial intelligence (AI) in education research, policy-making, and industry. In recent years, applications of big data and AI in education have made significant headways.
In this paper a program and methodology for bibliometric mining of research trends and directions is presented. The method is applied to the research area Big Data for the time period 2012 to 2022, using the Scopus database. It turns out that the 10 most important research directions in Big Data are Machine learning, Deep learning and neural networks, Internet of things, Data mining, Cloud ...
An SLR presents a comprehensive review of state-of-the-art to reveal existing methods, challenges, and potential future research directions for research communities (Brereton et al., 2007). ... The need for an SLR is to identify, classify, and compare the existing research reviews on big data analytics in social networks. In order to show that ...
These developments and directions of genetic-based research and big data go far beyond the struggle of a discipline, namely sociology, with a paradigm shift in empirical research.
Then, we present a classification of some of the most important challenges when handling big data. Based on this classification, we recommend solutions that could address the identified challenges, and in addition we highlight cross-disciplinary research directions that need further investigation in the future.
Current and Future Trends in Big Data-based Smart Mobility. Big data and related technologies, such as Artificial Intelligence (AI), the Internet of Things (IoT), and more, are important drivers of so-called smart mobility. This concept refers to "local and supra-local accessibility, availability of ICTs, …. Submission deadline: 28 February ...
The superiority of big data has led to ample research on big data analytics in the hospitality and tourism context. It is thus important to capture the overall intellectual landscape by reviewing extant relevant literature. ... This study extends previous scientometric works in this research direction from three perspectives. Firstly, this ...
Big data analytics has gained wide attention from both academia and industry as the demand for understanding trends in massive datasets increases. Recent developments in sensor networks, cyber-physical systems, and the ubiquity of the Internet of Things (IoT) have increased the collection of data (including health care, social media, smart cities, agriculture, finance, education, and more) to ...
The study of big data analytics (BDA) methods for the data-driven industries is gaining research attention and implementation in today's industrial activities, business intelligence, and rapidly changing the perception of industrial revolutions. The uniqueness of big data and BDA has created unprecedented new research calls to solve data generation, storage, visualization, and processing ...
Big data analytics for data-driven industry: a review of data sources, tools, challenges, solutions, and research directions
This brief concluding chapter draws on the collective scholarship, useful and cutting-edge insights, and hard work contributed by all the authors of this book addressing the nature and role of big data in psychology. Also, the conclusion of this chapter offers a list of future research directions (immodestly abbreviated FReDs) that summarizes how psychology can be more strategically and ...
We discuss the new challenges and directions facing the use of big data and. artificial intelligence (AI) in education research, policy-making, and industry. In recent. years, applications of big ...
Scientific Research and Big Data. First published Fri May 29, 2020. Big Data promises to revolutionise the production of knowledge within and beyond science, by enabling novel, highly efficient ways to plan, conduct, disseminate and assess research. The last few decades have witnessed the creation of novel ways to produce, store, and analyse ...
The method is applied on the research area Big Data for the time period 2012 to 2021, using the Scopus database. It turns out that the 10 most important research directions in Big Data are Machine ...
In this work, the authors discuss the characteristics of datasets in library and argue against a popular confusion that data involved in library research is not big enough, conduct a review for the research work on library big data and then summarize the applications and research directions in this field.
In this article, we will categorize BIA research activities into three broad research directions: (a) big data analytics, (b) text analytics, and (c) network analytics. The article aims to review the state-of-the-art techniques and models and to summarize their use in BIA applications. For each research direction, we will also determine a few ...
Big data is an essential aspect of innovation which has recently gained major attention from both academics and practitioners. Considering the importance of the education sector, the current tendency is moving towards examining the role of big data in this sector. So far, many studies have been conducted to comprehend the application of big data in different fields for various purposes.
Johnson & Shneiderman, 1991. BIG DATA DIRECTIONS IN ENTREPRENEURSHIP RESEARCH: RESEARCHER VIEWPOINTS. the relative size of each technology group as well as the prominence of a subgroup. Languages and framework were the most common application data technologies, while API tools were the most signifi cant utilities.
In the era of big data, artificial intelligence (AI) technology is widely used in various fields. Floral design, as a field that blends art and creativity, can also benefit from the development of AI. The application of AI intelligence in the field of flower design under the environment of big data is deeply studied and discussed. Through exploring the application methods and technical means ...
Climate change has become a common concern of the international community, transcending national boundaries and becoming a global challenge. The ecological progress of China has entered a pivotal period of focusing on carbon reduction as a strategic direction, achieving qualitative change in ecological and environmental quality. Taking industrial big data as a new carrier to drive the ...
Addressing this gap in evidence, we were able to use the power of big data to determine RSV vaccine effectiveness, information needed to inform vaccine policy," said study co-author Shaun Grannis ...