• Data Center
  • Applications
  • Open Source

Logo

How Data Mining is Used by Nasdaq, DHL, Cerner, PBS, and The Pegasus Group: Case Studies

Data Discovery Represented by Magnifying Glass Over Globe of Binary Code Data.

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More .

Companies understand that data mining can provide insights to improve the organization. Yet, many struggle with the right types of data to collect, where to start, or what project may benefit from data mining.

Examining the data mining success of others in a variety of circumstances illuminates how certain methods and software in the market can assist companies. See below how five organizations benefited from data mining in different industries: cybersecurity, finance, health care, logistics, and media.

See more: What is Data Mining? Types & Examples

1. Cerner Corporation

Over 14,000 hospitals, physician’s offices, and other medical facilities use Cerner Corporation’s software solutions.

Cerner’s access allows them to combine patient medical records and medical device data to create an integrated medical database and improve health care.

Using Cloudera’s data mining allows different devices to feed into a common database and predict medical conditions.

“In our first attempts to build this common platform, we immediately ran into roadblocks,” says Ryan Brush, senior director and distinguished engineer at Cerner.

“Our clients are reporting that the new system has actually saved hundreds of lives by being able to predict if a patient is septic more effectively than they could before.”

Industry: Health care

Data mining provider: Cloudera

  • Collect data from unlimited and different sources
  • Enhance operational and financial performance for health care facilities
  • Improve patient diagnosis and save lives

Read the Cerner Corporation and Cloudera, Inc. case study.

DHL Temperature Management Solutions provides temperature controlled pharmaceutical logistics to ensure pharmaceutical and biological goods stay within required temperature ranges to retain potency.

Previously, DHL transferred data into spreadsheets that took a week to compile and would only contain a portion of the potential information.

Moving to DOMO’s data mining platform allows for real-time reporting of a broader set of data categories to improve insight.

“We’re able to pinpoint issues that we couldn’t see before. For example, a certain product, on a certain lane, at a certain station is experiencing an issue repeatedly,” says Dina Bunn, global head of central operations and IT for DHL Temperature Management Solutions.

Industry: Logistics

Data mining provider: DOMO

  • Real-time versus week-old logistics information
  • More insight into sources of delays or problems at both a high and a detailed level
  • More customer engagement

Read the DHL and DOMO case study.

See more: Current Trends & Future Scope of Data Mining

The Nasdaq electronic stock exchange integrates Sisense’s data mining capabilities into their IR Insight software to help customers analyze huge data sets.

“Our customers rely on a range of content sets, including information that they license from others, as well as data that they input themselves,” says James Tickner, head of data analytics for Nasdaq Corporate Solutions.

“Being able to layer those together and attain a new level of value from content that they’ve been looking at for years but in another context.”

The combined application provides real-time analysis and clear reports easy for customers to understand and communicate internally.

Industry: Finance

Data mining provider: Sisense

  • Meets rigorous data security regulations
  • Quickly processes huge data sets from a variety of sources
  • Provides clients with new ways to visualize and interpret data to extract new value

Read or watch the Nasdaq and Sisense case study.

The Public Broadcasting System (PBS) of the U.S. manages an online website to service 353 PBS member stations and their viewers. Their 330 million sessions, 800 million page views, and 17.5 million episode plays generate enormous data that the PBS team struggled to analyze.

PBS worked with LunaMetrics to perform data mining on the Google Analytics 360 platform to speed up insights into PBS customers.

Dan Haggerty, director of digital analytics for PBS, says “that was the coolest thing about it. A machine took our data without prior assumptions and reaffirmed and strengthened ideas that subject matter experts already suspected about our audiences based on our contextual knowledge.”

Industry: Media

Data mining provider: Google Analytics and LunaMetrics

  • Identified seven key audience segments based on web behaviors
  • Developed in-depth personas per segment through data mining
  • Insights help direct future content and feature development

Read the PBS, LunaMetrics, and Google Analytics case study.

5. The Pegasus Group

Cyber attackers compromised and targeted the data mining system (DMS) of a major network client of The Pegasus Group and launched a distributed denial-of-service (DDoS) attack against 1,500 services.

Under extreme time pressure, The Pegasus Group needed to find a way to use data mining to analyze up to 35GB of data with no prior knowledge of the data contents.

“[I analyzed] the first three million lines and [used RapidMiner’s data mining to perform] a stratified sampling to see which ones [were] benign, which packets [were] really part of the network, and which packets were part of the attack,” says Rodrigo Fuentealba Cartes of The Pegasus Group.

“In just 15 minutes … I used this amazing simulator to see what kinds of parameters I could use to filter packets … and in another two hours, the attack was stopped.”

Industry: Cybersecurity

Data mining provider: RapidMinder

  • Uploaded and analyzed three million lines of data 
  • Recommended analysis models provided answers within 15 minutes
  • Data analysis suggested solutions that stopped the attack within two hours

Watch The Pegasus Group and RapidMiner case study.

See more: Top Data Mining Tools

Subscribe to Data Insider

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more.

Similar articles

Mastering structured data: from basics to real-world applications, 9 best ai certification courses to future-proof your career in 2024, 10 best cloud-based project management software platforms of 2024, get the free newsletter.

Subscribe to Data Insider for top news, trends & analysis

Latest Articles

Mastering structured data: from..., 9 best ai certification..., 10 best cloud-based project..., 9 top rpa companies....

Logo

Data Mining Case Studies & Benefits

Data Mining Case Studies & Benefits

  • Key Takeaways

Data mining has improved the decision-making process for over 80% of companies. (Source: Gartner).

Statista reports that global spending on robotic process automation (RPA) is projected to reach $98 billion by 2024, indicating a significant investment in automation technologies.

According to Grand View Research, the global data mining market will reach $16,9 billion in 2027.

Ethical Data Mining preserves individual rights and fosters trust.

A successful implementation requires defining clear goals, choosing data wisely, and constant adaptation.

Data mining case studies help businesses explore data for smart decision-making. It’s about finding valuable insights from big datasets. This is crucial for businesses in all industries as data guides strategic planning. By spotting patterns in data, businesses gain intelligence to innovate and stay competitive. Real examples show how data mining improves marketing and healthcare. Data mining isn’t just about analyzing data; it’s about using it wisely for meaningful changes.

The Importance of Data Mining for Modern Business:

The Importance of Data Mining for Modern Business Understanding the Role in Decision Making

Data mining has taken on a central role in the modern world of business. Data is a major issue for businesses today. Making informed decisions with this data can be crucial to staying competitive. This article explores the many aspects of data mining and its impact on decisions.

  • Unraveling Data Landscape

Businesses generate a staggering amount of data, including customer interactions, market patterns, and internal operations. Decision-makers face an information overload without effective tools for sorting through all this data.

Data mining is a process which not only organizes, structures and extracts patterns and insights from this vast amount of data. It acts as a compass to guide decision makers through the complex landscape of data.

  • Empowering Strategic Decision Making

Data mining is a powerful tool for strategic decision making. Businesses can predict future trends and market behavior by analyzing historical data. This insight allows businesses to better align their strategies with predicted shifts.

Data mining can provide the strategic insights required for successful decision making, whether it is launching a product, optimizing supply chain, or adjusting pricing strategies.

  • Customer-Centric Determining

Understanding and meeting the needs of customers is paramount in an era where customer-centricity reigns. Data mining is crucial in determining customer preferences, behaviors, and feedback.

This information allows businesses to customize products and services in order to meet the expectations of customers, increase satisfaction and build lasting relationships. With customer-centric insights, decision-makers can make choices that resonate with their target audiences and foster loyalty and brand advocacy.

Data Mining: Applications across industries

Data mining is transforming the way companies operate and make business decisions. This article explores the various applications of data-mining, highlighting case studies that illuminate its impact in the healthcare, retail, and finance sectors.

  • Healthcare Case Studies:

Healthcare Case Studies Revolutionizing Patient Care

Data mining is a powerful tool in the healthcare industry. It can improve patient outcomes and treatment plans. Discover compelling case studies in which data mining played a crucial role in predicting patterns of disease, optimizing treatment and improving patient care. These examples, which range from early detection of health risks to personalized medicines, show the impact that data mining has had on the healthcare industry.

State of Technology 2024

Humanity's Quantum Leap Forward

Explore 'State of Technology 2024' for strategic insights into 7 emerging technologies reshaping 10 critical industries. Dive into sector-wide transformations and global tech dynamics, offering critical analysis for tech leaders and enthusiasts alike, on how to navigate the future's technology landscape.

  • Retail Success stories:

Retail is at the forefront of leveraging data mining to enhance customer experiences and streamline operations. Discover success stories of how data mining empowered businesses to better understand consumer behavior, optimize their inventory management and create personalized marketing strategies.

These case studies, which range from e-commerce giants and brick-and-mortar shops, show how data mining can boost sales, improve customer satisfaction, transform the retail landscape, etc.

  • Financial Sector Examples:

Data mining is a valuable tool in the finance industry, where precision and risk assessment are key. Explore case studies that demonstrate how data mining can be used for fraud detection and risk assessment. These examples demonstrate how financial institutions use data mining to make better decisions, protect against fraud, and customize services to their clients’ needs.

  • Data Mining and Education:

Data mining has been used in the education sector to enhance learning beyond healthcare, retail and finance. Learn how educational institutions use data mining to optimize learning outcomes, analyze student performance and personalize materials. These examples, ranging from adaptive learning platforms and predictive analytics to predictive modeling, demonstrate the potential for data mining to revolutionize how we approach education.

  • Manufacturing efficiency:

Manufacturing efficiency Streamlining production processes

Data mining is a powerful tool for streamlining manufacturing processes. Examine case studies that demonstrate how data mining can be used to improve supply chain management, predict maintenance requirements, and increase overall operational efficiency. These examples show how data-driven insights can lead to cost savings, increased productivity, and a competitive advantage in manufacturing.

Data mining is a key component in each of these applications. It unlocks insights, streamlines operations, and shapes the future of decisions. Data mining is transforming the landscapes of many industries, including healthcare, retail, education, finance, and manufacturing.

Data Mining Techniques

Data mining techniques help businesses gain an edge by extracting valuable insights and information from large datasets. This exploration will provide an overview of the most popular data mining methods, and back each one with insightful case studies.

  • Popular Data Mining Techniques

Clustering Analysis

The clustering technique involves grouping data points based on a set of criteria. This method is useful for detecting patterns in data sets and can be used to segment customers, detect anomalies, or recognize patterns. The case studies will show how clustering can be used to improve marketing strategies, streamline products, and increase overall operational efficiency.

Association Rule Mining

Association rule mining reveals relationships between variables within large datasets. Market basket analysis is a common application of association rule mining, which identifies patterns in co-occurring products in transactions. Real-world examples of how association rule mining is used in retail to improve product placements, increase sales, and enhance the customer experience.

Decision Tree Analysis

The decision tree is a visual representation of the process of making decisions. This technique is a powerful tool for classification tasks. It helps businesses make decisions using a set of criteria. Through case studies, you will learn how decision tree analyses have been used in the healthcare industry for disease diagnosis and fraud detection, as well as predictive maintenance in manufacturing.

Regression Analysis

Regression analysis is a way to explore the relationship between variables. This allows businesses to predict and understand how one variable affects another. Discover case studies that demonstrate how regression analysis is used to predict customer behavior, forecast sales trends, and optimize pricing strategies.

Benefits and ROI:

Businesses are increasingly realizing the benefits of data mining in the current dynamic environment. The benefits are numerous and tangible, ranging from improved decision-making to increased operational efficiency. We’ll explore these benefits, and how businesses can leverage data mining to achieve significant gains.

  • Enhancing Decision Making

Data mining provides businesses with actionable insight derived from massive datasets. Analyzing patterns and trends allows organizations to make more informed decisions. This reduces uncertainty and increases the chances of success. There are many case studies that show how data mining has transformed the decision-making process of businesses in various sectors.

  • Operational Efficiency

Data mining is essential to achieving efficiency, which is the cornerstone of any successful business. Organizations can improve their efficiency by optimizing processes, identifying bottlenecks, and streamlining operations. These real-world examples show how businesses have made remarkable improvements in their operations, leading to savings and resource optimization.

  • Personalized Customer Experiences

Data mining has the ability to customize experiences for customers. Businesses can increase customer satisfaction and loyalty by analyzing the behavior and preferences of their customers. Discover case studies that show how data mining has been used to create engaging and personalized customer journeys.

  • Competitive Advantage

Gaining a competitive advantage is essential in today’s highly competitive environment. Data mining gives businesses insights into the market, competitor strategies, and customer expectations. These insights can give organizations a competitive edge and help them achieve success. Look at case studies that show how companies have outperformed their competitors by using data mining.

Calculating ROI and Benefits

To justify investments, businesses must also quantify their return on investment. Calculating ROI for data mining initiatives requires a thorough analysis of the costs, benefits, and long-term impacts. Let’s examine the complexities of ROI within the context of data-mining.

  • Cost-Benefit Analysis

Prior to focusing on ROI, companies must perform a cost-benefit assessment of their data mining projects. It involves comparing the costs associated with implementing data-mining tools, training staff, and maintaining infrastructure to the benefits anticipated, such as higher revenue, cost savings and better decision-making. Case studies from real-world situations provide insight into cost-benefit analysis.

  • Quantifying Tangible and intangible benefits

Data mining initiatives can yield tangible and intangible benefits. Quantifying tangible benefits such as an increase in sales or a reduction in operational costs is easier. Intangible benefits such as improved brand reputation or customer satisfaction are also important, but they may require a nuanced measurement approach. Examine case studies that quantify both types.

  • Long-term Impact Assessment

ROI calculations should not be restricted to immediate gains. Businesses need to assess the impact their data mining projects will have in the future. Consider factors like sustainability, scalability, and ongoing benefits. Case studies that demonstrate the success of data-mining strategies over time can provide valuable insight into long-term impact assessment.

  • Key Performance Indicators for ROI

Businesses must establish KPIs that are aligned with their goals in order to measure ROI. KPIs can be used to evaluate the success of data-mining initiatives, whether it is tracking sales growth, customer satisfaction rates, or operational efficiency. Explore case studies to learn how to select and monitor KPIs strategically for ROI measurement.

Data Mining Ethics

Data mining is a field where ethical considerations are crucial to ensuring transparent and responsible practices. It is important to carefully navigate the ethical landscape as organizations use data to extract valuable insights. This section examines ethical issues in data mining and highlights cases that demonstrate ethical practices.

  • Understanding Ethical Considerations

Data mining ethics revolves around privacy, consent, and responsible information use. Businesses are faced with the question of how they use and collect data. Ethics also includes the biases in data and the fairness of algorithms.

  • Balance Innovation and Privacy

Finding the right balance between privacy and innovation is a major ethical issue in data mining. In order to gain an edge in the market through data insights and to innovate, organizations must walk a tightrope between innovation and privacy. Case studies will illuminate how companies have successfully balanced innovation and privacy.

  • Transparency and informed consent

Transparency in the processes is another important aspect of ethical data mining. This is to ensure that individuals are informed and consented before their data is used. This subtopic will explore the importance of transparency in data collection and processing, with case studies that highlight instances where organizations have established exemplary standards to obtain informed consent.

Exploring Data Mining Ethics is crucial as data usage evolves. Businesses must balance innovation, privacy, and transparency while gaining informed consent. Real-world cases show how ethical data mining protects privacy and builds trust.

Implementing Data Mining is complex yet rewarding. This guide helps set goals, choose data sources, and use algorithms effectively. Challenges like data security and resistance to change are common but manageable.

Considering ethics while implementing data mining shows responsibility and opens new opportunities. Organizations prioritizing ethical practices become industry leaders, mitigating risks and achieving positive impacts on business, society, and technology. Ethics and implementation synergize in data mining, unlocking its true potential.

  • Q. What ethical considerations are important in data mining?

Privacy and consent are important ethical considerations for data mining.

  • Q. How can companies avoid common pitfalls when implementing data mining?

By ensuring the security of data, addressing cultural opposition, and encouraging continuous learning and adaptation.

  • Q. Why is transparency important in data mining?

Transparency and consent to use collected data ethically are key elements of building trust.

  • Q. What are the main steps to implement data mining in businesses?

Define your objectives, select data sources, select algorithms and monitor continuously.

  • Q. How can successful organizations use data mining to gain a strategic advantage?

By taking informed decisions, improving operations and staying on top of the competition.

favicon

Related Post

Understanding logical unit number : a key component of data storage, hadoop vs. spark: which big data tool is right for you, data governance frameworks: key components, pillars and more, data enrichment services: enhancing data quality for better decision-making, data retention policies: safeguarding your business data, strategic investments: choosing the right data center solutions for your business, table of contents.

Expand My Business is Asia's largest marketplace platform which helps you find various IT Services like Web and App Development, Digital Marketing Services and all others.

  • IT Staff Augmentation
  • Data & AI
  • E-commerce Development

Article Categories

  • Technology 671
  • Business 321
  • Digital Marketing 282
  • Social Media Marketing 129
  • E-Commerce 125
  • Website Development 107
  • Software 102

Sitemap / Glossary

Copyright © 2024 Mantarav Private Limited. All Rights Reserved.

expand my business

  • Privacy Overview
  • Strictly Necessary Cookies

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

Big data case study: How UPS is using analytics to improve performance

marksamuels.jpg

A new initiative at UPS will use real-time data, advanced analytics and artificial intelligence to help employees make better decisions.

As chief information and engineering officer for logistics giant UPS, Juan Perez is placing analytics and insight at the heart of business operations.

Big data and digital transformation: How one enables the other

Drowning in data is not the same as big data. Here's the true definition of big data and a powerful example of how it's being used to power digital transformation.

"Big data at UPS takes many forms because of all the types of information we collect," he says. "We're excited about the opportunity of using big data to solve practical business problems. We've already had some good experience of using data and analytics and we're very keen to do more."

Perez says UPS is using technology to improve its flexibility, capability, and efficiency, and that the right insight at the right time helps line-of-business managers to improve performance.

The aim for UPS, says Perez, is to use the data it collects to optimise processes, to enable automation and autonomy, and to continue to learn how to improve its global delivery network.

Leading data-fed projects that change the business for the better

Perez says one of his firm's key initiatives, known as Network Planning Tools, will help UPS to optimise its logistics network through the effective use of data. The system will use real-time data, advanced analytics and artificial intelligence to help employees make better decisions. The company expects to begin rolling out the initiative from the first quarter of 2018.

"That will help all our business units to make smart use of our assets and it's just one key project that's being supported in the organisation as part of the smart logistics network," says Perez, who also points to related and continuing developments in Orion (On-road Integrated Optimization and Navigation), which is the firm's fleet management system.

Orion uses telematics and advanced algorithms to create optimal routes for delivery drivers. The IT team is currently working on the third version of the technology, and Perez says this latest update to Orion will provide two key benefits to UPS.

First, the technology will include higher levels of route optimisation which will be sent as navigation advice to delivery drivers. "That will help to boost efficiency," says Perez.

Second, Orion will use big data to optimise delivery routes dynamically.

"Today, Orion creates delivery routes before drivers leave the facility and they stay with that static route throughout the day," he says. "In the future, our system will continually look at the work that's been completed, and that still needs to be completed, and will then dynamically optimise the route as drivers complete their deliveries. That approach will ensure we meet our service commitments and reduce overall delivery miles."

Once Orion is fully operational for more than 55,000 drivers this year, it will lead to a reduction of about 100 million delivery miles -- and 100,000 metric tons of carbon emissions. Perez says these reductions represent a key measure of business efficiency and effectiveness, particularly in terms of sustainability.

Projects such as Orion and Network Planning Tools form part of a collective of initiatives that UPS is using to improve decision making across the package delivery network. The firm, for example, recently launched the third iteration of its chatbot that uses artificial intelligence to help customers find rates and tracking information across a series of platforms, including Facebook and Amazon Echo.

"That project will continue to evolve, as will all our innovations across the smart logistics network," says Perez. "Everything runs well today but we also recognise there are opportunities for continuous improvement."

Overcoming business challenges to make the most of big data

"Big data is all about the business case -- how effective are we as an IT team in defining a good business case, which includes how to improve our service to our customers, what is the return on investment and how will the use of data improve other aspects of the business," says Perez.

These alternative use cases are not always at the forefront of executive thinking. Consultant McKinsey says too many organisations drill down on a single data set in isolation and fail to consider what different data sets mean for other parts of the business.

However, Perez says the re-use of information can have a significant impact at UPS. Perez talks, for example, about using delivery data to help understand what types of distribution solutions work better in different geographical locations.

"Should we have more access points? Should we introduce lockers? Should we allow drivers to release shipments without signatures? Data, technology, and analytics will improve our ability to answer those questions in individual locations -- and those benefits can come from using the information we collect from our customers in a different way," says Perez.

Perez says this fresh, open approach creates new opportunities for other data-savvy CIOs. "The conversation in the past used to be about buying technology, creating a data repository and discovering information," he says. "Now the conversation is changing and it's exciting. Every time we talk about a new project, the start of the conversation includes data."

By way of an example, Perez says senior individuals across the organisation now talk as a matter of course about the potential use of data in their line-of-business and how that application of insight might be related to other models across the organisation.

These senior executive, he says, also ask about the availability of information and whether the existence of data in other parts of the business will allow the firm to avoid a duplication of effort.

"The conversation about data is now much more active," says Perez. "That higher level of collaboration provides benefits for everyone because the awareness across the organisation means we'll have better repositories, less duplication and much more effective data models for new business cases in the future."

Read more about big data

  • Turning big data into business insights: The state of play
  • Choosing the best big data partners: Eight questions to ask
  • Report shows that AI is more important to IoT than big data insights

Quick Notion project: Dynamic org chart of your client personnel

The best business internet service providers, the best windows laptops you can buy: expert tested and reviewed.

Cost-Effective Big Data Mining in the Cloud: A Case Study with K-means

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

  • Open access
  • Published: 23 April 2020

Predictive analytics using big data for increased customer loyalty: Syriatel Telecom Company case study

  • Wissam Nazeer Wassouf   ORCID: orcid.org/0000-0001-8301-6320 1 ,
  • Ramez Alkhatib 2 ,
  • Kamal Salloum 1 &
  • Shadi Balloul 3  

Journal of Big Data volume  7 , Article number:  29 ( 2020 ) Cite this article

26k Accesses

31 Citations

4 Altmetric

Metrics details

Given the growing importance of customer behavior in the business market nowadays, telecom operators focus not only on customer profitability to increase market share but also on highly loyal customers as well as customers who are churn. The emergence of big data concepts introduced a new wave of Customer Relationship Management (CRM) strategies. Big data analysis helps to describe customer’s behavior, understand their habits, develop appropriate marketing plans for organizations to identify sales transactions and build a long-term loyalty relationship. This paper provides a methodology for telecom companies to target different-value customers by appropriate offers and services. This methodology was implemented and tested using a dataset that contains about 127 million records for training and testing supplied by Syriatel corporation. Firstly, customers were segmented based on the new approach (Time-frequency- monetary) TFM (TFM where: Time (T): total of calls duration and Internet sessions in a certain period of time. Frequency (F): use services frequently within a certain period. Monetary (M): The money spent during a certain period.) and the level of loyalty was defined for each segment or group. Secondly, The loyalty level descriptors were taken as categories, choosing the best behavioral features for customers, their demographic information such as age, gender, and the services they share. Thirdly, Several classification algorithms were applied based on the descriptors and the chosen features to build different predictive models that were used to classify new users by loyalty. Finally, those models were evaluated based on several criteria and derive the rules of loyalty prediction. After that by analyzing these rules, the loyalty reasons at each level were discovered to target them the most appropriate offers and services.

Introduction

The telecom sector is witnessing a massive increase in data, and by analyzing this massive data, telecom operators can manage and retain customers. It is also important for companies to be able to predict the amount of income they may receive from their active customers. For this purpose, they need models able to determine customer loyalty. The cost associated with customer gain is usually higher than the cost associated with maintaining it [ 1 ]. Prediction can be directed at customer loyalty to identify both customers who have great loyalty to their preservation as well as customers with intentions to change to the competitors. This capability is necessary, especially for modern telecommunications operators. Nowadays companies face more complexity and competition in their business and need to develop innovative activities to capture and improve customer satisfaction and retention [ 2 ]. Growing profitability is the goal of most companies, to reach this goal, companies must provide an analysis of customer relationship management (CRM) and provide appropriate marketing strategies [ 3 ]. Some studies provided a new model of transactions based on both the services and customer satisfaction and showed that the price is not the only measure affecting customer buying decisions, but it is also important that both the customer and the company agree on product value and good customer services. Therefore, organizations should not seek to develop a product to satisfy their customers, but they must follow the customer purchasing behavior and offer distinct products for each segment. In other words, segmenting customers based on purchasing behavior is necessary to develop successful marketing strategies, which in turn cause the creation and maintenance of competitive advantage. Current methods of customer value analysis which are based on past customer behavior patterns or demographic variables are limited to predict future customer behavior. So, better patterns were exch

Research objectives

Our goals of this research

Customers value was Analyzed by segmenting them according to the new approach TFM and then determine, the level of loyalty for each segment in a big data environment in telecom.

A set of features was derived from the telecom data.

The best behavioral features for customers with their demographic information were Chosen, based on these features and the level of loyalty for each segment, the following classification algorithms were applied and the classification models were built: random forest classifier, Decision tree classifier, Gradient-boosted tree classifier, and Multiplexer perceptron (MLPC).

These models were evaluated based on several criteria that evaluated and selected the most accurate model.

The loyalty rules were derived from this model, these rules showed the characteristics of each level of loyalty and thus the loyalty reasons were identified in each segment to target them in a representative manner. The other advantage of classification algorithms application was building a model to classify new users by loyalty.

Related works

Various efforts have been made to build an effective prediction model for retaining customers using different techniques. To better understand how Many studies have built their own predictive models suggested by Oladapo et al. [ 4 ]. Logistic regression model design, a good model of customer data to predict customer retention in a telecommunications company with 95.5% accuracy. This model predicts customer retention based on billing, value-added services, and SMS service issues.

Aluri et al. [ 5 ] have focused on using machine learning to determine the value of customers in the hospitality sectors of customers, such as restaurants and hotels, by engaging dynamic customers with the loyalty program brand. Their results also show that automated learning processes excel in identifying customers with greater value in specific promotions. They have deepened the practical and theoretical understanding of automated learning in the value chain of customer loyalty, in a structure that uses a dynamic model for customer engagement.

Wiaya and Gersang [ 6 ] predict customer loyalty at the National Multimedia Company of Indonesia, using three data mining algorithms, to form a customer loyalty classification model, namely: C4.5, Naive Bayes and Nearest Neighbor. These algorithms were applied to the set of data contained 2269 records and 9 attributes to be used. By comparing the analysis models, the C4.5 algorithm with its own data set segment (80% for training data and 20% of test data) has the highest accuracy results of 81.02% compared to algorithms and other data segments. In the attribute analysis, the disconnection attributes (the attribute that is interpreted as the reason why customers have stopped) get the most influential attribute on the accuracy of the results in the data extraction process to predict customer loyalty. This article does not discuss the algorithms of features selection, methods of obtaining important features, and its impact on model accuracy.

Wong and Wei [ 7 ] presented a research to develop a tool to analyze customer behavior and predict their upcoming purchases from Air Travel Company. They provided an integration tool between data mining Pricing for competitors, customer segmentation and predictive analysis. Results In customer segmentation analysis, 110,840 clients are identified and segmented based on their purchasing behavior. Customers’ profiles are split using a weighted RFM model, and customer purchasing behavior is analyzed in response to competitor price changes. The following destinations are expected for high-value customers identified using pre-link rules and custom packages promoted to targeted customer segments.

Moedjionom et al. [ 8 ] have predicted customer loyalty in a multimedia services company, offering many services to win the market. This research contribution is to use data related to the segmentation and splitting of potential customers based on the RFM model, then applying the classification, Proportion of accuracy in customer loyalty rating research. Although the C4.5 algorithm with the k-mean segmentation give a better result, there are some important action that can be added to the search: using optimization algorithm to select the features or to adjust the value of the label to obtain a more accurate model.

Kaya et al. [ 9 ] have built a predictive model based on spatial, temporal and optional behavioral features using individual transaction logs. Our results show that proposed dynamic behavioral models can predict change decisions much better than demographics-based features and that this effect remains constant across multiple data sets and different definitions of customer leakage. They have examined the relative importance of different behavioral features in predicting leakage, and how predictive power differed across different population groups.

Cheng and Sun [ 10 ] have viewed other application of the RFM model (named TFM) to identify high-value customers in the communications industry. Use three main features to describe users who have accumulated a greater amount of service time (T), often purchase 3G services (F) and create large amounts of invoices per month (M).

This study proposes a comprehensive CRM strategy framework that includes customer segmentation and behavior analysis, using a dataset that contains about 500 million (full dataset in syriatel company). Al Janabi and Razaq [ 11 ] used intelligent big data analysis to design smart predictors for customer churn in the telecommunication industry. The goal of this research maintain customers and improve the level of revenue. The proposed system consists of three basic pashas: First Phase: an understanding of the company’s data. This phase focuses on the initial processing of data that is fragmented and unbalanced. They addressed the problem of imbalance by building the DSMOTE algorithm. Second Phase: construct a GBM-based predictor after it was developed, replace its decision-making part, which is (DT) with a (GA) algorithm. The impact of this is able to overcome DT problems and reduce time implementation. Third Stage: The accuracy of the predictor results was verified by using the matrix of the conflict matrix. A comparison was made between the traditional method of initial treatment, which is SMOTE, DSMOTE in terms of error rate and accuracy. GBM-GA method has higher Accuracy than GBM.

One of the biggest challenges of the current big data landscape is our inability to process vast amounts of information in a reasonable time. Reyes-Ortiz et al. [ 12 ] explored and compared two distributed computing frameworks implemented on commodity cluster architectures: MPI/OpenMP on Beowulf that is high-performance oriented and exploits multi-machine/multi- core infrastructures, and Apache Spark on Hadoop which targets iterative algorithms through in-memory computing. The Google Cloud Platform service was used to create virtual machine clusters, run the frameworks, and evaluate two supervised machine learning algorithms: KNN and Pegasos SVM. Results obtained from experiments with a particle physics data set show MPI/OpenMP outperforms Spark by more than one order of magnitude in terms of processing speed and provides more consistent performance. However, Spark shows better data management infrastructure and the possibility of dealing with other aspects such as node failure and data replication.

There are several studies in the field of communication that deal with predicting the age and gender of the customer in big data platform by analyzing their personal data, including the study presented by Zaubi [ 13 ]. Where he designed a model using a reliable data set of 18,000 users provided by SyriaTel Telecom Company, for training and testing. The model applied by using big data technology and achieved 85.6% accuracy in terms of user gender prediction and 65.5% of user age prediction. The main contribution of this work is the improvement in the accuracy in terms of user gender prediction and user age prediction based on mobile phone data and end-to-end solution that approaches customer data from multiple aspects in the telecom domain.

Other studies have also dealt with the prediction of customer churn in telecom using machine learning in big data platform, including the study presented by Ahmad [ 14 ]. The main contribution of his work is to develop a churn prediction model which assists telecom operators to predict customers who are most likely subject to churn. The model developed in this work uses machine learning techniques on big data platforms and builds a new way of features’ engineering and selection. In order to measure the performance of the model, the Area Under Curve (AUC) standard measure is adopted, and the AUC value obtained is 93.3%. Another main contribution is to use customer social networks in the prediction model by extracting Social Network Analysis (SNA) features. The use of SNA enhanced the performance of the model from 84 to 93.3% against the AUC standard.

With regard to how some studies approached customer value analysis, retention, and loyalty. A study in [ 4 ] did not apply to big data as it studied all customers according to some features and using a method of machine learning (a logistic regression model) to show the role of machine education in retaining and increasing customer loyalty. In the study [ 5 ]. Machine learning was implemented in a major hospitality location and compared to traditional methods to determine customer value in the loyalty program. In the study [ 6 ] predict customer loyalty at the National Multimedia Company of Indonesia, using three data mining algorithms, These algorithms were applied to the set of data obtained are 2269 records and contain 9 attributes to be used. By comparing the analysis models, the C4.5 algorithm with its own data set segment has the highest accuracy results of 81.02% compared to algorithms and other data segments. In my study, a model is built to increase customer loyalty predictions based on the new TFM methodology and machine learning. My experiences were demonstrated that TFM most appropriate for the telecom sector than RFM. The concept of the TFM is adjusted, where T is the sum of the duration of calls and the periods of internet sessions during a certain period. The set of data obtained is 127 million records and contains 220 features to be used. Binary and multi-classification are applied. After comparing the classifiers, the Gradient-boosted-tree classifier was found to be the best in binary and Random Forest Classifier is the best in multi-classification.

Research tools

Hortonworks data platform (hdp).

An open-source framework for distributed storage and processing of large and multi-source datasets [ 15 ]. HDP enables flexible application deployment, machine learning, deep learning workloads, real-time data storage, security, and governance. It is a key element in the modern data structure of data (Fig. 1 ).

figure 1

Hortonworks data platform: HDP 3.1 [ 15 ]

The HDP framework was custom- installed to obtain only the tools and systems required to track all stages of this work. these tools and systems were: a distributed file system [ 16 ], Hadoop HDFS Footnote 1 for data storage, Spark Footnote 2 implementation engine for data processing [ 17 ], Yarn for resource management, Zeppelin Footnote 3 as a development user interface,Ambari for system monitoring, Ranger for system security and (Flume Footnote 4 System and Scoop Footnote 5 ) for data acquisition from Syriatel company data sources to HDFS in our dedicated framework.

Hive Footnote 6 is an ETL and data warehouse tool on top of the Hadoop ecosystem and used for processing structured and semi-structured data. Hive is a database present in the Hadoop ecosystem that performs DDL and DML operations, and it provides flexible query language such as HQL for better querying. Hive in Map reduce mode is used because data was distributed across multiple data, nodes to execute queries with better performance in a parallel way. The hardware resources were used included 12 nodes with 32 GB of RAM, 10 TB storage capacity, and 16-core processor per node. The Spark engine [ 17 ] was used in most phases of model building such as data processing, feature engineering, training and model testing because it is able to save its data in compute engine’s memory (RAM) and also perform data processing over this data stored in memory, thus eliminating the need for a continuous Input/Output (I/O) of writing/reading data from disk. In addition, there are many other advantages. One of these advantages is that this engine contains a variety of libraries to implement all stages of the machine learning life cycle.

Syriatel data sources

Call details record of cdrs.

Each time a call is made, a message is sent, the Internet is used, or an operation is performed on the network, the descriptive information is stored as a call details record (CDR). Table 1 illustrated some types of call logs, messages, and Internet details available in Syriatel that were used in this research to predict customer loyalty:

Rec: Call log, SMS message log, MMS Multimedia Message log MMS multimedia messaging log, DATA internet data usage log, Mon fee log,Vou recharge log, Mon monthly log information, web metadata information, EGGSK tab In roaming.

Detailed data stored in relational databases

The Call details record was linked to the customer detailed data stored in the relational databases using this GSM, which is detailed as follows: Customer Management Database, Customer Complaints Database, Towers Information Database, Towers Information Database, Mobile Phone Information Database.

Customer services

All services recorded by the client were collected and classified manually based on the type of service such as political related services News, sports news, horoscopes, etc \(\ldots\) , these categories are treated as a customer Advantages. As a result, a customer service table is produced. Table 2 is a sample.

Customer contract information

Customer contract information was fetched from the CRM system, and contains basic customer information (gender, age, location \(\ldots\) ) and customer subscription information, as a single customer. You may have more than one subscription (two or more GSM networks) with different types of subscriptions: pre-subscription, prepay, 3G, 4G \(\ldots\) subscription.

Database of cells and sites

Telecommunications companies related to location data and their components were stored in the relational database. This data was used to extract spatial features. Table 3 is a sample.

Demographics data for customers

Building such a predictive system requires a data containing the real demographics such as gender and age for each GSM, whatever the demographics of the GSM owner sometimes the real user and the GSM owner, not much. Table 4 is a sample.

Extraction of features

The features were engineered and extracted based on our research and our experiment in the telecom domain. 223 features were extracted for each GSM. These features belong to 6 feature categories; each category provided with examples.

Segmentation Features T, F, M (3 features)

total of calls and Internet duration in a certain period of time (Fig 2 ).

Frequency (F): use services frequently within a certain period (Fig. 3 ).

Monetary (M): The money spent during a certain period (Fig. 4 ).

Classification Features (220 features)

Individual Behavioral Features

Individual behavior can be defined as how an individual behaves with services.

For example:

Calls duration per day: calls duration per day for each GSM.

Duration per day: calls and sessions duration per day for each GSM (Figs. 5 , 6 , 7 ).

Entropy of duration

High entropy means the data has high variance and thus contains a lot of information and/or noise.

Daily outgoing calls: for each GSM the daily outgoing calls.

Calls incoming daily night: for each GSM the daily outgoing calls at night (Fig. 8 ).

SMS received daily at work time, \(\ldots\) About (200 features).

Social behavior features

Is behavior among two or more organisms within the same species, and encompasses any behavior in which one member affects the other. This is due to an interaction among those members.

Some examples about this features:

Number of contacts: for each customer number of contacts with other customers.

Transactions received per customer: number of calls, sms,Internet sessions received by each customer.

Transactions sent per contact, etc. (20 features).

spatial and navigation features

Features about the spatial navigation of customers

holiday navigation: Customer Movements on holiday.

Home zone: location of customer home.

Antenna location: location of antenna.

Daytime antenna: antenna which are used by customer transactions in daytime.

Workday antenna: antenna which are used by customer transactions in workday.

Vacation antenna: antenna which used by customer transactions in vacation, \(\ldots\) , etc. Their number is about (21 features).

Timestamps for each working day (Sunday to Thursday), on holidays, or during the day (9 to 16) and at night: average number of SMS received per day (17 to 8) on holidays, etc. (165 features).

Types of services registered

technical news services, educational services, sports news services, political news services, entertainment services, etc., (13 features).

Contract information tariff type

GSM type, (2 features).

The total number of features listed is 421, but there are about 201 features belonging to more than one category, so the total number of features is 220. Mass distribution for Some features with loyalty.

figure 2

Mass distribution for T in our Study

figure 3

Mass distribution for F in our Study

figure 4

Mass distribution for M in our Study

figure 5

Mass distribution for “Avg dur per day daylight”

figure 6

Mass distribution for “Avg dur perday night”

figure 7

Mass distribution for “Avg dur per day worktime”

figure 8

Avg dur per call

Features engineering-ways to choose features

Feature engineering is the process of using domain knowledge of data to create features that make machine learning algorithms work well. The most important reasons to use the selection of the features are:

It enables machine learning algorithm to train faster.

It reduces the complexity of the model and makes it easy to interpret.

It improves the accuracy of the model if the correct subset is selected.

It reduces overfitting.

Next, we’ll discuss several methodologies and techniques that you can use to set your feature space and help your models perform better and more efficiently.

Attribute Selection Algorithms (Features)

Feature selection algorithms discovered and reported in the literature.

The feature selection algorithms are categorized into three categories such as filter model, embedding model (or aggregator) and embedded model according to mathematical models.

Filter model

It depends on the general characteristics of the data and the evaluation of features without involving any learning algorithm [ 13 ]. Filter model Algorithms are Relive F, Information Gain. Entropy (H) was used to calculate the homogeneity of a sample.

Information gain is decrease in entropy after dividing the data set by an attribute. Gain index, Chi-Squared, Gain Ratio [ 18 ].

Wrapper model

A predefined learning algorithm is required and its performance is used as a benchmark for evaluation and feature identification.

Embedded model

It chooses a set of features for training and building a model, then test its feature importance depending on the goal of learning model. You can get the feature importance of each feature of your dataset by using the feature importance property of the model. Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable. Feature importance is an inbuilt class that comes with Tree Based Classifiers.

The big data analytics is used technological advances are based on memory usage, data processing and scrutiny of big data include handling high volume data with decreasing cost of storage and CPU power, the effective usage of storage management for flexible computation and storage and Distributed computing systems through flexible parallel processing [ 19 ] with the development of new frameworks such as Hadoop. And also big data frameworks such as the Hadoop ecosystem, No SQL databases efficiently handle complex queries, analytics and finally extract, transform and load (ETL) which is complex in conventional data warehouses. These technological changes have shaped a number of improvements from conventional analytics and big data analytics.

The feature selection aims to select a feature subset that has the lowest dimension and also retains classification accuracy so as to optimize the specific evaluation criteria. feature selection methods can be filter and wrapper. Filter methods make a selection on the basis of the characteristics of the dataset itself. The selection may be fast but it leads to poor performance. Wrapper methods takes a subsequent classification algorithm as the evaluation index, it achieves high classification accuracy but results in very low efficiency for a large amount of computation. Backward feature elimination and Forward feature construction have been used. Feature selection methods such as backward elimination, Forward selection [ 19 ].

Backward elimination: In contrast to the forward selection strategy, the backward elimination strategy starts with the complete attribute set as initial subset and iteratively (and also heuristically) removes attributes from that subset, until no performance gain can be achieved by removing another attribute.

Forward selection initially uses only attribute subsets which exactly one attributes. Then additional attributes are added heuristically until there is no more performance gain by adding an attribute. Both of the methods are handled carefully Hadoop ecosystem.

Implement methodology

The intended model was described to determine customer loyalty. This model relied on data mining and customer value analysis methods based on TFM to improve customer relationship management. Customers were divided using the Calculating TFM Score.

After the division of customers and determining the degree of loyalty in each department. The classification was performed based on descriptors expressing the level of loyalty in each segment and a set of behavioral and demographic features using several algorithms and evaluated their models to obtain the best model in terms of the highest accuracy. Steps required for the proposed model illustrated in Fig 9 .

figure 9

Steps required for the proposed model

The loyalty features of these models were extracted to find out the causes of loyalty and the precise targeting of these segments.

Data preparation

The data was collected from Syriatel data sources to the Hadoop environment.

The Scala language was chosen to perform data preparation, attribute extraction, model training, and testing because it is the language in which the distributed implementation engine (spark) has developed. spark achieves a high-speed execution in addition to stability. In Spark, the library of automated learning provided by the Spark implementation engine, ML extension is used.

The original data was divided into two sub-parts: the training group and the test group by 70/30, respectively.

Address missing and text values

Filling in the missing values with values of either zero or average of several nearby values was an advantage so that it enabled us to use the information in most attributes for the exercise. In this research, the following were applied:

The attributes whose 70% at least include a missing value were deleted.

The missing numerical values were replaced with the mean attribute itself.

StringIndexer has been applied to String style attributes to convert them into numbers. This represents a common way to produce labels using StringIndexer, train a model with these pointers, and retrieve the original labels from the predicted pointers column using IndexToString. However, you are free to provide your own labels. Emphasis was placed on the development of the attribute preparation and attribute selection process.

Emphasis was placed on the development of the attribute preparation and attribute selection process.

Data processing and application of extraction and selection of attributes

The T, F, M features were calculated for each customer, and the behavioral features were chosen.

The most important attributes were chosen based on Chi-Squared function. this function was applied to groups of categorical characteristics and selected features to assess the probability of correlation or correlation between them using frequency distribution.

Compilation using calculating TFM

To calculate the TFM results, the three input parameters were divided into five subcategories. The Time, frequency and monetary scores were calculated and then combined, to obtain a comprehensive TFM analysis score.

Cumulative total duration (cumulative total duration) T

Time (T): total of calls and Internet sessions duration in a certain period of time.

The customers have been divided into different categories (Table 5 ). The cumulative total duration of service, for example, the total of calls and Internet sessions duration in 3 months (T).

Some research defines it as the average time you spend communicating or using application services in one month and in our study for three months. When the user’s connection time is greater than the previously calculated average, the value of T is 2; otherwise, it is 1. In our study, There were 5 levels which were calculated after the maximum cumulative value of the total duration of calls and usage per GSM and the smallest value was calculated. Calculations were made and 5 levels of values T resulted. Where T1 was the first category with the lowest value, T2 was the second category, T3 was the third category, T4 was the category Fourth was high value, T5 Class V was very high.

Frequency (F): use services frequently within a certain period

Customers were divided into different frequency categories (Table 6 ). Totalize the number of times he/she performed with the company (communication, message, internet access) during the past 3 months.

F1 represents clients who have performed less than or equal to 2 transactions in the last 3 months and F5 represents customers who have performed more than or equal to 11 transactions in the last 3 months.

In our study, there were 5 levels calculated after calculating the maximum cumulative value of the total number of calls and the number of uses per GSM and the smallest value was calculated. Calculations were made and 5 levels of values F resulted.

Monetary (M): The money spent during a certain period. Customers were divided into different cash categories (Table 7 ) according to the total amount he/she paid for transactions with the organization over the past 3 months. M1 as clients paid less than or equal to 100 transactions in the last 3 months M5 customers who have paid greater than or equal to 10,000 transactions in the last 3 months.

In our study, There were 5 levels calculated after calculating the maximum cumulative value of the total cash mass of calls and the number of times of use per GSM and the smallest value was calculated. Calculations were made and 5 levels of values M resulted.

Based on the TF results calculated above (Table 8 ), we calculate the TFM results (Table 9 ).

Segment and target customers

Customer categories.

Customer segmentation involves splitting the customer base into different subsets. A specific subsets with the same interest and spending habits [ 20 ]. Based on the TFM results as calculated above, customers can be divided into five parts:

Very high value customers (greater loyalty) These are the customers who make the highest profit for the operator. Without them the operator will lose its market share and competitive advantage. These customers are given appropriate care and attention from the operator.

High value customers (great loyalty) These are the customers who make the highest profit for the operator. Without them, the operator will lose its market share and competitive advantage. These customers are given appropriate care and attention from the operator.

Medium value customers (average loyalty) These are the customers who make medium profitability.

Low value customers (little loyalty) These are the clients who make very little profit.

Customer churn from the company (very little loyalty).

Customers who have the least loyalty are those who have left the company or are about to leave. Endeavors were taken to prevent them from leaving, and if they leave the company, the cost of customer service will be calculated. The total cost (1) associated with these potential customers if they stop their relationship Total cost (customer leakage) = lost revenue Marketing cost (1) Lost revenue is the revenue that these customers can make if they do not cease their relationship with the operator. The cost of marketing is the cost associated with replacing these customers with new customers.

Target customers

By calculating the TFM score, individually the status of the total time spent on calls, sms and the total Internet data, high-value customers were recognized as well as potential customers to leave the company. Today, however, most people have access to a range of telecommunications operator services that include both the price of the connection and the amount of Internet data. The sudden exceptions to this were people who used only the operator’s services for the Internet or calls and not both. Therefore, to target customers based on these things, considerations about both the TFM score for the total call time, SMS and the amount of Internet data were taken (Tables 10 , 11 ).

In both TFM score for total call time and number of messages.

If TFM is High, then Customers who use large Internet data and spend a lot of time on calls and send a large number of messages. Average customers who use Internet data, spend time on calls and average messages. If TFM is low, then Customers use less Internet data and spend less on calls and fewer messages. Large users can be targeted using loyalty points and personalized offers tailored to them, as they are the key to the competitive operator advantage in the marketplace, for medium users who can make offers based on a combination of both call rates and data bundle and for low user operators can deliver Offers tend to make these users use more of the services provided by operators. for example, free local calls to the same telephone numbers of the operator late at night, etc. For Potential Customers churn, TFM points were integrated with unstructured data such as social media data and call center feedback data to accurately predict them. Customers who were likely to churn must be taken seriously into consideration by telecom operators because of the cost of revenue and marketing associated with each. Therefore, the telecom operator must be made offers such as free talk time, free packet data for a specified period and an additional number of messages, for example, 200 MB of data for 3 days, to retain it.

Results and discussion

Apply classification algorithms.

Having segmented using grades and recognizing loyalty for each segment, at this stage, the causes of loyalty were needed, i.e. The behavioral features of customers in each segment. The behavioral 220 features were taken and the descriptions resulting from the segmentation process as an input to the classification algorithms to identify the causes of loyalty and to identify the influential features at each level of loyalty. The other benefit of applying classification algorithms was to build an accurate predictive model for classifying new users by loyalty. Multiple and binary classifiers were built and the results were compared using different criteria. It is the highest accuracy classifier that gave us the best correlation between behavioral features and loyalty categories and gave the best behavioral features that were described categories (classes) and thus assist in decision-making in building marketing presentations for each category thus increasing the company’s profit.

Performance measurement

The correlation matrix shown in Table 12 contains information on the actual and predicted classifications made by the binary classification system where (Loyal 1, Not Loyal 0). Each term corresponds to a specific situation as follows:

True Positive (TP) is expressed as an example when the prediction is yes (the customer has loyalty to the company), and the truth has loyalty to the company.

True negative (TN): When the prediction is no (no customer loyalty to the company), in fact the customer has no loyalty to the company.

False Positive (FP): The prediction is yes (the customer has loyalty to the company), but the customer left the company is also known as “Type 1 error”.

D False negative (FN): When the prediction is not (the customer has no loyalty to the company), but the customer has loyalty and did not actually leave the company. Also known as “Type 2 error”.

Some performance measures can be calculated directly from the confusion matrix [ 21 ].

TPR is also known as recall or allergy.

The accuracy standard does not rate the rate of correctly classified cases from both categories. It is expressed by the following equation:

Area under the curve (AUC): measures the effectiveness of the work can be calculated by [ 21 ]:

F1-measure: harmonic mean of the precision and recall. Can be calculated by:

Compare binary classifiers

Confusion matrix.

An example of the confusion matrix for the Multilayer Perceptron Classifier algorithm (Tables 13 , 14 ).

Loyalty categories (loyal 1, not loyal 0)

After comparing the classifiers, Gradient-boosted-tree classifier was found to be the best.

Comparison of multiple classes

Example of confusion matrix for binary classes.

Avrage recall = [R(A) + R(B) + R(C) + R(D)]/4 = 0.775: 4, number

Recall calculation from the correlation matrix of model TP: 100, FN:100 R(A) = 100/200 TP:9, FN: 1 R(B) = 9/10 TP:8, FN:2 R(C) = 8/10 TP:9, FN:1 R(D) = 9/10

Recall = TP/(TP + FN) Table 15 .

Multi-classification (1, 2, 3, 4, 5)

It reflects the loyalty levels wherethe 5 is very high loyalty, 4 is high loyalty, 3 is medium loyalty, 2 is low loyalty, 1 is very low loyalty.

Note 1: Multilayer Perceptron Classifier Input: 220 feature with 4 layers, 5 node in each layer Output: 5 classes

Note 2: Gradient-boosted tree classifier currently only supports binary classification (Table 16 ).

After comparing the Classification algorithms, it turns out that Random Forest Classifier is the best. An example of the distinctive features of each level of loyalty derived from the binary classification model (Table 17 ):

TFM segmentation and setting loyalty levels were been relied on. The classification algorithms were applied based on the loyalty levels as classification categories and the selected attributes, compared the results and selected the best classification model in terms of accuracy. Then the rules of loyalty prediction were derived from this model, which expressed the correlation of behavioral features with classification categories and thus known the causes of loyalty in each segment. target customers were optimized with appropriate offers and services. The other benefit of applying the classification algorithms was to build an accurate predictive model for classifying new users by loyalty.

Availability of data and materials

The data that support the findings of this study are available from SyriaTel Telecom Company but restrictions apply to the availability of this data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of SyriaTel Telecom Company.

https://hadoop.apache.org/ .

https://spark.apache.org/ .

https://zeppelin.apache.org/ .

https://flume.apache.org .

https://sqoop.apache.org/ .

https://hive.apache.org/ .

Abbreviations

Global system for mobile communications

Call detail record

Customer relationship management

Short message service

Quality of experience

Hadoop distributed file system

Linear discriminant analysis

Extreme gradient boosting

Gradient boosting machine

Bagging classification and regression trees

Area under the curve

Standard deviation

Received by the customer; send by the customer

True positive

True negative

False negative

True positive + false negative

True negative + false positive

True positive rate

Saini N, Monika Garg K. Churn prediction in telecommunication industry using decision tree. 2017

Qiasi R, Baqeri-Dehnavi M, Minaei-Bidgoli B, Amooee G. Developing a model for measuring customer’s loyalty and value with rfm technique and clustering algorithms. J Math Comput Sci. 2012;4 (2):172–81.

Article   Google Scholar  

Kim S-Y, Jung T-S, Suh E-H, Hwang H-S. Customer segmentation and strategy development based on customer lifetime value: a case study. Expert Syst Appl. 2006;31 (1):101–7.

Oladapo K, Omotosho O, Adeduro O. Predictive analytics for increased loyalty and customer retention in telecommunication industry. Int J Comput Appl. 2018;975:8887.

Google Scholar  

Aluri A, Price BS, McIntyre NH. Using machine learning to cocreate value through dynamic customer engagement in a brand loyalty program. J Hosp Tour Res. 2019;43 (1):78–100.

Wijaya A, Girsang AS. Use of data mining for prediction of customer loyalty. CommIT J. 2015;10 (1):41–7.

Wong E, Wei Y. Customer online shopping experience data analytics: integrated customer segmentation and customised services prediction model. Int J Retail Distrib Manag. 2018;46 (4):406–20.

Moedjiono S, Isak YR, Kusdaryono A. Customer loyalty prediction in multimedia service provider company with k-means segmentation and c4. 5 algorithm. In: 2016 international conference on informatics and computing (ICIC), IEEE. 2016:210–5.

Kaya E, Dong X, Suhara Y, Balcisoy S, Bozkaya B, et al. Behavioral attributes and financial churn prediction. EPJ Data Sci. 2018;7 (1):41.

Cheng L-C, Sun L-M. Exploring consumer adoption of new services by analyzing the behavior of 3G subscribers: an empirical case study. Elect Comm Res Appl. 2012;11 (2):89–100.

Janabi S, Razaq F. Intelligent big data analysis to design smart predictor for customer churn in telecommunication industry. In: Farhaoui Y, Moussaid L, editors. Big data and smart digital environment. Cham: Springer; 2019. p. 246–72.

Reyes-Ortiz JL, Oneto L, Anguita D. Big data analytics in the cloud: spark on hadoop vs mpi/openmp on beowulf. Proc Comput Sci. 2015;53:121–30.

Al-Zuabi IM, Jafar A, Aljoumaa K. Predicting customer’s gender and age depending on mobile phone data. J Big Data. 2019;6 (1):18.

Ahmad AK, Jafar A, Aljoumaa K. Customer churn prediction in telecom using machine learning in big data platform. J Big Data. 2019;6 (1):28. https://doi.org/10.1186/s40537-019-0191-6 .

Hortonworks Data Platform (HDP) Kernel Description. https://www.cloudera.com/products/hdp.htm . Accessed 2019 Cloudera.

Shvachko K, Kuang H, Radia S, Chansler R, et al. The hadoop distributed file system. MSST. 2010;10:1–10.

Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets. HotCloud. 2010;10 (10–10):95.

El-Hasnony IM, El Bakry HM, Saleh AA. Comparative study among data reduction techniques over classification accuracy. Int J Comput Appl. 2015;122:2.

Seelammal C, Devi KV. Hadoop based feature selection and decision making models on big data. Middle-East J Sci Res. 2017;25 (3):660–5.

Singh I, Singh S. Framework for targeting high value customers and potential churn customers in telecom using big data analytics. Int J Educ Manag Eng. 2017;7 (1):36–45.

Bradley AP. The use of the area under the roc curve in the evaluation of machine learning algorithms. Patt Recogn. 1997;30 (7):1145–59.

Download references

Acknowledgements

This research was sponsored by SyriaTel telecom Co. We thank our colleagues Mem. Mjida (SyriaTel CEO), Mr. Adham Troudi (Big Data manager) who provided insight and expertise that greatly assisted the research, although they may not agree with all of the interpretations/conclusions of this paper. Also, the authors thank Housam Wassouf, Hazem S Nassar, Majdi Msallam, Mahmoud Eissa, Nasser abo saleh, Mustafa Mustafa for great ideas, help with the data processing and their useful discussions. the authors thank Moral help Rama Saleh, Marita wassouf, Amal Abbas, Nazier wassouf, weaam wassouf, walaa wassouf, Rawad wassouf.

The authors declare that they have no funding.

Author information

Authors and affiliations.

Faculty of Information Technology-Department of software engineering and information systems, Al-Baath University, Homs, Syria

Wissam Nazeer Wassouf & Kamal Salloum

Faculty of Applied Sciences, Hama University, Hama, Syria

Ramez Alkhatib

Faculty of Information Technology, Higher Institute for Applied Sciences and Technology, Damascus, Syria

Shadi Balloul

You can also search for this author in PubMed   Google Scholar

Contributions

WNW-W took on the main role so he performed the literature review, implemented the proposed model, conducted the experiments and wrote the manuscript. RA and KS took on a supervisory role and oversaw the completion of the work. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Wissam Nazeer Wassouf .

Ethics declarations

Ethics approval and consent to participate.

The authors Ethics approval and consent to participate.

Consent for publication

The authors consent for publication.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Wassouf, W.N., Alkhatib, R., Salloum, K. et al. Predictive analytics using big data for increased customer loyalty: Syriatel Telecom Company case study. J Big Data 7 , 29 (2020). https://doi.org/10.1186/s40537-020-00290-0

Download citation

Received : 19 October 2019

Accepted : 17 February 2020

Published : 23 April 2020

DOI : https://doi.org/10.1186/s40537-020-00290-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Customer loyalty
  • Classification algorithms
  • Customer behavior
  • Machine learning
  • Features selection

big data mining case study

A woman standing in a server room holding a laptop connected to a series of tall, black servers cabinets.

Published: 5 April 2024 Contributors: Tim Mucci, Cole Stryker

Big data analytics refers to the systematic processing and analysis of large amounts of data and complex data sets, known as big data, to extract valuable insights. Big data analytics allows for the uncovering of trends, patterns and correlations in large amounts of raw data to help analysts make data-informed decisions. This process allows organizations to leverage the exponentially growing data generated from diverse sources, including internet-of-things (IoT) sensors, social media, financial transactions and smart devices to derive actionable intelligence through advanced analytic techniques.

In the early 2000s, advances in software and hardware capabilities made it possible for organizations to collect and handle large amounts of unstructured data. With this explosion of useful data, open-source communities developed big data frameworks to store and process this data. These frameworks are used for distributed storage and processing of large data sets across a network of computers. Along with additional tools and libraries, big data frameworks can be used for:

  • Predictive modeling by incorporating artificial intelligence (AI) and statistical algorithms
  • Statistical analysis for in-depth data exploration and to uncover hidden patterns
  • What-if analysis to simulate different scenarios and explore potential outcomes
  • Processing diverse data sets, including structured, semi-structured and unstructured data from various sources.

Four main data analysis methods  – descriptive, diagnostic, predictive and prescriptive  – are used to uncover insights and patterns within an organization's data. These methods facilitate a deeper understanding of market trends, customer preferences and other important business metrics.

IBM named a Leader in the 2024 Gartner® Magic Quadrant™ for Augmented Data Quality Solutions.

Structured vs unstructured data

What is data management?

The main difference between big data analytics and traditional data analytics is the type of data handled and the tools used to analyze it. Traditional analytics deals with structured data, typically stored in relational databases . This type of database helps ensure that data is well-organized and easy for a computer to understand. Traditional data analytics relies on statistical methods and tools like structured query language (SQL) for querying databases.

Big data analytics involves massive amounts of data in various formats, including structured, semi-structured and unstructured data. The complexity of this data requires more sophisticated analysis techniques. Big data analytics employs advanced techniques like machine learning and data mining to extract information from complex data sets. It often requires distributed processing systems like Hadoop to manage the sheer volume of data.

These are the four methods of data analysis at work within big data:

The "what happened" stage of data analysis. Here, the focus is on summarizing and describing past data to understand its basic characteristics.

The “why it happened” stage. By delving deep into the data, diagnostic analysis identifies the root patterns and trends observed in descriptive analytics.

The “what will happen” stage. It uses historical data, statistical modeling and machine learning to forecast trends.

Describes the “what to do” stage, which goes beyond prediction to provide recommendations for optimizing future actions based on insights derived from all previous.

The following dimensions highlight the core challenges and opportunities inherent in big data analytics.

The sheer volume of data generated today, from social media feeds, IoT devices, transaction records and more, presents a significant challenge. Traditional data storage and processing solutions are often inadequate to handle this scale efficiently. Big data technologies and cloud-based storage solutions enable organizations to store and manage these vast data sets cost-effectively, protecting valuable data from being discarded due to storage limitations.

Data is being produced at unprecedented speeds, from real-time social media updates to high-frequency stock trading records. The velocity at which data flows into organizations requires robust processing capabilities to capture, process and deliver accurate analysis in near real-time. Stream processing frameworks and in-memory data processing are designed to handle these rapid data streams and balance supply with demand.

Today's data comes in many formats, from structured to numeric data in traditional databases to unstructured text, video and images from diverse sources like social media and video surveillance. This variety demans flexible data management systems to handle and integrate disparate data types for comprehensive analysis. NoSQL databases , data lakes and schema -on-read technologies provide the necessary flexibility to accommodate the diverse nature of big data.

Data reliability and accuracy are critical, as decisions based on inaccurate or incomplete data can lead to negative outcomes. Veracity refers to the data's trustworthiness, encompassing data quality, noise and anomaly detection issues. Techniques and tools for data cleaning, validation and verification are integral to ensuring the integrity of big data, enabling organizations to make better decisions based on reliable information.

Big data analytics aims to extract actionable insights that offer tangible value. This involves turning vast data sets into meaningful information that can inform strategic decisions, uncover new opportunities and drive innovation. Advanced analytics, machine learning and AI are key to unlocking the value contained within big data, transforming raw data into strategic assets.

Data professionals, analysts, scientists and statisticians prepare and process data in a data lakehouse, which combines the performance of a data lakehouse with the flexibility of a data lake to clean data and ensure its quality. The process of turning raw data into valuable insights encompasses several key stages:

  • Collect data: The first step involves gathering data, which can be a mix of structured and unstructured forms from myriad sources like cloud, mobile applications and IoT sensors. This step is where organizations adapt their data collection strategies and integrate data from varied sources into central repositories like a data lake, which can automatically assign metadata for better manageability and accessibility.
  • Process data: After being collected, data must be systematically organized, extracted, transformed and then loaded into a storage system to ensure accurate analytical outcomes. Processing involves converting raw data into a format that is usable for analysis, which might involve aggregating data from different sources, converting data types or organizing data into structure formats. Given the exponential growth of available data, this stage can be challenging. Processing strategies may vary between batch processing, which handles large data volumes over extended periods and stream processing, which deals with smaller real-time data batches.
  • Clean data: Regardless of size, data must be cleaned to ensure quality and relevance. Cleaning data involves formatting it correctly, removing duplicates and eliminating irrelevant entries. Clean data prevents the corruption of output and safeguard’s reliability and accuracy.
  • Analyze data: Advanced analytics, such as data mining, predictive analytics, machine learning and deep learning, are employed to sift through the processed and cleaned data. These methods allow users to discover patterns, relationships and trends within the data, providing a solid foundation for informed decision-making.

Under the Analyze umbrella, there are potentially many technologies at work, including data mining, which is used to identify patterns and relationships within large data sets; predictive analytics, which forecasts future trends and opportunities; and deep learning , which mimics human learning patterns to uncover more abstract ideas.

Deep learning uses an artificial neural network with multiple layers to model complex patterns in data. Unlike traditional machine learning algorithms, deep learning learns from images, sound and text without manual help. For big data analytics, this powerful capability means the volume and complexity of data is not an issue.

Natural language processing (NLP) models allow machines to understand, interpret and generate human language. Within big data analytics, NLP extracts insights from massive unstructured text data generated across an organization and beyond.

Structured Data

Structured data refers to highly organized information that is easily searchable and typically stored in relational databases or spreadsheets. It adheres to a rigid schema, meaning each data element is clearly defined and accessible in a fixed field within a record or file. Examples of structured data include:

  • Customer names and addresses in a customer relationship management (CRM) system
  • Transactional data in financial records, such as sales figures and account balances
  • Employee data in human resources databases, including job titles and salaries

Structured data's main advantage is its simplicity for entry, search and analysis, often using straightforward database queries like SQL. However, the rapidly expanding universe of big data means that structured data represents a relatively small portion of the total data available to organizations.

Unstructured Data

Unstructured data lacks a pre-defined data model, making it more difficult to collect, process and analyze. It comprises the majority of data generated today, and includes formats such as:

  • Textual content from documents, emails and social media posts
  • Multimedia content, including images, audio files and videos
  • Data from IoT devices, which can include a mix of sensor data, log files and time-series data

The primary challenge with unstructured data is its complexity and lack of uniformity, requiring more sophisticated methods for indexing, searching and analyzing. NLP, machine learning and advanced analytics platforms are often employed to extract meaningful insights from unstructured data.

Semi-structured data

Semi-structured data occupies the middle ground between structured and unstructured data. While it does not reside in a relational database, it contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Examples include:

  • JSON (JavaScript Object Notation) and XML (eXtensible Markup Language) files, which are commonly used for web data interchange
  • Email, where the data has a standardized format (e.g., headers, subject, body) but the content within each section is unstructured
  • NoSQL databases, can store and manage semi-structured data more efficiently than traditional relational databases

Semi-structured data is more flexible than structured data but easier to analyze than unstructured data, providing a balance that is particularly useful in web applications and data integration tasks.

Ensuring data quality and integrity, integrating disparate data sources, protecting data privacy and security and finding the right talent to analyze and interpret data can present challenges to organizations looking to leverage their extensive data volumes. What follows are the benefits organizations can realize once they see success with big data analytics:

Real-time intelligence

One of the standout advantages of big data analytics is the capacity to provide real-time intelligence. Organizations can analyze vast amounts of data as it is generated from myriad sources and in various formats. Real-time insight allows businesses to make quick decisions, respond to market changes instantaneously and identify and act on opportunities as they arise.

Better-informed decisions

With big data analytics, organizations can uncover previously hidden trends, patterns and correlations. A deeper understanding equips leaders and decision-makers with the information needed to strategize effectively, enhancing business decision-making in supply chain management, e-commerce, operations and overall strategic direction.  

Cost savings

Big data analytics drives cost savings by identifying business process efficiencies and optimizations. Organizations can pinpoint wasteful expenditures by analyzing large datasets, streamlining operations and enhancing productivity. Moreover, predictive analytics can forecast future trends, allowing companies to allocate resources more efficiently and avoid costly missteps.

Better customer engagement

Understanding customer needs, behaviors and sentiments is crucial for successful engagement and big data analytics provides the tools to achieve this understanding. Companies gain insights into consumer preferences and tailor their marketing strategies by analyzing customer data.

Optimized risk management strategies

Big data analytics enhances an organization's ability to manage risk by providing the tools to identify, assess and address threats in real time. Predictive analytics can foresee potential dangers before they materialize, allowing companies to devise preemptive strategies.

As organizations across industries seek to leverage data to drive decision-making, improve operational efficiencies and enhance customer experiences, the demand for skilled professionals in big data analytics has surged. Here are some prominent career paths that utilize big data analytics:

Data scientist

Data scientists analyze complex digital data to assist businesses in making decisions. Using their data science training and advanced analytics technologies, including machine learning and predictive modeling, they uncover hidden insights in data.

Data analyst

Data analysts turn data into information and information into insights. They use statistical techniques to analyze and extract meaningful trends from data sets, often to inform business strategy and decisions.

Data engineer

Data engineers prepare, process and manage big data infrastructure and tools. They also develop, maintain, test and evaluate data solutions within organizations, often working with massive datasets to assist in analytics projects.

Machine learning engineer

Machine learning engineers focus on designing and implementing machine learning applications. They develop sophisticated algorithms that learn from and make predictions on data.

Business intelligence analyst

Business intelligence (BI) analysts help businesses make data-driven decisions by analyzing data to produce actionable insights. They often use BI tools to convert data into easy-to-understand reports and visualizations for business stakeholders.

Data visualization specialist

These specialists focus on the visual representation of data. They create data visualizations that help end users understand the significance of data by placing it in a visual context.

Data architect

Data architects design, create, deploy and manage an organization's data architecture. They define how data is stored, consumed, integrated and managed by different data entities and IT systems.

IBM and Cloudera have partnered to create an industry-leading, enterprise-grade big data framework distribution plus a variety of cloud services and products — all designed to achieve faster analytics at scale.

IBM Db2 Database on IBM Cloud Pak for Data combines a proven, AI-infused, enterprise-ready data management system with an integrated data and AI platform built on the security-rich, scalable Red Hat OpenShift foundation.

IBM Big Replicate is an enterprise-class data replication software platform that keeps data consistent in a distributed environment, on-premises and in the hybrid cloud, including SQL and NoSQL databases.

A data warehouse is a system that aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, artificial intelligence and machine learning.

Business intelligence gives organizations the ability to get answers they can understand. Instead of using best guesses, they can base decisions on what their business data is telling them — whether it relates to production, supply chain, customers or market trends.

Cloud computing is the on-demand access of physical or virtual servers, data storage, networking capabilities, application development tools, software, AI analytic tools and more—over the internet with pay-per-use pricing. The cloud computing model offers customers flexibility and scalability compared to traditional infrastructure.

Purpose-built data-driven architecture helps support business intelligence across the organization. IBM analytics solutions allow organizations to simplify raw data access, provide end-to-end data management and empower business users with AI-driven self-service analytics to predict outcomes.

  • Open access
  • Published: 11 August 2021

Data mining in clinical big data: the frequently used databases, steps, and methodological models

  • Wen-Tao Wu 1 , 2   na1 ,
  • Yuan-Jie Li 3   na1 ,
  • Ao-Zi Feng 1 ,
  • Tao Huang 1 ,
  • An-Ding Xu 4 &
  • Jun Lyu   ORCID: orcid.org/0000-0002-2237-8771 1  

Military Medical Research volume  8 , Article number:  44 ( 2021 ) Cite this article

42k Accesses

160 Citations

2 Altmetric

Metrics details

Many high quality studies have emerged from public databases, such as Surveillance, Epidemiology, and End Results (SEER), National Health and Nutrition Examination Survey (NHANES), The Cancer Genome Atlas (TCGA), and Medical Information Mart for Intensive Care (MIMIC); however, these data are often characterized by a high degree of dimensional heterogeneity, timeliness, scarcity, irregularity, and other characteristics, resulting in the value of these data not being fully utilized. Data-mining technology has been a frontier field in medical research, as it demonstrates excellent performance in evaluating patient risks and assisting clinical decision-making in building disease-prediction models. Therefore, data mining has unique advantages in clinical big-data research, especially in large-scale medical public databases. This article introduced the main medical public database and described the steps, tasks, and models of data mining in simple language. Additionally, we described data-mining methods along with their practical applications. The goal of this work was to aid clinical researchers in gaining a clear and intuitive understanding of the application of data-mining technology on clinical big-data in order to promote the production of research results that are beneficial to doctors and patients.

With the rapid development of computer software/hardware and internet technology, the amount of data has increased at an amazing speed. “Big data” as an abstract concept currently affects all walks of life [ 1 ], and although its importance has been recognized, its definition varies slightly from field to field. In the field of computer science, big data refers to a dataset that cannot be perceived, acquired, managed, processed, or served within a tolerable time by using traditional IT and software and hardware tools. Generally, big data refers to a dataset that exceeds the scope of a simple database and data-processing architecture used in the early days of computing and is characterized by high-volume and -dimensional data that is rapidly updated represents a phenomenon or feature that has emerged in the digital age. Across the medical industry, various types of medical data are generated at a high speed, and trends indicate that applying big data in the medical field helps improve the quality of medical care and optimizes medical processes and management strategies [ 2 , 3 ]. Currently, this trend is shifting from civilian medicine to military medicine. For example, the United States is exploring the potential to use of one of its largest healthcare systems (the Military Healthcare System) to provide healthcare to eligible veterans in order to potentially benefit > 9 million eligible personnel [ 4 ]. Another data-management system has been developed to assess the physical and mental health of active-duty personnel, with this expected to yield significant economic benefits to the military medical system [ 5 ]. However, in medical research, the wide variety of clinical data and differences between several medical concepts in different classification standards results in a high degree of dimensionality heterogeneity, timeliness, scarcity, and irregularity to existing clinical data [ 6 , 7 ]. Furthermore, new data analysis techniques have yet to be popularized in medical research [ 8 ]. These reasons hinder the full realization of the value of existing data, and the intensive exploration of the value of clinical data remains a challenging problem.

Computer scientists have made outstanding contributions to the application of big data and introduced the concept of data mining to solve difficulties associated with such applications. Data mining (also known as knowledge discovery in databases) refers to the process of extracting potentially useful information and knowledge hidden in a large amount of incomplete, noisy, fuzzy, and random practical application data [ 9 ]. Unlike traditional research methods, several data-mining technologies mine information to discover knowledge based on the premise of unclear assumptions (i.e., they are directly applied without prior research design). The obtained information should have previously unknown, valid, and practical characteristics [ 9 ]. Data-mining technology does not aim to replace traditional statistical analysis techniques, but it does seek to extend and expand statistical analysis methodologies. From a practical point of view, machine learning (ML) is the main analytical method in data mining, as it represents a method of training models by using data and then using those models for predicting outcomes. Given the rapid progress of data-mining technology and its excellent performance in other industries and fields, it has introduced new opportunities and prospects to clinical big-data research [ 10 ]. Large amounts of high quality medical data are available to researchers in the form of public databases, which enable more researchers to participate in the process of medical data mining in the hope that the generated results can further guide clinical practice.

This article provided a valuable overview to medical researchers interested in studying the application of data mining on clinical big data. To allow a clearer understanding of the application of data-mining technology on clinical big data, the second part of this paper introduced the concept of public databases and summarized those commonly used in medical research. In the third part of the paper, we offered an overview of data mining, including introducing an appropriate model, tasks, and processes, and summarized the specific methods of data mining. In the fourth and fifth parts of this paper, we introduced data-mining algorithms commonly used in clinical practice along with specific cases in order to help clinical researchers clearly and intuitively understand the application of data-mining technology on clinical big data. Finally, we discussed the advantages and disadvantages of data mining in clinical analysis and offered insight into possible future applications.

Overview of common public medical databases

A public database describes a data repository used for research and dedicated to housing data related to scientific research on an open platform. Such databases collect and store heterogeneous and multi-dimensional health, medical, scientific research in a structured form and characteristics of mass/multi-ownership, complexity, and security. These databases cover a wide range of data, including those related to cancer research, disease burden, nutrition and health, and genetics and the environment. Table 1 summarizes the main public medical databases [ 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 ]. Researchers can apply for access to data based on the scope of the database and the application procedures required to perform relevant medical research.

Data mining: an overview

Data mining is a multidisciplinary field at the intersection of database technology, statistics, ML, and pattern recognition that profits from all these disciplines [ 27 ]. Although this approach is not yet widespread in the field of medical research, several studies have demonstrated the promise of data mining in building disease-prediction models, assessing patient risk, and helping physicians make clinical decisions [ 28 , 29 , 30 , 31 ].

Data-mining models

Data-mining has two kinds of models: descriptive and predictive. Predictive models are used to predict unknown or future values of other variables of interest, whereas descriptive models are often used to find patterns that describe data that can be interpreted by humans [ 32 ].

Data-mining tasks

A model is usually implemented by a task, with the goal of description being to generalize patterns of potential associations in the data. Therefore, using a descriptive model usually results in a few collections with the same or similar attributes. Prediction mainly refers to estimation of the variable value of a specific attribute based on the variable values of other attributes, including classification and regression [ 33 ].

Data-mining methods

After defining the data-mining model and task, the data mining methods required to build the approach based on the discipline involved are then defined. The data-mining method depends on whether or not dependent variables (labels) are present in the analysis. Predictions with dependent variables (labels) are generated through supervised learning, which can be performed by the use of linear regression, generalized linear regression, a proportional hazards model (the Cox regression model), a competitive risk model, decision trees, the random forest (RF) algorithm, and support vector machines (SVMs). In contrast, unsupervised learning involves no labels. The learning model infers some internal data structure. Common unsupervised learning methods include principal component analysis (PCA), association analysis, and clustering analysis.

Data-mining algorithms for clinical big data

Data mining based on clinical big data can produce effective and valuable knowledge, which is essential for accurate clinical decision-making and risk assessment [ 34 ]. Data-mining algorithms enable realization of these goals.

Supervised learning

A concept often mentioned in supervised learning is the partitioning of datasets. To prevent overfitting of a model, a dataset can generally be divided into two or three parts: a training set, validation set, and test set. Ripley [ 35 ] defined these parts as a set of examples used for learning and used to fit the parameters (i.e., weights) of the classifier, a set of examples used to tune the parameters (i.e., architecture) of a classifier, and a set of examples used only to assess the performance (generalized) of a fully-specified classifier, respectively. Briefly, the training set is used to train the model or determine the model parameters, the validation set is used to perform model selection, and the test set is used to verify model performance. In practice, data are generally divided into training and test sets, whereas the verification set is less involved. It should be emphasized that the results of the test set do not guarantee model correctness but only show that similar data can obtain similar results using the model. Therefore, the applicability of a model should be analysed in combination with specific problems in the research. Classical statistical methods, such as linear regression, generalized linear regression, and a proportional risk model, have been widely used in medical research. Notably, most of these classical statistical methods have certain data requirements or assumptions; however, in face of complicated clinical data, assumptions about data distribution are difficult to make. In contrast, some ML methods (algorithmic models) make no assumptions about the data and cross-verify the results; thus, they are likely to be favoured by clinical researchers [ 36 ]. For these reasons, this chapter focuses on ML methods that do not require assumptions about data distribution and classical statistical methods that are used in specific situations.

Decision tree

A decision tree is a basic classification and regression method that generates a result similar to the tree structure of a flowchart, where each tree node represents a test on an attribute, each branch represents the output of an attribute, each leaf node (decision node) represents a class or class distribution, and the topmost part of the tree is the root node [ 37 ]. The decision tree model is called a classification tree when used for classification and a regression tree when used for regression. Studies have demonstrated the utility of the decision tree model in clinical applications. In a study on the prognosis of breast cancer patients, a decision tree model and a classical logistic regression model were constructed, respectively, with the predictive performance of the different models indicating that the decision tree model showed stronger predictive power when using real clinical data [ 38 ]. Similarly, the decision tree model has been applied to other areas of clinical medicine, including diagnosis of kidney stones [ 39 ], predicting the risk of sudden cardiac arrest [ 40 ], and exploration of the risk factors of type II diabetes [ 41 ]. A common feature of these studies is the use of a decision tree model to explore the interaction between variables and classify subjects into homogeneous categories based on their observed characteristics. In fact, because the decision tree accounts for the strong interaction between variables, it is more suitable for use with decision algorithms that follow the same structure [ 42 ]. In the construction of clinical prediction models and exploration of disease risk factors and patient prognosis, the decision tree model might offer more advantages and practical application value than some classical algorithms. Although the decision tree has many advantages, it recursively separates observations into branches to construct a tree; therefore, in terms of data imbalance, the precision of decision tree models needs improvement.

The RF method

The RF algorithm was developed as an application of an ensemble-learning method based on a collection of decision trees. The bootstrap method [ 43 ] is used to randomly retrieve sample sets from the training set, with decision trees generated by the bootstrap method constituting a “random forest” and predictions based on this derived from an ensemble average or majority vote. The biggest advantage of the RF method is that the random sampling of predictor variables at each decision tree node decreases the correlation among the trees in the forest, thereby improving the precision of ensemble predictions [ 44 ]. Given that a single decision tree model might encounter the problem of overfitting [ 45 ], the initial application of RF minimizes overfitting in classification and regression and improves predictive accuracy [ 44 ]. Taylor et al. [ 46 ] highlighted the potential of RF in correctly differentiating in-hospital mortality in patients experiencing sepsis after admission to the emergency department. Nowhere in the healthcare system is the need more pressing to find methods to reduce uncertainty than in the fast, chaotic environment of the emergency department. The authors demonstrated that the predictive performance of the RF method was superior to that of traditional emergency medicine methods and the methods enabled evaluation of more clinical variables than traditional modelling methods, which subsequently allowed the discovery of clinical variables not expected to be of predictive value or which otherwise would have been omitted as a rare predictor [ 46 ]. Another study based on the Medical Information Mart for Intensive Care (MIMIC) II database [ 47 ] found that RF had excellent predictive power regarding intensive care unit (ICU) mortality [ 48 ]. These studies showed that the application of RF to big data stored in the hospital healthcare system provided a new data-driven method for predictive analysis in critical care. Additionally, random survival forests have recently been developed to analyse survival data, especially right-censored survival data [ 49 , 50 ], which can help researchers conduct survival analyses in clinical oncology and help develop personalized treatment regimens that benefit patients [ 51 ].

The SVM is a relatively new classification or prediction method developed by Cortes and Vapnik and represents a data-driven approach that does not require assumptions about data distribution [ 52 ]. The core purpose of an SVM is to identify a separation boundary (called a hyperplane) to help classify cases; thus, the advantages of SVMs are obvious when classifying and predicting cases based on high dimensional data or data with a small sample size [ 53 , 54 ].

In a study of drug compliance in patients with heart failure, researchers used an SVM to build a predictive model for patient compliance in order to overcome the problem of a large number of input variables relative to the number of available observations [ 55 ]. Additionally, the mechanisms of certain chronic and complex diseases observed in clinical practice remain unclear, and many risk factors, including gene–gene interactions and gene-environment interactions, must be considered in the research of such diseases [ 55 , 56 ]. SVMs are capable of addressing these issues. Yu et al. [ 54 ] applied an SVM for predicting diabetes onset based on data from the National Health and Nutrition Examination Survey (NHANES). Furthermore, these models have strong discrimination ability, making SVMs a promising classification approach for detecting individuals with chronic and complex diseases. However, a disadvantage of SVMs is that when the number of observation samples is large, the method becomes time- and resource-intensive, which is often highly inefficient.

Competitive risk model

Kaplan–Meier marginal regression and the Cox proportional hazards model are widely used in survival analysis in clinical studies. Classical survival analysis usually considers only one endpoint, such as the impact of patient survival time. However, in clinical medical research, multiple endpoints usually coexist, and these endpoints compete with one another to generate competitive risk data [ 57 ]. In the case of multiple endpoint events, the use of a single endpoint-analysis method can lead to a biased estimation of the probability of endpoint events due to the existence of competitive risks [ 58 ]. The competitive risk model is a classical statistical model based on the hypothesis of data distribution. Its main advantage is its accurate estimation of the cumulative incidence of outcomes for right-censored survival data with multiple endpoints [ 59 ]. In data analysis, the cumulative risk rate is estimated using the cumulative incidence function in single-factor analysis, and Gray’s test is used for between-group comparisons [ 60 ].

Multifactor analysis uses the Fine-Gray and cause-specific (CS) risk models to explore the cumulative risk rate [ 61 ]. The difference between the Fine-Gray and CS models is that the former is applicable to establishing a clinical prediction model and predicting the risk of a single endpoint of interest [ 62 ], whereas the latter is suitable for answering etiological questions, where the regression coefficient reflects the relative effect of covariates on the increased incidence of the main endpoint in the target event-free risk set [ 63 ]. Currently, in databases with CS records, such as Surveillance, Epidemiology, and End Results (SEER), competitive risk models exhibit good performance in exploring disease-risk factors and prognosis [ 64 ]. A study of prognosis in patients with oesophageal cancer from SEER showed that Cox proportional risk models might misestimate the effects of age and disease location on patient prognosis, whereas competitive risk models provide more accurate estimates of factors affecting patient prognosis [ 65 ]. In another study of the prognosis of penile cancer patients, researchers found that using a competitive risk model was more helpful in developing personalized treatment plans [ 66 ].

Unsupervised learning

In many data-analysis processes, the amount of usable identified data is small, and identifying data is a tedious process [ 67 ]. Unsupervised learning is necessary to judge and categorize data according to similarities, characteristics, and correlations and has three main applications: data clustering, association analysis, and dimensionality reduction. Therefore, the unsupervised learning methods introduced in this section include clustering analysis, association rules, and PCA.

Clustering analysis

The classification algorithm needs to “know” information concerning each category in advance, with all of the data to be classified having corresponding categories. When the above conditions cannot be met, cluster analysis can be applied to solve the problem [ 68 ]. Clustering places similar objects into different categories or subsets through the process of static classification. Consequently, objects in the same subset have similar properties. Many kinds of clustering techniques exist. Here, we introduced the four most commonly used clustering techniques.

Partition clustering

The core idea of this clustering method regards the centre of the data point as the centre of the cluster. The k-means method [ 69 ] is a representative example of this technique. The k-means method takes n observations and an integer, k , and outputs a partition of the n observations into k sets such that each observation belongs to the cluster with the nearest mean [ 70 ]. The k-means method exhibits low time complexity and high computing efficiency but has a poor processing effect on high dimensional data and cannot identify nonspherical clusters.

Hierarchical clustering

The hierarchical clustering algorithm decomposes a dataset hierarchically to facilitate the subsequent clustering [ 71 ]. Common algorithms for hierarchical clustering include BIRCH [ 72 ], CURE [ 73 ], and ROCK [ 74 ]. The algorithm starts by treating every point as a cluster, with clusters grouped according to closeness. When further combinations result in unexpected results under multiple causes or only one cluster remains, the grouping process ends. This method has wide applicability, and the relationship between clusters is easy to detect; however, the time complexity is high [ 75 ].

Clustering according to density

The density algorithm takes areas presenting a high degree of data density and defines these as belonging to the same cluster [ 76 ]. This method aims to find arbitrarily-shaped clusters, with the most representative algorithm being DBSCAN [ 77 ]. In practice, DBSCAN does not need to input the number of clusters to be partitioned and can handle clusters of various shapes; however, the time complexity of the algorithm is high. Furthermore, when data density is irregular, the quality of the clusters decreases; thus, DBSCAN cannot process high dimensional data [ 75 ].

Clustering according to a grid

Neither partition nor hierarchical clustering can identify clusters with nonconvex shapes. Although a dimension-based algorithm can accomplish this task, the time complexity is high. To address this problem, data-mining researchers proposed grid-based algorithms that changed the original data space into a grid structure of a certain size. A representative algorithm is STING, which divides the data space into several square cells according to different resolutions and clusters the data of different structure levels [ 78 ]. The main advantage of this method is its high processing speed and its exclusive dependence on the number of units in each dimension of the quantized space.

In clinical studies, subjects tend to be actual patients. Although researchers adopt complex inclusion and exclusion criteria before determining the subjects to be included in the analyses, heterogeneity among different patients cannot be avoided [ 79 , 80 ]. The most common application of cluster analysis in clinical big data is in classifying heterogeneous mixed groups into homogeneous groups according to the characteristics of existing data (i.e., “subgroups” of patients or observed objects are identified) [ 81 , 82 ]. This new information can then be used in the future to develop patient-oriented medical-management strategies. Docampo et al. [ 81 ] used hierarchical clustering to reduce heterogeneity and identify subgroups of clinical fibromyalgia, which aided the evaluation and management of fibromyalgia. Additionally, Guo et al. [ 83 ] used k-means clustering to divide patients with essential hypertension into four subgroups, which revealed that the potential risk of coronary heart disease differed between different subgroups. On the other hand, density- and grid-based clustering algorithms have mostly been used to process large numbers of images generated in basic research and clinical practice, with current studies focused on developing new tools to help clinical research and practices based on these technologies [ 84 , 85 ]. Cluster analysis will continue to have extensive application prospects along with the increasing emphasis on personalized treatment.

Association rules

Association rules discover interesting associations and correlations between item sets in large amounts of data. These rules were first proposed by Agrawal et al. [ 86 ] and applied to analyse customer buying habits to help retailers create sales plans. Data-mining based on association rules identifies association rules in a two-step process: 1) all high frequency items in the collection are listed and 2) frequent association rules are generated based on the high frequency items [ 87 ]. Therefore, before association rules can be obtained, sets of frequent items must be calculated using certain algorithms. The Apriori algorithm is based on the a priori principle of finding all relevant adjustment items in a database transaction that meet a minimum set of rules and restrictions or other restrictions [ 88 ]. Other algorithms are mostly variants of the Apriori algorithm [ 64 ]. The Apriori algorithm must scan the entire database every time it scans the transaction; therefore, algorithm performance deteriorates as database size increases [ 89 ], making it potentially unsuitable for analysing large databases. The frequent pattern (FP) growth algorithm was proposed to improve efficiency. After the first scan, the FP algorithm compresses the frequency set in the database into a FP tree while retaining the associated information and then mines the conditional libraries separately [ 90 ]. Association-rule technology is often used in medical research to identify association rules between disease risk factors (i.e., exploration of the joint effects of disease risk factors and combinations of other risk factors). For example, Li et al. [ 91 ] used the association-rule algorithm to identify the most important stroke risk factor as atrial fibrillation, followed by diabetes and a family history of stroke. Based on the same principle, association rules can also be used to evaluate treatment effects and other aspects. For example, Guo et al. [ 92 ] used the FP algorithm to generate association rules and evaluate individual characteristics and treatment effects of patients with diabetes, thereby reducing the readability rate of patients with diabetes. Association rules reveal a connection between premises and conclusions; however, the reasonable and reliable application of information can only be achieved through validation by experienced medical professionals and through extensive causal research [ 92 ].

PCA is a widely used data-mining method that aims to reduce data dimensionality in an interpretable way while retaining most of the information present in the data [ 93 , 94 ]. The main purpose of PCA is descriptive, as it requires no assumptions about data distribution and is, therefore, an adaptive and exploratory method. During the process of data analysis, the main steps of PCA include standardization of the original data, calculation of a correlation coefficient matrix, calculation of eigenvalues and eigenvectors, selection of principal components, and calculation of the comprehensive evaluation value. PCA does not often appear as a separate method, as it is often combined with other statistical methods [ 95 ]. In practical clinical studies, the existence of multicollinearity often leads to deviation from multivariate analysis. A feasible solution is to construct a regression model by PCA, which replaces the original independent variables with each principal component as a new independent variable for regression analysis, with this most commonly seen in the analysis of dietary patterns in nutritional epidemiology [ 96 ]. In a study of socioeconomic status and child-developmental delays, PCA was used to derive a new variable (the household wealth index) from a series of household property reports and incorporate this new variable as the main analytical variable into the logistic regression model [ 97 ]. Additionally, PCA can be combined with cluster analysis. Burgel et al. [ 98 ] used PCA to transform clinical data to address the lack of independence between existing variables used to explore the heterogeneity of different subtypes of chronic obstructive pulmonary disease. Therefore, in the study of subtypes and heterogeneity of clinical diseases, PCA can eliminate noisy variables that can potentially corrupt the cluster structure, thereby increasing the accuracy of the results of clustering analysis [ 98 , 99 ].

The data-mining process and examples of its application using common public databases

Open-access databases have the advantages of large volumes of data, wide data coverage, rich data information, and a cost-efficient method of research, making them beneficial to medical researchers. In this chapter, we introduced the data-mining process and methods and their application in research based on examples of utilizing public databases and data-mining algorithms.

The data-mining process

Figure  1 shows a series of research concepts. The data-mining process is divided into several steps: (1) database selection according to the research purpose; (2) data extraction and integration, including downloading the required data and combining data from multiple sources; (3) data cleaning and transformation, including removal of incorrect data, filling in missing data, generating new variables, converting data format, and ensuring data consistency; (4) data mining, involving extraction of implicit relational patterns through traditional statistics or ML; (5) pattern evaluation, which focuses on the validity parameters and values of the relationship patterns of the extracted data; and (6) assessment of the results, involving translation of the extracted data-relationship model into comprehensible knowledge made available to the public.

figure 1

The steps of data mining in medical public database

Examples of data-mining applied using public databases

Establishment of warning models for the early prediction of disease.

A previous study identified sepsis as a major cause of death in ICU patients [ 100 ]. The authors noted that the predictive model developed previously used a limited number of variables, and that model performance required improvement. The data-mining process applied to address these issues was, as follows: (1) data selection using the MIMIC III database; (2) extraction and integration of three types of data, including multivariate features (demographic information and clinical biochemical indicators), time series data (temperature, blood pressure, and heart rate), and clinical latent features (various scores related to disease); (3) data cleaning and transformation, including fixing irregular time series measurements, estimating missing values, deleting outliers, and addressing data imbalance; (4) data mining through the use of logical regression, generation of a decision tree, application of the RF algorithm, an SVM, and an ensemble algorithm (a combination of multiple classifiers) to established the prediction model; (5) pattern evaluation using sensitivity, precision, and the area under the receiver operating characteristic curve to evaluate model performance; and (6) evaluation of the results, in this case the potential to predicting the prognosis of patients with sepsis and whether the model outperformed current scoring systems.

Exploring prognostic risk factors in cancer patients

Wu et al. [ 101 ] noted that traditional survival-analysis methods often ignored the influence of competitive risk events, such as suicide and car accident, on outcomes, leading to deviations and misjudgements in estimating the effect of risk factors. They used the SEER database, which offers cause-of-death data for cancer patients, and a competitive risk model to address this problem according to the following process: (1) data were obtained from the SEER database; (2) demography, clinical characteristics, treatment modality, and cause of death of cecum cancer patients were extracted from the database; (3) patient data were deleted when there were no demographic, clinical, therapeutic, or cause-of-death variables; (4) Cox regression and two kinds of competitive risk models were applied for survival analysis; (5) the results were compared between three different models; and (6) the results revealed that for survival data with multiple endpoints, the competitive risk model was more favourable.

Derivation of dietary patterns

A study by Martínez Steele et al. [ 102 ] applied PCA for nutritional epidemiological analysis to determine dietary patterns and evaluate the overall nutritional quality of the population based on those patterns. Their process involved the following: (1) data were extracted from the NHANES database covering the years 2009–2010; (2) demographic characteristics and two 24 h dietary recall interviews were obtained; (3) data were weighted and excluded based on subjects not meeting specific criteria; (4) PCA was used to determine dietary patterns in the United States population, and Gaussian regression and restricted cubic splines were used to assess associations between ultra-processed foods and nutritional balance; (5) eigenvalues, scree plots, and the interpretability of the principal components were reviewed to screen and evaluate the results; and (6) the results revealed a negative association between ultra-processed food intake and overall dietary quality. Their findings indicated that a nutritionally balanced eating pattern was characterized by a diet high in fibre, potassium, magnesium, and vitamin C intake along with low sugar and saturated fat consumption.

The use of “big data” has changed multiple aspects of modern life, with its use combined with data-mining methods capable of improving the status quo [ 86 ]. The aim of this study was to aid clinical researchers in understanding the application of data-mining technology on clinical big data and public medical databases to further their research goals in order to benefit clinicians and patients. The examples provided offer insight into the data-mining process applied for the purposes of clinical research. Notably, researchers have raised concerns that big data and data-mining methods were not a perfect fit for adequately replicating actual clinical conditions, with the results potentially capable of misleading doctors and patients [ 86 ]. Therefore, given the rate at which new technologies and trends progress, it is necessary to maintain a positive attitude concerning their potential impact while remaining cautious in examining the results provided by their application.

In the future, the healthcare system will need to utilize increasingly larger volumes of big data with higher dimensionality. The tasks and objectives of data analysis will also have higher demands, including higher degrees of visualization, results with increased accuracy, and stronger real-time performance. As a result, the methods used to mine and process big data will continue to improve. Furthermore, to increase the formality and standardization of data-mining methods, it is possible that a new programming language specifically for this purpose will need to be developed, as well as novel methods capable of addressing unstructured data, such as graphics, audio, and text represented by handwriting. In terms of application, the development of data-management and disease-screening systems for large-scale populations, such as the military, will help determine the best interventions and formulation of auxiliary standards capable of benefitting both cost-efficiency and personnel. Data-mining technology can also be applied to hospital management in order to improve patient satisfaction, detect medical-insurance fraud and abuse, and reduce costs and losses while improving management efficiency. Currently, this technology is being applied for predicting patient disease, with further improvements resulting in the increased accuracy and speed of these predictions. Moreover, it is worth noting that technological development will concomitantly require higher quality data, which will be a prerequisite for accurate application of the technology.

Finally, the ultimate goal of this study was to explain the methods associated with data mining and commonly used to process clinical big data. This review will potentially promote further study and aid doctors and patients.

Abbreviations

Biologic Specimen and Data Repositories Information Coordinating Center

China Health and Retirement Longitudinal Study

China Health and Nutrition Survey

China Kadoorie Biobank

Cause-specific risk

Comparative Toxicogenomics Database

EICU Collaborative Research Database

Frequent pattern

Global burden of disease

Gene expression omnibus

Health and Retirement Study

International Cancer Genome Consortium

Medical Information Mart for Intensive Care

  • Machine learning

National Health and Nutrition Examination Survey

Principal component analysis

Paediatric intensive care

Random forest

Surveillance, epidemiology, and end results

Support vector machine

The Cancer Genome Atlas

Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big Data. 2014;1(1):1–35.

Article   Google Scholar  

Wang F, Zhang P, Wang X, Hu J. Clinical risk prediction by exploring high-order feature correlations. AMIA Annu Symp Proc. 2014;2014:1170–9.

PubMed   PubMed Central   Google Scholar  

Xu R, Li L, Wang Q. dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text. BMC Bioinform. 2014;15:105. https://doi.org/10.1186/1471-2105-15-105 .

Article   CAS   Google Scholar  

Ramachandran S, Erraguntla M, Mayer R, Benjamin P, Editors. Data mining in military health systems-clinical and administrative applications. In: 2007 IEEE international conference on automation science and engineering; 2007. https://doi.org/10.1109/COASE.2007.4341764 .

Vie LL, Scheier LM, Lester PB, Ho TE, Labarthe DR, Seligman MEP. The US army person-event data environment: a military-civilian big data enterprise. Big Data. 2015;3(2):67–79. https://doi.org/10.1089/big.2014.0055 .

Article   PubMed   Google Scholar  

Mohan A, Blough DM, Kurc T, Post A, Saltz J. Detection of conflicts and inconsistencies in taxonomy-based authorization policies. IEEE Int Conf Bioinform Biomed. 2012;2011:590–4. https://doi.org/10.1109/BIBM.2011.79 .

Luo J, Wu M, Gopukumar D, Zhao Y. Big data application in biomedical research and health care: a literature review. Biomed Inform Insights. 2016;8:1–10. https://doi.org/10.4137/BII.S31559 .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77(2):81–97.

Sahu H, Shrma S, Gondhalakar S. A brief overview on data mining survey. Int J Comput Technol Electron Eng. 2011;1(3):114–21.

Google Scholar  

Obermeyer Z, Emanuel EJ. Predicting the future - big data, machine learning, and clinical medicine. N Engl J Med. 2016;375(13):1216–9.

Article   PubMed   PubMed Central   Google Scholar  

Doll KM, Rademaker A, Sosa JA. Practical guide to surgical data sets: surveillance, epidemiology, and end results (SEER) database. JAMA Surg. 2018;153(6):588–9.

Johnson AE, Pollard TJ, Shen L, Lehman LW, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3: 160035. https://doi.org/10.1038/sdata.2016.35 .

Ahluwalia N, Dwyer J, Terry A, Moshfegh A, Johnson C. Update on NHANES dietary data: focus on collection, release, analytical considerations, and uses to inform public policy. Adv Nutr. 2016;7(1):121–34.

Vos T, Lim SS, Abbafati C, Abbas KM, Abbasi M, Abbasifard M, et al. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet. 2020;396(10258):1204–22. https://doi.org/10.1016/S0140-6736(20)30925-9 .

Palmer LJ. UK Biobank: Bank on it. Lancet. 2007;369(9578):1980–2. https://doi.org/10.1016/S0140-6736(07)60924-6 .

Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45(10):1113–20. https://doi.org/10.1038/ng.2764 .

Davis S, Meltzer PS. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics. 2007;23(14):1846–7.

Article   PubMed   CAS   Google Scholar  

Zhang J, Bajari R, Andric D, Gerthoffert F, Lepsa A, Nahal-Bose H, et al. The international cancer genome consortium data portal. Nat Biotechnol. 2019;37(4):367–9.

Article   CAS   PubMed   Google Scholar  

Chen Z, Chen J, Collins R, Guo Y, Peto R, Wu F, et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int J Epidemiol. 2011;40(6):1652–66.

Davis AP, Grondin CJ, Johnson RJ, Sciaky D, McMorran R, Wiegers J, et al. The comparative toxicogenomics database: update 2019. Nucleic Acids Res. 2019;47(D1):D948–54. https://doi.org/10.1093/nar/gky868 .

Zeng X, Yu G, Lu Y, Tan L, Wu X, Shi S, et al. PIC, a paediatric-specific intensive care database. Sci Data. 2020;7(1):14.

Giffen CA, Carroll LE, Adams JT, Brennan SP, Coady SA, Wagner EL. Providing contemporary access to historical biospecimen collections: development of the NHLBI Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC). Biopreserv Biobank. 2015;13(4):271–9.

Zhang B, Zhai FY, Du SF, Popkin BM. The China Health and Nutrition Survey, 1989–2011. Obes Rev. 2014;15(Suppl 1):2–7. https://doi.org/10.1111/obr.12119 .

Zhao Y, Hu Y, Smith JP, Strauss J, Yang G. Cohort profile: the China Health and Retirement Longitudinal Study (CHARLS). Int J Epidemiol. 2014;43(1):61–8.

Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU collaborative research database, a freely available multi-centre database for critical care research. Sci Data. 2018;5:180178. https://doi.org/10.1038/sdata.2018.178 .

Fisher GG, Ryan LH. Overview of the health and retirement study and introduction to the special issue. Work Aging Retire. 2018;4(1):1–9.

Iavindrasana J, Cohen G, Depeursinge A, Müller H, Meyer R, Geissbuhler A. Clinical data mining: a review. Yearb Med Inform. 2009:121–33.

Zhang Y, Guo SL, Han LN, Li TL. Application and exploration of big data mining in clinical medicine. Chin Med J. 2016;129(6):731–8. https://doi.org/10.4103/0366-6999.178019 .

Ngiam KY, Khor IW. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019;20(5):e262–73.

Huang C, Murugiah K, Mahajan S, Li S-X, Dhruva SS, Haimovich JS, et al. Enhancing the prediction of acute kidney injury risk after percutaneous coronary intervention using machine learning techniques: a retrospective cohort study. PLoS Med. 2018;15(11):e1002703.

Rahimian F, Salimi-Khorshidi G, Payberah AH, Tran J, Ayala Solares R, Raimondi F, et al. Predicting the risk of emergency admission with machine learning: development and validation using linked electronic health records. PLoS Med. 2018;15(11):e1002695.

Kantardzic M. Data Mining: concepts, models, methods, and algorithms. Technometrics. 2003;45(3):277.

Jothi N, Husain W. Data mining in healthcare—a review. Procedia Comput Sci. 2015;72:306–13.

Piatetsky-Shapiro G, Tamayo P. Microarray data mining: facing the challenges. SIGKDD. 2003;5(2):1–5. https://doi.org/10.1145/980972.980974 .

Ripley BD. Pattern recognition and neural networks. Cambridge: Cambridge University Press; 1996.

Book   Google Scholar  

Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Surv. 2010;4:40–79. https://doi.org/10.1214/09-SS054 .

Shouval R, Bondi O, Mishan H, Shimoni A, Unger R, Nagler A. Application of machine learning algorithms for clinical predictive modelling: a data-mining approach in SCT. Bone Marrow Transp. 2014;49(3):332–7.

Momenyan S, Baghestani AR, Momenyan N, Naseri P, Akbari ME. Survival prediction of patients with breast cancer: comparisons of decision tree and logistic regression analysis. Int J Cancer Manag. 2018;11(7):e9176.

Topaloğlu M, Malkoç G. Decision tree application for renal calculi diagnosis. Int J Appl Math Electron Comput. 2016. https://doi.org/10.18100/ijamec.281134.

Li H, Wu TT, Yang DL, Guo YS, Liu PC, Chen Y, et al. Decision tree model for predicting in-hospital cardiac arrest among patients admitted with acute coronary syndrome. Clin Cardiol. 2019;42(11):1087–93.

Ramezankhani A, Hadavandi E, Pournik O, Shahrabi J, Azizi F, Hadaegh F. Decision tree-based modelling for identification of potential interactions between type 2 diabetes risk factors: a decade follow-up in a Middle East prospective cohort study. BMJ Open. 2016;6(12):e013336.

Carmona-Bayonas A, Jiménez-Fonseca P, Font C, Fenoy F, Otero R, Beato C, et al. Predicting serious complications in patients with cancer and pulmonary embolism using decision tree modelling: the EPIPHANY Index. Br J Cancer. 2017;116(8):994–1001.

Efron B. Bootstrap methods: another look at the jackknife. In: Kotz S, Johnson NL, editors. Breakthroughs in statistics. New York: Springer; 1992. p. 569–93.

Chapter   Google Scholar  

Breima L. Random forests. Mach Learn. 2010;1(45):5–32. https://doi.org/10.1023/A:1010933404324 .

Franklin J. The elements of statistical learning: data mining, inference and prediction. Math Intell. 2005;27(2):83–5.

Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of in-hospital mortality in emergency department patients with sepsis: a local big data-driven, machine learning approach. Acad Emerg Med. 2016;23(3):269–78.

Lee J, Scott DJ, Villarroel M, Clifford GD, Saeed M, Mark RG. Open-access MIMIC-II database for intensive care research. Annu Int Conf IEEE Eng Med Biol Soc. 2011:8315–8. https://doi.org/10.1109/IEMBS.2011.6092050 .

Lee J. Patient-specific predictive modelling using random forests: an observational study for the critically Ill. JMIR Med Inform. 2017;5(1):e3.

Wongvibulsin S, Wu KC, Zeger SL. Clinical risk prediction with random forests for survival, longitudinal, and multivariate (RF-SLAM) data analysis. BMC Med Res Methodol. 2019;20(1):1.

Taylor JMG. Random survival forests. J Thorac Oncol. 2011;6(12):1974–5.

Hu C, Steingrimsson JA. Personalized risk prediction in clinical oncology research: applications and practical issues using survival trees and random forests. J Biopharm Stat. 2018;28(2):333–49.

Dietrich R, Opper M, Sompolinsky H. Statistical mechanics of support vector networks. Phys Rev Lett. 1999;82(14):2975.

Verplancke T, Van Looy S, Benoit D, Vansteelandt S, Depuydt P, De Turck F, et al. Support vector machine versus logistic regression modelling for prediction of hospital mortality in critically ill patients with haematological malignancies. BMC Med Inform Decis Mak. 2008;8:56. https://doi.org/10.1186/1472-6947-8-56 .

Yu W, Liu T, Valdez R, Gwinn M, Khoury MJ. Application of support vector machine modelling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Med Inform Decis Mak. 2010;10:16. https://doi.org/10.1186/1472-6947-10-16 .

Son YJ, Kim HG, Kim EH, Choi S, Lee SK. Application of support vector machine for prediction of medication adherence in heart failure patients. Healthc Inform Res. 2010;16(4):253–9.

Schadt EE, Friend SH, Shaywitz DA. A network view of disease and compound screening. Nat Rev Drug Discov. 2009;8(4):286–95.

Austin PC, Lee DS, Fine JP. Introduction to the analysis of survival data in the presence of competing risks. Circulation. 2016;133(6):601–9.

Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: competing risks and multi-state models. Stat Med. 2007;26(11):2389–430. https://doi.org/10.1002/sim.2712 .

Klein JP. Competing risks. WIREs Comp Stat. 2010;2(3):333–9. https://doi.org/10.1002/wics.83 .

Haller B, Schmidt G, Ulm K. Applying competing risks regression models: an overview. Lifetime Data Anal. 2013;19(1):33–58. https://doi.org/10.1007/s10985-012-9230-8 .

Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc. 1999;94(446):496–509.

Koller MT, Raatz H, Steyerberg EW, Wolbers M. Competing risks and the clinical community: irrelevance or ignorance? Stat Med. 2012;31(11–12):1089–97.

Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemiologic data. Am J Epidemiol. 2009;170(2):244–56.

Yang J, Li Y, Liu Q, Li L, Feng A, Wang T, et al. Brief introduction of medical database and data mining technology in big data era. J Evid Based Med. 2020;13(1):57–69.

Yu Z, Yang J, Gao L, Huang Q, Zi H, Li X. A competing risk analysis study of prognosis in patients with esophageal carcinoma 2006–2015 using data from the surveillance, epidemiology, and end results (SEER) database. Med Sci Monit. 2020;26:e918686.

Yang J, Pan Z, He Y, Zhao F, Feng X, Liu Q, et al. Competing-risks model for predicting the prognosis of penile cancer based on the SEER database. Cancer Med. 2019;8(18):7881–9.

Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform. 2018;19(6):1236–46.

Alashwal H, El Halaby M, Crouse JJ, Abdalla A, Moustafa AA. The application of unsupervised clustering methods to Alzheimer’s disease. Front Comput Neurosci. 2019;13:31.

Macqueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA: University of California Press;1967.

Forgy EW. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics. 1965;21:768–9.

Johnson SC. Hierarchical clustering schemes. Psychometrika. 1967;32(3):241–54.

Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Rec. 1996;25(2):103–14.

Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Rec. 1998;27(2):73–84.

Guha S, Rastogi R, Shim K. ROCK: a robust clustering algorithm for categorical attributes. Inf Syst. 2000;25(5):345–66.

Xu D, Tian Y. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Kriegel HP, Kröger P, Sander J, Zimek A. Density-based clustering. WIRES Data Min Knowl. 2011;1(3):231–40. https://doi.org/10.1002/widm.30 .

Ester M, Kriegel HP, Sander J, Xu X, editors. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd international conference on knowledge discovery and data mining Portland, Oregon: AAAI Press; 1996. p. 226–31.

Wang W, Yang J, Muntz RR. STING: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases, Morgan Kaufmann Publishers Inc.; 1997. p. 186–95.

Iwashyna TJ, Burke JF, Sussman JB, Prescott HC, Hayward RA, Angus DC. Implications of heterogeneity of treatment effect for reporting and analysis of randomized trials in critical care. Am J Respir Crit Care Med. 2015;192(9):1045–51.

Ruan S, Lin H, Huang C, Kuo P, Wu H, Yu C. Exploring the heterogeneity of effects of corticosteroids on acute respiratory distress syndrome: a systematic review and meta-analysis. Crit Care. 2014;18(2):R63.

Docampo E, Collado A, Escaramís G, Carbonell J, Rivera J, Vidal J, et al. Cluster analysis of clinical data identifies fibromyalgia subgroups. PLoS ONE. 2013;8(9):e74873.

Sutherland ER, Goleva E, King TS, Lehman E, Stevens AD, Jackson LP, et al. Cluster analysis of obesity and asthma phenotypes. PLoS ONE. 2012;7(5):e36631.

Guo Q, Lu X, Gao Y, Zhang J, Yan B, Su D, et al. Cluster analysis: a new approach for identification of underlying risk factors for coronary artery disease in essential hypertensive patients. Sci Rep. 2017;7:43965.

Hastings S, Oster S, Langella S, Kurc TM, Pan T, Catalyurek UV, et al. A grid-based image archival and analysis system. J Am Med Inform Assoc. 2005;12(3):286–95.

Celebi ME, Aslandogan YA, Bergstresser PR. Mining biomedical images with density-based clustering. In: International conference on information technology: coding and computing (ITCC’05), vol II. Washington, DC, USA: IEEE; 2005. https://doi.org/10.1109/ITCC.2005.196 .

Agrawal R, Imieliński T, Swami A, editors. Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD conference on management of data. Washington, DC, USA: Association for Computing Machinery; 1993. p. 207–16. https://doi.org/10.1145/170035.170072 .

Sethi A, Mahajan P. Association rule mining: A review. TIJCSA. 2012;1(9):72–83.

Kotsiantis S, Kanellopoulos D. Association rules mining: a recent overview. GESTS Int Trans Comput Sci Eng. 2006;32(1):71–82.

Narvekar M, Syed SF. An optimized algorithm for association rule mining using FP tree. Procedia Computer Sci. 2015;45:101–10.

Verhein F. Frequent pattern growth (FP-growth) algorithm. Sydney: The University of Sydney; 2008. p. 1–16.

Li Q, Zhang Y, Kang H, Xin Y, Shi C. Mining association rules between stroke risk factors based on the Apriori algorithm. Technol Health Care. 2017;25(S1):197–205.

Guo A, Zhang W, Xu S. Exploring the treatment effect in diabetes patients using association rule mining. Int J Inf Pro Manage. 2016;7(3):1–9.

Pearson K. On lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72.

Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933;24(6):417.

Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016;374(2065):20150202.

Zhang Z, Castelló A. Principal components analysis in clinical studies. Ann Transl Med. 2017;5(17):351.

Apio BRS, Mawa R, Lawoko S, Sharma KN. Socio-economic inequality in stunting among children aged 6–59 months in a Ugandan population based cross-sectional study. Am J Pediatri. 2019;5(3):125–32.

Burgel PR, Paillasseur JL, Caillaud D, Tillie-Leblond I, Chanez P, Escamilla R, et al. Clinical COPD phenotypes: a novel approach using principal component and cluster analyses. Eur Respir J. 2010;36(3):531–9.

Vogt W, Nagel D. Cluster analysis in diagnosis. Clin Chem. 1992;38(2):182–98.

Layeghian Javan S, Sepehri MM, Layeghian Javan M, Khatibi T. An intelligent warning model for early prediction of cardiac arrest in sepsis patients. Comput Methods Programs Biomed. 2019;178:47–58. https://doi.org/10.1016/j.cmpb.2019.06.010 .

Wu W, Yang J, Li D, Huang Q, Zhao F, Feng X, et al. Competitive risk analysis of prognosis in patients with cecum cancer: a population-based study. Cancer Control. 2021;28:1073274821989316. https://doi.org/10.1177/1073274821989316 .

Martínez Steele E, Popkin BM, Swinburn B, Monteiro CA. The share of ultra-processed foods and the overall nutritional quality of diets in the US: evidence from a nationally representative cross-sectional study. Popul Health Metr. 2017;15(1):6.

Download references

This study was supported by the National Social Science Foundation of China (No. 16BGL183).

Author information

Wen-Tao Wu and Yuan-Jie Li have contributed equally to this work

Authors and Affiliations

Department of Clinical Research, The First Affiliated Hospital of Jinan University, Tianhe District, 613 W. Huangpu Avenue, Guangzhou, 510632, Guangdong, China

Wen-Tao Wu, Ao-Zi Feng, Li Li, Tao Huang & Jun Lyu

School of Public Health, Xi’an Jiaotong University Health Science Center, Xi’an, 710061, Shaanxi, China

Department of Human Anatomy, Histology and Embryology, School of Basic Medical Sciences, Xi’an Jiaotong University Health Science Center, Xi’an, 710061, Shaanxi, China

Yuan-Jie Li

Department of Neurology, The First Affiliated Hospital of Jinan University, Tianhe District, 613 W. Huangpu Avenue, Guangzhou, 510632, Guangdong, China

You can also search for this author in PubMed   Google Scholar

Contributions

WTW, YJL and JL designed the review. JL, AZF, TH, LL and ADX reviewed and criticized the original paper. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to An-Ding Xu or Jun Lyu .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Wu, WT., Li, YJ., Feng, AZ. et al. Data mining in clinical big data: the frequently used databases, steps, and methodological models. Military Med Res 8 , 44 (2021). https://doi.org/10.1186/s40779-021-00338-z

Download citation

Received : 24 January 2020

Accepted : 03 August 2021

Published : 11 August 2021

DOI : https://doi.org/10.1186/s40779-021-00338-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Clinical big data
  • Data mining
  • Medical public database

Military Medical Research

ISSN: 2054-9369

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

big data mining case study

BusinessTechWeekly.com

Big Data Use Case: How Amazon uses Big Data to drive eCommerce revenue

Big Data Use Case

Amazon is no stranger to big data. In this big data use case, we’ll look at how Amazon is leveraging data analytic technologies to improve products and services and drive overall revenue.

Big data has changed how we interact with the world and continue strengthening its hold on businesses worldwide. New data sets can be mined, managed, and analyzed using a combination of technologies.

These applications leverage the fallacy-prone human brain with computers. If you can think of applications for machine learning to predict things, optimize systems/processes, or automatically sequence tasks – this is relevant to big data.

Amazon’s algorithm is another secret to its success. The online shop has not only made it possible to order products with just one mouse click, but it also uses personalization data combined with big data to achieve excellent conversion rates.

On this page:

Amazon and Big data

Amazon’s big data strategy, amazon collection of data and its use, big data use case: the key points.

The fascinating world of Big Data can help you gain a competitive edge over your competitors. The data collected by networks of sensors, smart meters, and other means can provide insights into customer spending behavior and help retailers better target their services and products.

RELATED: Big Data Basics: Understanding Big Data

Machine Learning (a type of artificial intelligence) processes data through a learning algorithm to spot trends and patterns while continually refining the algorithms.

Amazon is one of the world’s largest businesses, estimated to have over 310 million active customers worldwide. They recently accomplished transactions that reached a value of $90 billion. This shows the popularity of online shopping on different continents. They provide services like payments, shipping, and new ideas for their customers.

Amazon is a giant – it has its own clouds. Amazon Web Services (AWS) offers individuals, companies, and governments cloud computing platforms . Amazon became interested in cloud computing after its Amazon Web Services was launched in 2003.

Amazon Web Services has expanded its business lines since then. Amazon hired some brilliant minds in the field of analytics and predictive modeling to aid in further data mining of Amazon’s massive volume of data that it has accumulated. Amazon innovates by introducing new products and strategies based on customer experience and feedback.

Big Data has assisted Amazon in ascending to the top of the e-commerce heap.

Amazon uses an anticipatory delivery model that predicts the products most likely to be purchased by its customers based on vast amounts of data.

This leads to Amazon assessing your purchase pattern and shipping things to your closest warehouse, which you may use in the future.

Amazon stores and processes as much customer and product information as possible – collecting specific information on every customer who visits its website. It also monitors the products a customer views, their shipping address, and whether or not they post reviews.

Amazon optimizes the prices on its websites by considering other factors, such as user activity, order history, rival prices, product availability, etc., providing discounts on popular items and earning a profit on less popular things using this strategy. This is how Amazon utilizes big data in its business operations.

Data science has established its preeminent place in industries and contributed to industries’ growth and improvement.

RELATED: How Artificial Intelligence Is Used for Data Analytics

Ever wonder how Amazon knows what you want before you even order it? The answer is mathematics, but you know that.

You may not know that the company has been running a data-gathering program for almost 15 years now that reaches back to the site’s earliest days.

In the quest to make every single interaction between buyers and sellers as efficient as possible, getting down to the most minute levels of detail has been essential, with data collection coming from a variety of sources – from sellers themselves and customers with apps on their phones – giving Amazon insights into every step along the way.

Voice recording by Alexa

Alexa is a speech interaction service developed by Amazon.com. It uses a cloud-based service to create voice-controlled smart devices. Through voice commands, Alexa can respond to queries, play music, read the news, and manage smart home devices such as lights and appliances.

Users may subscribe to an Alexa Voice Service (AVS) or use AWS Lambda to embed the system into other hardware and software.

You can spend all day with your microphone, smartphone, or barcode scanner recording every interaction, receipt, and voice note. But you don’t have to with tools like Amazon Echo.

With its always-on Alexa Voice Service, say what you need to add to your shopping list when you need it. It’s fast and straightforward.

Single click order

There is a big competition between companies using big data. Using big data, Amazon realized that customers might prefer alternative vendors if they experience a delay in their orders. So, Amazon has created Single click ordering.

You need to mention the address and payment method by this method. Every customer is given a time of 30 minutes to decide whether to place the order or not. After that, it is automatically determined.

Persuade Customers

Persuasive technology is a new area at Amazon. It’s an intersection of AI, UX, and the business goal of getting customers to take action at any point in the shopping journey.

One of the most significant ways Amazon utilizes data is through its recommendation engine. When a client searches for a specific item, Amazon can better anticipate other items the buyer may be interested in.

Consequently, Amazon can expedite the process of convincing a buyer to purchase the product. It is estimated that its personalized recommendation system accounts for 35 percent of the company’s annual sales.

The Amazon Assistant helps you discover new and exciting products, browse best sellers, and shop by department—there’s no place on the web with a better selection of stuff. Plus, it automatically notifies you when price drops or items you’ve been watching get marked down, so customers get the best deal possible.

Price dropping

Amazon constantly changes the price of its products by using Big data trends. On many competitor sites, the product’s price remains the same.

But Amazon has created another way to attract customers by constantly changing the price of the products. Amazon continually updates prices to deliver you the best deals.

Customers now check the site constantly that the price of the product they want can be low at any time, and they can buy it easily.

Shipping optimization

Shipping optimization by Amazon allows you to choose your preferred carrier, service options, and expected delivery time for millions of items on Amazon.com. With Shipping optimization by Amazon, you can end surprises like unexpected carrier selection, unnecessary service fees, or delays that can happen with even standard shipping.

Today, Amazon offers customers the choice to pick up their packages at over 400 U.S. locations. Whether you need one-day delivery or same-day pickup in select metro areas, Prime members can choose how fast they want to get their goods in an easy-to-use mobile app.

RELATED: Amazon Supply Chain: Understanding how Amazon’s supply chain works

Using shipping partners makes this selection possible, allowing Amazon to offer the most comprehensive selection in the industry and provide customers with multiple options for picking up their orders.

To better serve the customer, Amazon has adopted a technology that allows them to receive information from shoppers’ web browsing habits and use it to improve existing products and introduce new ones.

Amazon is only one example of a corporation that uses big data. Airbnb is another industry leader that employs big data in its operations; you can also review their case study. Below are four ways big data plays a significant role in every organization.

1. Helps you understand the market condition: Big Data assists you in comprehending market circumstances, trends, and wants, as well as your competitors, through data analysis.

It helps you to research customer interests and behaviors so that you may adjust your products and services to their requirements.

2. It helps you increase customer satisfaction: Using big data analytics, you may determine the demographics of your target audience, the products and services they want, and much more.

This information enables you to design business plans and strategies with the needs and demands of customers in mind. Customer satisfaction will grow immediately if your business strategy is based on consumer requirements.

3. Increase sales: Once you thoroughly understand the market environment and client needs, you can develop products, services, and marketing tactics accordingly. This helps you dramatically enhance your sales.

4. Optimize costs: By analyzing the data acquired from client databases, services, and internet resources, you may determine what prices benefit customers, how cost increases or decreases will impact your business, etc.

You can determine the optimal price for your items and services, which will benefit your customers and your company.

Businesses need to adapt to the ever-changing needs of their customers. Within this dynamic online marketplace, competitive advantage is often gained by those players who can adapt to market changes faster than others. Big data analytics provides that advantage.

RELATED: Top 5 Big Data Privacy Issues Businesses Must Consider

However, the sheer volume of data generated at all levels — from individual consumer click streams to the aggregate public opinions of millions of individuals — provides a considerable barrier to companies that would like to customize their offerings or efficiently interact with customers.

'  data-src=

James joined BusinessTechWeekly.com in 2018, following a 19-year career in IT where he covered a wide range of support, management and consultancy roles across a wide variety of industry sectors. He has a broad technical knowledge base backed with an impressive list of technical certifications. with a focus on applications, cloud and infrastructure.

The Impact of Blockchain Technology on Logistics

Maximizing Business Insights with Web Analytics Tools

How to Redact in Adobe Acrobat

B2B vs B2C Social Media Marketing: Learn the differences, and Skyrocket your…

Unlock the Power of Social Media Advertising for your Business Today

The Ultimate Guide to B2B Marketing: A guide for Small and Medium Businesses (SMBs)

Twitter Data Mining: A Guide to Big Data Analytics Using Python

Twitter is a goldmine of data. Unlike other social platforms, almost every user’s tweets are completely public and pullable.

In this tutorial, Toptal Freelance Software Engineer Anthony Sistilli will be exploring how you can use Python, the Twitter API, and data mining techniques to gather useful data.

Twitter Data Mining: A Guide to Big Data Analytics Using Python

By Anthony Sistilli

With four years of experience, Anthony specializes in machine learning and artificial intelligence as an engineer and a researcher.

PREVIOUSLY AT

Big data is everywhere. Period. In the process of running a successful business in today’s day and age, you’re likely going to run into it whether you like it or not.

Whether you’re a businessman trying to catch up to the times or a coding prodigy looking for their next project, this tutorial will give you a brief overview of what big data is. You will learn how it’s applicable to you, and how you can get started quickly through the Twitter API and Python.

Python snake reading Twitter

What Is Big Data?

Big data is exactly what it sounds like—a lot of data. Alone, a single point of data can’t give you much insight. But terabytes of data, combined together with complex mathematical models and boisterous computing power, can create insights human beings aren’t capable of producing. The value that big data Analytics provides to a business is intangible and surpassing human capabilities each and every day.

The first step to big data analytics is gathering the data itself. This is known as “data mining.” Data can come from anywhere. Most businesses deal with gigabytes of user, product, and location data. In this tutorial, we’ll be exploring how we can use data mining techniques to gather Twitter data, which can be more useful than you might think.

For example, let’s say you run Facebook, and want to use Messenger data to provide insights on how you can advertise to your audience better. Messenger has 1.2 billion monthly active users . In this case, the big data are conversations between users. If you were to individually read the conversations of each user, you would be able to get a good sense of what they like, and be able to recommend products to them accordingly. Using a machine learning technique known as Natural Language Processing (NLP), you can do this on a large scale with the entire process automated and left up to machines.

This is just one of the countless examples of how machine learning and big data analytics can add value to your company.

Why Twitter data?

Twitter is a gold mine of data. Unlike other social platforms, almost every user’s tweets are completely public and pullable. This is a huge plus if you’re trying to get a large amount of data to run analytics on. Twitter data is also pretty specific. Twitter’s API allows you to do complex queries like pulling every tweet about a certain topic within the last twenty minutes, or pull a certain user’s non-retweeted tweets.

A simple application of this could be analyzing how your company is received in the general public. You could collect the last 2,000 tweets that mention your company (or any term you like), and run a sentiment analysis algorithm over it.

We can also target users that specifically live in a certain location, which is known as spatial data. Another application of this could be to map the areas on the globe where your company has been mentioned the most.

As you can see, Twitter data can be a large door into the insights of the general public , and how they receive a topic. That, combined with the openness and the generous rate limiting of Twitter’s API, can produce powerful results.

Tools Overview

We’ll be using Python 2.7 for these examples. Ideally, you should have an IDE to write this code in. I will be using PyCharm - Community Edition .

To connect to Twitter’s API, we will be using a Python library called Tweepy , which we’ll install in a bit.

Getting Started

Twitter developer account.

In order to use Twitter’s API, we have to create a developer account on the Twitter apps site .

  • Log in or make a Twitter account at https://apps.twitter.com/ .

Location of the button to create an app

We’ll need all of these later, so make sure you keep this tab open.

Installing Tweepy

Tweepy is an excellently supported tool for accessing the Twitter API. It supports Python 2.6, 2.7, 3.3, 3.4, 3.5, and 3.6. There are a couple of different ways to install Tweepy. The easiest way is using pip .

Simply type pip install tweepy into your terminal.

Using GitHub

You can follow the instructions on Tweepy’s GitHub repository . The basic steps are as follows:

You can troubleshoot any installation issues there as well.

Authenticating

Now that we have the necessary tools ready, we can start coding! The baseline of each application we’ll build today requires using Tweepy to create an API object which we can call functions with. In order create the API object, however, we must first authenticate ourselves with our developer information.

First, let’s import Tweepy and add our own authentication information.

Now it’s time to create our API object.

This will be the basis of every application we build, so make sure you don’t delete it.

Example 1: Your Timeline

In this example, we’ll be pulling the ten most recent tweets from your Twitter feed. We’ll do this by using the API object’s home_timeline() function. We can then store the result in a variable, and loop through it to print the results.

The result should look like a bunch of random tweets, followed by the URL to the tweet itself.

Tweet contents and links on the terminal

Following the link to the tweet will often bring you to the tweet itself. Following the link from the first tweet would give us the following result:

Tweet that the first link of the previous picture linked to

Note that if you’re running this through terminal and not an IDE like PyCharm, you might have some formatting issues when attempting to print the tweet’s text.

The JSON behind the Results

In the example above, we printed the text from each tweet using tweet.text . To refer to specific attributes of each tweet object, we have to look at the JSON returned by the Twitter API.

The result you receive from the Twitter API is in a JSON format, and has quite an amount of information attached. For simplicity, this tutorial mainly focuses on the “text” attribute of each tweet, and information about the tweeter (the user that created the tweet). For the above sample, you can see the entire returned JSON object here .

Here’s a quick look at some attributes a tweet has to offer.

Some of the attributes returned by the Twitter API

If you wanted to find the date the the tweet was created, you would query it with print tweet.created_at .

You can also see that each tweet object comes with information about the tweeter.

User attributes returned by the Twitter API

To get the “name” and “location” attribute of the tweeter, you could run print tweet.user.screen_name and print tweet.user.location .

Note that these attributes can be extremely useful if your application depends on spatial data.

Example 2: Tweets from a Specific User

In this example, we’ll simply pull the latest twenty tweets from a user of our choice.

First, we’ll examine the Tweepy documentation to see if a function like that exists. With a bit of research, we find that the user_timeline() function is what we’re looking for.

Documentation for the user timeline command

We can see that the user_timeline() function has some useful parameters we can use, specifically id (the ID of the user) and count (the amount of tweets we want to pull). Note that we can only pull a limited number of tweets per query due to Twitter’s rate limits .

Let’s try pulling the latest twenty tweets from twitter account @NyTimes.

The contents of the @NyTimes Twitter account at the moment of writing

We can create variables to store the amount of tweets we want to pull (count), and the user we want to pull them from (name). We can then call the user_timeline function with those two parameters. Below is the updated code (note that you should have kept the authentication and API object creation at the top of your code).

Our results should look something like this:

Contents of the user timeline for @NyTimes

Popular applications of this type of data can include:

  • Running analysis on specific users, and how they interact with the world
  • Finding Twitter influencers and analyzing their follower trends and interactions
  • Monitoring the changes in the followers of a user

Example 3: Finding Tweets Using a Keyword

Let’s do one last example: Getting the most recent tweets that contain a keyword. This can be extremely useful if you want to monitor specifically mentioned topics in the Twitter world, or even to see how your business is getting mentioned. Let’s say we want to see how Twitter’s been mentioning Toptal.

After looking through the Tweepy documentation , the search() function seems to be the best tool to accomplish our goal.

Documentation for the search command

The most important parameter here is q —the query parameter, which is the keyword we’re searching for.

We can also set the language parameter so we don’t get any tweets from an unwanted language. Let’s only return English (“en”) tweets.

We can now modify our code to reflect the changes we want to make. We first create variables to store our parameters (query and language), and then call the function via the API object. Let’s also print the screen name, of the user that created the tweet, in our loop.

Search results when querying for Toptal

Here are some practical ways you can use this information:

  • Create a spatial graph on where your company is mentioned the most around the world
  • Run sentiment analysis on tweets to see if the overall opinion of your company is positive or negative
  • Create a social graphs of the most popular users that tweet about your company or product

We can cover some of these topics in future articles.

Twitter’s API is immensely useful in data mining applications, and can provide vast insights into the public opinion. If the Twitter API and big data analytics is something you have further interest in, I encourage you to read more about the Twitter API , Tweepy , and Twitter’s Rate Limiting guidelines .

We covered only the basics of accessing and pulling. Twitter’s API can be leveraged in very complex big data problems, involving people, trends, and social graphs too complicated for the human mind to grasp alone.

Further Reading on the Toptal Blog:

  • Social Network Analysis in R and Gephi: Digging Into Twitter
  • Strategic Listening: A Guide to Python Social Media Analysis
  • Security in Django Applications: A Pydantic Tutorial, Part 4

Understanding the basics

What is data mining and big data.

Data mining is the task of pulling a huge amount of data from a source and storing it. The result of this is “big data,” which is just a large amount of data in one place.

Why is Twitter data useful?

Twitter data is open, personal, and extensive. You can extract quite a bit from a user by analyzing their tweets and trends. You can also see how people are talking specific topics using keywords and business names.

How is big data analytics useful for an organization?

For an organization, big data analytics can provide insights that surpass human capability. Being able to run large amounts of data through computation-heavy analysis is something mathematical models and machines thrive at.

Anthony Sistilli

Toronto, ON, Canada

Member since April 22, 2017

About the author

World-class articles, delivered weekly.

By entering your email, you are agreeing to our privacy policy .

Toptal Developers

  • Algorithm Developers
  • Angular Developers
  • AWS Developers
  • Azure Developers
  • Big Data Architects
  • Blockchain Developers
  • Business Intelligence Developers
  • C Developers
  • Computer Vision Developers
  • Django Developers
  • Docker Developers
  • Elixir Developers
  • Go Engineers
  • GraphQL Developers
  • Jenkins Developers
  • Kotlin Developers
  • Kubernetes Experts
  • Machine Learning Engineers
  • Magento Developers
  • .NET Developers
  • R Developers
  • React Native Developers
  • Ruby on Rails Developers
  • Salesforce Developers
  • SQL Developers
  • Tableau Developers
  • Unreal Engine Developers
  • Xamarin Developers
  • View More Freelance Developers

Join the Toptal ® community.

big data mining case study

The Big Data World: Benefits, Threats and Ethical Challenges

Ethical Issues in Covert, Security and Surveillance Research

ISBN : 978-1-80262-414-4 , eISBN : 978-1-80262-411-3

ISSN : 2398-6018

Publication date: 9 December 2021

Advances in Big Data, artificial Intelligence and data-driven innovation bring enormous benefits for the overall society and for different sectors. By contrast, their misuse can lead to data workflows bypassing the intent of privacy and data protection law, as well as of ethical mandates. It may be referred to as the ‘creep factor’ of Big Data, and needs to be tackled right away, especially considering that we are moving towards the ‘datafication’ of society, where devices to capture, collect, store and process data are becoming ever-cheaper and faster, whilst the computational power is continuously increasing. If using Big Data in truly anonymisable ways, within an ethically sound and societally focussed framework, is capable of acting as an enabler of sustainable development, using Big Data outside such a framework poses a number of threats, potential hurdles and multiple ethical challenges. Some examples are the impact on privacy caused by new surveillance tools and data gathering techniques, including also group privacy, high-tech profiling, automated decision making and discriminatory practices. In our society, everything can be given a score and critical life changing opportunities are increasingly determined by such scoring systems, often obtained through secret predictive algorithms applied to data to determine who has value. It is therefore essential to guarantee the fairness and accurateness of such scoring systems and that the decisions relying upon them are realised in a legal and ethical manner, avoiding the risk of stigmatisation capable of affecting individuals’ opportunities. Likewise, it is necessary to prevent the so-called ‘social cooling’. This represents the long-term negative side effects of the data-driven innovation, in particular of such scoring systems and of the reputation economy. It is reflected in terms, for instance, of self-censorship, risk-aversion and lack of exercise of free speech generated by increasingly intrusive Big Data practices lacking an ethical foundation. Another key ethics dimension pertains to human-data interaction in Internet of Things (IoT) environments, which is increasing the volume of data collected, the speed of the process and the variety of data sources. It is urgent to further investigate aspects like the ‘ownership’ of data and other hurdles, especially considering that the regulatory landscape is developing at a much slower pace than IoT and the evolution of Big Data technologies. These are only some examples of the issues and consequences that Big Data raise, which require adequate measures in response to the ‘data trust deficit’, moving not towards the prohibition of the collection of data but rather towards the identification and prohibition of their misuse and unfair behaviours and treatments, once government and companies have such data. At the same time, the debate should further investigate ‘data altruism’, deepening how the increasing amounts of data in our society can be concretely used for public good and the best implementation modalities.

  • Artificial intelligence
  • Data analytics
  • Ethics challenges
  • Individuals’ control over personal data
  • Dataveillance

Bormida, M.D. (2021), "The Big Data World: Benefits, Threats and Ethical Challenges", Iphofen, R. and O'Mathúna, D. (Ed.) Ethical Issues in Covert, Security and Surveillance Research ( Advances in Research Ethics and Integrity, Vol. 8 ), Emerald Publishing Limited, Leeds, pp. 71-91. https://doi.org/10.1108/S2398-601820210000008007

Emerald Publishing Limited

Copyright © 2022 Marina Da Bormida

These works are published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of these works (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode

The Era of Big Data and the ‘Datafication’ of Society

We live in the era of Big Data, where governments, organisations and marketers know, or can deduce, an increasing number of data items about aspects of our lives that in previous eras we could assume were reasonably private (e.g. our race, ethnicity, religion, politics, sexuality, interests, hobbies, health information, income, credit rating and history, travel history and plans, spending habits, decision-making capabilities and biases and much else). Devices to capture, collect, store and process data are becoming ever-cheaper and faster, whilst the computational power to handle these data is continuously increasing. Digital technologies have made possible the ‘datafication’ of society, affecting all sectors and everyone’s daily life. The growing importance of data for the economy and society is unquestionable and more is to come. 1

But what does ‘Big Data’ mean? Though frequently used, the term has no agreed definition. It is usually associated with complex and large datasets on which special tools and methods are used to perform operations to derive meaningful information and support better decision making. However, the Big Data concept is not just about the quantity of data available, but also encompasses new ways of analysing existing data and generating new knowledge. In public discourse, the term tends to refer to the increasing ubiquity of data, the size of datasets, the growth of digital data and other new or alternative data sources. From a more specifically technical perspective, Big Data has five essential features:

Volume : the size of the data, notably the quantity generated and stored. The volume of data determines its value and potential insight. In order to have Big Data, the volume has to be massive (Terabytes and Petabytes or more). 2

Variety : the type and nature of the data, as well as the way of structuring it. Big Data may draw from text, images, audio, video (and data fusion can complete missing pieces) and can be structured, semi-structured or unstructured. Data can be obtained from many different sources, whose importance varies depending on the nature of the analysis: from social networks, to in-house devices, to smartphone GPS technology. Big Data can also have many layers and be in different formats.

Velocity : the time needed to generate and process information. Data have to flow quickly and in as close to real-time as possible because, certainly in a business context, high speed can deliver a competitive advantage.

Veracity : data quality and reliability; it is essential to have ways of detecting and correcting any false, incorrect or incomplete data.

Value : the analysis of reliable data adds value within and across disciplines and domains. Value arises from the development of actionable information.

Big Data as an Enabler of Growth but Harbinger of Ethical Challenges

Big Data is increasingly recognised as an enabling factor that promises to transform contemporary societies and industry. Far-reaching social changes enabled by datasets are increasingly becoming part of our daily life with benefits ranging from finance to medicine, meteorology to genomics, and biological or environmental research to statistics and business.

Data will reshape the way we produce, consume and live. Benefits will be felt in every single aspect of our lives, ranging from more conscious energy consumption and product, material and food traceability, to healthier lives and better health-care …. Data is the lifeblood of economic development: it is the basis for many new products and services, driving productivity and resource efficiency gains across all sectors of the economy, allowing for more personalised products and services and enabling better policy making and upgrading government services …. The availability of data is essential for training artificial intelligence systems, with products and services rapidly moving from pattern recognition and insight generation to more sophisticated forecasting techniques and, thus, better decision making …. Moreover, making more data available and improving the way in which data is used is essential for tackling societal, climate and environment-related challenges, contributing to healthier, more prosperous and more sustainable societies. It will for example lead to better policies to achieve the objectives of the European Green Deal. (COM, 2020b )

The exploitation of Big Data can unlock significant value in areas such as decision making, customer experience, market demand predictions, product and market development and operational efficiency. McKinsey & Company (Bailly & Manyika, 2013 ) report that the manufacturing industry stores more data than any other sector, with Big Data (soon to be made available through Cyber-physical Systems) expected to have an important role in the fourth industrial revolution, the so-called ‘Industry 4.0’ (Kagermann & Wahlster, 2013). This revolution has the potential to enhance productivity by improving supply chain management (Reichert, 2014 ) and creating more efficient risk management systems based on better-informed decisions. Industry 4.0 is also aimed at developing intelligent products (smart products) capable of capturing and transmitting huge amounts of data on their production and use. These data have to be gathered and analysed in real-time so as to pinpoint customers’ preferences and shape future products. Data are also expected to fuel the massive uptake of transformative practices such as the use of digital twins in manufacturing.

As mentioned, Big Data also creates value in many other domains including health care, government administration and education. The application of transparency and open government policies is expected to have a positive impact on many aspects of citizens’ lives. This will hopefully lead to the development of more democratic and participative societies by improved administrative efficiency, alongside perhaps more obvious uses such as better disease prevention in the health sector or self-monitoring in the education sector.

However, these positive effects must be offset against complex and multi-dimensional challenges. In the health care sector, an area that could benefit enormously from Big Data solutions, concerns relate, for instance, to the difficulty of respecting ethical boundaries relating to sensitive data where the volume of data may be preventing the chance to acquire the informed and specific consent required before each processing instance takes place. Another example, in the education sector, is the risk that students feel under surveillance at all times due to the constant collection and processing of their data, thus potentially leading to a reduction of their creativity and/or in higher levels of stress.

When considering Big Data, the debate needs to highlight the several potential ethical and social dimensions that arise, and explore the legal, societal and ethical issues. Here, there is a need to elaborate a societal and ethical framework for safeguarding human rights, mitigating risks and ensuring a consistent alignment between ethical values and behaviours. Such a framework should be able to enhance the confidence of citizens and businesses towards Big Data and the data economy. As acknowledged by the European Data Protection Supervisor (EDPS), ‘big data comes with big responsibility and therefore appropriate data protection safeguards must be in place’. 3

Recent ethical debate has focussed on concerns about privacy, anonymisation, encryption, surveillance and, above all, trust. The debate is increasingly moving towards artificial intelligence (AI) and autonomous technology, in line with technological advances. It is likely that as technology changes even further upcoming new types of harms may also be identified and debated.

The Continuity (Or Not) of Data Science Research Ethics with Social and Behavioural Science Research Ethics

Given data-intensive advances, a pertinent question is whether ethical principles developed in the social and behavioural sciences using core concepts such as informed consent, risk, harm, ownership, etc. can be applied directly to data science, or whether they require augmentation with other principles specifically conceived for ‘human-subjects’ protection in data-intensive research activities. Traditionally, human-subjects’ protection applies when data can be readily associated with the individual who bears a risk of harm in his or her everyday life. However, with Big Data there may be a substantial distance between everyday life and the uses of personal data. If technical protections are inadequate, and do not prevent the re-identification of sensitive data across distinct databases, it is challenging to predict the types of possible harms to human subjects due to the multiple, complex reasons for sharing, re-using and circulating research data.

If these difficulties are insurmountable within existing paradigms of research ethics, we will need to re-think the traditional paradigms. Here, a new framework of research ethics specific to data science could perhaps be built that could better move the ‘person’ to the centre of the debate. The expanding literature on privacy and other civil rights confirms that the ethical dimension of Big Data is becoming more and more central in European Union (EU) debate, and that the common goal is to seek concrete solutions that balance making the most of the value of Big Data without sacrificing fundamental human rights. Here, the Resolution on the fundamental rights implications of Big Data (2016/2225), adopted by the European Parliament, underlines that though Big Data has valuable potential for citizens, academia, the scientific community and the public and private sectors, it also entails significant risks namely with regard to the protection of fundamental rights, the right to privacy, data protection, non-discrimination and data security. The European Parliament has therefore stressed the need for regulatory compliance together with strong scientific and ethical standards, and awareness-raising initiatives, whilst recognising the importance of greater accountability, transparency, due process and legal certainty with regard to data processing by the private and public sectors.

Likewise, the European Commission (EC) recognises the importance of safeguarding European fundamental rights and values in the data strategy and its implementation (COM, 2020b ), whilst in the COM ( 2020a ), built upon the European strategy for AI, it is underlined that in order to address the opportunities and challenges raised by AI systems and to achieve the objective of trustworthy, ethical and human-centric AI, it is necessary to rely on European values and to ensure ‘that new technologies are at the service of all Europeans – improving their lives while respecting their rights’ (COM, 2020a ). In the same direction, a coordinated European approach on the human and ethical implications of AI, as well as a reflection on the better use of Big Data for innovation, was announced in her political guidelines by the Commission President Ursula von der Leyen (2019).

Big Data and Its Impact on Privacy

Human dignity at risk due to the ‘creep factor’ of big data.

The use of Big Data, new surveillance tools and data gathering techniques represent a fundamental step for the European economy. Nevertheless, it also poses significant legal problems from a data protection perspective, despite the renewed legal framework (General Regulation on the Protection of Personal Data, GDPR). In the Big Data paradigm, traditional methods and notions of privacy protections might be inadequate in some instances (e.g. informed consent approaches), whilst the data are often used and re-used in ways that were inconceivable when the data were collected.

As acknowledged by the EDPS, the respect for human dignity is strictly interrelated with the respect for the right to privacy and the right to the protection of personal data. That human dignity is an inviolable right of human beings is recognised in the European Charter of Fundamental Rights. This essential right might be infringed by violations like objectification, which occurs when an individual is treated as an object serving someone else’s purposes (European Data Protection Supervisor, Opinion 4/2015).

The impact of Big Data technologies on privacy (and thereby human dignity) ranges from group privacy and high-tech profiling, to data discrimination and automated decision making. It is even more significant if people disseminate personal data in the digital world at different levels of awareness throughout their main life phases. Here, people can often make themselves almost completely transparent for data miners who use freely accessible data from social networks and other data associated with an IP address for profiling purposes.

This ‘creep factor’ of Big Data, due to unethical and deliberate practices, bypasses the intent of privacy law. Such practices are allowed by advances in analysing and using Big Data for revealing previously private individual data (or statistically close proxies for it) and often have the final aim of targeting and profiling customers.

Another concern in relation to Big Data is the possibility of the re-identification of the data subject after the process of anonymisation. This might occur using technologies of de-anonymisation made available by the increased computational power of modern day personal computers, enabling a trace back to the original personal data. Indeed, traditional anonymisation techniques, making each data entry non-identifiable by removing (or substituting) uniquely identifiable information, has limits: despite the substitution of users’ personal information in a dataset, de-anonymisation can be overcome in a relatively short period of time through simple links between such anonymous datasets, other datasets (e.g. web search history) and personal data. Re-identification of the data subject might also derive from the powerful insights produced when multiple and specific datasets from different sources are joined. This might allow interested parties to uniquely identify specific physical persons or small groups of persons, with varying degrees of certainty.

The re-identification of data poses serious privacy concerns: once anonymised (or pseudo-anonymised), data may be freely processed without any prior consent by the data subject, before the subject is then re-identified. The situation is exacerbated by the lack of adequate transparency regarding the use of Big Data: this affects the ability of a data subject to allow disclosure of his/her information and to control access to these data by third parties, also impacting civil rights.

It is advisable that organisations willing to use Big Data adopt transparent procedures and ensure that these procedures are easily accessible and knowable by the public. In this way, an ethical perspective would truly drive innovation and boundary setting, properly taking into account the individual’s need for privacy and self-determination.

New Types of Stigmatisation and Manipulation of Civil Rights in the ‘Group Privacy’ Landscape

The right to privacy is undergoing an evolution. Originally arising as the right to be let alone and to exclude others from personal facts, over the years it has shifted to the right to being able to control personal data, and is now moving further in the direction of improved control. The current direction is towards the right to manage identity and the analytical profile created by third parties which select the relevant patterns to be considered in metadata. This third phase dwells not only on data that enable the identification of specific physical persons, but more on data suitable for finding out specific patterns of behaviour such as health data, shopping preferences, health status, sleep cycles, mobility patterns, online consumption, friendships, etc., of groups rather than of individuals. Despite the data being anonymous (in the sense of being de-individualised), groups are increasingly becoming more transparent: indeed, stripping data from all elements pertaining to any sort of group belongingness would result in stripping the collection itself from its content and therefore its usefulness.

This information gathered from Big Data can be used in a targeted way to encourage people to behave or consume in a certain way. Targeted marketing is an example, but other initiatives (for instance, in the political landscape), based on the ability of Big Data to discover hidden correlations and on the inferred preferences and conditions of a specific group, could be adopted to encourage or discourage a certain behaviour, with incentives whose purposes are less transparent (including not only market intelligence, but other forms of manipulations in several sectors – such as in voting behaviour).

New types of stigmatisation might also arise, for instance, in relation to the commercial choices and other personal information of groups. Forms of discrimination are likely, especially when the groups get smaller (identified by geographical, age, sex, etc. settings). In this sense, Big Data techniques might eclipse longstanding civil rights protections.

What increases ethics concern is the related collection and aggregation of mass Big Data, and the resulting structured information and quantitative analysis for this purpose that are not subject to the application of current data protection regulations. Therefore, innovative ways of re-thinking citizens’ protection are needed, capable of offering adequate and full protection.

The ‘Sharing the Wealth’ Model and the ‘Personal Data Store’ Approach for Balancing Big Data Exploitation and Data Protection

As pointed out by the EU Agency for Network and Information Security (ENISA), it is necessary to overcome the conceptual conflict between privacy and Big Data and between privacy and innovation. The need is to shift ‘… the discussion from “big data versus privacy” to “big data with privacy”’, and to recognise the privacy and data protection principles as ‘an essential value of big data, not only for the benefit of the individuals, but also for the very prosperity of big data analytics’ (ENISA, 2015, p. 5). There is no dichotomy between ethics and innovation if feasible balancing solutions are figured out and implemented. The respect for citizens’ privacy and dignity and the exploitation of Big Data’s potential can fruitfully coexist and prosper together, balancing the fundamental human values (privacy, confidentiality, transparency, identity, free choice and others) with the compelling uses of Big Data for economic gains. This is aligned with EDPS’s recent opinion (European Data Protection Supervisor, Opinion 3/2020 on the European strategy for data) underlining that data strategy’s objectives could encompass ‘to prove the viability and sustainability of an alternative data economy model – open, fair and democratic’ where, in contrast with the current predominant business model,

characterised by unprecedented concentration of data in a handful of powerful players, as well as pervasive tracking, the European data space should serve as an example of transparency, effective accountability and proper balance between the interests of the individual data subjects and the shared interest of the society as a whole.

The key question is how to ensure this coexistence and the underlying balance is achieved. The answer is not simple and relies on multiple dimensions. From a technological perspective, Privacy by Design and Privacy Enhancing Technologies (PETs) come into play. 4

As stated by the EU Regulation 2016/679, the data protection principles should be taken into consideration at a very early stage, as well as privacy measures and PETs should be identified in conjunction with the determination of the means for processing and deployed at the time of the processing itself. ENISA proposed an array of privacy by design strategies, ranging from data minimisation and separate processing of personal data, to hiding personal data and their interrelation, opting for the highest level of aggregation. The PETs to implement these strategies are already applied in the Big Data industry: they rely on anonymisation, encryption, transparency and access, security and accountability control, consent ownership and control mechanisms. Even so, an adequate investment in this sector is required, as confirmed by the small number of patents for PETs compared to those granted for data analytics technologies. Efforts need to be directed towards strengthening data subject control thereby bringing transparency and trust in the online environment. In fact, trust has emerged as a complex topic within the contemporary Big Data landscape. At the same time, it has become a key factor for economic development and for the adoption of new services, such as public e-government services, as well as for users’ acceptance to provide personal data. In some instances, such as in the medical field, the choice not to provide a full disclosure of the requested information might impact the individual’s wellbeing or health (besides indirectly hindering progress in research), given that these are personal data and the trust relationship with the data collector (e.g. the staff of a hospital) is functional to the individual’s wellbeing and/health.

The ‘sharing the wealth’ strategy proposed by Tene and Polonetsky ( 2013 ) for addressing Big Data challenges is based on the idea of providing individuals access to their data in a usable format and, above all, allowing them to take advantage of solutions capable of analysing their own data and drawing useful conclusions from it. The underlying vision is to share the wealth individuals’ data helps to create with individuals themselves, letting them make use of and benefit from their own personal data. This approach is also aligned with the vision of the Big Data Value Association (BDVA Position Paper, 2019), which outlines opportunities of data economy arising over the next decade for the industry (business), the private users (citizens as customers), the research and academic community (science) and local, national and European government and public bodies (government).

Other authors (Rubinstein, 2013 ) underline the potentialities of a new business model based on the personal data store or personal data space (PDS). Such a business model shifts data acquisition and control to a user-centric paradigm, based on better control of data and joint benefits from its use. This solution (and the necessary implementing technology), if developed, might enable users’ empowerment and full control over their personal data. In fact, it would permit users to gather, store, update, correct, analyse and/or share personal data, as well as having the ability to grant and withdraw consent to third parties for access to data. In this way, it would also work towards more accountable companies, where the commitment in personal data protection might become an economic asset for digital players.

PDS are also aligned with the importance of data portability, strongly advocated by the EDPS in view of guaranteeing people the right to access, control and correct their personal data, whilst enhancing their awareness. Data portability also nurtures the suggested approach of allowing people to share the benefits of data and can foster the development of a more competitive market environment, where the data protection policy is transformed into a strategical economic asset , thus triggering a virtuous circle. Companies would be encouraged to invest to find and implement the best ways to guarantee the privacy of their customers: indeed, data portability allows customers to switch providers more easily, also by taking into account the provider more committed to respecting personal data and to investing in privacy-friendly technical measures and internal procedures.

The ‘sharing the wealth’ paradigm and the potentialities of a new ethically driven business model relying on personal data are at the basis of the European Project DataVaults – ‘Persistent Personal DataVaults Empowering a Secure and Privacy Preserving Data Storage, Analysis, Sharing and Monetisation Platform’ (Grant Agreement no. 871755), funded under the H2020 Programme. 5 This project, currently under development, is aimed at setting, sustaining and mobilising an ever-growing ecosystem for personal data and insights sharing, capable of enhancing the collaboration between stakeholders (data owners and data seekers). Its value-driven tools and methods for addressing concerns about privacy, data protection, security and Intellectual Property Rights (IPR) ownership will enable the ethically sound sharing both of personal data and proprietary/commercial/industrial data, following strict and fair mechanism for defining how to generate, capture, release and cash out value for the benefit of all the stakeholders involved, as well as securing value flow based on smart contract, moving towards a win–win data sharing ecosystem.

The European Privacy Association even proposes to see data protection for digital companies not as mere legal compliance obligations, but as part of a broader corporate social responsibility) and socially responsible investments in the Big Data industry. It is recommended to valorise them as assets within renewed business models, able to help companies responsibly achieve their economic targets.

From a wider perspective, as also underlined by BDVA (2020) in particular in relation to the Smart Manufacturing environment, the soft law in the form of codes of conduct could bring a set of advantages at ecosystem level in each domain. In fact, such sources are expected to offer guidance and to address in meaningful, flexible and practical ways the immediate issues and ethical challenges of Big Data and AI innovations in each sector, going beyond current gaps in the legal system: they can operate as a rulebook, providing more granular ethical guidance as regards problems and concerns, resulting in an increase of confidence and legal certainty of individuals which also encompass trust building and consolidation.

In parallel, this calls for promoting the acquisition of skills on privacy as a value and right, on ethical issues of behaviour profiling, ownership of personal contents, virtual identity-related risks and digital reputation control, as well as on other topics related to Big Data advancements. On this purpose, Bachelor’s and Master’s degree programmes in Data Science, Informatics, Computer Science, Artificial Intelligence and related subjects could be adequately integrated in order to cover these themes. In this way, human resources in Big Data businesses could include ad hoc professional figures.

At the same time, in order to promote the commitment of the business world, it is advisable that the efforts of those companies which invest in ethical relationships with customers are recognised by governments and properly communicated by the companies themselves to their customer base. The certification approach should also be explored, as inspired by the Ethics Certification Program for Autonomous and Intelligent Systems launched by the Institute of Electrical and Electronics Engineers for AIS 6 products, services and systems.

This would let them further benefit in terms of improved reputation and let them increase the trust of customers towards their products and services. At the same time, information on business ethics violations occurring through the improper use of Big Data analytics should be transparent and not kept opaque to consumers.

A Critical Perspective on the ‘Notice and Consent’ Model and on the Role of Transparency in the Evolving World of Big Data Analytics

Emerging commentators argue that the data protection principles, as embodied in national and EU law, are no longer adequate to deal with the Big Data world: in particular, they criticise the role of transparency in the evolving world of Big Data analytics, assuming that it no longer makes sense considering the complex and opaque nature of algorithms. They also debate the actual suitability of the so-called ‘notice and consent’ model, on the grounds of consumers’ lack of time, willingness or ability to read long privacy notices.

Others prefer to emphasise accountability, as opposed to transparency for answering Big Data ethics challenges, being focussed on mechanisms more aligned with the nature of Big Data (such as assessing the technical design of algorithms and auditability). GDPR itself highlights, besides the role of transparency, the growing importance of accountability.

Instead of denying the role of transparency in the Big Data context, others suggest that it is not possible to offer a wholesale replacement for transparency and propose a more ‘layered’ approach to it (for instance, as regards privacy notices to individuals and also the information detail), in conjunction with a greater level of detail and access being given to auditors and accredited certification bodies.

On the contrary, transparency itself might be considered as a requirement needed for accountability and seems unavoidable in the context of respect for human dignity. Traditional notice and consent models might be rather insufficient and obsolete in view of the effective exercise of control and in order to avoid a situation where individuals feel powerless in relation to their data. Nevertheless, to overcome this weakness, an alternative, more challenging path is to make consent more granular and capable of covering all the different processing (and related) purposes and the re-use of personal data. This effort should be combined with increased citizens’ awareness and a higher participation level, as well as with effective solutions to guarantee the so-called right to be forgotten.

In the same user-centric approach, based on control and joint benefits and promoted by EC and European-wide initiatives, 7 a number of views foster new approaches premised on consumer empowerment in the data-driven business world. These approaches strongly aligned with the transparency and accountability requirements, ask for proper internal policies and control systems, focussed on pragmatic, smart and dynamic solutions and able to prevent the risk of companies becoming stuck in bureaucracy.

Discrimination, Social Cooling, Big Data Divide and Social Sorting

A possible side effect of datafication is the potential risk of discrimination of data mining technologies in several aspects of daily life, such as employment and credit scoring (Favaretto, De Clercq, & Elger, 2019). It ranges from discriminatory practices based on profiling and related privacy concerns (e.g. racial profiling enabled by Big Data platforms in subtle ways by targeting characteristics like home address and misleading vulnerable less-educated groups with scams of harmful offers), 8 to the impact of Big Data in the context of the daily operation of organisations and public administrations (e.g. within human resources offices). In the latter context, crucial decisions, like those about employment, might rely on the use of Big Data practices which might bring the risk of unfair treatment through discrimination based on gender, race, disability, national origin, sexual orientation and so on.

Social Cooling as a Side Effect of Big Data

We live in a society where everything can be given a score and critical life changing opportunities are increasingly determined by such scoring systems, often obtained through secret predictive algorithms applied to data to determine which individuals or which social group has value. It is therefore essential to consider human values as oversight in the design and implementation of these systems and, at the same time, to guarantee that the policies and practices using data and scoring machines to make decisions are realised in a legal and ethical manner (including avoiding automated decision-making practices not compliant with regulatory boundaries set forth by art. 22 GDPR). Fair and accurate scoring systems have to be ensured, whilst also avoiding the risk that data might be biased to arbitrarily assign individuals to a stigmatising group. Such an assignment might potentially allow that decisions relevant for them are not fair and, in the end, might negatively affect their concrete opportunities.

Any Big Data system has to ensure that, if existing, automated decision making, especially in areas such as employment, health care, education and financial lending, operates fairly for all communities, and safeguards the interests of those who are disadvantaged. The use of Big Data, in other words, should not result in infringements of the fundamental rights of individuals, neither in differential treatment or indirect discrimination against groups of people, for instance, as regards the fairness and equality of opportunities for access to services.

As indicated by the European Parliament, all measures possible need to be taken to minimise algorithmic discrimination and bias and to develop a common ethical framework for the transparent processing of personal data and automated decision making. This common framework should guide data usage and the ongoing enforcement of EU law. From this perspective, it is necessary that the use of algorithms to provide services – useful for identifying patterns in data – rely on a comprehensive understanding of the context in which they are expected to function and are capable of picking up what matters. It is also essential to establish oversight activities and human intervention in automated systems as well, besides considering that Big Data needs to be coupled with room for politics and with mechanisms to hold power to account. In this way, unintended negative societal consequences of possible errors introduced by algorithms, especially in terms of the risk of systematic discrimination across society in the provision of services, might be prevented or at least minimised.

This will also limit the widening of one of the chilling effects of Big Data related to discrimination, the so-called social cooling. Social cooling could limit people’s desire to take risks or exercise free speech, which, over the long term, could ‘cool down’ society. 9 The term describes the long-term negative side effects in terms, for instance, of self-censorship, risk-aversion and exercise of free speech, of living in a reputation economy where Big Data practices that lack an ethical dimension are increasingly apparent and intrusive.

Social cooling is due to people’s emerging perception that their data, including the data reflecting their weaknesses, is turned into thousands of different scores and that their resulting ‘digital reputation’ could limit their opportunities. As a consequence, they feel pressure to conform to a bureaucratic average, start to apply self-censorship and tend to change their behaviour to achieve better scores. This might result, especially if public awareness remains very low, in increased social rigidity, limiting people’s ability and willingness to protest injustice and, in the end, in a subtle form of socio-political control. The related societal question is whether this trend will have an impact on the human ability to evolve as a society, where minority views are still able to flourish.

The social cooling effect emphasises another dimension of a mature and nuanced perception of data and privacy: its ability to protect the right to be imperfect, in other words the right to be human.

Big Data Divide

The expression Big Data Divide has a two-fold meaning. First, it refers to the difficulty in accessing services delivered through the use of the Internet and other new technologies and to the complexity in understanding how these technologies and related services work. This kind of digital divide might have consequences, for instance, with regard to online job hunting: senior citizens, who are unfamiliar with this new way of job hunting, can be harmed in terms of lost job opportunities. The same may happen with regard to other tools such as online dating services for finding a new partner or for social interactions. The consequences might be frustration and social withdrawal. Similarly, inclusion concerns are related to the possible definition of new policies based on a data-driven approach (e.g. data collected via sensors, social media, etc.); there is the concrete possibility that some individuals or portions of a society might not be considered. The risk is that the new policy will only take into account the needs of people having access to the given technological means. Secondly, the notion of a ‘Big Data divide’ refers to the asymmetric relationship between those ‘who collect, store, and mine large quantities of data, and those whom data collection targets’ (Andrejevic, 2014 ). The Big Data divide is perceived as potentially able to exacerbate power imbalances in the digital era and increase the individual’s sense of powerlessness in relation to emerging forms of data collection and data mining.

Furthermore, it has been argued that Big Data and data mining emphasise correlation and prediction and call to mind the emergent Big Data-driven forms of social sorting (and related risk of discrimination). This remark refers to the ability – enabled by Big Data and data mining – of discerning unexpected, unanticipated correlations and of generating patterns of actionable information. Such ability provides powerful insights for decision making and prediction purposes, unavailable to those without access to such data, processing power and findings: those with access are advantageously positioned compared to those without it.

Predictive analytics for data-driven decision making and social sorting can also lead to ‘predictive policing’ (Meijer & Wessels, 2019), where extra surveillance is set for certain individuals, groups or streets if it is more likely that a crime can be committed. Though systematic empirical research, capable of generating an evidence base on the benefits and drawbacks of this practice, seems to be still missing, the predictive policy encompasses a political challenge: if it is difficult to ignore these kinds of findings and doing nothing to prevent the occurrence of the crime, at the same time the risk of stigmatisation of such individuals or groups has to be tackled. A balance could be sought considering, for instance, the intervention threshold and correlating the type of intervention with the likelihood of crime anticipated by the algorithms, being careful to exclude incidental co-occurrences.

Big Data from the Public Sector Perspective

Big data for public use.

Another area to investigate is how Big Data might be used for public good and with public support.

Both in the ‘European Strategy for Data’ (COM, 2020b ) and in the recent Proposal for a Regulation on European Data Governance (‘Data Governance Act’) which is the first of a set of measures announced in the strategy, data altruism is facilitated, meaning ‘data voluntarily made available by individuals or companies for the common good’ (COM, 2020c ). The increasing amounts of data in society might change the type of evidence that is available for policy makers and, at the same time, policy makers can linger over computer models and predictive analytics as a basis for their decisions. The chance to draw meaningful insights (relevant for policy elaboration purposes) from data would require a comprehensive data infrastructure, where data sources are well organised and can be accessed by authorised people for the appropriate use. The discussion mainly explores the opportunities in local services in view of accompanying local decisions by evidence for securing investment from central budget holders. The surveys ranged from identifying what approaches work better for the public at a lower cost to efficaciously demonstrate and show where resources are lacking and investment needed. However, the possible use of data analysis in many local authorities is being confronted by more traditional approaches, as well as with civil servants’ diffidence in exploiting the potentialities of cutting-edge technologies. Thereby an organisational and cultural change needs to be supported, through awareness campaigns and other initiatives.

An interesting example of how Big Data can be exploited for the common good and public interest in conjunction with private business’ priorities is the solution developed in the project AEGIS – ‘Advanced Big Data Value Chain for Public Safety and Personal Security’ (Grant Agreement no. 732189), funded by the European Commission in the H2020 Programme. The project brought

together the data, the network and the technologies to create a curated, semantically enhanced, interlinked and multilingual repository for public and personal safety-related Big Data. It delivers a data-driven innovation that expands over multiple business sectors and takes into consideration structured, unstructured and multilingual datasets, rejuvenates existing models and facilitates organisations in the Public Safety and Personal Security linked sectors to provide better & personalised services to their users. 10

The services enabled by this technology aim to generate value from Big Data and renovate the Public Safety and Personal Security sector, positively influencing the welfare and protection of the general public. Project achievements aim to have positive impacts in terms of economic growth and enhanced public security, as well as for individuals, by improving safety and wellbeing through prevention and protection from dangers affecting safety (such as accidents or disasters).

Dataveillance, Big Data Governance and Legislation

Big Data poses multiple strategic challenges for governance and legislation, with the final aim of minimising harm and maximising benefit from the use of data. Such challenges require consideration of risks and risk management.

The first issue is related to the practice of the so-called ‘dataveillance’, where the use of data improves surveillance and security. It refers to the continuous monitoring and collecting of users’ online data (data resulting from email, credit card transactions, GPS coordinates, social networks, etc.), including communication and other actions across various platforms and digital media, as well as metadata. This kind of surveillance is partially unknown and happens discreetly. Dataveillance can be individual dataveillance (concerning the individual’s personal data), mass dataveillance (concerning data on groups of people) and facilitative mechanisms (without either considering the individual as part of a group, or targeting any specific group).

In the public perception, the idea that one’s position and activity might be in some way tracked at most times has become an ordinary fact of life, in conjunction with an increased perception of safety: almost everyone is aware of the ubiquitous use of CCTV 11 circuits, the GPS 12 positioning capabilities inside mobile devices, the use of credit cards and ATM 13 cards and other forms of tracking. On the contrary, this active surveillance might also have an impact on citizens’ liberties and might be used by governments (and businesses too) for unethical purposes.

Ethical concerns revolve around individual rights and liberties, as well as on the ‘data trust deficit’, whereby citizens have lower levels of trust in institutions to use their data appropriately.

Other important tools for accountability to the public should be implemented, in order to avoid the public perception that there are no mechanisms for accountability outside of public outcry. This implies tackling the challenge for Big Data governance. For instance, it would be useful if there were a formulation and upholding of an authoritative ethical framework at the national or international level, drawing upon a wide range of knowledge, skills and interests across the public, private and academic sectors, and confirmed by a wide public consultation.

Alongside this ethical framework an update of the current legislative system would be opportune for minimising harm and maximising benefit from the use of data: in fact, the regulation is developing at a much slower pace than the Big Data technology and its applications. This results in the business community’s responsibility to decide how to bridle the insights offered by data from the multiple data sources and devices, according to their respective core ethical values.

Data Ownership

Another dimension of the debate on Big Data also revolves around data ownership, which might be considered as a sort of IPR issue separate from technology IPR.

The latter refers to the procedures and technologies used to acquire, process, curate, analyse and use the data. Big Data technology IPRs are mostly covered by the general considerations applicable for software and hardware IPRs and the related business processes, though considered in the Big Data domain. In this view, special IPR approaches are not needed, being covered by existing models and approaches existing for the assertion, assignment and enforcement of copyright, design rights, trademarks and patents for IT technology in general.

On the contrary, data ownership refers to the IP related to the substantive data itself, including both raw data and derived data. The main IP rights in relation to data are database rights, copyright and confidentiality: due to the fact that database rights and copyright protect expression and form rather than the substance of information, the best form of IP protection for data is often considered the one offered by the provisions safeguarding the confidentiality of information, being capable of protecting the substance of data that is not generally publicly known.

IP challenges in the Big Data domain are different from existing approaches and need special care, especially as regards protection, security and liability, besides data ownership. At the same time, addressing the challenges raised by IP issues is essential, considering the expected high incomes due to increased Big Data innovation and technology diffusion.

Data ownership and the rights to use data might be covered by copyright and related contracts which are valid when collecting the data, often including also confidentiality clauses. In case of further processing of big datasets, it has to be explored when and how this creates new ownership: in fact, the acquisition of data, its curation and combination with other datasets, as well as possible analysis of them and resulting insights, creates new rights to the resulting data, which need be asserted and enforced.

Regardless of the considerations stemming from the regulatory perspective, notably Directive 96/9/EC on the legal protection of databases, the main ethical dilemma concerns how to consider user’s data. In other words, the question is to whom these data belong: still to the user, or to the company that conducted the analyses, or the company that gathered the original data? 14

All these issues should not only be specifically addressed by national and European legislation on IPR in relation to data, which is of uncertain scope at the moment, but also investigated by the data ethics debate: best practices for collection, recommendations and guidelines would be very useful. Currently, a key role for addressing this issues is played by contract provisions.

In view of ensuring the fair attribution of value represented in data creation, but, at the same time, considering the multiple, competing interests at stake in B2B 15 data sharing, balancing operations should be conducted between the data producers’ interest to remain in control of their data and to retain their rights as the original owners, the public interest in avoiding data monopolies (due to the fact that data still fuel innovation, creativity and research) and data subjects’ interest in their personal information collected by a company.

Regarding the first of these interests and the related ownership claims, the legal framework is still uncertain and fragmented. The situation is further complicated by the difficulty of applying legal categories: the data are an intangible good difficult to define and the same legal concept of data ownership is not clearly defined. Many questions arise, such as: does existing EU law provide sufficient protection for data? If not, what more is needed? Are data capable of ownership (sui generis right or copyright law)? Is there a legal basis for claims of ownership of data? Is there the need of enactment of exclusive rights in data? Or is it better to explore alternatives?

Regarding alternatives, an interesting option is to provide the factual exclusivity of data through flexible and pragmatic solutions able to provide certainty and predictability, by combining agile contracting with enabling technological tools. As for the contractual layer of this solution, it consists of ad hoc and on-the-fly B2B data exchange contracts, provided under the well-defined data sovereignty principle to safeguard data producers’ control over data generated. For this purpose, access and usage policies or protocols need to be implemented. At the same time, it is necessary to establish a trade-off with other interests, like individual ‘interest’ over personal data, in this case. On the contrary, the technological layer provides enabling technologies to implement and enforce the terms and conditions set forth by the data sharing agreements. Technologies to be explored include, for instance, sticky policies, Blockchain, Distributed Ledger Technologies and smart contract, Digital Rights Management technologies and APIs. 16

This kind of solution is well-developed by the International Data Space Association (IDSA), 17 consisting of more than one hundred companies and institutions from various industries and of different sizes from 20 countries collaborating to design and develop a trustworthy architecture for the data economy. Its vision and reference architecture rotate around the concept of ‘data sovereignty’, defined as ‘a natural person’s or corporate entity’s capability of being entirely self-determined with regard to its data’ (IDSA, 2019). Data sovereignty, which is materialised in ‘terms and conditions’ (such as time to live, forwarding rights, pricing information, etc.) linked to data before it is exchanged and shared. Such terms and conditions are supported and enforced through the technical infrastructure, including tools for the secure and trusted authorisation, authentication and data exchange (such as blockchain, smart contracts, identity management, point-to-point encryption, etc.) to be customised to the needs of individual participants.

In line with the joint benefit approach and with the related user-centric business model based on PDS, a similar path could be further extended also for strengthening the contract provisions underpinning high-value personal data ecosystems leaving the process under the individuals’ control, like in the DataVaults Project. This is also the goal of the new Smart Cities Marketplace Initiative within the Citizen Focus Action Cluster: ‘Citizen Control of Personal Data’, 18 launched on 27 January 2021. Its intention is

to contribute to speeding up the adoption, at scale, of common open urban data platforms, and ensure that 300 million European citizens are served by cities with competent urban data platforms, by 2025. The potential for citizen’s personal data to contribute to data ecosystems will be significantly enhanced by introducing secure, ethical and legal access to this highly coveted and valuable personal data, incorporating citizen-generated data as ‘city data’.

Novel contract rights, including IPR provisions, might be further spread in the data-driven economy, in view of confirming users’ control over their data, as well as their empowerment, thereby contributing to going beyond possible existing differences between national laws and gaps in the European legislation.

Nevertheless, as in the past, when the IPR development has followed the commercialising of innovation, the growth of the Big Data market is likely to generate also the further renewal of the IPRs’ regulatory framework underpinning it and to pave the way to set a coherent system at European level.

Conclusions

The rise of Big Data and the underlying ability to capture and analyse datasets from highly diversified contexts and generate novel, unanticipated knowledge, as well as AI developments relying on data, are capable of producing economic growth and bringing relevant benefits, both at the social and the individual level. This rapidly sprawling phenomenon is expected to have significant influence on governance, policing, economics, security, science, education, health care and much more.

The collection of Big Data and inferences based on them are sources enabling both economic growth and generation of value, with the potential to bring further improvement to everyday life in the near future. The examples span from road safety, to health services, agriculture, retail, education and climate change mitigation. Possible improvements rely on the direct use and collection of Big Data or on inferences or ‘nowcasting’ based on them: new knowledge and insights are generated, as well as real-time reports and analyses with alerting purposes can be produced.

At the same time, Big Data practices and techniques put at stake several ethical, social and policy challenges, threats and potential hurdles. They are often interrelated and range from concerns related to data ownership to the ‘datafication’ of society, to privacy dilemmas and the potential trade-off between privacy and data analytics progress, social cooling, dataveillance, discriminatory practices and the emerging Big Data divide. Such challenges, threats and potential hurdles also include, for instance, the data-driven business ethics violations, the ‘data trust deficit’, the concerns due to the use of Big Data in the public sector and the desirable role of the government towards the fair policy development and the provision of enhanced public services.

These and similar items need greater ethics engagement and reflection, in the framework of an interdependent ecosystem, composed of different and complementary competences (primarily legislators, data-driven businesses, IT developers and data scientists, civil society organisations and academia) in order to come up with a Big Data market fully respectful of human dignity and citizens’ rights and susceptible of further development in an ethically acceptable way.

The fruitful development of this ecosystem might also require the adjustment of familiar conceptual models and archetypes of research ethics, to better align them with the epistemic conditions of Big Data and the data analytics work. The envisioned alignment should reflect also on the shift towards algorithmic knowledge production to identify and address eventual mismatches between the Big Data research and the extant research ethics regimes. In parallel, inquiry should be moved away from considering only traditional categories of harm (e.g. physical pain and psychological distress) to cover other types and forms (e.g. effects of the perennial surveillance on human behaviour and dignity and group discrimination). Likewise, the concept of the human subject and related foundational assumptions should be revisited to include not only individuals, but also distributed groupings or classifications.

The need to productively re-think some concepts of research ethics and regulations, due to the development of large-scale data analytics, represents an opportunity to reaffirm basic principles and values of human dignity, respect, transparency, accountability and justice. The final aim is to contribute to shaping the future trajectory of the Big Data revolution, with its interplay with AI breakthroughs, in a way that is truly responsive to foundational ethical principles.

COM ( 2020b ). This communication is part of a wider package of strategic documents, including the COM ( 2020a ), the Communication on Shaping Europe’s digital future.

The volume of data produced is growing quickly, from 33 zettabytes in 2018 to an expected 175 zettabytes in 2025 in the world (IDC, 2018).

European Data Protection Supervisor, Opinion 3/2020 on the European strategy for data. In the same document, the EDPS applauds the EC’s commitment to safeguard that European fundamental rights and values, underpinning all aspects of the data strategy and its implementation.

A high-value description and classification of the PETs and their role was provided by the e-SIDES project ( https://e-sides.eu/e-sides-project ) deliverables. In the e-SIDES Deliverable D3.2 and in the related White Paper, the overview of existing PETs is accompanied by an assessment methodology of them for facing legal and ethical implications based on interviews and desk-research: it provides, on the one hand, the technology-specific assessment of selected classes of PETs, and, on the other hand, a more general assessment of such technologies.

In particular within the call H2020-ICT-2019-2, topic ICT-13-2018-2019 ‘Supporting the emergence of data markets and the data economy’. Further information on DataVaults can be retrieved at the following link: https://www.datavaults.eu/ .

Autonomous and Intelligent Systems.

See, for instance, EC’s COM ( 2019 ) and EFFRA ( 2013 , 2020).

An interesting reading on the risk of racial profiling which might be generated by new technological tools and methods, such as Big Data, automated decision making and AI is the ‘General recommendation No. 36 (2020) on preventing and combating racial profiling by law enforcement officials’ released by the United Nations’ Committee on the Elimination of Racial Discrimination ( 2020 ) on 17 December.

https://www.socialcooling.com/

https://cordis.europa.eu/project/rcn/206179_it.html

Closed-Circuit Television.

Global Positioning System.

Automated Teller Machine.

An interesting reading on this topic is AA.VV ( 2016 ).

Business to Business

Application Programming Interfaces.

https://www.internationaldataspaces.org/

https://smart-cities-marketplace.ec.europa.eu/news/new-initiative-citizen-control-personal-data-within-citizen-focus-action-cluster . This initiative is committed to seek to remove existing constrains and helping to create the conditions and relationships whereby ‘the citizen will be willing to share personal data with a city and with other actors in the data economy. The ambition behind this new initiative is to give the smart cities movement a boost by providing cities with access to a rich personal data pool. This pool of data, in turn, would stimulate further activity within the data economy, accelerate the take-up of urban data platforms and contribute to the improvement of mobility, health, energy efficiency and better governance among other’.

AA.VV, 2016 AA.VV . ( 2016 ). Data ownership and access to data. Position statement of the Max Planck Institute for Innovation and Competition of 16 August 2016 on the current European debate .

Andrejevic, 2014 Andrejevic , M. ( 2014 ). The Big Data divide . International Journal of Communication , 8 , 1673 – 1689 .

Bailly, & Manyika, 2013 Bailly , M. , & Manyika , J. ( 2013 ). Is manufacturing ‘cool’ again ?. McKinsey Global Institute . Retrieved from https://www.brookings.edu/opinions/is-manufacturing-cool-again/

BDVA, 2020 BDVA . ( 2020 ). Big Data challenges in smart manufacturing industry. A White paper on digital Europe Big Data challenges for smart manufacturing industry . Retrieved from https://bdva.eu/sites/default/files/BDVA_SMI_Discussion_Paper_Web_Version.pdf . Accessed on July 26, 2021.

BDVA Position Paper, 2019 BDVA Position Paper . ( 2019 , April) Towards a European data sharing space – Enabling data exchange and unlocking AI potential . Retrieved from https://www.bdva.eu/node/1277 . Accessed on July 26, 2021.

COM, 2019 COM . ( 2019 ). 168 final “Building trust in human-centric artificial intelligence” . Retrieved from https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:52019DC0168 . Accessed on July 26, 2021.

COM, 2020a COM . ( 2020a ). 65 final “White paper on artificial intelligence – A European approach to excellence and trust” . Retrieved from https://ec.europa.eu/info/sites/default/files/commission-white-paper-artificial-intelligence-feb2020_en.pdf . Accessed on July 26, 2021.

COM, 2020b COM . ( 2020b ). 66 final “A European strategy for data” . Retrieved from https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A52020DC0066 . Accessed on July 26, 2021.

COM, 2020c COM . ( 2020c ). 767 final “Proposal for a regulation of the European Parliament and of the Council on European data governance (Data Governance Act) ”. Retrieved from https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A52020PC0767 . Accessed on July 26, 2021.

DataVaults Project DataVaults Project . Retrieved from https://www.datavaults.eu/

EFFRA, 2013 EFFRA . ( 2013 ). Factories of the future multi-annual roadmap for the contractual PPP under Horizon 2020 . Retrieved from https://www.effra.eu/sites/default/files/190312_effra_roadmapmanufacturingppp_eversion.pdf

EFFRA, 2020 EFFRA . ( 2020 ). Vision for a manufacturing partnership in Horizon Europe 2021–2027 .

ENISA, 2015 ENISA . ( 2015 ). Privacy by design in Big Data. An overview of privacy enhancing technologies in the era of Big Data analytics . Retrieved from www.enisa.europa.eu . Accessed on July 26, 2021.

e-SIDES Deliverable D3.2 and White Paper., 2018 e-SIDES Deliverable D3.2 and White Paper. ( 2018 ). How effective are privacy-enhancing technologies in addressing ethical and societal issues? . Retrieved from https://e-sides.eu/resources . Accessed on July 26, 2021.

e-SIDES Project e-SIDES Project . Retrieved from https://e-sides.eu/e-sides-project

European Data Protection Supervisor, Opinion 3/2020 on the European strategy for data. Retrieved from https://edps.europa.eu/sites/default/files/publication/20-06-16_opinion_data_strategy_en.pdf

European Data Protection Supervisor, Opinion 4/2015. Towards a new digital ethics. Data, dignity and technology , 11 September 2015. Retrieved from https://edps.europa.eu/sites/edp/files/publication/15-09-11_data_ethics_en.pdf

Executive Office of the President, 2014 Executive Office of the President . ( 2014 ). Big Data: Seizing opportunities, preserving values . Retrieved from https://obamawhitehouse.archives.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf . Accessed July 26, 2021.

Favaretto, De Clercq, & Elger, 2019 Favaretto , M. , De Clercq , E. , & Elger , B. S. ( 2019 ). Big Data and discrimination: Perils, promises and solutions. A systematic review . Journal of Big Data , 6 ( 1 ), 12 .

IDSA, 2019 IDSA . ( 2019 ). Reference architecture model, version 3.0 . Retrieved from https://internationaldataspaces.org/use/reference-architecture/ Accessed on July 26, 2021.

Kagermann, & Wahlster, 2013 Kagermann , H. , & Wahlster , W. ( 2013 ). Securing the future of German manufacturing industry: Recommendations for implementing the strategic initiative Industrie 4.0 . Final report of the Industrie 4.0 Working Group, AcatechVNational Academy of Science and Engineering, Germany. Retrieved from https://www.academia.edu/36867338/Securing_the_future_of_German_manufacturing_industry_Recommendations_for_implementing_the_strategic_initiative_INDUSTRIE_4_0_Final_report_of_the_Industrie_4_0_Working_Group . Accessed on July 26, 2021.

Meijer, & Wessels, 2019 Meijer , A. , & Wessels , M. ( 2019 ). Predictive policing: Review of benefits and drawbacks . International Journal of Public Administration , 42 ( 12 ), 1031 – 1039 .

Metcalf, & Crawford, 2016 Metcalf , J. , & Crawford , K. ( 2016 ). Where are human subjects in Big Data research? The emerging ethics divide . Big Data & Society , 3 ( 1 ), 1 – 14 .

Reichert, 2014 Reichert , P. ( 2014 ). Comarch EDI platform case study: The advanced electronic data interchange hub as a supply-chain performance booster . In P. Golinska (Ed.), Logistics operations, supply chain management and sustainability (pp. 143 – 155 ). Cham : Springer .

Reinsel, Gantz, Rydning, 2018 Reinsel , R. , Gantz , J. , Rydning , J. ( 2018 ). Data Age 2025. The digitization of the World. From Edge to Core. An IDC White Paper . Retrieved at https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf

Rubinstein, 2013 Rubinstein , I. S. ( 2013 ). Big Data: The end of privacy or a new beginning? . International Data Privacy Law , 3 ( 2 ), 74 – 87 .

Smart Cities Marketplace Initiative within the Citizen Focus Action Cluster, 2021 Smart Cities Marketplace Initiative within the Citizen Focus Action Cluster . ( 2021 ). Citizen control of personal data . Retrieved from https://smart-cities-marketplace.ec.europa.eu/news/new-initiative-citizen-control-personal-data-within-citizen-focus-action-cluster

Tene, & Polonetsky, 2013 Tene , O. , & Polonetsky , J. ( 2013 ). Big Data for all: Privacy and user control in the age of analytics . Northwestern Journal of Technology and Intellectual Property , 11 ( 5 ), 237 – 273 .

United Nations’ Committee on the Elimination of Racial Discrimination, 2020 United Nations’ Committee on the Elimination of Racial Discrimination . ( 2020 ). General recommendation No. 36 on preventing and combating racial profiling by law enforcement officials . Retrieved from https://digitallibrary.un.org/record/3897913 . Accessed on July 26, 2021.

van Brakel, 2016 van Brakel , R. ( 2016 ). Pre-emptive Big Data surveillance and its (dis)empowering consequences: The case of predictive policing . In B. van der Sloot , D. Broeders , E. Schrijvers (Eds.), Exploring the Boundaries of Big Data (pp. 117 – 141 ). Amsterdam : Amsterdam University

von der Leyen, 2019 von der Leyen , U. ( 2019 ). A union that strives for more. My agenda for Europe . Retrieved from https://ec.europa.eu/info/sites/info/files/political-guidelines-next-commission_en_0.pdf . Accessed on July 26, 2021.

Book Chapters

We’re listening — tell us what you think, something didn’t work….

Report bugs here

All feedback is valuable

Please share your general feedback

Join us on our journey

Platform update page.

Visit emeraldpublishing.com/platformupdate to discover the latest news and updates

Questions & More Information

Answers to the most commonly asked questions here

IMAGES

  1. Top 10 Big Data Case Studies that You Should Know

    big data mining case study

  2. (PDF) Time Series Data Mining: A Case Study with Big Data Analytics

    big data mining case study

  3. Data Mining Case Study

    big data mining case study

  4. Here’s What You Need to Know about Data Mining and Predictive Analytics

    big data mining case study

  5. Big Data Mining

    big data mining case study

  6. Data Mining RapidMiner Case Study

    big data mining case study

VIDEO

  1. Data mining and warehouse Paper Questions Rgpv Exam

  2. Lecture 16: Data Mining CSE 2020 Fall

  3. ΟΜΗΡΟΣ Case Study: How we turned machine learning data into an effective smart bidding strategy

  4. Big Data Project Use Case

  5. Lecture 15: Data Mining CSE 2020 Fall

  6. DM Assignment G8: Strategies in Big Data Mining

COMMENTS

  1. Netflix Recommender System

    The V's of Big Data . Volume: As of May 2019, Netflix has around 13,612 titles (Gaël, 2019). Their US library alone consists of 5087 titles. As of 2016, Netflix has completed its migration to Amazon Web Services. Their data of tens of petabytes of data was moved to AWS (Brodkin et al., 2016).

  2. 5 Data Mining Use Cases

    Read the PBS, LunaMetrics, and Google Analytics case study. 5. The Pegasus Group. Cyber attackers compromised and targeted the data mining system (DMS) of a major network client of The Pegasus Group and launched a distributed denial-of-service (DDoS) attack against 1,500 services. Under extreme time pressure, The Pegasus Group needed to find a ...

  3. Data Mining Case Studies & Benefits

    A successful implementation requires defining clear goals, choosing data wisely, and constant adaptation. Data mining case studies help businesses explore data for smart decision-making. It's about finding valuable insights from big datasets. This is crucial for businesses in all industries as data guides strategic planning.

  4. Ten big data case studies in a nutshell

    You haven't seen big data in action until you've seen Gartner analyst Doug Laney present 55 examples of big data case studies in 55 minutes. It's kind of like The Complete Works of Shakespeare, Laney joked at Gartner Symposium, though "less entertaining and hopefully more informative."(Well, maybe, for this tech crowd.) The presentation was, without question, a master class on the three Vs ...

  5. Time Series Data Mining: A Case Study With Big Data Analytics Approach

    Time series data is common in data sets has become one of the focuses of current research. The prediction of time series can be realized through the mining of time series data, so that we can obtain the development process and regularity of social economic phenomena reflected by time series, and extrapolate to predict its development trend. More and more attention has been paid to time series ...

  6. 5 Big Data Case Studies

    Following are the interesting big data case studies -. 1. Big Data Case Study - Walmart. Walmart is the largest retailer in the world and the world's largest company by revenue, with more than 2 million employees and 20000 stores in 28 countries. It started making use of big data analytics much before the word Big Data came into the picture.

  7. Big data case study: How UPS is using analytics to improve ...

    Perez says UPS is using technology to improve its flexibility, capability, and efficiency, and that the right insight at the right time helps line-of-business managers to improve performance. The ...

  8. Cost-Effective Big Data Mining in the Cloud: A Case Study with K-means

    In this paper, we explore and demonstrate the cost effectiveness of big data mining with a case study using well known k-means. With the case study, we find that achieving 99% accuracy only needs .32%-46.17% computation cost of 100% accuracy. This finding lays the cornerstone for cost-effective big data mining in a variety of domains.

  9. Genetic programming for experimental big data mining: A case study on

    Additionally, the multi-objective process is slower than a single-objective process mostly because of the non-dominated sorting process. In this study, to accelerate the GP process for Big Data analysis, the following strategies have been used in the GP process: • Only 60% of the data have been randomly used for training proposes.

  10. Case Study

    February 9, 2024: Amazon Kinesis Data Firehose has been renamed to Amazon Data Firehose. Read the AWS What's New post to learn more. This is a guest blog post co-written by Tal Knopf at Coralogix. Digital data is expanding exponentially, and the existing limitations to store and analyze it are constantly being challenged and overcome.

  11. Home page

    The Journal of Big Data publishes open-access original research on data science and data analytics. Deep learning algorithms and all applications of big data are welcomed. Survey papers and case studies are also considered. The journal examines the challenges facing big data today and going forward including, but not limited to: data capture ...

  12. Predictive analytics using big data for increased customer loyalty

    In the study predict customer loyalty at the National Multimedia Company of Indonesia, using three data mining algorithms, These algorithms were applied to the set of data obtained are 2269 records and contain 9 attributes to be used. By comparing the analysis models, the C4.5 algorithm with its own data set segment has the highest accuracy ...

  13. What is Big Data Analytics?

    Big data analytics is the systematic processing and analysis of large amounts of data to extract valuable insights and help analysts make data-informed decisions. ... Big data analytics employs advanced techniques like machine learning and data mining to extract information from complex data sets. It often requires distributed processing ...

  14. Data mining in clinical big data: the frequently used databases, steps

    Data mining is a multidisciplinary field at the intersection of database technology, statistics, ML, and pattern recognition that profits from all these disciplines [].Although this approach is not yet widespread in the field of medical research, several studies have demonstrated the promise of data mining in building disease-prediction models, assessing patient risk, and helping physicians ...

  15. TOP-10 DATA MINING CASE STUDIES

    Abstract. We report on the panel discussion held at the ICDM'10 conference on the top 10 data mining case studies in order to provide a snapshot of where and how data mining techniques have made significant real-world impact. The tasks covered by 10 case studies range from the detection of anomalies such as cancer, fraud, and system failures to ...

  16. Big Data Use Case: How Amazon uses Big Data to drive eCommerce revenue

    In this big data use case, we'll look at how Amazon is leveraging data analytic technologies to improve products and services and drive overall revenue. Big data has changed how we interact with the world and continue strengthening its hold on businesses worldwide. New data sets can be mined, managed, and analyzed using a combination of ...

  17. Twitter Data Mining: Analyzing Big Data Using Python

    The value that big data Analytics provides to a business is intangible and surpassing human capabilities each and every day. The first step to big data analytics is gathering the data itself. This is known as "data mining.". Data can come from anywhere. Most businesses deal with gigabytes of user, product, and location data.

  18. Quantifying production-living-ecology functions with spatial detail

    Combining big data mining and fusion techniques, this study built a comprehensive and quantitative evaluation scheme of production-living-ecology (PLE) land functions with spatial detail. The scheme was applied in the karst region of southwestern China as a case study.

  19. Trends Prediction of Big Data: A Case Study based on Fusion Data

    As shown in Figure 1, the global datasphere, the sum of the world’s data, has grown from 4.3 zettabytes in 2013 to 33 zettabytes in 2018, and the compound average growth rate (CAGR) was 40% approximately. In this case, IDC predicts that the global datasphere will increase to 175 zettabytes by 2025 [15].

  20. PDF R and Data Mining: Examples and Case Studies

    already have a basic idea of data mining and also have some basic experience with R. We hope that this book will encourage more and more people to use R to do data mining work in their research and applications. This chapter introduces basic concepts and techniques for data mining, including a data mining process and popular data mining techniques.

  21. The Big Data World: Benefits, Threats and Ethical Challenges

    Furthermore, it has been argued that Big Data and data mining emphasise correlation and prediction and call to mind the emergent Big Data-driven forms of social sorting (and related risk of discrimination). ... Comarch EDI platform case study: The advanced electronic data interchange hub as a supply-chain performance booster.

  22. Data mining for "big archives" analysis: A case study

    We present a case of archival analysis using a combination of data mining methods. The team of researchers, composed by archivists and computer scientists, used a collection of declassified Department of State Cables as a case study. The methods implemented included Support Vector Machine (SVM) and Association Rule Mining.

  23. (PDF) Big data implementation in Tesla using ...

    In this study, we will analyze how big data is implemented in TESLA Company, in this case, we will use sales data. ... Data mining is the mining or discovery of new information by looking for ...

  24. Data Mining with R

    The second part includes case studies, and the new edition strongly revises the R code of the case studies making it more up-to-date with recent packages that have emerged in R. The book does not assume any prior knowledge about R. Readers who are new to R and data mining should be able to follow the case studies, and they are designed to be ...