case study examples for python

Python Case Studies

Machine Learning business case studies solved using python. These are examples of how you can solve similar use cases for your own project and deploy the models into production.

I have discussed below points in each of the case studies.

How to explore the given data?
How to perform data pre-processing (missing values, outliers, transformations, etc.)
How to create new columns based on existing columns (Feature Engineering)?
How to select the best columns for machine learning (Feature Selection)?
How to find the best ML algorithm for the given data?
How to tune the predictive models.
How to deploy predictive models into production?
What happens after the model deployment?

Time Series Use Cases

Time Series Forecasting Forecasting monthly sales quantity for Superstore dataset

NLP Use Cases

TF-IDF Text classification Support Ticket Classification using TF-IDF Vectorization
Sentiment Analysis using BERT Finding the sentiment of Indigo flight passengers based on their tweets
Transfer Learning using GloVe Microsoft Ticket classification using GloVe
Text classification using Word2Vec How to create classification models on text data using word2vec encodings

Regression Use Cases

Zomato restaurant rating How to predict the future rating of a restaurant based on an ML model. A Case study in python.
Predicting diamond prices Creating an ML model to predict the apt price of a given diamond.
Evaluating old car price Predicting the right price for an old car using python machine learning.
Bike rental demand prediction Create an ML model to forecast the demand of rental bikes every hour of the day.
Computer price prediction Estimating the price of a computer, based on its specs.
Concrete strength prediction How strong will this concrete be? Predicting the strength of concrete based on its mixture details.
Boston housing price prediction House price prediction case study on the famous Boston data.

Classification Use Cases

Loan Classification A predictive model to approve/reject a new loan application.
German Credit Risk Classification of a loan as a potential risk or safe for the bank.
Salary Band Classification Identify if you deserve a salary more than $50,000 or not .
Titanic survival A case study to see what type of passengers survived the titanic crash.
Advertisement Click Prediction A case study to predict if a user will click on advertisements or not

Deep Learning Use Cases

ANN-Regression Creating an Artificial Neural Network model for Regression
ANN-Classification Creating an Artificial Neural Network Model for Classification
LSTM Predicting Infosys stock price using Long Short Term Memory network
CNN Creating a face recognition model using the Convolution Neural Network

10 Real World Data Science Case Studies Projects with Example

Top 10 Data Science Case Studies Projects with Examples and Solutions in Python to inspire your data science learning in 2023.

BelData science has been a trending buzzword in recent times. With wide applications in various sectors like healthcare , education, retail, transportation, media, and banking -data science applications are at the core of pretty much every industry out there. The possibilities are endless: analysis of frauds in the finance sector or the personalization of recommendations on eCommerce businesses. We have developed ten exciting data science case studies to explain how data science is leveraged across various industries to make smarter decisions and develop innovative personalized products tailored to specific customers.

Walmart Sales Forecasting Data Science Project

Downloadable solution code | Explanatory videos | Tech Support

Data science case studies in retail , data science case study examples in entertainment industry , data analytics case study examples in travel industry , case studies for data analytics in social media , real world data science projects in healthcare, data analytics case studies in oil and gas, what is a case study in data science, how do you prepare a data science case study, 10 most interesting data science case studies with examples.

So, without much ado, let's get started with data science business case studies !

With humble beginnings as a simple discount retailer, today, Walmart operates in 10,500 stores and clubs in 24 countries and eCommerce websites, employing around 2.2 million people around the globe. For the fiscal year ended January 31, 2021, Walmart's total revenue was $559 billion showing a growth of $35 billion with the expansion of the eCommerce sector. Walmart is a data-driven company that works on the principle of 'Everyday low cost' for its consumers. To achieve this goal, they heavily depend on the advances of their data science and analytics department for research and development, also known as Walmart Labs. Walmart is home to the world's largest private cloud, which can manage 2.5 petabytes of data every hour! To analyze this humongous amount of data, Walmart has created 'Data Café,' a state-of-the-art analytics hub located within its Bentonville, Arkansas headquarters. The Walmart Labs team heavily invests in building and managing technologies like cloud, data, DevOps , infrastructure, and security.

ProjectPro Free Projects on Big Data and Data Science

Walmart is experiencing massive digital growth as the world's largest retailer . Walmart has been leveraging Big data and advances in data science to build solutions to enhance, optimize and customize the shopping experience and serve their customers in a better way. At Walmart Labs, data scientists are focused on creating data-driven solutions that power the efficiency and effectiveness of complex supply chain management processes. Here are some of the applications of data science at Walmart:

i) Personalized Customer Shopping Experience

Walmart analyses customer preferences and shopping patterns to optimize the stocking and displaying of merchandise in their stores. Analysis of Big data also helps them understand new item sales, make decisions on discontinuing products, and the performance of brands.

ii) Order Sourcing and On-Time Delivery Promise

Millions of customers view items on Walmart.com, and Walmart provides each customer a real-time estimated delivery date for the items purchased. Walmart runs a backend algorithm that estimates this based on the distance between the customer and the fulfillment center, inventory levels, and shipping methods available. The supply chain management system determines the optimum fulfillment center based on distance and inventory levels for every order. It also has to decide on the shipping method to minimize transportation costs while meeting the promised delivery date.

Here's what valued users are saying about ProjectPro

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

Gautam Vermani

Data Consultant at Confidential

Not sure what you are looking for?

iii) Packing Optimization

Also known as Box recommendation is a daily occurrence in the shipping of items in retail and eCommerce business. When items of an order or multiple orders for the same customer are ready for packing, Walmart has developed a recommender system that picks the best-sized box which holds all the ordered items with the least in-box space wastage within a fixed amount of time. This Bin Packing problem is a classic NP-Hard problem familiar to data scientists .

Whenever items of an order or multiple orders placed by the same customer are picked from the shelf and are ready for packing, the box recommendation system determines the best-sized box to hold all the ordered items with a minimum of in-box space wasted. This problem is known as the Bin Packing Problem, another classic NP-Hard problem familiar to data scientists.

Here is a link to a sales prediction data science case study to help you understand the applications of Data Science in the real world. Walmart Sales Forecasting Project uses historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and you must build a model to project the sales for each department in each store. This data science case study aims to create a predictive model to predict the sales of each product. You can also try your hands-on Inventory Demand Forecasting Data Science Project to develop a machine learning model to forecast inventory demand accurately based on historical sales data.

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

Amazon is an American multinational technology-based company based in Seattle, USA. It started as an online bookseller, but today it focuses on eCommerce, cloud computing , digital streaming, and artificial intelligence . It hosts an estimate of 1,000,000,000 gigabytes of data across more than 1,400,000 servers. Through its constant innovation in data science and big data Amazon is always ahead in understanding its customers. Here are a few data analytics case study examples at Amazon:

i) Recommendation Systems

Data science models help amazon understand the customers' needs and recommend them to them before the customer searches for a product; this model uses collaborative filtering. Amazon uses 152 million customer purchases data to help users to decide on products to be purchased. The company generates 35% of its annual sales using the Recommendation based systems (RBS) method.

Here is a Recommender System Project to help you build a recommendation system using collaborative filtering.

ii) Retail Price Optimization

Amazon product prices are optimized based on a predictive model that determines the best price so that the users do not refuse to buy it based on price. The model carefully determines the optimal prices considering the customers' likelihood of purchasing the product and thinks the price will affect the customers' future buying patterns. Price for a product is determined according to your activity on the website, competitors' pricing, product availability, item preferences, order history, expected profit margin, and other factors.

Check Out this Retail Price Optimization Project to build a Dynamic Pricing Model.

iii) Fraud Detection

Being a significant eCommerce business, Amazon remains at high risk of retail fraud. As a preemptive measure, the company collects historical and real-time data for every order. It uses Machine learning algorithms to find transactions with a higher probability of being fraudulent. This proactive measure has helped the company restrict clients with an excessive number of returns of products.

You can look at this Credit Card Fraud Detection Project to implement a fraud detection model to classify fraudulent credit card transactions.

New Projects

Let us explore data analytics case study examples in the entertainment indusry.

Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!

Netflix started as a DVD rental service in 1997 and then has expanded into the streaming business. Headquartered in Los Gatos, California, Netflix is the largest content streaming company in the world. Currently, Netflix has over 208 million paid subscribers worldwide, and with thousands of smart devices which are presently streaming supported, Netflix has around 3 billion hours watched every month. The secret to this massive growth and popularity of Netflix is its advanced use of data analytics and recommendation systems to provide personalized and relevant content recommendations to its users. The data is collected over 100 billion events every day. Here are a few examples of data analysis case studies applied at Netflix :

i) Personalized Recommendation System

Netflix uses over 1300 recommendation clusters based on consumer viewing preferences to provide a personalized experience. Some of the data that Netflix collects from its users include Viewing time, platform searches for keywords, Metadata related to content abandonment, such as content pause time, rewind, rewatched. Using this data, Netflix can predict what a viewer is likely to watch and give a personalized watchlist to a user. Some of the algorithms used by the Netflix recommendation system are Personalized video Ranking, Trending now ranker, and the Continue watching now ranker.

ii) Content Development using Data Analytics

Netflix uses data science to analyze the behavior and patterns of its user to recognize themes and categories that the masses prefer to watch. This data is used to produce shows like The umbrella academy, and Orange Is the New Black, and the Queen's Gambit. These shows seem like a huge risk but are significantly based on data analytics using parameters, which assured Netflix that they would succeed with its audience. Data analytics is helping Netflix come up with content that their viewers want to watch even before they know they want to watch it.

iii) Marketing Analytics for Campaigns

Netflix uses data analytics to find the right time to launch shows and ad campaigns to have maximum impact on the target audience. Marketing analytics helps come up with different trailers and thumbnails for other groups of viewers. For example, the House of Cards Season 5 trailer with a giant American flag was launched during the American presidential elections, as it would resonate well with the audience.

Here is a Customer Segmentation Project using association rule mining to understand the primary grouping of customers based on various parameters.

Get FREE Access to Machine Learning Example Codes for Data Cleaning , Data Munging, and Data Visualization

In a world where Purchasing music is a thing of the past and streaming music is a current trend, Spotify has emerged as one of the most popular streaming platforms. With 320 million monthly users, around 4 billion playlists, and approximately 2 million podcasts, Spotify leads the pack among well-known streaming platforms like Apple Music, Wynk, Songza, amazon music, etc. The success of Spotify has mainly depended on data analytics. By analyzing massive volumes of listener data, Spotify provides real-time and personalized services to its listeners. Most of Spotify's revenue comes from paid premium subscriptions. Here are some of the examples of case study on data analytics used by Spotify to provide enhanced services to its listeners:

i) Personalization of Content using Recommendation Systems

Spotify uses Bart or Bayesian Additive Regression Trees to generate music recommendations to its listeners in real-time. Bart ignores any song a user listens to for less than 30 seconds. The model is retrained every day to provide updated recommendations. A new Patent granted to Spotify for an AI application is used to identify a user's musical tastes based on audio signals, gender, age, accent to make better music recommendations.

Spotify creates daily playlists for its listeners, based on the taste profiles called 'Daily Mixes,' which have songs the user has added to their playlists or created by the artists that the user has included in their playlists. It also includes new artists and songs that the user might be unfamiliar with but might improve the playlist. Similar to it is the weekly 'Release Radar' playlists that have newly released artists' songs that the listener follows or has liked before.

ii) Targetted marketing through Customer Segmentation

With user data for enhancing personalized song recommendations, Spotify uses this massive dataset for targeted ad campaigns and personalized service recommendations for its users. Spotify uses ML models to analyze the listener's behavior and group them based on music preferences, age, gender, ethnicity, etc. These insights help them create ad campaigns for a specific target audience. One of their well-known ad campaigns was the meme-inspired ads for potential target customers, which was a huge success globally.

iii) CNN's for Classification of Songs and Audio Tracks

Spotify builds audio models to evaluate the songs and tracks, which helps develop better playlists and recommendations for its users. These allow Spotify to filter new tracks based on their lyrics and rhythms and recommend them to users like similar tracks ( collaborative filtering). Spotify also uses NLP ( Natural language processing) to scan articles and blogs to analyze the words used to describe songs and artists. These analytical insights can help group and identify similar artists and songs and leverage them to build playlists.

Here is a Music Recommender System Project for you to start learning. We have listed another music recommendations dataset for you to use for your projects: Dataset1 . You can use this dataset of Spotify metadata to classify songs based on artists, mood, liveliness. Plot histograms, heatmaps to get a better understanding of the dataset. Use classification algorithms like logistic regression, SVM, and Principal component analysis to generate valuable insights from the dataset.

Explore Categories

Below you will find case studies for data analytics in the travel and tourism industry.

Airbnb was born in 2007 in San Francisco and has since grown to 4 million Hosts and 5.6 million listings worldwide who have welcomed more than 1 billion guest arrivals in almost every country across the globe. Airbnb is active in every country on the planet except for Iran, Sudan, Syria, and North Korea. That is around 97.95% of the world. Using data as a voice of their customers, Airbnb uses the large volume of customer reviews, host inputs to understand trends across communities, rate user experiences, and uses these analytics to make informed decisions to build a better business model. The data scientists at Airbnb are developing exciting new solutions to boost the business and find the best mapping for its customers and hosts. Airbnb data servers serve approximately 10 million requests a day and process around one million search queries. Data is the voice of customers at AirBnB and offers personalized services by creating a perfect match between the guests and hosts for a supreme customer experience.

i) Recommendation Systems and Search Ranking Algorithms

Airbnb helps people find 'local experiences' in a place with the help of search algorithms that make searches and listings precise. Airbnb uses a 'listing quality score' to find homes based on the proximity to the searched location and uses previous guest reviews. Airbnb uses deep neural networks to build models that take the guest's earlier stays into account and area information to find a perfect match. The search algorithms are optimized based on guest and host preferences, rankings, pricing, and availability to understand users’ needs and provide the best match possible.

ii) Natural Language Processing for Review Analysis

Airbnb characterizes data as the voice of its customers. The customer and host reviews give a direct insight into the experience. The star ratings alone cannot be an excellent way to understand it quantitatively. Hence Airbnb uses natural language processing to understand reviews and the sentiments behind them. The NLP models are developed using Convolutional neural networks .

Practice this Sentiment Analysis Project for analyzing product reviews to understand the basic concepts of natural language processing.

iii) Smart Pricing using Predictive Analytics

The Airbnb hosts community uses the service as a supplementary income. The vacation homes and guest houses rented to customers provide for rising local community earnings as Airbnb guests stay 2.4 times longer and spend approximately 2.3 times the money compared to a hotel guest. The profits are a significant positive impact on the local neighborhood community. Airbnb uses predictive analytics to predict the prices of the listings and help the hosts set a competitive and optimal price. The overall profitability of the Airbnb host depends on factors like the time invested by the host and responsiveness to changing demands for different seasons. The factors that impact the real-time smart pricing are the location of the listing, proximity to transport options, season, and amenities available in the neighborhood of the listing.

Here is a Price Prediction Project to help you understand the concept of predictive analysis which is widely common in case studies for data analytics.

Uber is the biggest global taxi service provider. As of December 2018, Uber has 91 million monthly active consumers and 3.8 million drivers. Uber completes 14 million trips each day. Uber uses data analytics and big data-driven technologies to optimize their business processes and provide enhanced customer service. The Data Science team at uber has been exploring futuristic technologies to provide better service constantly. Machine learning and data analytics help Uber make data-driven decisions that enable benefits like ride-sharing, dynamic price surges, better customer support, and demand forecasting. Here are some of the real world data science projects used by uber:

i) Dynamic Pricing for Price Surges and Demand Forecasting

Uber prices change at peak hours based on demand. Uber uses surge pricing to encourage more cab drivers to sign up with the company, to meet the demand from the passengers. When the prices increase, the driver and the passenger are both informed about the surge in price. Uber uses a predictive model for price surging called the 'Geosurge' ( patented). It is based on the demand for the ride and the location.

ii) One-Click Chat

Uber has developed a Machine learning and natural language processing solution called one-click chat or OCC for coordination between drivers and users. This feature anticipates responses for commonly asked questions, making it easy for the drivers to respond to customer messages. Drivers can reply with the clock of just one button. One-Click chat is developed on Uber's machine learning platform Michelangelo to perform NLP on rider chat messages and generate appropriate responses to them.

iii) Customer Retention

Failure to meet the customer demand for cabs could lead to users opting for other services. Uber uses machine learning models to bridge this demand-supply gap. By using prediction models to predict the demand in any location, uber retains its customers. Uber also uses a tier-based reward system, which segments customers into different levels based on usage. The higher level the user achieves, the better are the perks. Uber also provides personalized destination suggestions based on the history of the user and their frequently traveled destinations.

You can take a look at this Python Chatbot Project and build a simple chatbot application to understand better the techniques used for natural language processing. You can also practice the working of a demand forecasting model with this project using time series analysis. You can look at this project which uses time series forecasting and clustering on a dataset containing geospatial data for forecasting customer demand for ola rides.

Explore More Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

7) LinkedIn

LinkedIn is the largest professional social networking site with nearly 800 million members in more than 200 countries worldwide. Almost 40% of the users access LinkedIn daily, clocking around 1 billion interactions per month. The data science team at LinkedIn works with this massive pool of data to generate insights to build strategies, apply algorithms and statistical inferences to optimize engineering solutions, and help the company achieve its goals. Here are some of the real world data science projects at LinkedIn:

i) LinkedIn Recruiter Implement Search Algorithms and Recommendation Systems

LinkedIn Recruiter helps recruiters build and manage a talent pool to optimize the chances of hiring candidates successfully. This sophisticated product works on search and recommendation engines. The LinkedIn recruiter handles complex queries and filters on a constantly growing large dataset. The results delivered have to be relevant and specific. The initial search model was based on linear regression but was eventually upgraded to Gradient Boosted decision trees to include non-linear correlations in the dataset. In addition to these models, the LinkedIn recruiter also uses the Generalized Linear Mix model to improve the results of prediction problems to give personalized results.

ii) Recommendation Systems Personalized for News Feed

The LinkedIn news feed is the heart and soul of the professional community. A member's newsfeed is a place to discover conversations among connections, career news, posts, suggestions, photos, and videos. Every time a member visits LinkedIn, machine learning algorithms identify the best exchanges to be displayed on the feed by sorting through posts and ranking the most relevant results on top. The algorithms help LinkedIn understand member preferences and help provide personalized news feeds. The algorithms used include logistic regression, gradient boosted decision trees and neural networks for recommendation systems.

iii) CNN's to Detect Inappropriate Content

To provide a professional space where people can trust and express themselves professionally in a safe community has been a critical goal at LinkedIn. LinkedIn has heavily invested in building solutions to detect fake accounts and abusive behavior on their platform. Any form of spam, harassment, inappropriate content is immediately flagged and taken down. These can range from profanity to advertisements for illegal services. LinkedIn uses a Convolutional neural networks based machine learning model. This classifier trains on a training dataset containing accounts labeled as either "inappropriate" or "appropriate." The inappropriate list consists of accounts having content from "blocklisted" phrases or words and a small portion of manually reviewed accounts reported by the user community.

Here is a Text Classification Project to help you understand NLP basics for text classification. You can find a news recommendation system dataset to help you build a personalized news recommender system. You can also use this dataset to build a classifier using logistic regression, Naive Bayes, or Neural networks to classify toxic comments.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Pfizer is a multinational pharmaceutical company headquartered in New York, USA. One of the largest pharmaceutical companies globally known for developing a wide range of medicines and vaccines in disciplines like immunology, oncology, cardiology, and neurology. Pfizer became a household name in 2010 when it was the first to have a COVID-19 vaccine with FDA. In early November 2021, The CDC has approved the Pfizer vaccine for kids aged 5 to 11. Pfizer has been using machine learning and artificial intelligence to develop drugs and streamline trials, which played a massive role in developing and deploying the COVID-19 vaccine. Here are a few data analytics case studies by Pfizer :

i) Identifying Patients for Clinical Trials

Artificial intelligence and machine learning are used to streamline and optimize clinical trials to increase their efficiency. Natural language processing and exploratory data analysis of patient records can help identify suitable patients for clinical trials. These can help identify patients with distinct symptoms. These can help examine interactions of potential trial members' specific biomarkers, predict drug interactions and side effects which can help avoid complications. Pfizer's AI implementation helped rapidly identify signals within the noise of millions of data points across their 44,000-candidate COVID-19 clinical trial.

ii) Supply Chain and Manufacturing

Data science and machine learning techniques help pharmaceutical companies better forecast demand for vaccines and drugs and distribute them efficiently. Machine learning models can help identify efficient supply systems by automating and optimizing the production steps. These will help supply drugs customized to small pools of patients in specific gene pools. Pfizer uses Machine learning to predict the maintenance cost of equipment used. Predictive maintenance using AI is the next big step for Pharmaceutical companies to reduce costs.

iii) Drug Development

Computer simulations of proteins, and tests of their interactions, and yield analysis help researchers develop and test drugs more efficiently. In 2016 Watson Health and Pfizer announced a collaboration to utilize IBM Watson for Drug Discovery to help accelerate Pfizer's research in immuno-oncology, an approach to cancer treatment that uses the body's immune system to help fight cancer. Deep learning models have been used recently for bioactivity and synthesis prediction for drugs and vaccines in addition to molecular design. Deep learning has been a revolutionary technique for drug discovery as it factors everything from new applications of medications to possible toxic reactions which can save millions in drug trials.

You can create a Machine learning model to predict molecular activity to help design medicine using this dataset . You may build a CNN or a Deep neural network for this data analyst case study project.

Access Data Science and Machine Learning Project Code Examples

9) Shell Data Analyst Case Study Project

Shell is a global group of energy and petrochemical companies with over 80,000 employees in around 70 countries. Shell uses advanced technologies and innovations to help build a sustainable energy future. Shell is going through a significant transition as the world needs more and cleaner energy solutions to be a clean energy company by 2050. It requires substantial changes in the way in which energy is used. Digital technologies, including AI and Machine Learning, play an essential role in this transformation. These include efficient exploration and energy production, more reliable manufacturing, more nimble trading, and a personalized customer experience. Using AI in various phases of the organization will help achieve this goal and stay competitive in the market. Here are a few data analytics case studies in the petrochemical industry:

i) Precision Drilling

Shell is involved in the processing mining oil and gas supply, ranging from mining hydrocarbons to refining the fuel to retailing them to customers. Recently Shell has included reinforcement learning to control the drilling equipment used in mining. Reinforcement learning works on a reward-based system based on the outcome of the AI model. The algorithm is designed to guide the drills as they move through the surface, based on the historical data from drilling records. It includes information such as the size of drill bits, temperatures, pressures, and knowledge of the seismic activity. This model helps the human operator understand the environment better, leading to better and faster results will minor damage to machinery used.

ii) Efficient Charging Terminals

Due to climate changes, governments have encouraged people to switch to electric vehicles to reduce carbon dioxide emissions. However, the lack of public charging terminals has deterred people from switching to electric cars. Shell uses AI to monitor and predict the demand for terminals to provide efficient supply. Multiple vehicles charging from a single terminal may create a considerable grid load, and predictions on demand can help make this process more efficient.

iii) Monitoring Service and Charging Stations

Another Shell initiative trialed in Thailand and Singapore is the use of computer vision cameras, which can think and understand to watch out for potentially hazardous activities like lighting cigarettes in the vicinity of the pumps while refueling. The model is built to process the content of the captured images and label and classify it. The algorithm can then alert the staff and hence reduce the risk of fires. You can further train the model to detect rash driving or thefts in the future.

Here is a project to help you understand multiclass image classification. You can use the Hourly Energy Consumption Dataset to build an energy consumption prediction model. You can use time series with XGBoost to develop your model.

10) Zomato Case Study on Data Analytics

Zomato was founded in 2010 and is currently one of the most well-known food tech companies. Zomato offers services like restaurant discovery, home delivery, online table reservation, online payments for dining, etc. Zomato partners with restaurants to provide tools to acquire more customers while also providing delivery services and easy procurement of ingredients and kitchen supplies. Currently, Zomato has over 2 lakh restaurant partners and around 1 lakh delivery partners. Zomato has closed over ten crore delivery orders as of date. Zomato uses ML and AI to boost their business growth, with the massive amount of data collected over the years from food orders and user consumption patterns. Here are a few examples of data analyst case study project developed by the data scientists at Zomato:

i) Personalized Recommendation System for Homepage

Zomato uses data analytics to create personalized homepages for its users. Zomato uses data science to provide order personalization, like giving recommendations to the customers for specific cuisines, locations, prices, brands, etc. Restaurant recommendations are made based on a customer's past purchases, browsing history, and what other similar customers in the vicinity are ordering. This personalized recommendation system has led to a 15% improvement in order conversions and click-through rates for Zomato.

You can use the Restaurant Recommendation Dataset to build a restaurant recommendation system to predict what restaurants customers are most likely to order from, given the customer location, restaurant information, and customer order history.

ii) Analyzing Customer Sentiment

Zomato uses Natural language processing and Machine learning to understand customer sentiments using social media posts and customer reviews. These help the company gauge the inclination of its customer base towards the brand. Deep learning models analyze the sentiments of various brand mentions on social networking sites like Twitter, Instagram, Linked In, and Facebook. These analytics give insights to the company, which helps build the brand and understand the target audience.

iii) Predicting Food Preparation Time (FPT)

Food delivery time is an essential variable in the estimated delivery time of the order placed by the customer using Zomato. The food preparation time depends on numerous factors like the number of dishes ordered, time of the day, footfall in the restaurant, day of the week, etc. Accurate prediction of the food preparation time can help make a better prediction of the Estimated delivery time, which will help delivery partners less likely to breach it. Zomato uses a Bidirectional LSTM-based deep learning model that considers all these features and provides food preparation time for each order in real-time.

Data scientists are companies' secret weapons when analyzing customer sentiments and behavior and leveraging it to drive conversion, loyalty, and profits. These 10 data science case studies projects with examples and solutions show you how various organizations use data science technologies to succeed and be at the top of their field! To summarize, Data Science has not only accelerated the performance of companies but has also made it possible to manage & sustain their performance with ease.

FAQs on Data Analysis Case Studies

A case study in data science is an in-depth analysis of a real-world problem using data-driven approaches. It involves collecting, cleaning, and analyzing data to extract insights and solve challenges, offering practical insights into how data science techniques can address complex issues across various industries.

To create a data science case study, identify a relevant problem, define objectives, and gather suitable data. Clean and preprocess data, perform exploratory data analysis, and apply appropriate algorithms for analysis. Summarize findings, visualize results, and provide actionable recommendations, showcasing the problem-solving potential of data science techniques.

Access Solved Big Data and Data Science Projects

About the Author

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

User policy

Write for ProjectPro

Data Science Case Studies: Solved using Python

February 19, 2021
Machine Learning

Solving a Data Science case study means analyzing and solving a problem statement intensively. Solving case studies will help you show unique and amazing data science use cases in your portfolio. In this article, I’m going to introduce you to 3 data science case studies solved and explained using Python.

Data Science Case Studies

If you’ve learned data science by taking a course or certification program, you’re still not that close to finding a job easily. The most important point of your Data Science interview is to show how you can use your skills in real use cases. Below are 3 data science case studies that will help you understand how to analyze and solve a problem. All of the data science case studies mentioned below are solved and explained using Python.

Case Study 1: Text Emotions Detection

If you are one of them who is having an interest in natural language processing then this use case is for you. The idea is to train a machine learning model to generate emojis based on an input text. Then this machine learning model can be used in training Artificial Intelligent Chatbots.

Use Case: A human can express his emotions in any form, such as the face, gestures, speech and text. The detection of text emotions is a content-based classification problem. Detecting a person’s emotions is a difficult task, but detecting the emotions using text written by a person is even more difficult as a human can express his emotions in any form.

Recognizing this type of emotion from a text written by a person plays an important role in applications such as chatbots, customer support forum, customer reviews etc. So you have to train a machine learning model that can identify the emotion of a text by presenting the most relevant emoji according to the input text.

Case Study 2: Hotel Recommendation System

A hotel recommendation system typically works on collaborative filtering that makes recommendations based on ratings given by other customers in the same category as the user looking for a product.

Use Case: We all plan trips and the first thing to do when planning a trip is finding a hotel. There are so many websites recommending the best hotel for our trip. A hotel recommendation system aims to predict which hotel a user is most likely to choose from among all hotels. So to build this type of system which will help the user to book the best hotel out of all the other hotels. We can do this using customer reviews.

For example, suppose you want to go on a business trip, so the hotel recommendation system should show you the hotels that other customers have rated best for business travel. It is therefore also our approach to build a recommendation system based on customer reviews and ratings. So use the ratings and reviews given by customers who belong to the same category as the user and build a hotel recommendation system.

Case Study 3: Customer Personality Analysis

The analysis of customers is one of the most important roles that a data scientist has to do who is working at a product based company. So if you are someone who wants to join a product based company then this data science case study is best for you.

Use Case: Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviours and concerns of different types of customers.

You have to do an analysis which should help a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

So these three data science case studies are based on real-world problems, starting with the first; Text Emotions Detection, it is completely based on natural language processing and the machine learning model trained by you will be used in training an AI chatbot. The second use case; Hotel Recommendation System, is also based on NLP, but here you will understand how to generate recommendations using collaborative filtering. The last use case; customer personality analysis, is based on someone who wants to focus on the analysis part.

All these data science case studies are solved using Python, here are the resources where you will find these use cases solved and explained:

Text Emotions Detection
Hotel Recommendation System
Customer Personality Analysis

I hope you liked this article on data science case studies solved and explained using the Python programming language. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Recommended For You

Data Science Certifications to Boost Your Resume

April 11, 2024

Here’s How to Learn Data Science for Finance

April 10, 2024

Data Manipulation Operations Asked in Interviews

April 9, 2024

Stock Market Anomaly Detection using Python

April 8, 2024

One comment

[…] there is no need for any academic or professional qualifications, you should have projects based on practical use cases in your portfolio to get your first data science […]

Practical Business Python

Taking care of business, one python script at a time

Sharing Your Python Case Studies

Posted by Chris Moffitt in articles

Introduction

I would like to offer this blog as platform for people to share their success stories with python. Over the past couple of weeks, I have had a handful of conversations related to the topic of how to get python implemented in an organization. In these conversations, I have noticed a lot of common themes related to getting the process started and sustaining it over time. Some of the key items are:

How do I figure out where to start?
What resources help newbies vs. more experienced users?
How do I select a good problem to tackle?
How do I operationalize a solution and sustain it over time?

I am hopeful that the combination of real-world case studies plus the detailed articles I have done in the past will be a helpful guide for people on this journey. Please read on for more of the back story and learn how you can help.

Situation 1

On Saturday, April 23rd, I presented at Minnebar #11. The topic of my presentation was “Escaping Excel Hell with Python and Pandas.” For those that are interested, I have placed a copy of the slides as well as my example notebook in my github repo. My presentation boiled down to a few key points:

People find themselves in a position where they need to solve a fairly basic data wrangling task and reach for Excel as that solution.
Excel is really not an ideal tool for the solution but it is the only one many people know.
Frequently the Excel “solution” evolves and grows over time into an unmanageable mess.
Python plus pandas is a really good solution to this problem.
If someone can build a super gnarly excel formula, they could probably learn to code python.
One approach to solving this problem is to train the “Excel Alpha Geek” how they can use python to solve their problems in a better way.

Overall, the feedback was positive and I think people enjoyed the presentation. There’s just one problem. When I asked the people in the room, “how many of you know about or use python?” The overwhelming majority raised their hand. While it is always good to speak to a friendly audience, I feel like I was probably preaching to the choir. This group mostly knew about the python solution and would be able to evaluate its application to their needs. How do we reach people that only know VBA ?

Situation 2

Through this blog, I have had the really good fortune to speak to some really smart people that are interested in the same thing I am. Basically, they feel that there is a big opportunity to introduce python into organizations and help people accomplish their jobs in a more efficient way. They have all had the experience of seeing organizations struggle with fairly simple processes because they were stuck in the Excel mindset. Many of these people have then introduced python into their workplace and seen tremendous improvements in productivity.

I have had similar experiences and here is a small example experience I had just the other day.

I asked someone to help pull some disparate data together and summarize it. The analyst (who is plenty smart) did the following tasks:

pulled data from 2 or 3 systems
exported and formatted the data for excel
pasted it into multiple tabs on a workbook
did a bunch of pivot tables, vlookups, manual manipulations and formulas to get the data to answer the question

I saw the results (which were what I was looking for) and then said: “Ok, thanks for doing this. How much time would it take for you to update this every week?” The pained look on his face confirmed my suspicions. It was probably several hours of work - based on the way the solution was built. Clearly time that he did not want to sign up for.

Since this was data I had familiarity with, I used the python+pandas approach and built a ~100 line script that does the same thing in a cleaner and more repeatable fashion. I probably spent as much time on the script as he did for the Excel creation. I do not say this to boast. I say this to highlight how much opportunity there is to streamline and improve day to day processes.

Situation 3

As I mentioned above, I have spoken to several people working on products to help with the python deployment problem. During one of the conversations, someone mentioned something along the lines that working in San Francisco gives people a distorted view of what the average work place is really like. This person mentioned that almost everyone at a company like Facebook has the ability to write custom SQL queries against their massive database. Sure enough, I looked this up and found:

Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day. https://prestodb.io/

I don’t know about you but I certainly don’t work in an area where people write queries against Petabytes of data!

Situation 4

I was talking to someone that had recently moved to a new position at a local government entity. She is a savvy user but not a developer. Our exchange went something like this (names and acronyms changed to protect the innocent):

Me: “What are you working on in your new job?”

Amy: “I am helping them upgrade their system to Excel and Access.”

Me: “Uhh. Upgrading to Excel and Access. What in the world are they using now?”

Amy: “I don’t know. Some kind of green screen thing name BINGO .”

Amy: “Yeah, they hope to have it replaced by mid-2017.”

Me: “Oh. Ok…”

My point with these anecdotes is that there is such a disconnect between the extreme of a highly technical company like Facebook and the rest of the world just trying to do their job. It’s a huge chasm and you can not assume that a multi-petabyte database solution is going to work for someone trying to migrate away from a terminal solution or a heavily Excel-driven mindset.

Get To The Point

As I was thinking about these various observations, I wanted to try to draw out some common threads. I strongly believe that python is a great tool to help with these types of organizational problems but there are challenges:

How do we let people know that python would be a good solution?
Assuming they buy in to python, how do they get started?
How do you simply and efficiently deploy python-based solutions?

Regarding point #3, Wes McKinney wrote a good article about the challenges and the python communities’ opportunity to fix this. The community has made progress. It is still a challenge but I’m hopeful people will take up Wes’ call to action.

I want to focus on points #1 and #2. I don’t know that I can build a technical solution but I think there may be an opportunity to share best practice with others and raise awareness of python and how it could be used to help people solve their day to day challenges.

A couple of weeks ago, this thread on reddit was extremely active and illustrated the interest people had in learning about real world examples of how python helped them solve a problem. There were lots of really good ideas and lots of interest in learning more.

What I would like to do is offer to help people post their solutions as case studies on this blog. The main goals would be:

Show concrete examples of how python helped solve a real world business problem.
The issue could be as big or as small as you’d like but I would lean towards solutions built by individuals or very small teams - not a massive project.
You can share as much or as little as you’d like.
Posting here would provide a level of anonymity (if desired). I think people are hesitant to talk about their work solutions for fear that someone will come after them.
Organizational buy-in and change management
What went well, what didn’t
What would you do differently?

The true value might not be in the actual sharing of code but in the ideas and processes used to solve a problem and make it scalable. In many situations, the challenges are not technical in nature.

I think there is a real need to spread this information out in a format that is less threatening to a non-programmer. If we could get some good case studies out there it might spark some ideas and help people understand how to tackle their own problems.

If you are interested in sharing your experiences, let me know. I would be more than willing to work with you to put together as much or as little detail as you would like in order to get the word out there. This can be a small but meaningful way way that you could give back to the community.

So, what do you think? Put your thoughts in the comments and reach out to me if you have any great ideas.

← Interactive Data Analysis with Python and Excel
Excel “Filter and Edit” - Demonstrated in Pandas →

Subscribe to the mailing list

Submit a topic.

Suggest a topic for a post
Pandas Pivot Table Explained
Common Excel Tasks Demonstrated in Pandas
Overview of Python Visualization Tools
Guide to Encoding Categorical Values in Python
Overview of Pandas Data Types

Article Roadmap

We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.

Case Studies in Neural Data Analysis

Case-studies-python ¶.

This repository is a companion to the textbook Case Studies in Neural Data Analysis , by Mark Kramer and Uri Eden. That textbook uses MATLAB to analyze examples of neuronal data. The material here is similar, except that we use Python.

The intended audience is the practicing neuroscientist - e.g., the students, researchers, and clinicians collecting neuronal data in the hospital or lab. The material can get pretty math-heavy, but we’ve tried to outline the main concepts as directly as possible, with hands-on implementations of all concepts. We focus on only two main types of data: spike trains and electric fields (such as the local field potential [LFP], or electroencephalogram [EEG]). If you’re interested in other data (e.g., calcium imaging, or BOLD), you may still find the examples indirectly useful (for example, demonstrations of how to compute and interpret a power spectrum of a signal).

This repository was created by Emily Schlafly and Mark Kramer, with important contributions from Dr. Anthea Cheung.

Thank you to:

MIT Press for publishing the MATLAB version of this material.

NIH NIGMS R25GM114827 and NSF DMS #1451384 for support.

Quick start to learning Python for neural data analysis: ¶

Visit the web-formatted version of the book .

Watch this 2 minute video .

Read and interact with the Python code in your web browser.

Slow start to learning Python for neural data analysis: ¶

There are multiple ways to interact with these notebooks.

Simple : Visit the web-formatted version of the notebooks .

Intermediate : Open a notebook in Binder and interact with the notebooks through a JupyterHub server. Binder provides an easy interface to interact with this material; read about it in eLife .

Advanced : Download the notebooks and run them locally (i.e. on your own computer) in Jupyter . You’ll then be able to read, edit and execute the Python code directly in your browser and you can save any changes you make or notes that you want to record. You will need to install Python and we recommend that you configure a Python environment as well.

Install Python ¶

We assume you have installed Python and can get it running on your computer. Some useful references to do so include,

If this is your first time working with Python, using Anaconda is probably a good choice. It provides a simple, graphical interface to start Jupyter .

Configure Python ¶

If you have never used the terminal before, consider using Anaconda Navigator , Anaconda’s desktop graphical user interface (GUI).

Once you have installed Anaconda or Miniconda, we recommend setting up an environment to run the notebooks. If you downloaded the repository from Github , then you can run the commands below in your terminal to configure your local environment to match the Binder environment. If you have never used the terminal before, consider using Anaconda Navigator , Anaconda’s desktop graphical user interface (GUI). The environment file we use on Binder is located in the binder folder.

This will ensure that you have all the packages needed to run the notebooks. Note that you can use make clean to remove the changes made during make config .

Finally, whenever you are ready to work with the notebooks, activate your environment and start Jupyter:

If you prefer, you can use jupyter lab instead of jupyter notebook .

Contributions ¶

We very much appreciate your contributions to this material. Contribitions may include:

Error corrections

Suggestions

New material to include (please start from this template ).

There are two ways to suggest a contribution:

Simple : Visit Case Studies Python , locate the file to edit, and follow these instructions .

Advanced : Fork Case Studies Python and submit a pull request

If you enjoy Case-Studies-Python, and would like to share your enjoyment with us, sponsor our coffee consuption here .

Cookie Policy

We use cookies to operate this website, improve usability, personalize your experience, and improve our marketing. Privacy Policy .

By clicking "Accept" or further use of this website, you agree to allow cookies.

Data Science
Data Analytics
Machine Learning

Essential Statistics for Data Science: A Case Study using Python, Part I

Get to know some of the essential statistics you should be very familiar with when learning data science

Our last post dove straight into linear regression. In this post, we'll take a step back to cover essential statistics that every data scientist should know. To demonstrate these essentials, we'll look at a hypothetical case study involving an administrator tasked with improving school performance in Tennessee.

You should already know:

Python fundamentals — learn on dataquest.io

Note, this tutorial is intended to serve solely as an educational tool and not as a scientific explanation of the causes of various school outcomes in Tennessee .

Article Resources

Notebook and Data: Github
Libraries: pandas, matplotlib, seaborn

Introduction

Meet Sally, a public school administrator. Some schools in her state of Tennessee are performing below average academically. Her superintendent, under pressure from frustrated parents and voters, approached Sally with the task of understanding why these schools are under-performing. Not an easy problem, to be sure.

To improve school performance, Sally needs to learn more about these schools and their students, just as a business needs to understand its own strengths and weaknesses and its customers.

Though Sally is eager to build an impressive explanatory model, she knows the importance of conducting preliminary research to prevent possible pitfalls or blind spots (e.g. cognitive bias'). Thus, she engages in a thorough exploratory analysis, which includes: a lit review, data collection, descriptive and inferential statistics, and data visualization.

Sally has strong opinions as to why some schools are under-performing, but opinions won't do, nor will a handful of facts; she needs rigorous statistical evidence.

Sally conducts a lit review, which involves reading a variety of credible sources to familiarize herself with the topic. Most importantly, Sally keeps an open mind and embraces a scientific world view to help her resist confirmation bias (seeking solely to confirm one's own world view).

In Sally's lit review, she finds multiple compelling explanations of school performance: curriculae , income , and parental involvement . These sources will help Sally select her model and data, and will guide her interpretation of the results.

Data Collection

The data we want isn't always available, but Sally lucks out and finds student performance data based on test scores ( school_rating ) for every public school in middle Tennessee. The data also includes various demographic, school faculty, and income variables (see readme for more information). Satisfied with this dataset, she writes a web-scraper to retrieve the data.

But data alone can't help Sally; she needs to convert the data into useful information.

Descriptive and Inferential Statistics

Sally opens her stats textbook and finds that there are two major types of statistics, descriptive and inferential.

Descriptive statistics identify patterns in the data, but they don't allow for making hypotheses about the data.

Within descriptive statistics, there are two measures used to describe the data: central tendency and deviation . Central tendency refers to the central position of the data (mean, median, mode) while the deviation describes how far spread out the data are from the mean. Deviation is most commonly measured with the standard deviation. A small standard deviation indicates the data are close to the mean, while a large standard deviation indicates that the data are more spread out from the mean.

Inferential statistics allow us to make hypotheses (or inferences ) about a sample that can be applied to the population. For Sally, this involves developing a hypothesis about her sample of middle Tennessee schools and applying it to her population of all schools in Tennessee.

For now, Sally puts aside inferential statistics and digs into descriptive statistics.

To begin learning about the sample, Sally uses pandas' describe method, as seen below. The column headers in bold text represent the variables Sally will be exploring. Each row header represents a descriptive statistic about the corresponding column.

Looking at the output above, Sally's variables can be put into two classes: measurements and indicators.

Measurements are variables that can be quantified. All data in the output above are measurements. Some of these measurements, such as state_percentile_16 , avg_score_16 and school_rating , are outcomes; these outcomes cannot be used to explain one another. For example, explaining school_rating as a result of state_percentile_16 (test scores) is circular logic. Therefore we need a second class of variables.

The second class, indicators, are used to explain our outcomes. Sally chooses indicators that describe the student body (for example, reduced_lunch ) or school administration ( stu_teach_ratio ) hoping they will explain school_rating .

Sally sees a pattern in one of the indicators, reduced_lunch . reduced_lunch is a variable measuring the average percentage of students per school enrolled in a federal program that provides lunches for students from lower-income households. In short, reduced_lunch is a good proxy for household income, which Sally remembers from her lit review was correlated with school performance.

Sally isolates reduced_lunch and groups the data by school_rating using pandas' groupby method and then uses describe on the re-shaped data (see below).

Below is a discussion of the metrics from the table above and what each result indicates about the relationship between school_rating and reduced_lunch :

count : the number of schools at each rating. Most of the schools in Sally's sample have a 4- or 5-star rating, but 25% of schools have a 1-star rating or below. This confirms that poor school performance isn't merely anecdotal, but a serious problem that deserves attention.

mean : the average percentage of students on reduced_lunch among all schools by each school_rating . As school performance increases, the average number of students on reduced lunch decreases. Schools with a 0-star rating have 83.6% of students on reduced lunch. And on the other end of the spectrum, 5-star schools on average have 21.6% of students on reduced lunch. We'll examine this pattern further. in the graphing section.

std : the standard deviation of the variable. Referring to the school_rating of 0, a standard deviation of 8.813498 indicates that 68.2% (refer to readme ) of all observations are within 8.81 percentage points on either side of the average, 83.6%. Note that the standard deviation increases as school_rating increases, indicating that reduced_lunch loses explanatory power as school performance improves. As with the mean, we'll explore this idea further in the graphing section.

min : the minimum value of the variable. This represents the school with the lowest percentage of students on reduced lunch at each school rating. For 0- and 1-star schools, the minimum percentage of students on reduced lunch is 53%. The minimum for 5-star schools is 2%. The minimum value tells a similar story as the mean, but looking at it from the low end of the range of observations.

25% : the bottom quartile; represents the lowest 25% of values for the variable, reduced_lunch . For 0-star schools, 25% of the observations are less than 79.5%. Sally sees the same trend in the bottom quartile as the above metrics: as school_rating increases the bottom 25% of reduced_lunch decreases.

50% : the second quartile; represents the lowest 50% of values. Looking at the trend in school_rating and reduced_lunch , the same relationship is present here.

75% : the top quartile; represents the lowest 75% of values. The trend continues.

max : the maximum value for that variable. You guessed it: the trend continues!

The descriptive statistics consistently reveal that schools with more students on reduced lunch under-perform when compared to their peers. Sally is on to something.

Sally decides to look at reduced_lunch from another angle using a correlation matrix with pandas' corr method. The values in the correlation matrix table will be between -1 and 1 (see below). A value of -1 indicates the strongest possible negative correlation, meaning as one variable decreases the other increases. And a value of 1 indicates the opposite. The result below, -0.815757, indicates strong negative correlation between reduced_lunch and school_rating . There's clearly a relationship between the two variables.

Sally continues to explore this relationship graphically.

Essential Graphs for Exploring Data

Box-and-whisker plot.

In her stats book, Sally sees a box-and-whisker plot . A box-and-whisker plot is helpful for visualizing the distribution of the data from the mean. Understanding the distribution allows Sally to understand how far spread out her data is from the mean; the larger the spread from the mean, the less robust reduced_lunch is at explaining school_rating .

See below for an explanation of the box-and-whisker plot.

Now that Sally knows how to read the box-and-whisker plot, she graphs reduced_lunch to see the distributions. See below.

In her box-and-whisker plots, Sally sees that the minimum and maximum reduced_lunch values tend to get closer to the mean as school_rating decreases; that is, as school_rating decreases so does the standard deviation in reduced_lunch .

What does this mean?

Starting with the top box-and-whisker plot, as school_rating decreases, reduced_lunch becomes a more powerful way to explain outcomes. This could be because as parents' incomes decrease they have fewer resources to devote to their children's education (such as, after-school programs, tutors, time spent on homework, computer camps, etc) than higher-income parents. Above a 3-star rating, more predictors are needed to explain school_rating due to an increasing spread in reduced_lunch .

Having used box-and-whisker plots to reaffirm her idea that household income and school performance are related, Sally seeks further validation.

Scatter Plot

To further examine the relationship between school_rating and reduced_lunch , Sally graphs the two variables on a scatter plot. See below.

In the scatter plot above, each dot represents a school. The placement of the dot represents that school's rating (Y-axis) and the percentage of its students on reduced lunch (x-axis).

The downward trend line shows the negative correlation between school_rating and reduced_lunch (as one increases, the other decreases). The slope of the trend line indicates how much school_rating decreases as reduced_lunch increases. A steeper slope would indicate that a small change in reduced_lunch has a big impact on school_rating while a more horizontal slope would indicate that the same small change in reduced_lunch has a smaller impact on school_rating .

Sally notices that the scatter plot further supports what she saw with the box-and-whisker plot: when reduced_lunch increases, school_rating decreases. The tighter spread of the data as school_rating declines indicates the increasing influence of reduced_lunch . Now she has a hypothesis.

Correlation Matrix

Sally is ready to test her hypothesis: a negative relationship exists between school_rating and reduced_lunch (to be covered in a follow up article). If the test is successful, she'll need to build a more robust model using additional variables. If the test fails, she'll need to re-visit her dataset to choose other variables that possibly explain school_rating . Either way, Sally could benefit from an efficient way of assessing relationships among her variables.

An efficient graph for assessing relationships is the correlation matrix, as seen below; its color-coded cells make it easier to interpret than the tabular correlation matrix above. Red cells indicate positive correlation; blue cells indicate negative correlation; white cells indicate no correlation. The darker the colors, the stronger the correlation (positive or negative) between those two variables.

With the correlation matrix in mind as a future starting point for finding additional variables, Sally moves on for now and prepares to test her hypothesis.

Sally was approached with a problem: why are some schools in middle Tennessee under-performing? To answer this question, she did the following:

Conducted a lit review to educate herself on the topic.
Gathered data from a reputable source to explore school ratings and characteristics of the student bodies and schools in middle Tennessee.
The data indicated a robust relationship between school_rating and reduced_lunch .
Explored the data visually.
Though satisfied with her preliminary findings, Sally is keeping her mind open to other explanations.
Developed a hypothesis: a negative relationship exists between school_rating and reduced_lunch .

In a follow up article, Sally will test her hypothesis. Should she find a satisfactory explanation for her sample of schools, she will attempt to apply her explanation to the population of schools in Tennessee.

Course Recommendations

Further learning:, applied data science with python — coursera, statistics and data science micromasters — edx, get updates in your inbox.

Join over 7,500 data science learners.

Recent articles:

The 6 best courses to actually learn python in 2024, best course deals for black friday and cyber monday 2024, sigmoid function, dot product, 7 best artificial intelligence (ai) courses.

Top courses you can take today to begin your journey into the Artificial Intelligence field.

Meet the Authors

A graduate of Belmont University, Tim is a Nashville, TN-based software engineer and statistician at Perception Health, an industry leader in healthcare analytics, and co-founder of Sidekick, LLC, a data consulting company. Find him on Twitter and GitHub .

John Burke Data Scientist Author @ Learn Data Sci

John is a research analyst at Laffer Associates, a macroeconomic consulting firm based in Nashville, TN. He graduated from Belmont University. Find him on GitHub and LinkedIn

Back to blog index

Data Science Projects with Python: A case study approach to gaining valuable insights from real data with machine learning

Gain hands-on experience of Python programming with industry-standard machine learning techniques using pandas, scikit-learn, and XGBoost

Key Features

Think critically about data and use it to form and test a hypothesis
Choose an appropriate machine learning model and train it on your data
Communicate data-driven insights with confidence and clarity

Book Description

What you will learn.

Load, explore, and process data using the pandas Python package
Use Matplotlib to create compelling data visualizations
Implement predictive machine learning models with scikit-learn
Use lasso and ridge regression to reduce model overfitting
Evaluate random forest and logistic regression model performance
Deliver business insights by presenting clear, convincing conclusions

Who this book is for

Data Science Projects with Python – Second Edition is for anyone who wants to get started with data science and machine learning. If you’re keen to advance your career by using data analysis and predictive modeling to generate business insights, then this book is the perfect place to begin. To quickly grasp the concepts covered, it is recommended that you have basic experience of programming with Python or another similar language, and a general interest in statistics.

IEEE Account

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Humanities Data Analysis: Case Studies with Python

Humanities data analysis: case studies with python #.

Humanities Data Analysis: Case Studies with Python is a practical guide to data-intensive humanities research using the Python programming language. The book, written by Folgert Karsdorp , Mike Kestemont and Allen Riddell , was originally published with Princeton University Press in 2021 (for a printed version of the book, see the publisher’s website ), and is now available as an Open Access interactive Juptyer Book.

The book begins with an overview of the place of data science in the humanities, and proceeds to cover data carpentry: the essential techniques for gathering, cleaning, representing, and transforming textual and tabular data. Then, drawing from real-world, publicly available data sets that cover a variety of scholarly domains, the book delves into detailed case studies. Focusing on textual data analysis, the authors explore such diverse topics as network analysis, genre theory, onomastics, literacy, author attribution, mapping, stylometry, topic modeling, and time series analysis. Exercises and resources for further reading are provided at the end of each chapter.

What is the book about?

Learn to how effectively gather, read, store and parse different data formats, such as CSV , XML , HTML , PDF , and JSON data.

Construct Vector Space Models for texts and represent data in a tabular format. Learn how use these and other representations (such as topics ) to assess similarities and distances between texts.

Emphasizes visual storytelling via data visualizations of character networks , patterns of cultural change , statistical distributions , and (shifts in) geographical distributions .

Work on real-world case studies using publicly available data sets. Dive into the world of historical cookbooks , French drama , Danish folktale collections , the Tate art gallery , mysterious medieval manuscripts , and many more.

Accompanying Data #

The book features a large number of quality datasets. These datasets are published online and are associated with the DOI 10.5281/zenodo.891264 . They can be downloaded from the address https://doi.org/10.5281/zenodo.891264 .

Citing HDA #

If you use Humanities Data Analysis in an academic publication, please cite the original publication:

Python OpenCV at Work: Real-Life Examples and Case Studies

Python OpenCV is a powerful library for image processing and computer vision, used by developers in various industries. In this article, we'll explore real-world examples and case studies of Python OpenCV applications, from facial recognition to object tracking, and learn how it's used in various industries.

Introduction to python opencv, facial recognition, object tracking, augmented reality, license plate recognition, text recognition, retail analytics, security and surveillance.

OpenCV (Open Source Computer Vision Library) is an open-source computer vision and machine learning software library. It was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in commercial products. Python OpenCV is the Python binding for OpenCV, making it easy to use OpenCV functions with Python programming.

With Python OpenCV, developers can perform various tasks, such as:

Image processing (filtering, transformations, color space conversion)
Feature detection and description
Object detection and recognition
Image stitching and panorama creation
Video analysis and motion tracking

Real-World Python OpenCV Examples

Facial recognition is one of the most common applications of Python OpenCV. It can be used to identify or verify a person's identity based on their facial features. OpenCV provides pre-trained models for facial detection and recognition, which can be used to build applications like:

Access control systems
Personal photo organization
Social media tagging
Surveillance and security

Python OpenCV can be used to track objects in video streams or image sequences. This has various applications, such as:

Robotics (navigation, object manipulation)
Automated quality control and inspection
Traffic analysis and vehicle tracking
Sports analysis and player tracking

Python OpenCV can be used to create augmented reality (AR) applications by overlaying virtual objects onto real-world images. Developers can use OpenCV's feature detection, camera calibration, and 3D rendering capabilities to build AR applications such as:

Mobile gaming
Virtual dressing rooms
Architectural visualization
Industrial training and simulation

License plate recognition (LPR) is another popular application of Python OpenCV. It can be used to automatically detect and recognize license plates in images or video streams. This has various applications, such as:

Parking management systems
Toll collection systems
Traffic enforcement and monitoring
Vehicle tracking and identification

Python OpenCV can be used in conjunction with Optical Character Recognition (OCR) libraries like Tesseract to detect and recognize text in images. This has applications in:

Document scanning and digitization
License plate recognition
Text-based augmented reality
Translation and language learning apps

Case Studies

Python OpenCV can be used for retail analytics to gather insights on customer behavior, store layout, and product performance. For example, OpenCV can track customer movement patterns, heat maps, and dwell time, which can help retailers optimize store layouts and product placements.

Python OpenCV can be used in healthcare applications for tasks like medical image analysis, diagnosis support, and surgical assistance. For example, OpenCV can be used to analyze X-rays, MRIs, and CT scans, detect abnormalities, and assist in image-guided surgery.

Python OpenCV is extensively used in security and surveillance applications for tasks like facial recognition, motion detection, and object tracking. For example, OpenCV can be used to build intelligent surveillance systems that can detect and track intruders, recognize people or vehicles of interest, and generate alerts.

Python OpenCV is a versatile and powerful library for computer vision and image processing, with real-world applications in various industries. In this article, we explored examples and case studies of Python OpenCV at work in facial recognition, object tracking, augmented reality, and more. By leveraging Python OpenCV, developers can build innovative and effective solutions for a wide range of applications.

Get VelocityAI - An AI coworker, not just a copilot

Python case study

Compilation, value objects, inheritance, variable scope.

R news and tutorials contributed by hundreds of R bloggers

R and python together: a second case study using langchain’s llm tools.

Posted on April 14, 2024 by Mark White in R bloggers | 0 Comments

Following my previous post , I am again looking at how to employ R and Python seamlessly to use large language models (LLMs). Last time, I scraped information off of Wikipedia using the rvest package, fed that information to OpenAI’s Python API, and asked it to extract information for me.

But what if we could skip that scraping step? What if we had a more complex question where writing an rvest or RSelenium script was not feasible?

Enter the LangChain Python library. I recently read Generative AI with LangChain and Developing Apps with GPT-4 and ChatGPT , both of which do a fabulous job of introducing LangChain’s capabilities.

I’ve been thinking of LangChain as an LLM version of scikit-learn: It is a model-agnostic framework where you can build LLM pipelines. Most relevant to our needs here, though, is that you can employ tools in these pipelines. Tools allow the LLM to rely on integrations to answer prompts. One tool is Wikipedia, which allows the LLM to search and read Wikipedia in trying to answer the question it’s been given. This is especially useful if you want to ask it information about something that happened after it was trained.

I’m continuing to use my Best Picture model as the project, using LLMs to get more features for me to add to it. This means I’m mostly using LLMs as information extractors instead of information generators . Does this really map one-to-one with what something like GPT 3.5 Turbo was meant to do? I truly don’t know. I don’t think many people know what the hell these things can and should be used for. Which is why I’m testing it and reporting out the accuracy here!

Past Lives ( my favorite movie of last year) was nominated for Best Picture, even though it was director Celine Song’s debut feature-length film. This is rare, and I think a helpful feature for my Best Picture model would be how many films the director had directed before the nominated film.

Getting this information is a bit more complicated. It would involve an RSelenium script of going to the movie’s page, finding the director, clicking on their profile, and then either pulling down filmography information from there or by clicking into their filmography page. Pages aren’t formatted the same, either. Sometimes the section is “Filmography,” sometimes it is “Works,” sometimes the information is in a table, while other times it is in a bulleted list.

The idea here is to use LangChain to give an LLM the Wikipedia tool to find this information for me. As opposed to my last post, I am not giving it the relevant slice of info from Wikipedia anymore. Instead, I asking it a question and giving it Wikipedia as a tool to use itself.

Methodology

I took the following steps:

Wrote a prompt that asks an LLM to figure out how many feature-length films the director of a movie had directed before making a film. The prompt is a template that takes a film’s name and release year.

Give the LLM the Wikipedia tool and a calculator (my thinking was it might need this to sum up the number of movies, since these models are optimized on language, not math).

Collect and clean the responses for every movie that’s been nominated for Best Picture.

Test the accuracy by hand-checking 100 of these cases.

The Prompt and Function

I started by making a file named funs.py :

It turns out that many of the functions in the two books above, despite being Published in October and December of 2023, have been deprecated (but the author of Developing Apps has published updated code and a second version of the book is being released this year). This is a good reminder of how quickly this field is moving, and it’s OK to not be entirely sure how to use these models—so long as you have appropriate respect for that lack of knowledge. This is to say: I’m no expert here. What I have above is cobbled together from LangChain docs, StackOverflow posts, and GitHub threads. The prompt here lays out the basic steps it should be following to get the information.

Bringing It To R

I wrote an R script to use this function in an R session. Why? Because I think the tidyverse makes it easier to inspect, clean, and wrangle data than anything in Python currently.

We start off by loading the R packages, sourcing the R script, activating the Python virtual environment (the path is relative to my file structure in my drive), and sourcing the Python script. I read in the data from a Google Sheet of mine and do one step of cleaning, as the read_sheet() function was bringing the title variable in as a list of lists instead of a character vector. I then initialize a new column, resp , where I will collect the responses from the LLM.

I iterate through each film using a for loop. This is not the R way, I know, but if something snags, I want to catch the response that I’m paying for. You’ll see that I take extra precautions to catch everything by writing out the results in .csv row-by-row. (My solution because I had been running this script and got an aborted R session in the middle and lost everything.)

As I said in my previous post, this is an example of how we can use R and Python together in harmony. prior_films is a Python function, but we use it inside of an R script.

An example call that returns the correct information:

I then pulled in this .csv to a separate cleaning script. Since I asked it for standardized feedback, I could use a regular expression to clean most of the responses. I read the rest myself. This went into a new column called resp_clean , with only the integer representing the number of movies the director had directed before the movie in question.

Performance

I used OpenAI’s GPT 3.5 Turbo alone this time. The trade-off of using the Wikipedia tool in LangChain is there are many, many more “ context tokens ,” which can drive up the expense of each prompt quite a bit (hence no GPT 4). I also set the temperature to zero to get a more reproducible and the most probable response each time. You could set this higher to get more “creative” responses, but that is not what I want from information extraction.

I was able to get valid responses for 81% of the films, which meant the LLM couldn’t find an answer for 115 of them. It either told me it couldn’t find the information or simply said "Agent stopped due to max iterations." , which meant it couldn’t find the answer in the limited six steps I gave it (so as to not run up a bill by the model running in circles, reading the same few Wikipedia pages over and over).

Correlation

First, let’s look at the correlation between GPT 3.5 Turbo’s responses and the hand-checked responses, alongside a scatterplot. The purple line is a smoothed loess, while the green line is OLS. The dotted black line would be perfect performance, where hand-checked equals GPT 3.5 Turbo. This also means any dot above the line is an undercount, while any below it is an overcount.

Oof, some huge misses. Not great performance from the LLM here. Most of these huge misses are due to early filmmakers and the studio system, which would churn out massive amounts of films, especially during the silent era. So, let’s look at absolute error by year.

Error by Year

We get a mean absolute error of nearly ten films . We also see that the model gave us the correct answer 28% of the time, an undercount 57%, and an overcount 15%.

When we plot absolute error against release year, we can see the poor performance is driven by earlier films:

So how about we remove the studio-era films, since that wouldn’t be a good input into a model trying to predict Best Picture next year anyways? The cleanest cutoff I could think of is 1970 and later, since RKO closed in 1969 :

We’re still off by 1.5 films, which is still more error than I’d like to include in my model. (Spoiler: I won’t be using the data generated here for my Best Picture model.)

Problematic Films and Directors

Let’s look at which films were the biggest misses, with an error of more than thirty films .

The directors in question here are W.S. Van Dyke, Michael Curtiz, John Ford, William Wyler, and Edmund Goulding. I would invite you to visit their Wikipedia pages and try to make sense of their filmography sections; it’s a lot. John Ford’s page, for example, lists all of the informational “short films” he made with the military, including Sex Hygiene and How to Operate Behind Enemy Lines . These pages were hard for me to hand-code according to the prompt.

Lastly, let’s examine overcounts. It makes more intuitive sense to get an undercount: The model didn’t pick up on films the person already directed—it missed them. But an overcount is stranger: How does that happen? A few examples:

We can see it didn’t make up new movies. Instead, listed movies that came out after the movie in question. This is a great example of how LLMs are trained on language and not mathematical reasoning. It doesn’t understand the temporal sequence here of 2008 being after 1999 and thus couldn’t be before American Beauty .

A few takeaways:

Again, we can see R and Python work together seamlessly.

LangChain provides an LLM with tools, but this comes at greater cost; GPT 4 probably would have done much better here, but it would have cost much more money in context tokens.

Prompt engineering is important: I could have been more explicit in the language Wikipedia tends to use, I could have asked it to use a calculator to check that the years didn’t difference out to below zero (e.g., 1999 - 2008 < 0), and I could have asked it to ignore silent films (even though the first Best Picture winner had no dialogue).

Domain expertise remains huge in data science: I think I’m pretty knowledgeable about film, but I don’t know the silent era. I wasn’t aware how many of the early directors had many dozens of silent films. I didn’t know about quota quickies . Understanding domain knowledge is vital for a data scientist.

KEEP HUMANS IN THE LOOP. What I did here is kept myself in the loop by checking performance against 100 hand-coded examples. This is a very nascent field using technology that has only been available to the public for a few years. Keep humans in the loop to make sure things don’t go off-track. For example, I won’t be adding these data to my model due to the error being too great.

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Module re: add possibility to return matches during substitution

I would like to have access to the matches from a substitution. This would allow to reconstruct the original string (or parts of it) from the replaced string. An example use case is to (efficiently) parse a long string and then parse it again, e.g.,

Now, I would like to reconstruct the original string partially to obtain ab and ac . Of course, this example is a bit too simplified as here a simple other regular expression ( pattern='a( and )?b|c( and )d' ) would work as well but, in my use case, I have much longer patterns which get impossible to combine to a single one.

Currently, there is

re.sub(...) which only returns the substituted text and
re.subn(...) which returns additionally the number of substitutions.

I suggest to add either a function (probably my preferred):

re.subm(...) which returns both the substituted text and an iterable over the replaced matches (or a list, depending on the implementation)

or an additional argument to the existing re.sub(...) , e.g.,

re.sub(..., matches=False) which would return additionally the replaced matches iterable.

Possible implementation in python

The following code implements this function in python, but it likely could be done much more efficient when using the proper cpython _sre module.

Nevertheless, in my tests the timing was already not too bad with less than a factor of two between usage of re.sub versus subm .

Edit: fixed example above.

Do you know that you can pass a callable as replacement?

Moving this to the help category as the capability already exists.

What the… That’s not even remotely true.

I now fixed the example, above.

Sure, that feature I also use in the proposed python implementation of subm . But, that does not solve it for the case (which is my case) when the pattern includes groups, too. Surely your suggestion helps to further simplify it, here a simplified (and also slightly quicker) version of a possible python implementation of subm :

Still, a big part of that code with the groups could be done more efficiently within _sre. Or are there other functions that can be used with the output of re._compile_repl ?

(edit: improved execution time by removing unnecessary re.compile call if repl is a callable)

case study examples for python

IMAGES

VIDEO

COMMENTS

Python Case Studies

Time Series Use Cases

NLP Use Cases

Regression Use Cases

Classification Use Cases

Deep Learning Use Cases

10 Real World Data Science Case Studies Projects with Example

Table of Contents

Here's what valued users are saying about ProjectPro

7) LinkedIn

9) Shell Data Analyst Case Study Project

10) Zomato Case Study on Data Analytics

FAQs on Data Analysis Case Studies

About the Author

Data Science Case Studies: Solved using Python

Data Science Case Studies

Case Study 1: Text Emotions Detection

Case Study 2: Hotel Recommendation System

Case Study 3: Customer Personality Analysis

Aman Kharwal

Recommended For You

Data Science Certifications to Boost Your Resume

Here’s How to Learn Data Science for Finance

Data Manipulation Operations Asked in Interviews

Stock Market Anomaly Detection using Python

One comment

Leave a Reply Cancel reply

Practical Business Python

Sharing Your Python Case Studies

Introduction

Situation 1

Situation 2

Situation 3

Situation 4

Get To The Point

Subscribe to the mailing list

Article Roadmap

Case Studies in Neural Data Analysis

Quick start to learning Python for neural data analysis: ¶

Slow start to learning Python for neural data analysis: ¶

Install Python ¶

Configure Python ¶

Contributions ¶

Cookie Policy

Essential Statistics for Data Science: A Case Study using Python, Part I

You should already know:

Article Resources

Introduction

Data Collection

Descriptive and Inferential Statistics

Essential Graphs for Exploring Data

Scatter Plot

Correlation Matrix

Course Recommendations

Recent articles:

Meet the Authors

Data Science Projects with Python: A case study approach to gaining valuable insights from real data with machine learning

Key Features

Book Description

Who this book is for

IEEE Account

Purchase Details

Profile Information

Humanities Data Analysis: Case Studies with Python

Accompanying Data #

Citing HDA #

Python OpenCV at Work: Real-Life Examples and Case Studies

Table of Contents

Real-World Python OpenCV Examples

Case Studies

Python case study

R news and tutorials contributed by hundreds of R bloggers

Methodology

The Prompt and Function

Bringing It To R

Performance

Correlation

Error by Year

Problematic Films and Directors

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)