Lending Club Case Study

A case study by srikanth chakravarthy linkedin.

In this case study, we plan to apply our knowledge of EDA into use and understand risk analytics in banking and financial services. It is intended to showcase how data is analyzed to minimize the risk of losing money while lending to customers.

Table of Contents

  • General Info

Conclusions

Technologies used, acknowledgements, general information.

This project/case study is about helping a Consumer Finanace Company to identify the pattern of defaulters of the loan that the institution has given. Depending on the pattern, the institution would like to make decisions about either providing a loan or rejecting it to certain clients that may seem to fall in this pattern of defaulting.

The business problem that this study is intending to solve is to provide enough information to the institution th ehelp them make decisions for loan approval. Effectively, the institution should be able to decide if the applicant will be able to replay the loan or not.

For this purpose, the dataset used for analysis contains the information about past loan applicants and whether they defaulted or not. Studying this dataset would provide insights into the pattern of defaulters.

Some of the attributes are • Number of years of employment at the time of receiving the loan • The debt-to-income ratio of the applicant • The purpose of taking the loan • Study the pattern of the other attributes against the loans that were charged-off.

  • This number of employees with tenure more than 8 years are the most applicants and in need of the loan.
  • The bivariate analysis complements the univariate analysis that the most applicants with charged-off loans are in the employment range 8 – 10 years.
  • The heat maps show the main purpose and the amount taken as loan that constitute the applicants whose loans were charged-off.
  • The Grade of the loan does not play a part in the defaulting pattern.
  • Numpy, Pandas, Seaborn, Matplotlib.pyplot.

Mr. Aditya Bhattacharya for providing an overview of the Case Study and guidance in approaching the task.

Created by scorsagg - feel free to contact me!

Case Study #1: Lending Club Loans

Devin masiak.

This dataset has 10,000 observations and 55 variables. Each observation is an accepted loan through the Lending Club platform. The variables describe the attributes of the person who requested the loan as well as the terms of the loan. Some problems with this dataset are several observations that have missing data for certain variables. This is most likely because that variable does not apply to them (like if they aren’t applying for a joint loan) so the data is NA. Another note is that the “issue_month” variable contains both the month and year of when the loan was issued. Splitting this into two separate variables will make the data easier to work with. Additionally, for the modeling part of this case study, it’s important to note which variables should be used to estimate interest rates. Only the attributes of the person requesting the loan should be used in modeling, not the attributes of the loan itself. Differentiating the two will be important for creating a feature set.

Observations

There is no significant difference between loan grades of joint applications versus individual applications. This is counterintuitive to my common sense, but this is most likely a side effect of the dataset being accepted loans.

This graph is hard to interpret. Under a certain annual income, there is no strong correlation between interest rate and income. But above a certain threshold, higher incomes indicate higher grades. Another interesting observation from this graph is that interest rates are not continuous, they have specific values.

There appears to be a slight correlation between loan grade and type of homeownership. There appears to be no distinction between owning and mortgaging a house, but people who rent a home have lower grades than the aforementioned.

Income verification plays a large role in loan grading. For reasons that allude me, unverified incomes have the highest loan grades while verified incomes have a normal distribution.

Overall, state does not appear to play a large factor in interest rates outside of a few exceptions. The state with the highest average interest rate is Hawaii at 14.78%. The state with the lowest average interest rate is South Dakota at 11.73%.

Model prediction

The feature set used to predict interest rates includes all given variables except for, variables relating to the loan itself, any variables relating to joint income, and the job title. All joint applicants were removed from the dataset entirely. The feature set ended up containing 36 features and 8,505 samples. There are two types of features in this data set, categorical and numerical. To prep the data for use, all categorical variables were encoded using a one-hot encoder. All null values were replaced with the median value from that feature. The two models I used to model interest rates were a Random Forest Classifier and a stochastic Gradient Descent Classifier. Unfortunately, the highest accuracy achieved was only ~21% by the random forest. The stochastic gradient descent only had ~11% accuracy.

Model Results

Improvements

Given more time to complete this assignment, I would fine-tune the features to only be the most important ones and try more models. The random forest classifier shows the weights it assigned to each feature during training. Using this information (shown below), I can prune the least important features and hopefully get better results. I would also be able to add joint accounts to the model and see if they can be supported as well. My biggest assumption was that the interest rates are not a continuous variable. I assumed this because the “Annual Income vs. Interest Rate” graph showed that interest rates had specific values within each grade.

Feature Weights

LendingClubCaseStudy

Lending club case study, table of contents, problem statement.

  • Dataset Analysis

Case Study Analysis

Conclusions, technologies used.

  • Acknowledgements

Lending Club is a consumer finance marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return.

It specialises in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile.

Like most other lending companies, lending loans to ‘risky’ applicants is the largest source of financial loss (called credit loss) . The credit loss is the amount of money lost by the lender when the borrower refuses to pay or runs away with the money owed.

In other words, borrowers who default cause the largest amount of loss to the lenders . In this case, the customers labelled as ‘charged-off’ are the ‘defaulters’ .

The core objective of the excercise is to help the company minimise the credit loss . There are two potential sources of credit loss are:

  • Applicant likely to repay the loan , such an applicant will bring in profit to the company with interest rates.** Rejecting such applicants will result in loss of business**.
  • Applicant not likely to repay the loan, i.e. and will potentially default, then approving the loan may lead to a financial loss* for the company

The goal is to identify these risky loan applicants , then such loans can be reduced thereby cutting down the amount of credit loss. Identification of such applicants using EDA using the given dataset , is the aim of this case study.

If one is able to identify these risky loan applicants , then such loans can be reduced thereby cutting down the amount of credit loss. Identification of such applicants using EDA is the aim of this case study.

In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilise this knowledge for its portfolio and risk assessment.

DataSet Analysis

The data given below contains the information about past loan applicants and whether they ‘defaulted’ or not. The aim is to identify patterns which indicate if a person is likely to default, which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.

  • Overall objective will be to observe key leading indicaters (driver variables) in the dataset, which contribute to defaulters
  • Use the analysis as a the foundation of the hypothesis
  • Potential borrower requests for loan amount (loan_amnt)
  • The approver approves/rejects an amount based on past history/risk (funded_amnt)
  • The final amount offered as loan by the investor (funded_amnt_inv)

Analysis based on Domain Understanding

Leading attribute (loan_status).

  • Fully-Paid - The customer has successfuly paid the loan
  • Charged-Off - The customer is “Charged-Off” ir has “Defaulted”
  • For the given case study, “Current” status rows will be ignored

Decision Matrix (loan_status)

  • Fully Paid - Applicant has fully paid the loan (the principal and the interest rate)
  • Current - Applicant is in the process of paying the instalments, i.e. the tenure of the loan is not yet completed. These candidates are not labelled as ‘defaulted’.
  • Charged-off - Applicant has not paid the instalments in due time for a long period of time, i.e. he/she has defaulted on the loan
  • Loan Rejected - The company had rejected the loan (because the candidate does not meet their requirements etc.). Since the loan was rejected, there is no transactional history of those applicants with the company and so this data is not available with the company (and thus in this dataset)

Important Columns

The given columns are leading attributes, or predictors . These attributes are available at the time of the loan application and strongly helps in prediction of loan pass or rejection. Key attributes Some of these columns may get dropped due to empty data in the dataset

  • Annual Income (annual_inc) - Annual income of the customer. Generally higher the income, more chances of loan pass
  • Home Ownership (home_ownership) - Wether the customer owns a home or stays rented. Owning a home adds a collateral which increases the chances of loan pass.
  • Employment Length (emp_length) - Employment tenure of a customer (this is overall tenure). Higher the tenure, more financial stablity, thus higher chances of loan pass
  • Debt to Income (dti) - The percentage of the salary which goes towards paying loan. Lower DTI, higher the chances of a loan pass.
  • State (addr_state) - Location of the customer. Can be used to create a generic demographic analysis. There could be higher delinquency or defaulters demographicaly.
  • Loan Ammount (loan_amt)
  • Grade (grade)
  • Term (term)
  • Loan Date (issue_date)
  • Purpose of Loan (purpose)
  • Verification Status (verification_status)
  • Interest Rate (int_rate)
  • Installment (installment)
  • Public Records (public_rec) - Derogatory Public Records. The value adds to the risk to the loan. Higher the value, lower the success rate.
  • Public Records Bankruptcy (public_rec_bankruptcy) - Number of bankruptcy records publocally available for the customer. Higher the value, lower is the success rate.

Ignored Columns

  • Customer Behaviour Columns - Columns which describes customer behaviour will not contribute to the analysis. The current analysis is at the time of loan application but the customer behaviour variables generate post the approval of loan applications. Thus these attributes wil not be considered towards the loan approval/rejection process.
  • Granular Data - Columns which describe next level of details which may not be required for the analysis. For example grade may be relevant for creating business outcomes and visualizations, sub grade is be very granular and will not be used in the analysis

Data Set Analysis based on understanding of EDA

Rows analysis.

  • Summary Rows: No summary rows were there in the dataset
  • Header & Footer Rows - No header or footer rows in the dataset
  • Extra Rows - No column number, indicators etc. found in the dataset
  • Rows where the loan_status = CURRENT will be dropped as CURRENT loans are in progress and will not contribute in the decision making of pass or fail of the loan. The rows are dropped before the column analysis as it also cleans up unecessary column related to CURRENT early and columns with NA values can be cleaned in one go
  • Find duplicate rows in the dataset and drop if there are

Columns Analysis of the Dataset

Drop columns.

  • This is evaluated after dropping rows with loan_status = Current
  • (next_pymnt_d, mths_since_last_major_derog, annual_inc_joint, dti_joint, verification_status_joint, tot_coll_amt, tot_cur_bal, open_acc_6m, open_il_6m, open_il_12m, open_il_24m, mths_since_rcnt_il, total_bal_il, il_util, open_rv_12m, open_rv_24m, max_bal_bc, all_util, total_rev_hi_lim, inq_fi, total_cu_tl, inq_last_12m, acc_open_past_24mths, avg_cur_bal, bc_open_to_buy, bc_util, mo_sin_old_il_acct, mo_sin_old_rev_tl_op, mo_sin_rcnt_rev_tl_op, mo_sin_rcnt_tl, mort_acc, mths_since_recent_bc, mths_since_recent_bc_dlq, mths_since_recent_inq, mths_since_recent_revol_delinq, num_accts_ever_120_pd, num_actv_bc_tl, num_actv_rev_tl, num_bc_sats, num_bc_tl, num_il_tl, num_op_rev_tl, num_rev_accts, num_rev_tl_bal_gt_0, num_sats, num_tl_120dpd_2m, num_tl_30dpd, num_tl_90g_dpd_24m, num_tl_op_past_12m, pct_tl_nvr_dlq, percent_bc_gt_75, tot_hi_cred_lim, total_bal_ex_mort, total_bc_limit, total_il_high_credit_limit)
  • There are multiple columns where the values are only zero , the columns will be dropped
  • There are columns where the values are constant . They dont contribute to the analysis, columns will be dropped
  • There are columns where the value is constant but the other values are NA . The column will be considered as constant. columns will be dropped
  • There are columns where more than 65% of data is empty (mths_since_last_delinq, mths_since_last_record) - columns will be dropped
  • Drop columns (id, member_id) as they are index variables and have unique values and dont contribute to the analysis
  • Drop columns (emp_title, desc, title) as they are discriptive and text (nouns) and dont contribute to analysis
  • Drop redundant columns (url) . On closer analysis url is a static path with the loan id appended as query. It’s a redundant column to (id) column
  • They contribute to the behaviour of the customer. Behaviour of the customer is recorded post approval of loan and not available at the time of loan approval. Thus these variables will not be considered in analysis and thus dropped
  • (delinq_2yrs, earliest_cr_line, inq_last_6mths, open_acc, pub_rec, revol_bal, revol_util, total_acc, out_prncp, out_prncp_inv, total_pymnt, total_pymnt_inv, total_rec_prncp, total_rec_int, total_rec_late_fee, recoveries, collection_recovery_fee, last_pymnt_d, last_pymnt_amnt, last_credit_pull_d, application_type)

Convert Column Format

  • (loan_amnt, funded_amnt, funded_amnt_inv) columns are Object and will be converted to float
  • (int_rate, installment, dti) columns are Object and will be converted to float
  • strip “month” text from term column and convert to integer
  • Percentage columns (int_rate) are object. Strip “%” characters and convert column to float
  • issue_d column converted to datetime format

Standardise Values

  • All currency columns are rounded off to 2 decimal places as currency are limited to cents/paise etc only.

Convert Column Values

  • loan_status column converted to boolean Charged Off = False and Fully Paid = True . This converts the column into ordinal values
  • < 1 year: 0,
  • 2 years: 2,
  • 3 years: 3,
  • 7 years: 7,
  • 4 years: 4,
  • 5 years: 5,
  • 6 years: 6,
  • 8 years: 8,
  • 9 years: 9,
  • 10+ years: 10

Added new columns

  • verification_status_n added. Considering domain knowledge of lending = Verified > Source Verified > Not Verified. verification_status_n correspond to {Verified: 3, Source Verified: 2. Not Verified: 1} for better analysis
  • issue_y is year extracted from issue_d
  • issue_m is month extracted from issue_d

Ignored Rows and Columns because of missing data

  • Columns with high percentage of missing values will be dropped (65% above for this case study)
  • Columns with less percentage of missing value will be imputed
  • Rows with high percentage of missing values will be removed (65% above for this case study)
  • Step 0 - Data Cleaning & Manipulation Checklist
  • Step 1 - Dropping Rows - where loan_status = “Current”
  • Step 2 - Dropping Columns based on EDA and Domain Knowledge
  • Step 3 - Convert the data types
  • Step 4 - Identify columns with blank values which need to be imputed
  • Step 5 - Analysis of the dataset post cleanup
  • Step 6 - Outlier Treatment
  • Step 7 - Analysis - Univariate, Bivariate and Derived Metrics Analysis
  • Step 8 - Conclusions Inferences and Recommendations
  • Python - Version 3.9.12
  • numpy - Version 1.21.5
  • pandas - Version 1.4.2
  • matplotlib - Version 3.5.1
  • seaborn - Version 0.11.2
  • Jupyter Notebook - Version 3.3.2
  • JupyterLab - Version 6.4.11
  • Anaconda - Version 2.1.4

Acknowledgements and References

  • The project references insights and inferences from live presentation given by Aditya Bhattacharya
  • The project reference course materieals from upGrads curriculm
  • What is Exploratory Data Analysis? by Prasad Patil
  • https://stackoverflow.com/questions/56611698/pandas-how-to-read-csv-file-from-google-drive-public
  • EDA - Exploratory Data Analytics
  • Manoj Kumar Shukla
  • Siddhartha Lahiri

Instantly share code, notes, and snippets.

@15Nik

15Nik / Lending-club-case-study

  • Download ZIP
  • Star ( 0 ) 0 You must be signed in to star a gist
  • Fork ( 0 ) 0 You must be signed in to fork a gist
  • Embed Embed this gist in your website.
  • Share Copy sharable link for this gist.
  • Clone via HTTPS Clone using the web URL.
  • Learn more about clone URLs
  • Save 15Nik/68db81306b8af00ad10632b00e1cabae to your computer and use it in GitHub Desktop.

Lending_Club_Case_Study

Lending club case study.

To identify the pattern of Defaulter /risky customers and Non Defaulter customers.

Table of Contents

  • General Info

Technologies Used

Conclusions, acknowledgements, general information.

  • We have data of consumer finance company which specialises in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision
  • Scope is to perform a Exploratory data analysis om the given data
  • and need to find pattern to identify for good customers(who are likely to play the loan) and Defaulters(who are not likely to repay the loan) and lead loss to Finance company
  • If Annual Income is between 25k to 50k then chance of ‘Charged off’ is high whereas chance of ‘Fully Paid’ is high at salary ~14 million (1400000).
  • Loan status has strong relationship with grades of customer. ~94% of Grade A customer has paid the loan (i.e., Loan Status=’Fully Paid’)
  • Applicant who paid less (~ <2000) ‘Last Payment Amount’ are more likely to be Defaulter
  • Maximum number of defaulter are those whose verification_status is ‘Source Verified’ and Purpose of loan are either ‘small business’, ‘other’ or ‘moving’
  • Pandas and Numpy

Give credit here.

  • This project was inspired by…
  • References if any…
  • This project was based on this tutorial .

Created by [https://ash4929.github.io/Lending_Club_Case_Study/] - feel free to contact me!

Avatar of Jifu Zhao

Ph.D. Candidate @ UIUC

Lending Club Loan

Jifu Zhao, 05 March 2018

img

Lending Club Loan Data Analysis (imbalanced classification problem)

Classification is one of two most common data science problems (another one is regression). For the supervised classification problem, imbalanced data is pretty common yet very challenging. For example, credit card fraud detection, disease classification, network intrusion and so on, are classification problem with imbalanced data. In this project, working with the Lending Club loan data, we hope to correctly predict whether or not on loan will be default using the history data.

This blog can be roughly divided into the following 7 parts. From the problem statement, to the final conclusion, as a case study, I will go through a typical data science project’s major procedures. (For more details, please refer to my GitHub jupyter notebook )

  • Problem Statement
  • Data Exploration
  • Data Cleaning and Initial Feature Engineering
  • Visualization
  • Further Feature Engineering
  • Machine Learning
  • Conclusions

1. Problem Statement

For companies like Lending Club, correctly predicting whether or not one loan will be default is very important. In this project, using the historical data, more specifically, the Lending Club loan data from 2007 to 2015, we hope to build a machine learning model such that we can predict the chance of default for the future loans. As I will show later, this dataset is highly imbalanced and includes a lot of features, which makes this problem more challenging.

2. Data Exploration

There are several ways to download the dataset, for example, you can go to Lending Club’s website , or you can go to Kaggle . I will use the loan data from 2007 to 2015 as the training set (+ validation set), and use the data from 2016 as the test set. Below is a summary of the dataset (part of the columns)

img

We should notice some differences between the training and test set, and look into details. Some major difference are:

  • For test set, id, member_id, and url are totally missing, which is different from training set
  • For training set, open_acc_6m, open_il_6m, open_il_12m, open_il_24m, mths_since_rcnt_il, total_bal_il, il_util, open_rv_12m, open_rv_24m, max_bal_bc, all_util, inq_fi, total_cu_tl, and inq_last_12m are almost missing in training set, which is different from test set
  • desc, mths_since_last_delinq, mths_since_last_record, mths_since_last_major_derog, annual_inc_joint, dti_joint, and verification_status_joint have large amount of missing values
  • There are multiple loan status, but we only concern whether or not the load is default

3. Data Cleaning and Initial Feature Engineering

Data cleaning and feature engineering are two of the most important steps. For this project, I have done the following parts.

I. Transform feature int_rate and revol_util in test set

Ii. transform target values loan_status.

In the training set, only 7% of all data have label 1. It’s clear that our dataset is highly imbalanced. (Note that here we treat current status also as label 0 to increase the difficulty. In other related projects, some people simply drop all the loan in current status. I have also explored that case, and you can easily get over 99% AUC on both training and test set even with logistic regression.)

III. Drop useless features

Now, we have successfully reduce the features from 74 to 40. Next, let’s focus on more detailed feature engineering First, let’s look at the data again. From the below table, we can see that:

  • Most features are numerical, but there are severl categorical features.
  • There are still some missing values among numerical and categorical features.

IV. Feature transformation

  • Transform numerical values into categorical values
  • Transform categorical values into numerical values (discrete)

V. Fill missing values

  • For numerical features, use median
  • For categorical features, use mode (here, we don’t have missing categorical values)

4. Visualization

I. visualize categorical features.

img

II. Visualize numerical features

img

5. Further Feature Engineering

From the above heatmap and the categorical variable countplot, we can see that some feature has strong correlation

  • loan_amnt, funded_amnt, funded_amnt_inv, installment
  • int_rate, sub_grade
  • total_pymnt, total_pymnt_inv, total_rec_prncp
  • out_prncp, out_prncp_inv
  • recoveries, collection_recovery_fee

We can drop some of them to reduce redundancy. Now, we only 14 categorical features, 17 numerical features. Let’s check the correlation again.

img

6. Machine Learning

After the above procedures, we are ready to build the predictive models. In this part, I explored three different models: Logistic regression, Random Forest, and Deep Learning.

I used to use scikit-learn a lot. But there is one problem with scikit-learn: you need to do one-hot encoding manually, which can sometimes dramatically increase the feature space. In this part, for logistic regression and random forest, I use H2O package, which has a better support with categorical features. For deep learning model, I use Keras with TensorFlow backend.

I. Logistic Regression

After grid search over alpha and lambda, I got test AUC of 0.841.

II. Random Forest

After grid search over alpha and lambda, I got test AUC of 0.848.

Feature Importance

img

As shown above, the top 9 most important features are:

  • out_prncp: Remaining outstanding principal for total amount funded
  • recoveries: Post charge off gross recovery
  • last_pymnt_amnt: Last total payment amount received
  • total_pymnt: Payments received to date for total amount funded
  • int_rate: Interest Rate on the loan
  • addr_state: The state provided by the borrower in the loan application
  • total_rec_late_fee: Late fees received to date
  • loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
  • total_rec_int: Interest received to date

III. Neural Networks

In this part, let’s manually build a fully-connected neural network (NN) model to finish the classification task. I use a relatively small model with only two hidden layers. Without comprehensive parameter tuning, the model gives AUC of 0.834.

7. Conclusions

From our above analysis, we can see that for the above three algorithms: Logistic Regression, Random Forest, and Neural Networks, their performance on the test set is pretty similar. Based on our simple analysis and grid search, Random Forest gives the best result.

There are a lot of other methods, such as AdaBoost and XGBoost, and we can tune a lot of parameters for different models, especially for Neural Networks. Here, I didn’t explore all possible algorithms and conduct comprehensive parameter tune. For more details, please refer to my GitHub jupyter notebook .

img

In this project, to increase the problem difficulty, the loan with status like “current” is treated as status 0, which might not be very appropriate. There are some other online analysis that exclude all the loan that is still in “current” status. I also conducted analysis with that kind of dataset, which is less imbalanced. I can easily get over 0.99 AUC on both training and test dataset even with simple logistic regression. You can try it by yourself if interested.

  • ©Jifu Zhao. All rights reserved.
  • Design: HTML5 UP
  • Jekyll integration: Chris Bobbe

Model Training and Tuning

Baseline models, feature engineering attempts, decision tree model tuning/comparison, random forest model tuning/comparison, comparison of top decision tree vs. top random forest models.

When building and testing models for real-world use, we should choose model performance metrics based on the goal and usage of the model. In our case, we will be focusing on a high precision bar where the positive outcome is a fully paid off loan and then maximizing the total number of true positives. We have chosen to limit ourselves to decision tree-based algorithms because they are flexible and apply to broad types of data. We have some important ordinal variables in our dataset, including loan subgrade.

Imports and Loading Data

Since we have performed feature selection using Random Forest, we will use a single Decision Tree as our baseline with the top 15 features chosen by feature importances.

First, for comparison purposes, we compute an accuracy score from a trival model: a model in which we simply predict all loans to have the outcome of the most-common class (i.e., predicting all loans to be fully paid).

**For the purposes of our baseline model, we simply select the top 15 features from the random forest feature importances. We train and compare decison trees using this subset of features. **

In subsequent models, we will engineer new features and carefully tune the subset of features to include.

Now we train our baseline models on the subset of features. For baseline models we simply use the default DecisionTreeClassifier (which uses class_weights=None), trained on max_depths from 2 to 10.

We store various performance metrics; in addition to the accuracy score, we also store the balanced accuracy score, precision score, and confusion matrix for each model so that we can investigate beyond a simple accuracy score.

png

Comments: Our baseline accuracy values are not impressive; the best accuracy score on the validation set is about 80%, which is the accuracy we’d achieve if we simply predicted all loans to be fully paid.

Though we have used accuracy as the scoring metric for our baseline model, we realize that this is not the most appropriate metric to use going forward. We should also consider other performance metrics, such as precision scores and balanced accuracy scores.

Going forward, we should also place more weight (i.e., by modifyinng the class_weight when training models) on correctly classifying loans that are charged off. There are two main reasons for this:

  • 80% of the loans in our dataset are, in fact, fully paid. We should add more weight to the charged-off loans to account for this imbalance in the loan outcome labels.
  • For the purposes of building a sound, reasonably low-risk investment strategy, we hope to minimize the particular error in which the model predicts a loan to be fully paid when it is truly charged off. We will tune the class_weight parameter in future models to select an optimal value for our purposes.

To enhance model performance while tuning our models, , we explored generating additional features not in the raw data set to potentially catch additional relationships in the response variable.

Our attempts included interaction variables between top features, as well as polynomial terms with degree 2. Unfortunately, we did not see any notable model performance from these changes, so they were not included in our final model.

There are many other higher-order polynomial and interaction terms and combinations of relevant/related predictors that could also be fine-tuned into better summary variables to boost model performance. This feature engineering step is an area where there is room for substantial improvement upon our model in the future.

With similar total funded fully paid loans between the best Decision Tree and Random Forest, we need to look at the probability distributions. Because it is unrealistic to expect investors to invest in all loans we recommend, we will rank the loans by order of probability. As such, the performance of the models at predicting loans near 1.0 matters more than the loans near 0.5 as those will be fulfilled first.

png

Because we chose models with a high base precision, the calibration curve is not suprising. Our use case will only focus on the upper end of the probability scores where an approved charged off loan is the worst misclasification so it is not a big issue that the probabilities are not 1:1 to the proportion of fully paid loans. What we see is that our chosen Random Forest model has higher proportion of fully paid loans in comparison to Decision Tree at high probability values. While the Decision Tree is more interpretable, Random Forest is a good balance between performance and interpretability.

lending-club

Lending club.

EDA to understand how consumer attributes and loan attributes influence the tendency to default

Table of Contents

  • General Info

Technologies Used

Conclusions, acknowledgements, general information.

When a loan application is received, the company must make a loan approval decision based on the applicant’s profile. The bank’s decision is associated with two types of risks:

  • If the applicant is likely to repay the loan, then the company loses business by not approving the loan.
  • If the applicant is unlikely to repay the loan, i.e. if he or she will default, approving the loan may result in a financial loss for the company.

The loan dataset contains information about previous loan applicants and whether they ‘defaulted’ or not. The goal is to identify patterns that indicate whether a person is likely to default, which can then be used to take actions such as denying the loan, reducing the loan amount, lending (to risky applicants) at a higher interest rate, and so on.

Loan Dataset Image

When a person applies for a loan, the company may make one of two decisions:

  • Fully paid: The applicant has paid off the loan in full (the principal and the interest rate)
  • Current: The applicant is in the process of paying the installments, so the loan’s tenure has not yet been completed. These candidates are not marked as ‘defaulted’.
  • Charged-off: The applicant has not paid the installments on time for an extended period of time, indicating that he or she has defaulted on the loan.
  • Loan rejected: The loan had been rejected by the company (because the candidate does not meet their requirements etc.). Because the loan was rejected, the applicants have no transactional history with the company, and thus this data is not available to the company (and thus in this dataset)
  • Loan Data Set : This contains the complete loan data for all loans issued through the time period 2007-2011.
  • Data Dictionary : This dataset describes the meaning of the variables mentioned in the Loan Data Set.

Python 3.8.10 | Library | Version | | ———- | ——- | | matplotlib | 3.5.3 | | numpy | 1.22.4 | | pandas | 1.3.5 | | seaborn | 0.11.2 |

  • Term has the most impact on default rate, loan term being 60 months are 2.5x more likely to default than 36 months.
  • Loan Grades A through G shows a pattern of increased risk.
  • Employment Length does not show much impact on default rates.
  • Verification Status being source verified and verified slightly increases the default rate.
  • Home ownerships being stated mortgaged, owned or rented, very slightly increases the default risk in the order. Mortgaged being the least.
  • Higher the interest rate, higher is the risk of default.
  • Higher annual income may very slightly reduce the risk of defaulting the loan.
  • As DTI(Debt-To-Income) ratio increases, the risk of default increases as well.
  • This case study was done as a part of EPGP ML & AI, IIIT-B.
  • Loan Data Set and Data Dictionary was provided by upGrad and IIIT-B.

Case Study done by:

  • Rahul Nanwani
  • Nitin Katiyar

Lending Club Default Prediction

Harvard University CS109A Summer 2018 Kenneth Brown - David Gil Garcaa - Nikat Patel

Modeling and Predictions

Load, visualize, and clean data.

We see that there are categories of the loan_status have very few observations. Since we are really interested in the loan getting to good term, instead of trying to predict the status, we turn it in a binary category, indicating if it is current (non risky) or falls into any of the other categories, which we’ll call risky.

png

We see some correlations that may be interesting to explore further between features that indicate potential risk (as by the new feature added to the dataset) and others. We see that the variables by which Lending Club seems to grade loans do indeed have a potential effect on risk (such as term of loan or loan ammount) but we also see others that they don’t seem to take in so much consideration as having tax lien or derogatory public records.

It’s interesting to point out that better grades, as assigned by Lending Club, don’t necessarily correspond with less risk of default, as seen by “charged off” having a negative correlation with Grade G and positive with better levels.

In favor of Lending Club’s grading system we see that there seems to be an intrinsic higher risk on higher interest paying loans, at least through this rough preliminary analysis.

After all that visualization! Lets Clean the Data!

We now process NaN on a column by column basis to impute the appropriate value in each case

After cleaning the dataset of loans_df , we were able to reduce from 145 categories to 55 categories.

Dimensionality Reduction

png

Even though the first 10 PCA components only get to explain short of 50% of the variance combined, we find interesting that some of the variables with the higher absolute coefficients in the first vectors are in the group of those that previous analysis indicated that these could be good predictors.

From the results of dimensionality reductions we reduce the set of features to build our further models on to the first 30 with higher coefficients in PCA1

Random Forest

A Random Forest Classifier was fitted on the training data set and resulted in a 90% accuracy using these features when evaluating the model’s performance on the test sit. Repeated runs of each model would result in minor changes to the accuracy scores.

png

From the random forest model, it generated a forest of trees of depth of 5. From these multiple generated trees, the importance of each feature is plotted above.

png

The following features, ( inq_last_12m , bc_open_to_buy , mo_sin_old_il_acct , mths_since_recent_bc , mths_since_recent_bc_dlq , num_rev_accts , percent_bc_gt_75 ) are the features of the random forest model which considers to be the most importance in all of the generated trees. We were very impressed by the results from fitting Random Forest to such a small subset, from which we obtained a score of almost 90% on our test set with a model that trained in only a few seconds.

IMAGES

  1. GitHub

    lending club case study github

  2. GitHub

    lending club case study github

  3. GitHub

    lending club case study github

  4. GitHub

    lending club case study github

  5. GitHub

    lending club case study github

  6. GitHub

    lending club case study github

VIDEO

  1. Journal Club: Case Study Presentation 2024

COMMENTS

  1. GitHub

    Case study to identify risky loan applicants and understand factors that contribute to a loan default. Topics data-science eda data-visualization seaborn data-analysis case-study lending-club bivariate-analysis univariate-analysis

  2. Sureshkrishh/Lending-club-case-study

    The study identifies the risky loan applicants who might end up being a "defaulter" and can significantly bring financial loss to the lending company. A rigourous analysis is done to identify those driving factors which helps us narrow down to pinpoint who might end up being a defaulter.

  3. GitHub

    This company is the largest online loan marketplace, facilitating personal loans, business loans, and financing of medical procedures.Borrowers can easily access lower interest rate loans through a fast online interface. Like most other lending companies, lending loans to 'risky' applicants is the largest source of financial loss (called credit loss).

  4. GitHub

    Lending Club Case Study. This project involves a comprehensive Exploratory Data Analysis (EDA) of the Lending Club dataset with the objective of uncovering insights into how various consumer and loan attributes influence the tendency of borrowers to default. Lending Club, a peer-to-peer lending platform, provides a rich dataset encompassing ...

  5. GitHub

    Discretion# Lending Club is a marketplace for loans. This case study is to determine which application is suitable to lend the money so that lending club will minimise the risk of losing money while lending to customer When the company receives a loan application, the company has to make a decision for loan approval based on the applicant's profile.

  6. Lending_club_case_study/README.md at main

    Objective of Study. The primary cause of financial loss in lending comes from granting loans to applicants with higher risk profiles, resulting in what's known as credit loss. Credit loss is the monetary amount a lender loses when borrowers fail to repay their loans or abscond with the money they owe. The primary goal is to identify these high ...

  7. GitHub

    Lending-Club-Case-Study. Project Brief Solving this assignment will give you an idea about how real business problems are solved using EDA. In this case study, apart from applying the techniques you have learnt in EDA, you will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used ...

  8. ishankarve/upgrad_lending_club_case_study

    Lending Club Case Study You work for a consumer finance company which specialises in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant's profile.

  9. Lending Club Case Study

    View on GitHub Lending Club Case Study. In this case study, we plan to apply our knowledge of EDA into use and understand risk analytics in banking and financial services. It is intended to showcase how data is analyzed to minimize the risk of losing money while lending to customers. Table of Contents.

  10. Case Study #1: Lending Club Loans

    Case Study #1: Lending Club Loans Devin Masiak 4/9/2022. The Data. This dataset has 10,000 observations and 55 variables. Each observation is an accepted loan through the Lending Club platform. The variables describe the attributes of the person who requested the loan as well as the terms of the loan. Some problems with this dataset are several ...

  11. Case Study: Lending Club

    GitHub; Case Study: Lending Club 1 minute read Problem Statement. A consumer finance company specialises in lending various types of loans to urban customers. When the company receives a loan application, it has to make a decision for loan approval based on the applicant's profile. Two types of risks are associated with the bank's decision:

  12. Lending Club Case Study

    Fully-Paid - The customer has successfuly paid the loan. Charged-Off - The customer is "Charged-Off" ir has "Defaulted". Current - These customers, the loan is currently in progress and cannot contribute to conclusive evidence if the customer will default of pay in future. For the given case study, "Current" status rows will be ignored.

  13. Lending-club-case-study · GitHub

    We can see 0.6%, 3.8%, 7.7% loans in the years 2007, 2008, 2009 respectively. # 2009 & 2010: they gave loans to "major_purchase" and "home_improvement" purpose. # 2010: they reduced lending to home_imporovement and concentrated on debt_consolidation. # 2011: they increased debt consolidation lending and also started lending for credit_card.

  14. Lending Club Case Study EDA

    Exploratory Data Analysis on Lending Club Case Study Machine Learning Algorithm.Github link for code https://github.com/kunalsahu/lending_club__case_study#ED...

  15. Lending Club Project Overview

    The Lending Club is a peer-to-peer lending network for loans ranging from $1,000 to $40,000. The Lending Club provides a large amount of loan data online so that its investors can make their own informed investment decisions. This online data dates from 2007 to Q2 2018 and provides certain borrower demographic and credit history information, as ...

  16. Lending Club Case Study

    Lending_Club_Case_Study Lending Club Case Study. To identify the pattern of Defaulter /risky customers and Non Defaulter customers. Table of Contents. General Info; Technologies Used; Conclusions; Acknowledgements; General Information. We have data of consumer finance company which specialises in lending various types of loans to urban customers.

  17. Lending Club Loan

    Conclusions. 1. Problem Statement. For companies like Lending Club, correctly predicting whether or not one loan will be default is very important. In this project, using the historical data, more specifically, the Lending Club loan data from 2007 to 2015, we hope to build a machine learning model such that we can predict the chance of default ...

  18. Exploratory Data Analysis

    In favor of Lending Club's grading system we see that there seems to be an intrinsic higher risk on higher interest paying loans, at least through this rough preliminary analysis. After all that visualization! Lets Clean the Data! loans_df=full_loan_stats.copy()loans_df=loans_df.select_dtypes(include=['float64']).join(loans_df[target_col])

  19. Model Training and Tuning

    Lending Club Project - Homepage | Data Cleaning and Pre-Processing | EDA | Model Training and Tuning ... Our use case will only focus on the upper end of the probability scores where an approved charged off loan is the worst misclasification so it is not a big issue that the probabilities are not 1:1 to the proportion of fully paid loans. What ...

  20. PDF Lending Club Case Study

    Background -Lending Club Case Study Background Lending club is the largest peer-to-peer marketplace connecting borrowers with lenders. Borrowers apply through an online platform where they are assigned an internal score. Lenders decide 1) whether to lend and 2) the terms of loan such as interest rate, monthly instalment, tenure etc.

  21. Lending Club

    Lending Club Case Study for EPGP ML & AI. lending-club Lending Club. EDA to understand how consumer attributes and loan attributes influence the tendency to default. ... This case study was done as a part of EPGP ML & AI, IIIT-B. Loan Data Set and Data Dictionary was provided by upGrad and IIIT-B. Case Study done by:

  22. Modeling and Predictions

    It's interesting to point out that better grades, as assigned by Lending Club, don't necessarily correspond with less risk of default, as seen by "charged off" having a negative correlation with Grade G and positive with better levels. In favor of Lending Club's grading system we see that there seems to be an intrinsic higher risk on ...

  23. PDF Lending Club Case Study Submission

    Abstract This case study is to help a consumer finance company to take decision as to whether a applicant will default on the loan provided to him/her based on the various consumer attributes and loan attributes Based on the attributes the company can take two decisions, either to accept the loan or reject the loan.