Credit ScoreCard Model on SQLFlow
The credit scorecard model is a common model in the financial lending scenario. A scorecard model outputs a score representing how likely the lender will repay on time. This tutorial introduces how to train a credit scorecard model with SQLFlow.
Dataset Introduction
This tutorial using a public dataset Give Me Some Credit on the Kaggle website. The dataset contains 150,000 rows training data, and each row includes 11 columns, where serious_dlqin2yrs
is the target column and represents person experienced 90 days past due delinquency or worse. The description for each column is as the table:
Column | Description | Type |
---|---|---|
serious_dlqin2yrs | Person experienced 90 days past due delinquency or worse | 1/0 |
revolving_utilization_of_unsecured_lines | Total balance on credit cards and personal lines of credit except for real estate and no installment debt like car loans divided by the sum of credit limits | Float |
age | Age of borrower in years | Integer |
number_of_time30_59days_past_due_not_worse | Number of times borrower has been 30-59 days past due but no worse in the last two years. | Integer |
debt_ratio | Monthly debt payments, alimony, living costs divided by monthly gross income | Float |
monthly_income | Monthly income | Integer |
number_of_open_credit_lines_and_loans | Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) | Integer |
number_of_times_90_days_late | Number of times borrower has been 90 days or more past due | Integer |
number_real_estate_loans_or_lines | Number of mortgage and real estate loans including home equity lines of credit | Integer |
number_of_time60_89_days_past_due_not_worse | Number of times borrower has been 60-89 days past due but no worse in the last two years | Integer |
number_of_dependents | Number of dependents in family excluding themselves (spouse, children, etc.) | Integer |
Dataset Loading and Preprocessing
Please note, you can skip this section and use the preprocessed dataset inner SQLFlow playground.
Download and extract the zip file from the download page, and execute the Python script give_me_some_credit.py to finish the data preprocess. The Python scripts would output a CSV file train.csv
, and you can run the following SQL program to popularize the training table scorecard.train
.
DROP DATABASE IF EXISTS scorecard;
CREATE DATABASE scorecard;
DROP TABLE IF EXISTS scorecard.train;
CREATE TABLE IF NOT EXISTS scorecard.train(
serious_dlqin2yrs int,
revolving_utilization_of_unsecured_lines float,
age int,
number_of_time30_59days_past_due_not_worse int,
debt_ratio float,
monthly_income float NULL,
number_of_open_credit_lines_and_loans int,
number_of_times_90_days_late int,
number_real_estate_loans_or_lines int,
number_of_time60_89_days_past_due_not_worse int,
number_of_dependents float
);
-- load train and test data from CSV files
LOAD DATA LOCAL INFILE '/tmp/train.csv' INTO TABLE scorecard.train FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS;
Credit ScoreCard Modeling
You can have a glance at the training data by running the following SQL.
%%sqlflow
SELECT * FROM scorecard.train LIMIT 10;
Run the following command to start the credit scorecard modeling.
SELECT * FROM scorecard.train
TO TRAIN sqlflow_models.ScoreCard
LABEL serious_dlqin2yrs
INTO sqlflow_models.my_scorecard_model;
The above TRAIN
clause would output the scorecard as the following. The final total score is the sum of all scores based on the independent variable’s value. The target score is 600, meaning that a user with a score higher than 600 will grant the credit.
TARGET SCORE: 600
age (20.999, 38.0] 106.0
age (38.0, 47.0] 108.0
age (47.0, 56.0] 101.0
age (56.0, 65.0] 99.0
age (65.0, 89.0] 93.0
debt_ratio (-0.001, 0.132] 87.0
debt_ratio (0.132, 0.287] 103.0
debt_ratio (0.287, 0.472] 93.0
debt_ratio (0.472, 3.248] 118.0
debt_ratio (3.248, 15466.0] 101.0
monthly_income (-0.001, 3332.4] 106.0
monthly_income (3332.4, 5212.6] 108.0
monthly_income (5212.6, 5400.0] 100.0
monthly_income (5400.0, 8180.0] 100.0
monthly_income (8180.0, 208333.0] 98.0
number_of_dependents (-0.001, 2.0] 103.0
number_of_dependents (2.0, 8.0] 98.0
number_of_open_credit_lines_and_loans (-0.001, 4.0] 108.0
number_of_open_credit_lines_and_loans (4.0, 6.0] 101.0
number_of_open_credit_lines_and_loans (6.0, 9.0] 98.0
number_of_open_credit_lines_and_loans (9.0, 12.0] 97.0
number_of_open_credit_lines_and_loans (12.0, 46.0] 108.0
number_of_time30_59days_past_due_not_worse (-0.001, 98.0] 103.0
number_of_time60_89_days_past_due_not_worse (-0.001, 98.0] 103.0
number_of_times_90_days_late (-0.001, 98.0] 103.0
number_real_estate_loans_or_lines (-0.001, 1.0] 101.0
number_real_estate_loans_or_lines (1.0, 2.0] 102.0
number_real_estate_loans_or_lines (2.0, 9.0] 119.0
revolving_utilization_of_unsecured_lines (-0.001, 0.0191] 74.0
revolving_utilization_of_unsecured_lines (0.0191, 0.0867] 66.0
revolving_utilization_of_unsecured_lines (0.0867, 0.276] 70.0
revolving_utilization_of_unsecured_lines (0.276, 0.686] 94.0
revolving_utilization_of_unsecured_lines (0.686, 2340.0] 139.0