Bin Data to Create Credit Scorecards Using Binning Explorer
Create a credit scorecard using the Binning Explorer app. Use the
Binning Explorer to bin the data, plot the binned data information,
and export a creditscorecard
object. Then use the
creditscorecard
object with functions from Financial Toolbox™ to fit a logistic regression model, determine a score for the data,
determine the probabilities of default, and validate the credit scorecard model
using three different metrics.
Step 1. Load credit scorecard
data
into the MATLAB workspace.Step 3. Fine-tune the bins using manual binning in Binning Explorer.
Step 4. Export the
creditscorecard
object from Binning Explorer.Step 9. Validate the credit scorecard model using the CAP, ROC, and Kolmogorov-Smirnov statistic
Step 1. Load credit scorecard data
into the MATLAB workspace.
Use the CreditCardData.mat
file to load the data
into
the MATLAB® workspace (using a dataset from Refaat 2011).
load CreditCardData
disp(data(1:10,:))
CustID CustAge TmAtAddress ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance UtilRate status ______ _______ ___________ __________ _________ __________ _______ _______ _________ ________ ______ 1 53 62 Tenant Unknown 50000 55 Yes 1055.9 0.22 0 2 61 22 Home Owner Employed 52000 25 Yes 1161.6 0.24 0 3 47 30 Tenant Employed 37000 61 No 877.23 0.29 0 4 50 75 Home Owner Employed 53000 20 Yes 157.37 0.08 0 5 68 56 Home Owner Employed 53000 14 Yes 561.84 0.11 0 6 65 13 Home Owner Employed 48000 59 Yes 968.18 0.15 0 7 34 32 Home Owner Unknown 32000 26 Yes 717.82 0.02 1 8 50 57 Other Employed 51000 33 No 3041.2 0.13 0 9 50 10 Tenant Unknown 52000 25 Yes 115.56 0.02 1 10 49 30 Home Owner Unknown 53000 23 Yes 718.5 0.17 1
Step 2. Import the data
into Binning Explorer.
Open Binning Explorer from the MATLAB toolstrip: On the Apps tab, under
Computational Finance, click the app icon.
Alternatively, you can enter binningExplorer
on the
MATLAB command line. For more information on starting the Binning
Explorer from the command line, see Start from MATLAB Command Line Using Data or an Existing creditscorecard Object.
From the Binning Explorer toolstrip, select Import Data to open the Import Data window.
Under Step 1, select data
.
Under Step 2, optionally set the Variable
Type for each of the predictors. By default, the last column in
the data ('status'
in this example) is set to
'Response'
. All other variables are considered
predictors. However, in this example, because 'CustID'
(customer identification number) is not a useful predictor, set the
Variable Type column for 'CustID'
to
Do not include.
Note
If the input MATLAB table contains a column for weights
,
from the Step 2 pane, using the Variable
Type column, click the drop-down to select
Weights. For more information on using
observation weights with a creditscorecard
object,
see Credit Scorecard Modeling Using Observation Weights.
If the data contains missing values, from the Step 2 pane, set Bin missing data: to Yes. For more information on working with missing data, see Credit Scorecard Modeling with Missing Values.
Under Step 3, leave Monotone as the default initial binning algorithm.
Click Import Data to complete the import operation. Automatic binning using the selected algorithm is applied to all predictors as they are imported into Binning Explorer.
The bins are plotted and displayed for each predictor. By clicking to select an individual predictor plot from the Overview pane, the details for that predictor plot display in the main pane and in the Bin Information and Predictor Information panes at the bottom of the app.
Binning Explorer performs automatic binning for every predictor
variable, using the default 'Monotone'
algorithm with default
algorithm options. A monotonic, ideally linear trend in the Weight of Evidence
(WOE) is often desirable for credit scorecards because this translates into
linear points for a given predictor. WOE trends are visualized on the plots for
each predictor in Binning Explorer.
Perform some initial data exploration. Inquire about predictor statistics for
the 'ResStatus'
categorical variable.
Click the ResStatus plot. The Bin Information pane contains the “Good” and “Bad” frequencies and other bin statistics such as weight of evidence (WOE).
For numeric data, the same statistics are displayed. Click the CustIncome plot. The Bin Information is updated with the information about CustIncome.
Step 3. Fine-tune the bins using manual binning in Binning Explorer.
Click the CustAge predictor plot. Notice that bins 1 and 2 have similar WOEs, as do bins 5 and 6.
To merge bins 1 and 2, from the main pane, click Ctrl + click or Shift + click to multiselect bin 1 and 2 to display with blue outlines for merging.
On the Binning Explorer toolstrip, use the read-only display for the Edges text boxes to verify values for the edges of the selected bins to merge.
Click Merge to finish merging bins 1 and 2. The CustAge predictor plot is updated for the new bin information and the details in the Bin Information and Predictor Information panes are also updated.
Next, merge bins 4 and 5, because they also have similar WOEs.
The CustAge predictor plot is updated with the new bin information. The details in the Bin Information and Predictor Information panes are also updated.
Repeat this merge operation for the following bins that have similar WOEs:
For CustIncome, merge bins 3, 4 and 5.
For TmWBank, merge bins 2 and 3.
For AMBalance, merge bins 2 and 3.
Now the bins for all predictors have close-to-linear WOE trends.
Step 4. Export the creditscorecard
object from Binning Explorer.
After you complete your binning assignments, using Binning
Explorer, click Export and then click
Export Scorecard and provide a
creditscorecard
object name. The
creditscorecard
object (sc
) is saved
to the MATLAB workspace.
Step 5. Fit a logistic regression model.
Use the fitmodel
function
to fit a logistic regression model to the WOE data. fitmodel
internally
bins the training data, transforms it into WOE values, maps the response
variable so that 'Good'
is 1
,
and fits a linear logistic regression model. By default, fitmodel
uses
a stepwise procedure to determine which predictors belong in the model.
sc = fitmodel(sc);
1. Adding CustIncome, Deviance = 1490.8954, Chi2Stat = 32.545914, PValue = 1.1640961e-08 2. Adding TmWBank, Deviance = 1467.3249, Chi2Stat = 23.570535, PValue = 1.2041739e-06 3. Adding AMBalance, Deviance = 1455.858, Chi2Stat = 11.466846, PValue = 0.00070848829 4. Adding EmpStatus, Deviance = 1447.6148, Chi2Stat = 8.2432677, PValue = 0.0040903428 5. Adding CustAge, Deviance = 1442.06, Chi2Stat = 5.5547849, PValue = 0.018430237 6. Adding ResStatus, Deviance = 1437.9435, Chi2Stat = 4.1164321, PValue = 0.042468555 7. Adding OtherCC, Deviance = 1433.7372, Chi2Stat = 4.2063597, PValue = 0.040272676 Generalized Linear regression model: logit(status) ~ 1 + CustAge + ResStatus + EmpStatus + CustIncome + TmWBank + OtherCC + AMBalance Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue ________ _______ ______ __________ (Intercept) 0.7024 0.064 10.975 5.0407e-28 CustAge 0.61562 0.24783 2.4841 0.012988 ResStatus 1.3776 0.65266 2.1107 0.034799 EmpStatus 0.88592 0.29296 3.024 0.0024946 CustIncome 0.69836 0.21715 3.216 0.0013001 TmWBank 1.106 0.23266 4.7538 1.9958e-06 OtherCC 1.0933 0.52911 2.0662 0.038806 AMBalance 1.0437 0.32292 3.2322 0.0012285 1200 observations, 1192 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 89.7, p-value = 1.42e-16
Step 6. Review and format scorecard points.
After fitting the logistic model, the points are unscaled by default and come
directly from the combination of WOE values and model coefficients. Use the
displaypoints
function to
summarize the scorecard points.
p1 = displaypoints(sc); disp(p1)
Predictors Bin Points ____________ __________________ _________ 'CustAge' '[-Inf,37)' -0.15314 'CustAge' '[37,40)' -0.062247 'CustAge' '[40,46)' 0.045763 'CustAge' '[46,58)' 0.22888 'CustAge' '[58,Inf]' 0.48354 'ResStatus' 'Tenant' -0.031302 'ResStatus' 'Home Owner' 0.12697 'ResStatus' 'Other' 0.37652 'EmpStatus' 'Unknown' -0.076369 'EmpStatus' 'Employed' 0.31456 'CustIncome' '[-Inf,29000)' -0.45455 'CustIncome' '[29000,33000)' -0.1037 'CustIncome' '[33000,42000)' 0.077768 'CustIncome' '[42000,47000)' 0.24406 'CustIncome' '[47000,Inf]' 0.43536 'TmWBank' '[-Inf,12)' -0.18221 'TmWBank' '[12,45)' -0.038279 'TmWBank' '[45,71)' 0.39569 'TmWBank' '[71,Inf]' 0.95074 'OtherCC' 'No' -0.193 'OtherCC' 'Yes' 0.15868 'AMBalance' '[-Inf,558.88)' 0.3552 'AMBalance' '[558.88,1597.44)' -0.026797 'AMBalance' '[1597.44,Inf]' -0.21168
Use modifybins
to
give the bins more descriptive labels.
sc = modifybins(sc,'CustAge','BinLabels',... {'Up to 36' '37 to 39' '40 to 45' '46 to 57' '58 and up'}); sc = modifybins(sc,'CustIncome','BinLabels',... {'Up to 28999' '29000 to 32999' '33000 to 41999' '42000 to 46999' '47000 and up'}); sc = modifybins(sc,'TmWBank','BinLabels',... {'Up to 11' '12 to 44' '45 to 70' '71 and up'}); sc = modifybins(sc,'AMBalance','BinLabels',... {'Up to 558.87' '558.88 to 1597.43' '1597.44 and up'}); p1 = displaypoints(sc); disp(p1)
Predictors Bin Points ____________ ___________________ _________ 'CustAge' 'Up to 36' -0.15314 'CustAge' '37 to 39' -0.062247 'CustAge' '40 to 45' 0.045763 'CustAge' '46 to 57' 0.22888 'CustAge' '58 and up' 0.48354 'ResStatus' 'Tenant' -0.031302 'ResStatus' 'Home Owner' 0.12697 'ResStatus' 'Other' 0.37652 'EmpStatus' 'Unknown' -0.076369 'EmpStatus' 'Employed' 0.31456 'CustIncome' 'Up to 28999' -0.45455 'CustIncome' '29000 to 32999' -0.1037 'CustIncome' '33000 to 41999' 0.077768 'CustIncome' '42000 to 46999' 0.24406 'CustIncome' '47000 and up' 0.43536 'TmWBank' 'Up to 11' -0.18221 'TmWBank' '12 to 44' -0.038279 'TmWBank' '45 to 70' 0.39569 'TmWBank' '71 and up' 0.95074 'OtherCC' 'No' -0.193 'OtherCC' 'Yes' 0.15868 'AMBalance' 'Up to 558.87' 0.3552 'AMBalance' '558.88 to 1597.43' -0.026797 'AMBalance' '1597.44 and up' -0.21168
Points are scaled and are also often rounded. To round and scale the points,
use the formatpoints
function. For
example, you can set a target level of points corresponding to a target odds
level and also set the required points-to-double-the-odds (PDO).
TargetPoints = 500; TargetOdds = 2; PDO = 50; % Points to double the odds sc = formatpoints(sc,'PointsOddsAndPDO',[TargetPoints TargetOdds PDO]); p2 = displaypoints(sc); disp(p2)
Predictors Bin Points ____________ ___________________ ______ 'CustAge' 'Up to 36' 53.239 'CustAge' '37 to 39' 59.796 'CustAge' '40 to 45' 67.587 'CustAge' '46 to 57' 80.796 'CustAge' '58 and up' 99.166 'ResStatus' 'Tenant' 62.028 'ResStatus' 'Home Owner' 73.445 'ResStatus' 'Other' 91.446 'EmpStatus' 'Unknown' 58.777 'EmpStatus' 'Employed' 86.976 'CustIncome' 'Up to 28999' 31.497 'CustIncome' '29000 to 32999' 56.805 'CustIncome' '33000 to 41999' 69.896 'CustIncome' '42000 to 46999' 81.891 'CustIncome' '47000 and up' 95.69 'TmWBank' 'Up to 11' 51.142 'TmWBank' '12 to 44' 61.524 'TmWBank' '45 to 70' 92.829 'TmWBank' '71 and up' 132.87 'OtherCC' 'No' 50.364 'OtherCC' 'Yes' 75.732 'AMBalance' 'Up to 558.87' 89.908 'AMBalance' '558.88 to 1597.43' 62.353 'AMBalance' '1597.44 and up' 49.016
Step 7. Score the data.
Use the score
function to compute the
scores for the training data. You can also pass an optional
data
input to score
, for example,
validation data. The points per predictor for each customer are provided as an
optional output.
[Scores,Points] = score(sc); disp(Scores(1:10)) disp(Points(1:10,:))
528.2044 554.8861 505.2406 564.0717 554.8861 586.1904 441.8755 515.8125 524.4553 508.3169 CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance _______ _________ _________ __________ _______ _______ _________ 80.796 62.028 58.777 95.69 92.829 75.732 62.353 99.166 73.445 86.976 95.69 61.524 75.732 62.353 80.796 62.028 86.976 69.896 92.829 50.364 62.353 80.796 73.445 86.976 95.69 61.524 75.732 89.908 99.166 73.445 86.976 95.69 61.524 75.732 62.353 99.166 73.445 86.976 95.69 92.829 75.732 62.353 53.239 73.445 58.777 56.805 61.524 75.732 62.353 80.796 91.446 86.976 95.69 61.524 50.364 49.016 80.796 62.028 58.777 95.69 61.524 75.732 89.908 80.796 73.445 58.777 95.69 61.524 75.732 62.353
Step 8. Calculate the probability of default.
To calculate the probability of default, use the probdefault
function.
pd = probdefault(sc);
Define the probability of being “Good” and plot the predicted odds versus the formatted scores. Visually analyze that the target points and target odds match and that the points-to-double-the-odds (PDO) relationship holds.
ProbGood = 1-pd; PredictedOdds = ProbGood./pd; figure scatter(Scores,PredictedOdds) title('Predicted Odds vs. Score') xlabel('Score') ylabel('Predicted Odds') hold on xLimits = xlim; yLimits = ylim; % Target points and odds plot([TargetPoints TargetPoints],[yLimits(1) TargetOdds],'k:') plot([xLimits(1) TargetPoints],[TargetOdds TargetOdds],'k:') % Target points plus PDO plot([TargetPoints+PDO TargetPoints+PDO],[yLimits(1) 2*TargetOdds],'k:') plot([xLimits(1) TargetPoints+PDO],[2*TargetOdds 2*TargetOdds],'k:') % Target points minus PDO plot([TargetPoints-PDO TargetPoints-PDO],[yLimits(1) TargetOdds/2],'k:') plot([xLimits(1) TargetPoints-PDO],[TargetOdds/2 TargetOdds/2],'k:') hold off
Step 9. Validate the credit scorecard model using the CAP, ROC, and Kolmogorov-Smirnov statistic
The creditscorecard
object supports three
validation methods, the Cumulative Accuracy Profile (CAP), the Receiver
Operating Characteristic (ROC), and the Kolmogorov-Smirnov (KS) statistic.
For more information on CAP, ROC, and KS, see validatemodel
.
[Stats,T] = validatemodel(sc,'Plot',{'CAP','ROC','KS'}); disp(Stats) disp(T(1:15,:))
Measure Value ______________________ _______ 'Accuracy Ratio' 0.32225 'Area under ROC curve' 0.66113 'KS statistic' 0.22324 'KS score' 499.18 Scores ProbDefault TrueBads FalseBads TrueGoods FalseGoods Sensitivity FalseAlarm PctObs ______ ___________ ________ _________ _________ __________ ___________ __________ __________ 369.4 0.7535 0 1 802 397 0 0.0012453 0.00083333 377.86 0.73107 1 1 802 396 0.0025189 0.0012453 0.0016667 379.78 0.7258 2 1 802 395 0.0050378 0.0012453 0.0025 391.81 0.69139 3 1 802 394 0.0075567 0.0012453 0.0033333 394.77 0.68259 3 2 801 394 0.0075567 0.0024907 0.0041667 395.78 0.67954 4 2 801 393 0.010076 0.0024907 0.005 396.95 0.67598 5 2 801 392 0.012594 0.0024907 0.0058333 398.37 0.67167 6 2 801 391 0.015113 0.0024907 0.0066667 401.26 0.66276 7 2 801 390 0.017632 0.0024907 0.0075 403.23 0.65664 8 2 801 389 0.020151 0.0024907 0.0083333 405.09 0.65081 8 3 800 389 0.020151 0.003736 0.0091667 405.15 0.65062 11 5 798 386 0.027708 0.0062267 0.013333 405.37 0.64991 11 6 797 386 0.027708 0.007472 0.014167 406.18 0.64735 12 6 797 385 0.030227 0.007472 0.015 407.14 0.64433 13 6 797 384 0.032746 0.007472 0.015833
See Also
creditscorecard
| screenpredictors
| autobinning
| bininfo
| predictorinfo
| modifypredictor
| modifybins
| bindata
| plotbins
| fitmodel
| displaypoints
| formatpoints
| score
| setmodel
| probdefault
| validatemodel
| compactCreditScorecard
Related Examples
- Common Binning Explorer Tasks
- Credit Scorecard Modeling with Missing Values
- Feature Screening with screenpredictors
- Troubleshooting Credit Scorecard Results
- Credit Rating by Bagging Decision Trees
- Stress Testing of Consumer Credit Default Probabilities Using Panel Data
More About
- Overview of Binning Explorer
- About Credit Scorecards
- Credit Scorecard Modeling Workflow
- Monotone Adjacent Pooling Algorithm (MAPA)
- Credit Scorecard Modeling Using Observation Weights