This project mainly implements the Monotone Optimal Binning(MOB) algorithm in SAS 9.4. We extend the application of this algorithm which can be applied to numerical and categorical data. In order to avoid the problem of creating too many bins, we optimize the p-value iteratively and provide bins size first binning
, monotonicity first binning
, and chi merge binning
methods for users to discretize data more conveniently.
git clone https://github.com/cdfq384903/MonotonicOptimalBinning.git
- Upload source code as the frame shown below.
Note: we had made some modifications to the dataset
german_data_credit_cat.csv
. Details are shown below :
- Rename all columns
- Change the value of column
Cost Matrix(Risk)
:
Types of Credit Risk | original value | Revised value |
---|---|---|
Good Risk | 1 | 0 |
Bad Risk | 2 | 1 |
Initialize parameters:
%let data_table = german_credit_card;
%let y = CostMatrixRisk;
%let x = AgeInYears CreditAmount DurationInMonth;
%let exclude_condi = < -99999999;
%let init_sign = auto ;
%let min_samples = %sysevalf(1000 * 0.05);
%let min_bads = 10;
%let min_pvalue = 0.35;
%let show_woe_plot = 1;
%let lib_name = TMPWOE;
%let is_using_encoding_var = 1;
Run MainSizeFirstBining.sas
script
%let min_bins = 3;
%let max_samples = %sysevalf(1000 * 0.4);
PROC DATASETS lib = TMPWOE kill ; QUIT ;RUN ;
%init(data_table = &data_table., y = &y., x = &x., exclude_condi = &exclude_condi., init_sign = &init_sign.,
min_samples = &min_samples., min_bads = &min_bads., min_pvalue = &min_pvalue.,
show_woe_plot = &show_woe_plot.,
is_using_encoding_var = &is_using_encoding_var., lib_name = &lib_name.);
%initSizeFirstBining(max_samples = &max_samples., min_bins = &min_bins., max_bins = 7);
%runMob();
SFB RESULT OUTPUT - DurationInMonth
:
Note: The image above shows the Woe Transformation Result of variable
DurationInMonth
with applyingSFB Algorithm
. It clearly presents the monotonicity of the WoE value.
SFB RESULT OUTPUT - CreditAmount
:
Note: The image above shows the Woe Transformation Result of variable
CreditAmount
with applyingSFB Algorithm
. It violates the monotonicity of WoE becauseSBF Algorithm
will tend to meet the bins relevant restrictions as priority.
Run MainMonotonicFirstBining.sas
script
PROC DATASETS lib = TMPWOE kill ; QUIT ;RUN ;
%init(data_table = &data_table., y = &y., x = &x., exclude_condi = &exclude_condi., init_sign = &init_sign.,
min_samples = &min_samples., min_bads = &min_bads., min_pvalue = &min_pvalue.,
show_woe_plot = &show_woe_plot.,
is_using_encoding_var = &is_using_encoding_var., lib_name = &lib_name.);
%initMonotonicFirstBining();
%runMob();
MFB RESULT OUTPUT - DurationInMonth
:
Note: The image above shows the Woe Transformation Result of variable
DurationInMonth
with applyingMFB Algorithm
. It presents the monotonicity of WoE.
MFB RESULT OUTPUT - CreditAmount
:
Note: The image above shows the Woe Transformation Result of variable
CreditAmount
with applyingMFB Algorithm
. It presents the monotonicity of WoE, but it is likely to lead to some issues such as excessive sample proportion or an insufficient number of bins or bins size.
Initialize parameters:
%let data_table = german_credit_card;
%let y = CostMatrixRisk;
%let x = Purpose;
%let max_bins_threshold = 30 ;
%let min_bins = 4 ;
%let max_bins = 6 ;
%let min_samples = 0.05 ;
%let max_samples = 0.4 ;
%let p_value_threshold = 0.35 ;
%let libName = TMPWOE ;
Chi Merge Binning (CMB) is an auto binning algorithm applying chi-squared test for the merging criterion. It is also limited by the same restrictions as the SFB and MFB on bins amount, bins size, sample size, etc. Currently, the CMB cannot deal with the categorical varibales with order.
Run MainChiMerge.sas
script
%runChiMerge( dataFrame = german_credit_card, x = &x., y = &y.,
max_bins_threshold = &max_bins_threshold.,
min_bins = &min_bins., max_bins = &max_bins.,
min_samples = &min_samples., max_samples = &max_samples.,
p_value_threshold = &p_value_threshold.,
libName = &libName.) ;
CMB OUTPUT RESULT :
The result of CMB
is shown above. We can see that the CMB Algorithm
merges the categorical variable Purpose
in german_credit_card
from 10 attributes to 6 groups eventually.
MFB Algorithm
macro example:
%init(data_table, y, x, exclude_condi, min_samples, min_bads, min_pvalue,
show_woe_plot, is_using_encoding_var, lib_name);
%initMonotonicFirstBining();
%runMob();
SFB Algorithm
macro example:
%init(data_table, y, x, exclude_condi, min_samples, min_bads, min_pvalue,
show_woe_plot , is_using_encoding_var , lib_name );
%initSizeFirstBining(max_samples , min_bins , max_bins);
%runMob();
-
data_table
Default: None
Suggestion: a training data set.
Thedata_table
argument defines the input data set. The datasets must includes all independent variables and the target variable (response variable). For example, inMainMonotonicFirstBining.sas
script you can passgerman_credit_card
as the given dataset which is a table structure created by%readCsvFile()
macro. -
y
Default: None
Suggestion: The label name of response variable.
They
argument defines the column name of the response variable. For example, inMainMonotonicFirstBining.sas
script you can passCostMatrixRisk
which exists in the datasetgerman_credit_card
. -
x
Default: None
Suggestion: The column names of the variable for executing the alogorithm.
Thex
argument defines the column names of the chosen variables. Multiuple columns can be passed simultaneously. For example, inMainMonotonicFirstBining.sas
script you can passAgeInYears
CreditAmount
DurationInMonth
which all exist in the datasetgerman_credit_card
. -
exclude_condi
Default: None
Suggestion: The condition given to exclude the observations in the variables.
Theexclude_condi
argument defines the conditiont to exclude the observations that meet the specified condition of the variables. For example, inMainMonotonicFirstBining.sas
script you can pass< -99999999
, which means that the algorithm will exclude the observations that the value of the variable is less then -99999999. -
init_sign
Default: None
Suggestion: Set theinit_sign
asauto
will automatically calculate the pearson correlation to determine the relation between thex
andy
variables. If the pearson correlation is greater than 0, then the program will take it as a positive relation, which means the greaterx
is, the higher defualt rate (higher mean ofy
) is. -
min_samples
Default: None
Suggestion: The minimum sample amount that will be kept in each bin. Usuallymin_samples
is suggested to be 5% of the total population.
Themin_samples
argument defines the minimum sample that will be kept in each bin. For example, inMainMonotonicFirstBining.sas
script you can pass%sysevalf(1000 * 0.05)
, which means the minimum samples will be constrained by 5% of total samples (1000 obs). -
min_bads
Default: None
Suggestion: The minimum positive event amount (default/bad in risk analysis) that will be kept in each bin. Usuallymin_bads
is suggested to be 1.
Themin_bads
argument defines the minimum positive event amount that will be kept in each bin. For example, inMainMonotonicFirstBining.sas
script you can pass 10, which means that the minimum bads will be constrained by a minimum of 10 positive events in each bins. -
min_pvalue
Default: None
Suggestion: The minimum threshold of p-value for the algorithm to decide whether merge the two bins or not. Usually a highermin_pvalue
, the algorithm will reduce the times of merging bins.
Themin_pvalue
argument defines the minimum threshold of p value. For example, inMainMonotonicFirstBining.sas
script you can pass 0.35, which means that the alogorithm will decide to merge the two bins if the p-value of the statistical test (Z-Test) conducted between them is greater than 0.35. The argument will iteratively decrease its value if there is no p-value of the statistical test (Z-Test) conducted between any two bins greater than the given parameter and the final bins amount is still greater thanmax_bins
. -
show_woe_plot
Default: None
Suggestion: Boolean(0, 1) : Whether showing the woe plot when MOB algorithm is running.
Theshow_woe_plot
argument defines whether showing the woe plot in the algorithm process or not. For example, inMainMonotonicFirstBining.sas
script you can pass 1, which means that the SAS will show the woe plot result for each givenx
. -
is_using_encoding_var
Default: None
Suggestion: The boolean(0, 1) of using encoding var table. If your length of label name(x or y) is too long for sas macro, suggest you should open this parameter.
Theis_using_encoding_var
argument defines the boolean(0, 1) of using encoding var table. For example, in MainMonotonicFirstBining.sas script you can try 1, which means the attributes name of data will be changed to be encoding variable. -
lib_name
Default: None
Suggestion: The library name to store the output tables. If no preference, please passwork
, which means a temporary library in SAS.
Thelib_name
argument defines the output library name for storing tables created by the algorithm. For example, inMainMonotonicFirstBining.sas
script you can passTMPWOE
which are assigned byLIBNAME TMPWOE "/home/u60021675/output"
under the given direction. -
max_samples
Default: None
Suggestion: Only use in%initSizeFirstBining()
macro. The maximum sample will be kept in each bins. Usuallymax_sample
suggest to be 40% of population to avoid a serious concentration issue on WoE binning.
Themax_samples
argument defines the maximum sample amount that will be kept in each bin. For example, inMainSizeFirstBining.sas
script you can pass with%sysevalf(1000 * 0.4)
, which means the maximum samples will be constrained by a maximum limitation of observations which is 40% of population in each bins. -
min_bins
Default: None
Suggestion: Only use in%initSizeFirstBining()
macro. The minimum bins will be kept in the final woe summary output for each givenx
.
Themin_bins
argument defines the minimum bins amount that will be kept in the final woe summary output for each givenx
. For example, inMainSizeFirstBining.sas
script you can pass3
, which means the algorithm will create at least 3 bins for the givenx
in each. -
max_bins
Default: None
Suggestion: Only use in%initSizeFirstBining()
macro. The maximum bins will be kept in the final woe summary output for each givenx
. Note thatmax_bins
must be higher thanmin_bins
.
Themax_bins
argument defines the maximum bins amount that will be kept in the final woe summary output for each givenx
. For example, inMainSizeFirstBining.sas
script you can pass7
, which means the algorithm will create at most 7 bins for the givenx
in each.
- The output files created by MOB algorithm.
- The woe summary result table created by MOB algorithm.
%printWithoutCname()
macro example:
%printWithoutCname(lib_name);
lib_name
Default: None
Suggestion: The library which will be assigned for storing the woe summary result.
Thelib_name
argument defines the library which will be assigned for storing woe summary result. For example, inMainMonotonicFirstBining.sas
script you can passTMPWOE
, which means that the%printWithoutCname()
macro will output the files and result table toTMPWOE
library assigned byLIBNAME TMPWOE(/home/u60021675/output) ;
.
The output of runing %printWithoutCname()
macro. It shows the result of all variable which was discretized.
%getIvPerVar()
macro example:
%getIvPerVar(lib_name, min_iv, min_obs_rate, max_obs_rate, min_bin_size, max_bin_size, min_bad_count);
-
lib_name
Default: None
Suggestion: The library which will be assigned for storing the IV summary result.
Thelib_name
argument defines the library which will be assigned for storing the IV summary result. For example, inMainMonotonicFirstBining.sas
script you can passTMPWOE
, which means that the%printWithoutCname()
macro will output the files and result table toTMPWOE
library assigned byLIBNAME TMPWOE(/home/u60021675/output) ;
. -
min_iv
Default: None
Suggestion: The minimum threshold of information value (IV). Usually greater than 0.1.
Themin_iv
argument defines the minimum threshold of the information value (IV). For example, inMainMonotonicFirstBining.sas
script you can pass 0.1, which means the%getIvPerVar()
macro will markis_iv_pass
as 1 if IV is greater than 0.1. -
min_obs_rate
Default: None
Suggestion: The minimum threshold of observation rate.0.05
is usually given based on experiences.
Themin_obs_rate
argument defines minimum threshold of observation rate. For example, in MainMonotonicFirstBining.sas script you can pass 0.05, which means the%getIvPerVar()
macro will markis_obs_pass
as 1 if the value is greater than 0.05 and lower thanmax_obs_rate
. -
max_obs_rate
Default: None
Suggestion: The maximum threshold of observation rate.0.4
is usually given based on experiences.
Themax_obs_rate
argument defines maximum threshold of observation rate. For example, inMainMonotonicFirstBining.sas
script you can pass 0.4, which means the%getIvPerVar()
macro will markis_obs_pass
as 1 if the value is less than 0.4 and greater thanmin_obs_rate
. -
min_bin_size
Default: None
Suggestion: The minimum threshold of bins size. Usually set at 3.
Themin_bin_size
argument defines the minimum amount of bins. For example, inMainMonotonicFirstBining.sas
script you can pass 3, which means the%getIvPerVar()
macro will markis_bin_pass
as 1 if the value is higher than 3 and lower thanmax_bin_size
. -
max_bin_size
Default: None
Suggestion: The maximum threshold of bins size. Usually set at 6.
Themax_bin_size
argument defines the maximum amount of bins. For example, inMainMonotonicFirstBining.sas
script you can pass 10, which means the%getIvPerVar()
macro will markis_bin_pass
as 1 if the value is less than 6 and greater thanmin_bin_size
. -
min_bad_count
Default: None
Suggestion: The minimum number threshold of the positive events (default/bad). Usually set at 1.
Themin_bad_count
argument defines the minimum number threshold of the positive events, defualt or bad event is commonly seen in risk analysis. For example, inMainMonotonicFirstBining.sas
script you can pass 1, which means the%getIvPerVar()
macro will markis_bad_count_pass
as 1 if the value is higher than 1.
The output of %getIvPerVar()
macro. It shows the IV information for all discretized variables.
iv
: the information value per each discretized variable.is_iv_pass
: true(1) if IV higher thanmin_iv
else than false(0).is_obs_pass
: true(1) if observation rate betweenmin_obs_rate
andmax_obs_rate
else then false(0).is_bad_count_pass
: true(1) if bad count higher thanmin_bad_count
else then false(0).is_bin_pass
: true(1) if bin size betweenmin_bin_size
andmax_bin_size
else then false(0).is_woe_pass
: true(1) if the value of WoE have monotonicity properties else then false(0).woe_dir
:asc
if the WoE value show a monotone increasing pattern, whiledesc
if the WoE value show a monotone decreasing pattern. Otherwise, null is given.
%printWoeBarLineChart()
macro example:
%printWoeBarLineChart(lib_name, min_iv);
-
lib_name
Default: None
Suggestion: The library which will be assigned for the data to print WoE bar chart.
Thelib_name
argument defines the library used to store the data for plotting. For example, inMainMonotonicFirstBining.sas
script you can passTMPWOE
, which means that the%printWithoutCname()
macro will output the files and result table toTMPWOE
library assigned byLIBNAME TMPWOE(/home/u60021675/output) ;
. -
min_iv
Default: None
Suggestion: The minimum threshold of information value. Usually set more higher than 0.1.
Themin_iv
argument defines the minimum threshold of information value. For example, inMainMonotonicFirstBining.sas
script you can pass 0.1, which means the%printWoeBarLineChart()
macro will show the woe bar chart of the varibale if its IV is greater than 0.1.
The output of runing %printWoeBarLineChart()
macro. It shows the woe bar charts of the variables whose IV is greater than min_iv
.
%exportSplitRule()
macro example:
%exportSplitRule(lib_name, output_file);
-
lib_name
Default: None
Suggestion: The library which is assigned to store the split rule exported by the macro.
Thelib_name
argument defines the library which is assigned to store the split rule exported by the macro. For example, inMainMonotonicFirstBining.sas
script you can passTMPWOE
, which means that the%printWithoutCname()
macro will output the files and result table toTMPWOE
library assigned byLIBNAME TMPWOE(/home/u60021675/output) ;
. -
output_file
Default: None
Suggestion: The output file path which will be export split rule.
Theoutput_file
argument defines the output file path which will be export split rule. For example, inMainMonotonicFirstBining.sas
script you can try/home/u60021675/output/
, which means the%exportSplitRule()
macro will export the split rule to "/home/u60021675/output/" directory. Note that you DON'T need to quote the direction.
The output of %exportSplitRule()
macro.
%cleanBinsDetail()
macro example:
%cleanBinsDetail(bins_lib);
bins_lib
Default: None
Suggestion: The library used to store files created from the algorithm process and will be cleared eventually. Suggest to use the same value assigned in%init()
macro.
Thebins_lib
argument defines the library which the files in it will be cleared at the end. For example, inMainMonotonicFirstBining.sas
script you can passTMPWOE
, which means bins summary files and exclude files will be deleted.
The output of runing %cleanBinsDetail()
macro. It shows the bins_summary and exclude file was be deleted.
CMB Algorithm
macro example:
%runChiMerge( dataFrame = german_credit_card, x = &x., y = &y.,
max_bins_threshold = &max_bins_threshold.,
min_bins = &min_bins., max_bins = &max_bins.,
min_samples = &min_samples., max_samples = &max_samples.,
p_value_threshold = &p_value_threshold.,
libName = &libName.) ;
-
dataFrame
Default: None
Suggestion: a training data set.
ThedataFrame
argument defines the input data set. The datasets must includes all independent variables and the target variable (response variable). For example, inMainChiMerge.sas
script you can passgerman_credit_card
as the given dataset which is a table structure created by%readCsvFile()
macro. -
y
Default: None
Suggestion: The label name of response variable.
They
argument defines the column name of the response variable. For example, inMainChiMerge.sas
script you can passCostMatrixRisk
which exists in the datasetgerman_credit_card
. -
x
Default: None
Suggestion: The column names of the variable for executing the alogorithm.
Thex
argument defines the column names of the chosen variables. Multiuple columns can be passed simultaneously. For example, inMainChiMerge.sas
script you can passPurpose
which exists in the datasetgerman_credit_card
. -
max_bins_threshold
Default: None
Suggestion: Maximum initial attributes of a variable to run CMB algorithm.
Themax_bins_threshold
argument defines that the maximum for conducting the CMB algorithm, if the inital unique attributes of the givenx
exceed the given parameter ofmax_bins_threshold
then the algorithm will stop the execution. For example, inMainChiMerge.sas
script, you can pass20
, which means that if the givenx
has unique attributes greater than 20, then the algorithm will stop executing. -
min_bins
Default: None
Suggestion: The minimum bins will be kept in the final woe summary output for each givenx
.
Themin_bins
argument defines the minimum bins amount that will be kept in the final woe summary output for each givenx
. For example, inMainChiMerge.sas
script you can pass3
, which means the algorithm will create at least 3 bins for the givenx
in each. -
max_bins
Default: None
Suggestion: The maximum bins will be kept in the final woe summary output for each givenx
. Note thatmax_bins
must be higher thanmin_bins
.
Themax_bins
argument defines the maximum bins amount that will be kept in the final woe summary output for each givenx
. For example, inMainChiMerge.sas
script you can pass7
, which means the algorithm will create at most 7 bins for the givenx
in each. -
min_samples
Default: None
Suggestion: Integer or float : The minimum sample amount that will be kept in each bin. Usuallymin_samples
is suggested to be5%
of the total population.
Themin_samples
argument defines the minimum sample that will be kept in each bin. If the given value is between 0 and 1, which means 0 <min_samples
< 1, then the program will calculate the given proportion samples of the total population. For example, inMainChiMerge.sas
script you can pass0.05
, which means the minimum samples will be constrained by5%
of total samples automatically calculated in the program. Or, the parameter can be passed%sysevalf(1000 * 0.05) ;
, which means the minimum sample will directly be constrained as 50. -
max_samples
Default: None
Suggestion: Integer or float : The maximum sample will be kept in each bins. Usuallymax_sample
suggest to be 40% of the total population to avoid a serious concentration issue on WoE binning.
Themax_samples
argument defines the maximum sample amount that will be kept in each bin. For example, inMainChiMerge.sas
script you can pass0.4
, which means the minimum samples will be constrained by40%
of total samples automatically calculated in the program. Or, the parameter can be passed%sysevalf(1000 * 0.4)
, which means the maximum samples will directly be constrained as 400. -
p_value_threshold
Default: None
Suggestion: The minimum threshold of p-value for the algorithm to decide whether merge the two bins or not. Usually a highermin_pvalue
, the algorithm will reduce the times of merging bins.
Thep_value_threshold
argument defines the minimum threshold of p value. For example, inMainChiMerge.sas
script you can pass0.35
, which means that the alogorithm will decide to merge the two bins if the p-value of the statistical test (Chi-Squared Test) conducted between them is greater than0.35
. The argument will iteratively decrease its value if there is no p-value of the statistical test (Chi-Squared Test) conducted between any two bins greater than the given parameter and the final bins amount is still greater thanmax_bins
. -
libName
Default: None
Suggestion: The library which will store the woe summary result and other tables.
ThelibName
argument defines the library which will be loaded and show IV summary result. For example, in MainMonotonicFirstBining.sas script you can passTMPWOE
, which means that the%printWithoutCname()
macro will output the files and result table toTMPWOE
library assigned byLIBNAME TMPWOE(/home/u60021675/output) ;
.
- The output files created by CMB algorithm.
The final output of the woe binning result is stored in woe_summary_<x>.sas7bdat
. Details are shown below:
SAS Studio 3.8 with SAS 9.4
- German Credit Risk Analysis : Beginner's Guide . (2022). Retrieved 9 June 2022, from Kaggle
- Pavel Mironchyk and Viktor Tchistiakov. "Monotone optimal binning algorithm for credit risk modeling.". (2017): 1-15. citation
- SAS OnDemand for Academics. (2022). Retrieved 9 June 2022
- MOBPY : Monotonic-Optimal-Binning
- Darren Tsai(https://www.linkedin.com/in/darren-yucheng-tsai/)
- Denny Chen(https://www.linkedin.com/in/dennychen-tahung/)
- Thea Chan([email protected])