Scorecard Development Stages

Scorecard Development Stages
Historical Data Preparation
Training and Validation Datasets
Characteristics Selection using Information Value
Logistic Regression for Scorecard Calculation
Automatic Variables Selection with Stepwise Methods
Basic Scorecard Quality Assessment
Reject Inference
Scorecard Scaling

Scorecard Development Stages
To develop a scorecard, we use historical data from the credit portfolio that have been freed from invalid records. Each record in the portfolio must have one of two possible values that characterize the borrower’s credit behavior either as “Good” or “Bad”.


After historical data have been prepared, the process of scorecard development requires that the following actions are performed:

  • Sampling preparation to train the scorecard;
  • Preparation and selection of borrower characteristics that are used to develop the scorecard (including binning);
  • Development of the scorecard;
  • Assessment of scorecard quality;
  • Implementation of the scorecard and its subsequent monitoring.

Since the result of scorecard development must be the best of all possible options, the above-mentioned sequence of actions is typically repeated cyclically. The following methods can be deployed to improve the quality of the scorecard: Changes in training and validation datasets; Changes in defined transformations of borrower characteristics; Changes in scorecard development methods;

After implementation the developed scorecard is used in real time to facilitate operational decision-making. To support the quality of its performance, one needs to periodically (monthly, quarterly or annually) monitor the scorecard and to readjust or recalibrate the scorecard if necessary.

Historical Data Preparation

bul_1 Historical data should include a set of characteristics and a target variable. All of scorecard development methods quantify the relationship between the characteristics (input columns) and “Good/Bad” performance (target column).


bul_2 Example of borrowers characteristics. Scorecard characteristics are similar to those used in subjective expert judgment.


bul_3 Those characteristics, whose usage is not reasonable, are excluded. For example: on the picture you can see that the “Good/ Bad” distribution does not depend on the Home Ownership characteristics.


bul_4 All borrowers should be marked in the target column as “Good” or “Bad” by a certain rule. For example: all the borrowers to pay in 30 days, are “Good”, but borrowers with a delay of more than 90 days are marked as “Bad”.



Certain types of accounts need to be excluded from the dataset. For example: bank workers or VIP clients records could be excluded from data set.


Data Cleansing



Training & Validation Datasets
train_valid1How do we know, whether the scorecard will perform identically when evaluating new borrowers and when using historical data?

To provide an answer to this question, the credit portfolio data are divided into two parts: training and validation.
The training dataset is used to train the scorecard, while the validation dataset is used exclusively for the purpose of validation.
As a rule, 80% of available data is used to train the scorecard, while the remaining 20% is used for the purpose of validation. If there is a large amount of data, the validation dataset can comprise as much as 50%.

If a certain category of borrowers is insufficiently represented in the credit portfolio, its distribution in the training and validation datasets must be specifically controlled, since this category must be proportionally represented in both datasets.



What is binning

Binning means the process of transforming a numeric characteristic into a categorical one as well as re-grouping and consolidating categorical characteristics.

Why binning is required

  • Increases scorecard stability: some characteristic values can rarely occur, and will lead to instability if not grouped together.
  • Improves quality: grouping of similar attributes with similar predictive strengths will increase scorecard accuracy.
  • Allows to understand logical trends of “Good/Bad” deviations for each characteristic.
  • Prevents scorecard impairment otherwise possible due to seldom reversal patterns and extreme values.
  • Prevents overfitting(overtraining) possible with numerical variables.

Automatic binning

The most widely used automatic binning algorithm is Chi-merge. Chi-merge is a process of dividing into intervals (bins) in the way that neighboring bins will differ from each other as much as possible in the ratio of “Good” and “Bad” records in them. For visual cross-verification of automatic binning results one can use WOE values (Fig 1.).

Analysis and manual correction of automatic binning

Sometimes due to particularities in data distribution automatic binning needs to be corrected manually.

The example below shows the range divided into 5 bins using an automatic binning (Fig 1.), now we only need to manually adjust the band.

For example, manually adjusts the second boundary of the range for several values to the left, from 5.02 to 4.94 (Fig 2.) and recalculate WOE values.

As a result, we will get a smooth decreasing WOE curve indicating the correct distribution of values within the ranges.

Sometimes, for easier analysis automatic binning ranges should be adjusted to logical boundaries. For example for Age or Job Time boundaries can be adjusted to integers.

Fig. 1 – Sharply-varied and illogical WOE graph after automatic binning


Fig. 2 – Smooth and logical WOE decline after manual correction


Characteristics Selection Using Information Value
Each of the available borrower characteristics contains information on his/her credit quality. It is only natural that some characteristics are less important than others for the purpose of assessing creditworthiness; for example, the borrower’s income bracket is more important than the borrower’s family status. How can we assess the rationality and effectiveness of the characteristics’ use in the process of developing the scorecard?

For that purpose, we use the Information Value (IV) criterion:


Here, per each category of the selected borrower characteristic:
DistrGood i – is the share of “good” borrowers in the category
DistrBad i – is the share of “bad” borrowers in the category.

Value of IV Statistical strength
less than 0.02 a very weak statistical relation
0.02 – 0.1 a weak statistical relation
0.1 – 0.3 an average statistical relation
0.3 – 0.5 a strong statistical relation
greater than 0.5 an extremely strong statistical relation

IMPORTANT: Borrower characteristics that have an extremely strong statistical relationship can nullify the contribution of other less informative characteristics, that is why their use in the process of scorecard development requires special attention.


In the example, the LTV (Loan to Value Ratio) characteristic has an extremely strong statistical relationship, while the indicator that characterizes the borrower’s family status potentially has no bearing on creditworthiness.

Logistic Regression for Scorecard Calculation

A statistical scorecard is the result of deploying the logistic regression algorithm to the formed set of data from the credit portfolio.



Logistic regression allows evaluating the logarithm of the positive outcome of a loan case.
Here odds are defined as a ratio of the probability of the positive outcome to the negative.

The odds of the positive outcome are evaluated as a combination of borrower characteristics, while the weight of an individual characteristic is defined using the regressional algorithm:



Finally, after the procedure of autoscaling has been performed , the resulting weights of borrower characteristics become points of the scorecard:

Automatic Variables Selection with Stepwise Methods

The statistical scorecard reflects the existing interrelations between different borrower characteristics. That is why, its quality is directly dependent both on the predicting capacity of the borrower characteristics in use and the combination of these characteristics that takes part in the process of development.

When developing a scorecard, we can use either all of the available borrower characteristics,


or only those characteristics that are the most important from the point of view of forecasting the borrower’s creditworthiness:


When we select the most important characteristics, each of them is considered as a candidate for participating in the scorecard and can be included or excluded from the final set of characteristics.

Forward stepwise selection starts with an empty set of borrower characteristics, extending it with the most important characteristics.
Backward stepwise selection initially uses all borrower characteristics and excludes insufficiently meaningful ones.

In every practical case, the selection of the method for the development of the scorecard depends on the number of available borrower characteristics and their forecasting ability.
In any case, the best reason for selecting one method or another is the quality of the resulting scorecard.

Basic Scorecard Quality Assessment

Basic Scorecard Quality Assessment – allows assessing the quality with which the scorecard identifies (classifies) “good” and “bad” borrowers as well as the correctness of risk distribution.

Classification quality assessment


The most common method of risk distribution validation is the ROC curve and Gini coefficient.
The higher the curve is, the larger the indicator is, the better the scorecard is.

Confusion matrix


The confusion matrix reflects the percentage (or the number of):

  • correctly identified “good” borrowers (true positive);
  • “good” borrowers, mistakenly identified as “bad” ones (False Negative);
  • “bad” borrowers mistakenly identified as “good” ones (False Positive);
  • correctly identified “bad” borrowers (True Negative).

The confusion matrix shown corresponds to a scorecard that correctly identifies 96% of “good” borrowers and 78% of “bad” borrowers.

Risk distribution validation


The correct risk distribution involves a monotonous increase in the odds of the “good” outcome. Only such distributions allow formulating rules for working with borrowers based on their score, use of the risk-based price formation, etc.

IMPORTANT: The scorecard that does not demonstrate an increase in the odds of the “good” outcome must be re-developed, even if all the other quality indicators are acceptable.

Reject Inference


Reject Inference is a method of improving the quality of the scorecard based on the use of data contained in rejected loan applications.

When developing a scorecard, we normally use information on those borrowers who have previously been granted a loan (approved applicants). However, the number of potential customers is significantly higher and a correctly developed scorecard must be able to perform as expected in the context of the entire population of potential customers.


The behavior of new types of borrowers can significantly differ from the behavior of the borrowers included in our credit portfolio (approved applicants).

To improve our knowledge of potential borrowers, we can use information on those customers who applied for and were refused a loan (rejected applicants).

To develop a scorecard, we need to identify each borrower either as “good“ or “bad”. However, there is no such information available for rejected loan applications. We cannot tell for sure, to which group a borrower would have belonged, had he/she been granted a loan. The Reject Inference methods are intended to provide the most correct way to perform the Good-Bad identification of rejected application in order to include them into the development set, based on which we can build a scorecard.


Scorecard Scaling

The final scorecards are produced to generate the final set of characteristics for the scorecard. Note that you are not limited to the characteristics to the scorecard. Note that you are not limited to the characteristics selected in the preliminary scorecard. Some characteristics may apper weaker or stronger after reject inference.

In addition to the requirements of the statistical procedure of development, the scorecard must be in line with business requirements, must conform to common sense and, when possible, not to contradict the expert opinion.

For example:
In the process of scorecard development, the following points were obtained in the Age category:


This distribution contradicts the common knowledge in some countries: older customers are more reliable. Borrowers of 40-45 years of age have the lowest score.

Expert analysis must find reasons for this contradiction to common sense. If the reason is the fault of historical data, it must be corrected by expert correction of score distribution.

For that purpose 40-45 -year-old borrowers are assigned score 14 (instead of 0). At the same time, to preserve the scale of the scorecard, the youngest borrowers are assigned the minimum score 0 (instead of 4).

Need Scoring trainings?   Scorecard Development Services

FREE Trial Scorecard Development Software   FREE Trial Credit Scoring System

Credit Scoring Software is the most easy-to-use and the fastest to integrate scoring system.