Modification of the Random Forest Algorithm for Credit Scoring and Its Comparison with Gradient Boosting, Random Forest and CART

Analyze literature about credit scoring and tree-based models. Create algorithm of the new model. Realize the algorithm in programming language r. Construct these models on two credit-scoring data and compare them on test data using chosen criteria.

Рубрика Менеджмент и трудовые отношения
Вид дипломная работа
Язык английский
Дата добавления 01.12.2019
Размер файла 295,1 K

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Размещено на http://www.allbest.ru/

FEDERAL STATE EDUCATIONAL INSTITUTION

OF HIGHER EDUCATION

NATIONAL RESEARCH UNIVERSITY

HIGHER SCHOOL OF ECONOMICS

Saint Petersburg School of Economics and Management

Department of Management

Bachelor's thesis

Modification of the Random Forest Algorithm for Credit Scoring and Its Comparison with Gradient Boosting, Random Forest and CART

Karagezyan Vagram

In the field 38.03.02 Management

Educational programme `Management'

Saint Petersburg

2019

Abstract

Credit scoring is one of the oldest applications of analytics where financial institutions perform statistical analysis and machine learning models to assess the creditworthiness of potential borrowers. Through its history lots of techniques have been created for creditworthiness assessment, but creating new models and developing old ones still remains important problem, because even one percent growth in scoring models accuracy can significantly increase the profit of financial institutions. Thus the goal of this research paper is to develop one of the well-known machine learning models, called Random forest and compare it with CART, Gradient boosting and original Random forest on two scoring datasets.

The developed model demonstrates worse results than other three models in case of one of the datasets, and better results than only CART in case of other. Thus, the use of this model in case of these two certain datasets is meaningless. Nevertheless, more comparisons on more datasets should be done to understand the effectiveness of the implementation of the model in scoring field. Besides, some suggestions have been done in this research paper, concerning further research and development of the New algorithm.

Keywords: Credit-scoring, CART, Random forest, Gradient boosting

Table of contents

  • Introduction
  • 1. Theoretical Foundation
    • 1.1 History of credit-scoring
    • 1.2 CART algorithm
    • 1.3 Ensemble methods
    • 1.4 Random forest algorithm
    • 1.5 Gradient boosting algorithm
    • 1.6 Conclusion.
  • 2. Research question
  • 3. Methodology
    • 3.1 Objectives of the thesis
    • 3.2 New method description
    • 3.3 Datasets' description
    • 3.4 Criteria of comparison
    • 3.5 Cross-validation
  • 4. Results
    • 4.1 Acquaintance with the first dataset
    • 4.2 Acquaintance with the second dataset
    • 4.3 Models Construction and their results on the first dataset
    • 4.4 Models Construction and their results on the second dataset
  • 5. Conclusion
  • References
  • Appendix 1

Introduction

“Credit in recent years has turned into a quite significant component in everyday life. It allows accessing to resources today with an agreement to repay over a period of time, usually at regular intervals” (Marques, Garcia, & Sanchez, 2013). There are lots of forms of credit, such as auto loans, student loans, credit cards etc.

Actually, the main problem of any lender is to identify “bad” borrowers prior to granting credit (Vojtek, & Kocenda, 2006).

“The decision to give loans was traditionally based upon subjective judgement of human experts, using past experiences and some guiding principles. This method has a lot of significant disadvantages, among which are high training costs and frequent incorrect decisions” (Marques, Garcia, & Sanchez, 2013). “These shortcomings have led to a rise in more formal and accurate methods to assess the risk of default and credit scoring has become a primary tool to evaluate credit risk” (Marques, Garcia, & Sanchez, 2013).

“The aim of credit scoring is essentially to classify loan applicants into two classes, i.e., good payers (i.e., those who are likely to keep up with their repayments) and bad payers (i.e., those who are likely to default on their loans)” (Brown, & Mues, 2012).

“Actually, credit scoring is one of the oldest applications of analytics where lenders and financial institutions perform statistical analysis to assess the creditworthiness of potential borrowers to help them decide whether or not to grant credit” (Oskarsdottir, Bravo, Sarraute, Vanthienen, & Baesens, 2019). “Fair Isaac was founded in 1956 as one of the first analytical companies offering retail credit scoring services in the United States of America and its FICO score (ranging between 300 and 850) has been used as a key decision instrument by financial institutions, insurers and even employers” (Oskarsdottir, Bravo, Sarraute, Vanthienen, & Baesens, 2019).

The first corporate credit scoring models date back to the late sixties with Edward Altman developing z-score model for bankruptcy prediction (Altman, 1968). Since that a wide range of classification techniques have already been proposed in the credit scoring literature, including statistical techniques, such as linear discriminant analysis, logistic regression, some non-parametric models, such as k-nearest neighbor and decision trees.

“Actually, according to Henley and Hand companies could make significant future savings if an improvement of only a fraction of a percent could be made in the accuracy of the credit scoring techniques implemented” (Henley, & Hand, 1997). The statement was done in 1997, but considering the growth of several types of bank loans in lots of countries, such as United States of America (2019), the statement is still actual.

In spite of all the models created, developing these models and creating new ones still remains significant problem, because of the statement above. That's why the goal of this research paper is to develop one of the method of data science called Random forest and compare it with other modifications of decision tree algorithm on two credit-scoring data. The new model will be the mixture of k-nearest neighbors and Random forest algorithms. In this model firstly k-nearest neighbors are identified. After that the difference between independent variables of the observations and independent variables of its first, second, …, k-th nearest neighbors are measured. Using these variables and also the dependent variables of nearest neighbors the random forests algorithm is constructed. The model has some advantages. One of the most important advantage is the growth of the number of observations, so the algorithm can train better on the training data and demonstrate better results on the test data. Unfortunately, there are also some disadvantages. For example, if a data has 30000 observations, while constructing the model, the number of observations increases and becomes k*30000 and a lot of time is needed to train the model. So the increasing number of observations is at the same time very important advantage and disadvantage. The second disadvantage is the possible overfitting, which “is the phenomenon detected when a learning algorithm fits the training data set so well that noise and the peculiarities of the training data are memorized” (Allamy, & Rafiqul, 2014).

The comparison will show can the model give better results on a credit-scoring data or not. Lots of other research papers about credit scoring techniques have been written and the main difference of this research paper is the new method implementation.

To reach the objective of the research paper the following tasks have to be solved:

a) analyze appropriate literature about credit scoring and tree-based models;

b) create algorithm of the new model;

c) realize the algorithm in programming language R;

d) construct these models on two credit-scoring data and compare them on test data using chosen criteria.

The results of the thesis could be relevant for banks and financial institutions, who use credit scoring in decision making process, analytical companies, who deals with credit scoring and also data scientists, because the new model can be used in not only credit-scoring data but also other several data, where classification problem has to be solved.

The structure of the thesis will be the following:

a) firstly, the history of credit-scoring and also some popular methods will be represented;

b) secondly, the research question statement will be constructed;

c) thirdly, about the data and methods, used in the thesis will be written;

d) fourthly, the description of the result will be done;

e) finally, the conclusion about the comparison of the methods will be done.

The comparison will be done between Decision tree method and its some modifications.

Decision tree, Random forest, Gradient boosting and also the new model, which will be constructed in the methodology part, are the methods, used in the thesis. It also has to be mentioned that several criteria, such as accuracy, precision, recall, about which will be written in the methodology part, will be used to compare all the methods.

If the new model worked better on specific data, then companies, which use credit-scoring while decision-making process, can implement the model and make much more profit. The new model can also be used on several other data, where classification problem has to be solved.

1. Theoretical Foundation

1.1 History of credit-scoring.

“Credit risk is most simply defined as the potential that a bank borrower or counterparty will fail to meet its obligations in accordance with agreed terms” (Basel Committee on Banking Supervision, 1999). Before the twentieth century bankers had generally dealt with a relatively small client base, thus at that times credit risk could be evaluated on individual basis. By the first quarter of the twentieth century, the potential client base had expanded enormously, therefore there had been a radical change in credit risk evaluation.

One of the methods of credits risk assessment is credit rating. “Credit rating means an opinion regarding the creditworthiness of an entity, a debt or financial obligation, deb security, preferred share or other financial instrument, issued using an established and defined ranking system of rating categories” (Weissova, Kollar, & Siekelova, 2015).

Credit ratings are intended to analyze and evaluate the creditworthiness of corporate and sovereign issuers of debt securities. The other method of credit risk assessment, which unlike credit rating can be also used to assess the creditworthiness of private persons is credit scoring.

“Credit scoring can be formally defined as a mathematical model for the quantitative measurement of credit” (Hanic, Dzelihodzic, & Dzelihodzic, 2013). “Another definition says that credit scoring is the set of decision models and their underlying techniques that aid lenders in the granting of consumer credit” (Hanic, Dzelihodzic, & Dzelihodzic, 2013). It is significant to mention that not only banks and financial institutions can use credit scoring, but also other companies seeking to analyze their customer risk.

“The scientific background to modern credit scoring is provided by the pioneering work of R. A. Fisher (1936), who devised a statistical technique called discriminant analysis to differentiate groups in a population through measurable attributes, when the common characteristics of the members of the group are unobservable” (Marquez, 2008). In 1941 D. Durand recognized that the approach of discriminant analysis could be used to distinguish between good and bad loans.

The first credit score card, which was heaving dealt with indirect lending to car buyers was created by the Fair Isaac and Co. for America Finance Inc.

“The route to success of credit scoring techniques was heavily mined, because establishing creditworthiness of a debtor using an automated process based on discriminant analysis was viewed as a frontal attack on conventional banking wisdom, painstakingly acquired through several millennia” (Marquez, 2008). Nevertheless, the complete acceptance of credit scoring was taken place with the issuance in the United States of America of the “Equal Credit Opportunity acts” and its amendments in 1975-76. It became illegal to reject a loan on the basis of gender, religion or race, unless it “was empirically derived and statistically based”. But before that, in 1968, the first model that uses multivariate approach was Z-score model had been presented by Edward Altman. The model is based on financial indicators where each of them has appropriate pounder and sum of that financial indicator is Z-score. Value of Z-score tells in which zone potential borrower will belong. Five important financial indicators were included in the Z-score first general formula:

a) working capital/total assets;

b) retained earnings/total assets;

c) earnings before interest and taxes/total assets;

d) market value of equity/book value of total liabilities;

e) sales/total assets.

The variables have the coefficients-0.012, 0.014, 0.033, 0.06, 0.010 respectively. If the Z-score was less than 1.81 potential borrower was in the distress zone, if Z score was from 1.81 to 2.99 borrower was in the grey zone and, if Z score was more than 2.99-in safe zone. It is significant to mention that there are special Altman's formulas for Z score depends on business of company. And in the different models the borders of trustworthiness are different.

For instance, in the Z score for non-manufacturer industrial and emerging markets the following variables are included:

a) (current assets-current liabilities)/total assets;

b) retained earnings/total assets;

c) earnings before interest and taxes/total assets;

d) book value of equity/total liabilities.

The following development of both statistical methods and artificial intelligence methods make credit scoring more accurate. Now lots of different credit scoring techniques exist, among which are such popular methods as Classification tree, Logistic regression, Random forest, Gradient boosting, Neural nets and etc. Three of these algorithms, including CART, Random forests and Gradient boosting, will be represented in this research paper.

1.2 CART algorithm

modification algorithm credit forest

“A decision tree is a flowchart-like tree structure, where each internal node represents a test on an attribute, each branch represents an outcome of the test, class label is represented by each leaf node (or terminal node)” (Vayssieres, Plant, & Allen-Diaz, 2000).

“Classification and Regression Trees, introduced by Leo Breiman in 1984 in their turn, belong to a family of algorithmic methods generating decision trees from a set of learning cases”

(Vayssieres, Plant, & Allen-Diaz, 2000). They operate by recursive partitioning of the set into subsets that are more homogeneous in terms of the response variable. There are two very significant problems to solve while constructing an effective decision tree. The questions are the following:

a) how to find good splits;

b) when to stop to avoid over-fiiting the data.

CART procedure consider all possible splits for all variables. It ranks each splitting rule on the basis of a goodness-of-split criterion reflecting the degree of homogeneity achieved in the child nodes. Homogeneity is assessed with an impurity function, which in case of CART is Gini index.

Its value is maximum for a node with equal proportions and zero for a node which contains only one of the classes. Once the best split is done to separate the data in two parts, algorithm evaluate all possible splits in each subset. The process is repeated recursively until the preset number of cases or when all the node's cases belong to the same class.

CART has a lot of advantages. Firstly, it does not require the specification of any functional form because it is a non-parametric procedure. Secondly, variables can be chosen and used several times because at each stage CART selects the variable holding the most information for the part of the multivariate space it is currently working on. “This use of conditional information constitutes perhaps the most important advantage of CART” (Vayssieres, Plant, & Allen-Diaz, 2000). The other significant characteristic is the robustness. In the evaluation of splits each of n cases has only a weight of one among n, so extreme values do not have undue leverage. As for outliers in the response variable, they are generally separated into their own nodes where they no longer have an influence the rest of tree.

The last important advantage of CART is that compare to some other machine learning and statistics methods it does not require much time to train. “This advantage gives an opportunity to construct ensemble of models, which can be defined as set of classifiers whose individual decisions are combined is some way (typically by weighted or unweighted voting) to classify new examples (Dietterich, 2000), on the basis of CART”.

CART has also some disadvantages. One is that orthogonal partitions of the multivariate space are not always optimal. For instance, linear and simple curvilinear structures of the data may be obscured by the decision tree. A second disadvantage is that, the tree grows the identification of additional predictive factors becomes increasingly difficult, mainly because later splits are based on fewer cases than the initial ones. Thus, parametric methods generally have better results on small data sets compare to the decision trees.

1.3 Ensemble methods

“A classifier ensemble (also referred to as committee of learners, mixture of experts, multiple classifier system) consists of a set of individually trained classifiers (base classifiers) whose decisions are combined in some way, typically by weighted or unweighted voting, when classifying new examples” (Marquez, Garcia, & Sanchez, 2012). “The main idea of ensemble classification is to take advantage of the base classifiers and avoid their weaknesses” (Xiao, Xiao, & Wang, 2016).

There are homogenous (single model) and heterogeneous (multi-model) ensembles. The main difference between these types, is that in homogenous ensembles the individual base learners are of the same type but each with randomly generated training set, while in the heterogeneous ensembles there are different individual base learners.

Ensemble methods can also be divided via the design of their arbitrators, into linear ensembles, where the arbitrator combines the outputs of the base learners using linear technique (for instance, averaging) and nonlinear ensembles, where no assumptions are made about the input that is given to the ensemble.

“It has been shown theoretically and experimentally that ensemble classification tends to be an effective methodology for improving the accuracy and stability of a single classifier” (Xiao, Xiao, & Wang, 2016).

One of the well-known ensemble methods is bagging. The main idea of bagging is the reduction of variance for a given base procedure, such as decision trees. CART algorithm, which has been already been described, have quite high variance, which means, that dividing train data into two parts and construct CART model on these datasets, the results will probably we quite different. Taking a lot of training set from the population universe, constructing models on each of them and averaging them to get the final model can be the solution of these problem, because for an aggregate, consisting from n observations, each of which has variance-a, variance of average will be a/n. But actually, it is quite problematic to use this approach in practice, because usually there is no access to huge amount of training data. Instead of this, bootstrap approach can be used, according to which a multiple sampling from a training data is done. By this way instead of one training data, n bootstrap samples are get. After that CART models are trained on these bootstrap samples and the final model is constructed by averaging in case of solving regression problem and by voting in case of solving classification problem. Usually, deep CART models are trained on bootstrap samples, because they have much less bias and much more variance, but as it has already been mentioned, variance problem is solvable.

Another well-known ensemble model is boosting. While the bagging creates multiple copies of the original training data using bootstrap, fitting a separate decision tree and then combining the trees by averaging (in case of regression) or voting (in case of classification), boosting grow trees sequentially, using information from previously grown trees and does not use bootstrap sampling. Boosting algorithm for regression consists of following steps:

a) set estimate function of relationship between dependent and independent variables=0 and residuals ri =yi for all observations in training set;

b) for b=1, 2, 3, 4…, B construct trees f1, f2, …, fn with d splits on the training data, update estimate function by adding a shrunken version of new trees, update the residuals;

c) construct the final model, which will be the average of constructed trees with weights л.

The construction of boosting ensemble to solve classification tasks is approximately the same process, as described below, but in case of classification other loss functions are used, such as Logistic loss (Bernoulli loss), Adaboost loss etc. The main idea of boosting algorithm is to learn slowly, unlike decision trees, which fit a single large tree to the data, which amounts to fitting the data hard and potentially overfitting.

Boosting has three tuning parameters:

1) the number of trees B;

2) the shrinkage parameter л. The parameter controls the rate at which boosting learns. Typical values are 0.01 and 0.001, and the right choice depends on problem;

3) the number d of splits in each tree, which controls the complexity of the boosted ensemble.

1.4 Random forest algorithm

“Random forest algorithm can be defined as a group of un-pruned classification or regression trees, trained on bootstrap samples of the training data using random feature selection in the process of tree generation” (Brown, & Mues, 2012).

As it can be guessed from the definition, in random forest algorithm a number of decision trees are built on bootstrapped training samples. But in the process of tree building, each time a split in a tree is considered, a random sample of m predictors is taken at each split. This technique is actually used to exclude the correlation between the trees. For instance, if there were very strong predictors, these predictors would probably be in the top of split, thus trees would look quite similar to each other.

Random forest algorithm has a lot of parameters, among which the most important are:

a) number of variables to possibly split at in each node;

b) minimal terminal node size;

c) fraction of observations to sample in each tree;

d) number of trees.

One of the most important advantages of Random forest algorithm compare to CART (Classification and Regression Trees), as it has been mentioned below, is that it has much less variance, meaning that it will show approximately the same results on two different training datasets.

Random Forest algorithm has also numerous other advantages, which makes their applicability in credit scoring field effective. Firstly, a lot of both empirical and theoretical researches have proven that the algorithm has a high accuracy rate, which is actually the most important advantage (Tang, Cai, & Ouyang, 2018). Secondly, Random Forest algorithm has a good tolerance of outliers and noises, meaning that a few strange observations will hardly have an influence on the accuracy of the model. Besides, as it has already been mentioned, Random forest algorithm belongs to the bagging ensemble types, which usually have less variance than base model. Paying attention to that fact, it becomes clear that the algorithm has a low likelihood of overfitting. “The other advantage is the fact, that multiple trees can be trained efficiently in parallel, Random forest usually perform s better than Classification and Regression Trees algorithm with large datasets” (Tanaka, Kinkyo, & Hamori, 2016).

As all other algorithms, Random Forest has also some disadvantages. Firstly, large number of trees can make the algorithm very slow, and consequently, ineffective for real time predictions. The second very important disadvantage is connected with the interpretability of the model. “Actually, the interpretability is significant component in the credit scoring field, because usually banking supervision authorities require banks to subject to comprehensive credit risk models, verifying the soundness of bank choices” (Yufei, Chuanzhe, YuYing, & Nana, 2017). “Besides, according to Lessmann understandable models are needed to convince managers to shift from simple models on more complex, but at the same time more accurate ones, because the idea of building various models and then combining their results to make final decision is counter-intuitive and can be seemed strange for managers” (Yufei, Chuanzhe, YuYing, & Nana, 2017). Actually an indicator, called importance exists, which helps to understand which variables are more significant, but these indicators does not demonstrate any relationships between the variables, used in model construction.

The last disadvantage, which will be mentioned is the following. It has already been said that Random Forest use bootstrap sampling while training. It is clear that in most of the cases the number of no defaults is much higher in credit scoring datasets than the number of defaults. Doing bootstrap sampling a situation can be met, there in bootstrap samples the number of defaults is so low that the final model can have huge difficulties in the identification of defaults and consequently have very low sensitivity.

1.5 Gradient boosting algorithm

“Gradient boosting is an ensemble algorithm that improves the accuracy of a predictive function through incremental minimization of the error term” (Brown, & Mues, 2012). It becomes clear from the definition that in gradient boosting the same approach as in other boosting models is used. “The main difference from other boosting models is that in Gradient boosting algorithm the new base learners are constructed to be maximally correlated with the negative gradient of the loss function, associated with the whole ensemble” (Natekin, & Alois, 2013).

“Actually on of the most important advantages of Gradient boosting algorithm is the fact that it has shown considerable success in both various machine learning challenges and practical applications” (Natekin, & Alois, 2013). Besides, as Random forest algorithm, Gradient boosting do not make any assumptions about the type of relationship between variables, which means that this is quite useful tool for to analyze high dimensional and fuzzy data.

One of the disadvantages of the Gradient boosting, like the Random forest algorithm is that depending from the size of a dataset and also the chosen parameters, such as the number of trees, it can be needed huge time to train the model. In this case this problem is even more actual than in RF case, because while the RF construct trees in parallel, Gradient boosting makes it sequentially.

The second disadvantages are connected with the interpretability of the model. Actually Gradient boosting does not show the type of connection between dependent variable and independent ones. And as it has already been mentioned above the interpretability can be very important for financial institutions' authorities while making decisions.

1.6 Conclusion

All three mentioned methods, as it has been mentioned have both advantages and disadvantages.

While CART (Classification and Regression Trees) is very easy to implement, quite well interpretable, but at the same time usually demonstrates worse accuracy than two other mentioned models, Random forest algorithm and Gradient boosting are not interpretable, connected with the size of dataset, they sometimes needed huge time to train, but usually demonstrate better results. So the models can be used in different cases. For example, if there is a need to understand the type of relationships between the variables CART can be used, but if accuracy of the model is much more significant for financial institutions Random forest or Gradient boosting will be more convenient to use.

It is also quite important to mention, that in spite of the fact that the new method, proposed in this research paper is based on Random forest algorithm, it does not neutralize its disadvantages. Moreover, it needed more time to train and does not show the types of relationships between the variables. The main goal of this method is to increase accuracy of predictions. The model will be described, and both its advantages and disadvantages will be shown in methodology part.

2. Research question

As it has already been discussed in the introduction and theoretical foundation parts that credit scoring has a very long history and its start point is 1941, when Durand has recognized that discriminant analysis technique can be used for bad borrowers (who have high probability of default) identification. In 1968 is also very important year of the credit scoring techniques history, because in that year the first multivariate approach (Altman Z-score) has been created.

Further development of both statistical and machine learning techniques has become a reason of emergence such popular methods as Classification and Regression Trees, Random forest algorithm, Gradient boosting, Logistic regression, Neural nets, etc. Some of these techniques are shown to be successful in forecasting credit scores (Hue, Hurlin, Tokpavi, & Dumitrescu, 2017).

Nevertheless, as it has been already mentioned, only a fraction of a percent improvement in credit scoring techniques can give financial institutions an opportunity to significantly increase profit (Henley, & Hand, 1997). The statement was done in 1997, but paying attention to the fact to the growth of some types of credit loans in several countries, creating new models and also modifying old ones still remains important. The goal of this research paper dedicated to this problem and consists of two parts. The first part includes the modification of very popular machine learning model, called Random Forest and the second part includes the comparison of modified model with CART (Classification and Regression Trees), original Random forest and also Gradient boosting. The main reason of setting such goal is to understand the applicability of the modified model in credit scoring field. Actually the results of comparison cannot certainly reject the possible applicability of the model in credit scoring, because demonstrating worse results on two datasets, does not mean, that the model will demonstrate worse results on all credit scoring datasets. Nevertheless, the results of this research paper can demonstrate some strengths and weaknesses, which can be very important for following development of the model and making new comparisons on other datasets. To reach the goal of this research paper four tasks have to be solved. Firstly, appropriate literature about credit scoring and tree-based models has to be analyzed. Secondly, an algorithm of the modified model has to be created. Thirdly the algorithm has to be realized in programming language R. The last step includes training of four mentioned models on two credit scoring datasets and their comparison, using chosen criteria. It is significant to mention that Cross-Validation will be used for comparison. The definition of Cross-Validation, number of folds and other important questions will be given in the next chapter, but it has to be mentioned that the main goal of using cross-validations folds is the neutralization of influences of a test data's specificities.

Two hypothesis are set:

a) modified model will demonstrate better results on the first credit scoring dataset in terms of accuracy, sensitivity and specificity;

b) modified model will demonstrate better results on the second credit scoring dataset in terms of accuracy, sensitivity and specificity.

Three mentioned criteria will be defined and described in the next chapter, but in this chapter it is quite important to mention why not to use only one criteria. Actually, in most of the cases, credit scoring datasets have much more no defaults than defaults. Supposing to have a credit scoring dataset with 80 “no defaults” and 20 “defaults”. A model, which predicts no default for all borrowers can reach 80% accuracy without any identification of defaults. It is clear that implementation of such model is meaningless. That's why using only one criteria of evaluation of machine learning models, especially in credit scoring field can be a reason of wrong conclusions.

3. Methodology

3.1 Objectives of the thesis.

The main objective of this research paper consist of 2 parts:

a) by the use of some data transformations, modify a well-known machine learning algorithm, called Random forest;

b) compare the modified model with original Random forest algorithm, Gradient boosting and CART (Classification and Regression Tress).

The first step to reach the objectives of the research is to collect credit scoring datasets, on which the models will be constructed. The second step is the choice of criteria and also the type of test data (randomly used test data or cross-validation), which will be used to assess and compare the performance of the models. After the two steps, models have to be constructed on chosen datasets and comparison has to be done.

3.2 New method description.

The modified model, which will be compared to other tree based methods is based on the Random Forest algorithm. While the original Random Forest predicts dependent variable, using independent ones, the new model predicts dependent variable, using the differences between independent variables of an observation and independent variables of its nearest neighbors. Suppose to have a following credit scoring dataset with 4 observations, 2 independent variables and dependent variable (0-no default, 1-default).

Table 1

An example of scoring dataset

ID

X1

X2

Y

1

20

3

0

2

27

5

0

3

23

2

1

4

17

8

0

Original Random Forest algorithm predicts Y using X1 and X2. The modified method uses other approach. Firstly, the number of nearest neighbors is chosen. Let's k=2. Secondly, nearest neighbors are identified using only independent variables, in this case k=2 (the method of nearest neighbors' identification will be described further). After that a new dataset is created where the differences between independent variables of observations and independent variables of their 2 nearest neighbors.

Following table shows IDs of 2 nearest neighbors of observations. Distances between observations calculated using Euclidean distance, but before the calculation standardization of variables had been used.

Table 2

Nearest neighbors of observations

ID

First nearest neighbor

Second nearest neighbor

1

3

2

2

3

1

3

1

2

4

2

1

After identification of nearest neighbors, new dataset is created with the number of rows-k multiplied by the number of rows of original data (in this case-8) and the number of columns-as in the original data+1 (in this case-5). For this dataset the following table is get.

Table 3

Final dataset

ID

Differences of X1

Differences of X2

Y of the neighbor

Y

1

-3

1

1

0

2

4

3

1

0

3

3

-1

0

1

4

-10

3

0

0

1

7

-2

0

0

2

3

2

0

0

3

-4

-3

0

1

4

-3

5

0

0

Random forest model is trained on this new dataset. Here a question raises. For instance, if the model, using the differences between independent variables of observation with ID-1and its first nearest neighbor and also dependent variable of the first nearest neighbor, predicted 1 (default) for first observation, and by the same way (using the second nearest neighbor) predicted 0 for the first observation, what would be the final prediction. That's why after constructing Random Forest on the dataset, a matrix is created with the number of rows as in the original dataset and number of columns of k (number of nearest neighbors). In the crossing of the first row and first column, a prediction of the first observation using its first nearest neighbor is written. In the crossing of the i-th row and j-th column, the prediction of i-th observation using its j-th nearest neighbor. After that an average of the rows is calculated. This number will be the probability of default.

This model has a very important advantage. It gives an opportunity to significantly increase the number of observations, which, in its turn, can make Random Forest algorithm train better.

Nevertheless, the model has also some disadvantages. One of the most important disadvantages is the time, model is needed to train. For example, having a credit scoring dataset with 30000 observations, and choosing 10 nearest neighbors for model construction, the number of rows of the new dataset, on which Random Forest has to be constructed, will become 300000. Training Random Forest or another ensemble model on such a huge dataset can be problematic. Paying attention to the fact, that several financial institutions have credit scoring datasets with much high number than 30000, this is actually a significant problem.

Besides, artificially increasing the number of observations can become a reason of overfitting, “which is a phenomenon, when a model has high accuracy for a classifier when evaluated on the training set but low accuracy when evaluated on a separate test set” (Subramanian, & Simon, 2013). The last problem, which is essential to mention is the problem of the choice of k. Actually a very high number of observations can be chosen firstly, and after that this number can be decreased to choose the k, the use of which in the model provides best results (having certain criteria) on test data or cross-validation data. But as it has already been mentioned, a huge time is needed to construct modified model on big scoring datasets, thus using such method of choosing the number of nearest neighbors may be time consuming.

The results of comparison (will be described in the following chapter) will better demonstrate the advantages and disadvantages of the model and also show can the model be implemented in credit scoring field or not.

3.3 Datasets' description

Comparison of the methods will be done on two datasets. One of the credit scoring datasets has 1000 observations, 8 independent variables and a dependent variable, connected with the credit risk (Hofmann, 2000).

Among the independent variables are age of borrower in years, gender and job. Complete list of the names of independent variables and also their explanations can be seen in the following table.

Table 4

Names and descriptions of the first dataset variables

Variable name

Explanation

Age

Age in years

Sex

Male or female

Job

Unskilled and non-resident, unskilled and resident, skilled, highly skilled

Housing

Own, rent or free

Saving accounts

Little, moderate, quite rich, rich

Checking account

In Deutsch Marks

Duration

In months

Purpose

Car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others

Risk

Good or bad

The second dataset, in its turn, consists of 30000 observations, 24 independent variables and a dependent variable (Lichman, 2013). Among the independent variables are age, marriage, education etc. Complete names and explanations can be seen in the following table.

Table 5

Names and descriptions of the second dataset variables

Variable name

Explanation

ID

ID of each client

LIMIT_BAL

Amount of given credit in NT dollars (includes individual and family/supplementary credit)

SEX

Male or female

EDUCATION

Graduate school, graduate university, graduate high school, others

MARRIAGE

Marital status (married, single or others)

AGE

Age in years

PAY_0

Repayment status in September 2005 (pay duly, payment delay for one month, payment delay for two months, …, payment delay for nine months and above

PAY_2

Repayment status in August, 2005 (scale same as above)

PAY_3

Repayment status in July, 2005 (scale same as above)

PAY_4

Repayment status in June, 2005 (scale same as above)

PAY_5

Repayment status in May, 2005 (scale same as above)

PAY_6

Repayment status in April, 2005 (scale same as above)

BILL_AMT1

Amount of bill statement in September, 2005 (NT dollar)

BILL_AMT2

Amount of bill statement in August, 2005 (NT dollar)

BILL_AMT3

Amount of bill statement in July, 2005 (NT dollar)

BILL_AMT4

Amount of bill statement in June, 2005 (NT dollar)

BILL_AMT5

Amount of bill statement in May, 2005 (NT dollar)

BILL_AMT6

Amount of bill statement in April, 2005 (NT dollar)

PAY_AMT1

Amount of previous payment in September, 2005 (NT dollar)

PAY_AMT2

Amount of previous payment in August, 2005 (NT dollar)

PAY_AMT3

Amount of previous payment in July, 2005 (NT dollar)

PAY_AMT4

Amount of previous payment in June, 2005 (NT dollar)

PAY_AMT5

Amount of previous payment in May, 2005 (NT dollar)

PAY_AMT6

Amount of previous payment in April, 2005 (NT dollar)

default.payment.next.month

Default payment (yes or no)

Second dataset, unlike the first one is about credit card and in its case the possible monthly default has to be predicted. In both cases the source of datasets is Kaggle, which is a platform, where data science competitions are created.

As it can be seen the two datasets have quite different number of observations and also independent variables, thus the performances of the models will be compared in different situations and the advantages, disadvantages and also the best conditions for certain model usage will be better shown.

It is quite important to mention that the results of comparison do not mean that the model, which demonstrate better results here, will also demonstrate better results on other credit scoring datasets, meaning that the results of this research paper cannot certainly reject the possible applicability of the modification in credit scoring field. Besides the modification creates huge opportunity for further research, because other models, such as Gradient boosting or Logistic regression etc. can be used with the same data transformations.

3.4 Criteria of comparison.

Accuracy, sensitivity and specificity will be used to compare the models. A question can raise why not to use only accuracy and why other two criteria are needed. Actually, in a lot of datasets, where classification problem has to be solved, there can be a situation, when the number of one class is much higher than other. In credit scoring datasets, usually the number of no defaults is higher than the number of defaults. Two datasets, which will be used for model construction are not exception. In such cases only high accuracy does not mean, that the model works well. For instance, assuming to have a credit scoring dataset with 100 observations, among which are 80 no defaults and 20 defaults. A model, which predicts no default for all the observations can reach 80% accuracy, but it is clear that implementation of such model is meaningless.

The three criteria can be described in terms of TP (true positive), TN (true negative), FN (false negative) and FP (false positive):

a) “accuracy=(TN+TP)/(TN+TP+FN+FP)*100%, meaning that accuracy is the percentage ratio of the number of correct assessments and the number of all assessments;

b) sensitivity=TP/(TP+FN), meaning that sensitivity is the ratio of the number of true positive assessment and the number of all positive assessments;

c) specificity=TN/(TN+FP), meaning that specificity is the ratio of the number of true negative assessments and the number of all negative assessments” (Rocha-Muniz, Befi-Lopes, & Schochat, 2014).

Speaking in terms of credit scoring, if a model demonstrates high sensitivity, it offers a greater probability of the positive outcomes recognition, which in case of scoring datasets used in this research paper, are defaults. Higher specificity, in its turn provides a greater probability for negative outcomes (in this case-no defaults) recognition.

"A model with high specificity means a high level of rejected credit applicants and trying to minimize a credit risk, corresponds to a conservative credit policy, while a model with high sensitivity (high level of approved credit applicants) corresponds to a risky credit policy, trying to minimize the loss of economic benefit" (Garanin, Lukashevich, & Salkutsan, 2014).

3.5 Cross-validation

"Cross-validation is an estimator widely used to evaluate prediction errors" (Bergmeir, Costantini, & Benitez, 2014). In k-fold cross-validation partitioning of overall data into equal size k blocks takes place. K-1 blocks are used to train a model and the k-th block is used to measure the model performance. Thus, each of the k sets is used to train a model and once to measure forecast performance. Supposing to have a credit scoring dataset with 100 observations and choosing the number of k as 10, firstly 10 sets are created. From first to tenth observations build the first block, from eleventh to twentieth-the second block and so on. After that a model is constructed using as train data all the blocks besides first one and as test data-first block. The same thing is done, using as test data the second, the third and other sets and as train data, all the sets besides them. "By averaging over the k measures, the error estimate using cross-validation has a lower variance compared to an error estimate using only one training and test set and in this way, a more accurate evaluation of the generalization error can be obtained" (Bergmeir, Costantini, & Benitez, 2014).

The choice of the number of folds in cross-validation mostly depends on the amount of available data and the computational cost. Typical choice for k are 5 or 10 (Bergmeir, Costantini, & Benitez, 2014) and having two credit scoring datasets, in one case 5-fold cross-validation will be used to assess the performance of the models and in the second case-randomly chosen test data.

4. Results

In the results part of this research paper, firstly also connection between the independent and dependent variables will be described, secondly the packages used to construct models will be represented and finally, comparison between the models will be done.

As it has been mentioned in the methodology part, one of the datasets has 1000 observations, 8 independent variable and dependent variable. In 394 observations, variable “Checking account” is missing. Approximately the same situation is in the variable “Saving accounts”, having 183 missing observations. The goal of this research paper actually is not connected with the solution of the problem of missing observations, these two variables are removed and are not included in the constructed of models.

4.1 Acquaintance with the first dataset

As it has already been mentioned, the first dataset, which is about credit cards, has 1000 observations and 7 variables (two will be ignored). Here can be seen the connection between independent variables and default risk.

a) It can be seen from the graph, that median age of good borrowers is higher than defaulters'.

b) among the female borrowers, the number of bad is 109, while the number of good-201, at the same time among the male borrowers-191 and 499 respectively. As a percentage, approximately the 35% of female borrowers are bad, while among the males, they are only 27%;

c) the variable “Job” has four different levels and the percentage of defaulters among them is approximately 32%, 28%, 30%, 34% respectively;

d) the percentage of “bad” borrowers in “Housing” variable is the following: approximately 41%, 26%, 39% respectively;

Figure 1. Connection between age and risk of the borrowers

Figure 2. Connection between duration and risk

e) the variable “Purpose” has a lot of different levels, that's why only the level with the highest and lowest percentage will be mentioned. The number of defaults among the borrowers, who has taken credit for vacation/other purpose is the highest-approximately 42%, while the lowest is among the borrowers, which has taken credit for buying radio/TV purpose-less than 23%;

f) as it can be guessed and seen from the table, the median credit amount for “bad” borrowers or defaulters is higher than for “good” borrowers.

Figure 3. Connection between credit amount and risk

4.2 Acquaintance with the second dataset

The connection between the variables of the second dataset will be described in the same way, if the variable is categorical-the percentage of defaulters between different levels will be mentioned, otherwise the connection will be shown graphically-with the help of boxplots.

Figure 4. Connection between the amount of given credit and default risk

a) the percentage of defaulters among category 1 (it was not mentioned in the description of the dataset, which category is male and which-female) is more than 24%, while among the category 2-20%;

b) the number of levels in the “Education” variable is higher compared to other variables' levels, besides in some levels the number of observations is too low, that's why only the highest and lowest percentage of defaulters in levels, where the number of observations is higher, will be mentioned. The category, with the lowest defaulters' percentage is the seventh category-less than 0.06%, with the highest (more than 25%)-third category;

c) the variable “MARRIAGE” has four different levels, and the percentage of defaulters among them is 9%, 23%, 21%, 26% respectively;

d) the boxplots of “Age” variable for both defaulters and non-defaulters are approximately the same, therefore the variable probably will not be very significant in the models;

e) the group of variables (PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6) are quite similar, as they show repayment status in different months and they have too many levels, that's' why the percentage of “bad” borrowers in this case will not be mentioned. The same situation is with the variables “BILL_ATM1”, … , “BILL_ATM6”, and “PAY_AMT1”, … , “PAY_AMT6”.

Figure 5. Connection between age and default risk

4.3 Models Construction and their results on the first dataset

The first dataset, on which the models will be constructed is the one, which has 1000 observations, 9 independent variables and a dependent variable, but as it has already been mentioned, two of the variables will not be used. Models construction will be in programming language R. “Caret” package and a function of its, called “train” will be used. The main advantage of this function is that its optimize some of the parameters of the models. For instance, in case of training Random forest algorithm, while the number of trees is given before the model training, the number of variables, which are used in the construction of separate trees are optimized to maximize the accuracy. The function has also a disadvantage, because if the number of variables and also observations in a dataset, a huge time can be needed to optimize the parameters. The first dataset, used in this research paper, as it has already been mentioned, is not too huge, that's why this method is used here.

...

Подобные документы

  • Мотивация труда: задачи и инструменты. Краткий обзор классических теорий мотивации персонала. Характеристика гостиницы "Forest Inn", анализ кадрового направления ее работы. Рекомендации по совершенствованию системы стимулирования работников предприятия.

    дипломная работа [477,4 K], добавлен 18.05.2011

  • Реинжиниринг как радикальное перепроектирование деловых процессов для улучшения показателей деятельности предприятия. Анализ реинжиниринга бизнеса в компании "IBM Credit". Информационное обеспечение совершенствования дивизиональной организации управления.

    курсовая работа [239,0 K], добавлен 04.12.2015

  • Теоретические аспекты основных понятий, сущности реинжиниринга. Использование потенциала реинжиниринга в Российских условиях. Практическое применение реинжиниринга на примере компаний: Ford Motor Company, IBM Credit, Kodak. Реинжиниринг бизнес-процессов.

    реферат [15,2 K], добавлен 30.11.2010

  • Реинжиниринг - фундаментальное перепроектирование бизнес-процессов компаний для достижения улучшения показателей их деятельности. Особенности реинжиниринга в банковской сфере на примере "IBM Credit": этапы проведения, участники, результаты, перспективы.

    курсовая работа [49,6 K], добавлен 03.05.2012

  • Impact of globalization on the way organizations conduct their businesses overseas, in the light of increased outsourcing. The strategies adopted by General Electric. Offshore Outsourcing Business Models. Factors for affect the success of the outsourcing.

    реферат [32,3 K], добавлен 13.10.2011

  • Critical literature review. Apparel industry overview: Porter’s Five Forces framework, PESTLE, competitors analysis, key success factors of the industry. Bershka’s business model. Integration-responsiveness framework. Critical evaluation of chosen issue.

    контрольная работа [29,1 K], добавлен 04.10.2014

  • Searching for investor and interaction with him. Various problems in the project organization and their solutions: design, page-proof, programming, the choice of the performers. Features of the project and the results of its creation, monetization.

    реферат [22,0 K], добавлен 14.02.2016

  • Системный подход к задачам информационного менеджмента. Обоснование архитектуры технологической среды обработки информации, варианта создания информационной системы на базе стоимости владения. Оценка использования ресурсов. Реинжиниринг бизнес-процессов.

    курсовая работа [660,6 K], добавлен 20.03.2014

  • Ознакомление с предложениями и рекомендациями по выбору модели управления инвестиционными рисками. Исследование и анализ особенностей финансовой политики рассматриваемой компании. Изучение организационно-экономической характеристики предприятия.

    дипломная работа [142,0 K], добавлен 24.08.2017

  • Description of the structure of the airline and the structure of its subsystems. Analysis of the main activities of the airline, other goals. Building the “objective tree” of the airline. Description of the environmental features of the transport company.

    курсовая работа [1,2 M], добавлен 03.03.2013

  • Organizational legal form. Full-time workers and out of staff workers. SWOT analyze of the company. Ways of motivation of employees. The planned market share. Discount and advertizing. Potential buyers. Name and logo of the company, the Mission.

    курсовая работа [1,7 M], добавлен 15.06.2013

  • Analysis of the peculiarities of the mobile applications market. The specifics of the process of mobile application development. Systematization of the main project management methodologies. Decision of the problems of use of the classical methodologies.

    контрольная работа [1,4 M], добавлен 14.02.2016

  • История основания корпорации в городе Рочестер (США) в 1906 г. Появление первого ксерокопировального аппарата с незатейливым названием Model A. Выпуск в 2003 г. цифровой печатной машины нового поколения - iGen3. Изобретения, принадлежащие компании Xerox.

    презентация [1,7 M], добавлен 01.12.2013

  • Formation of intercultural business communication, behavior management and communication style in multicultural companies in the internationalization and globalization of business. The study of the branch of the Swedish-Chinese company, based in Shanghai.

    статья [16,2 K], добавлен 20.03.2013

  • Value and probability weighting function. Tournament games as special settings for a competition between individuals. Model: competitive environment, application of prospect theory. Experiment: design, conducting. Analysis of experiment results.

    курсовая работа [1,9 M], добавлен 20.03.2016

  • Nonverbal methods of dialogue and wrong interpretation of gestures. Historical both a cultural value and universal components of language of a body. Importance of a mimicry in a context of an administrative communication facility and in an everyday life.

    эссе [19,0 K], добавлен 27.04.2011

  • Basic assumptions and some facts. Algorithm for automatic recognition of verbal and nominal word groups. Lists of markers used by Algorithm No 1. Text sample processed by the algorithm. Examples of hand checking of the performance of the algorithm.

    курсовая работа [22,8 K], добавлен 13.01.2010

  • History of the "First Credit bureau". Seven of the leading commercial banks in the Republic of Kazakhstan. Formation of credit stories on legal entities and granting of credit reports: credit score, conditions, capacity, capital, collateral, character.

    презентация [777,2 K], добавлен 16.10.2013

  • Lists used by Algorithm No 2. Some examples of the performance of Algorithm No 2. Invention of the program of reading, development of efficient algorithm of the program. Application of the programs to any English texts. The actual users of the algorithm.

    курсовая работа [19,3 K], добавлен 13.01.2010

  • Review of development of cloud computing. Service models of cloud computing. Deployment models of cloud computing. Technology of virtualization. Algorithm of "Cloudy". Safety and labor protection. Justification of the cost-effectiveness of the project.

    дипломная работа [2,3 M], добавлен 13.05.2015

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.