Modification of the Random Forest Algorithm for Credit Scoring and Its Comparison with Gradient Boosting, Random Forest and CART

Analyze literature about credit scoring and tree-based models. Create algorithm of the new model. Realize the algorithm in programming language r. Construct these models on two credit-scoring data and compare them on test data using chosen criteria.

Рубрика Менеджмент и трудовые отношения
Вид дипломная работа
Язык английский
Дата добавления 01.12.2019
Размер файла 295,1 K

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Actually there is another important problem. When the function tries to optimize the model maximizing the accuracy in the train data, it can become a reason of overfitting in the test data. Ideally, the four models, among which the comparison will be done, should have been optimized using cross-validation and only after that compared, but as the package and function for the new model do not exist, it is problematic and needs huge time to do. Thus, in the model construction process, the train data (original dataset without the observations from 1 to 200, or 201 to 400 and so on, depending on which validation set used as test data) in its turn will be divided into two datasets, 80% (which in this case will be 640) will be used to construct the model, 20% (160 observations) will be used to check the accuracy and respectively, optimize parameters. As it has already been mentioned R programming language, Caret package and its train function will be used. And the methods are following:

a) “rpart” for decision trees algorithm;

b) “gbm” for Gradient boosting'

c) “rf” for original Random Forest algorithm;

d) “rf” for the new algorithm. It has been mentioned that Random forest algorithm is used in the new model construction, but before that some significant changes take place, which make the model different.

Before presenting results, about a new indicator, called AUC will be written, because it is one of the very useful and popular indicators for models comparison. Actually, “AUC is the area under the receiver operating characteristic (ROC) curve, which in its turn is a technique for visualizing, organizing and selecting classifiers based on their performance” (Fawcett, 2006). “The ROC curve, in its turn plots sensitivity as a function of commission error (1-specificity) as the threshold changes” (Lobo, Jimenez-Valverde, & Real, 2008). The steps of constructing such curve will not be described, firstly because the description can be quite long and secondly, it is supposed that a reader is already familiar with the ROC curve.

The new model demonstrates AUC on the first validation set approximately 0.53, while the tree model-0.5, and Gradient boosting and Random forest algorithms-0.68 and 0.65 respectively. To make the given results more presentable different thresholds will be used to identify good and bad borrowers and show the accuracy, sensitivity and also specificity of the four models.

For the threshold-0.5, the following results are get:

Table 6

Models' results on the first dataset for threshold 0.5

Accuracy

Sensitivity

Specificity

Classification tree

0.715

0

1

Gradient boosting

0.7

0.1053

0.9301

Random forest

0.71

0.386

0.8601

New algorithm

0.72

0.0175

1

Table 7

Models' results on the first dataset for threshold 0.6

Accuracy

Sensitivity

Specificity

Classification tree

0.715

0

1

Gradient boosting

0.7

0.3158

0.8531

Random forest

0.63

0.4561

0.6993

New algorithm

0.715

0

1

Table 8

Models' results on the first dataset for threshold 0.7

Accuracy

Sensitivity

Specificity

Classification tree

0.285

1

0

Gradient boosting

0.64

0.6667

0.6294

Random forest

0.585

0.5965

0.5804

New algorithm

0.715

0

1

Table 9

Models' results on the first dataset for threshold 0.8

Accuracy

Sensitivity

Specificity

Classification tree

0.285

1

0

Gradient boosting

0.42

0.9298

0.2168

Random forest

0.49

0.8772

0.3357

New algorithm

0.71

0.035

0.979

Table 10

Models' results on the first dataset for threshold 0.85

Accuracy

Sensitivity

Specificity

Classification tree

0.285

1

0

Gradient boosting

0.285

1

0

Random forest

0.405

0.9298

0.1958

New algorithm

0.705

0.0087

0.0951

Table 11

Models' results on the first dataset for threshold 0.9

Accuracy

Sensitivity

Specificity

Classification tree

0.285

1

0

Gradient boosting

0.7

0.1228

0.9301

Random forest

0.285

0.1

0

New algorithm

0.685

0.1404

0.9021

Table 12

Models' results on the first dataset for threshold 0.925

Accuracy

Sensitivity

Specificity

Classification tree

0.285

1

0

Gradient boosting

0.285

1

0

Random forest

0.33

0.3509

0.0629

New algorithm

0.665

0.2281

0.8392

Table 13

Models' results on the first dataset for threshold 0.95

Accuracy

Sensitivity

Specificity

Classification tree

0.285

1

0

Gradient boosting

0.285

1

0

Random forest

0.29

1

0.007

New algorithm

0.625

0.2982

0.7552

Table 14

Models' results on the first dataset for threshold 0.975

Accuracy

Sensitivity

Specificity

Classification tree

0.285

1

0

Gradient boosting

0.285

1

0

Random forest

0.29

1

0.007

New algorithm

0.695

0.3509

0.8322

For other cross validation sets the tables will not be represented, because actually there are five sets and it is not possible to demonstrate all the three criteria for eight different thresholds. That's why only the AUC of four models will be mentioned.

For the second validation set, containing the observations from 201 to 400, the models have demonstrated following results in terms of AUC:

a) classification tree algorithm-0.5;

b) random forest algorithm-0.71;

c) gradient boosting algorithm-0.67;

d) new algorithm-0.5.

Finally, the following table is get, where the average AUCs of five different validation folds are be written.

Table 15

Average AUC of the models

Model

AUC

Classification tree algorithm

0.52

Gradient boosting algorithm

0.67

Random forest algorithm

0.66

New algorithm

0.49

It becomes clear that in terms of AUC the new model demonstrates worse results, compared to Classification tree, Gradient boosting and Random forest algorithms. But as it has already been mentioned the number of defaulters in the dataset is much lower than the number of good borrowers, and this is one of the cases where different indicators are needed to be analyzed to make some conclusions. For example, in the first validation set, Classification tree model depending from chosen threshold identifies all the borrowers either as defaulters or as good borrowers. So in one case the accuracy is quite high-0.715, but the sensitivity is 0, and in other case, the accuracy is 0.285 and the specificity is 0. It is clear that there is no reason to practically use such a model in credit scoring field. At the same time using a threshold 0.975, the new suggested method demonstrates the accuracy-0.625, sensitivity-approximately 0.3 and specificity more than 0.75. So, to make some conclusions a table with average sensitivity, specificity, accuracy and also two new indicators, called precision and F-score, will be represented. The final threshold for all four models will be the one, which maximizes F-score. (Thresholds will be different for the models), but firstly it is essential to describe some new indicators. Actually, precision is the proportion of correct positive identifications, or to formulate mathematically, precision is the division of the number of true positives and sum of true positives and false positives. And F-score is the harmonic mean of sensitivity (recall) and precision. Five criteria-accuracy, sensitivity, specificity, precision and F-score will be represented for different thresholds in the following table:

Table 16

Models' results on the first dataset

Threshold

Model

Accuracy

Sensitivity

Specificity

Precision

F-score

Classification tree

0.7

0.0249

0.9899

0.1

0.0374

0.5

Gradient boosting

0.714

0.1954

0.9389

0.5643

0.2903

0.5

Random Forest

0.704

0.1416

0.9594

0.6361

0.2316

New algorithm

0.7

0.0174

1

0

0

Classification tree

0.7

0.023

0.9899

0.1

0.0374

0.6

Gradient boosting

0.691

0.3564

0.837

0.487

0.4116

0.6

Random Forest

0.693

0.2153

0.9021

0.5905

0.3155

New algorithm

0.7

0

1

0

0

Classification tree

0.507

0.4754

0.9899

0.1916

0.2731

0.7

Gradient boosting

0.63

0.5917

0.6528

0.4223

0.4928

0.7

Random Forest

0.671

0.3778

0.7734

0.4769

0.4216

New algorithm

0.699

0

0.9986

0

0

Classification tree

0.3

1

0

0.3

0.4615

0.8

Gradient boosting

0.458

0.9017

0.2737

0.3495

0.5037

0.8

Random forest

0.601

0.5391

0.6358

0.3896

0.4523

New algorithm

0.689

0.0138

0.9785

0.18

0.0256

Classification tree

0.3

1

0

0.3

0.4615

0.85

Gradient boosting

0.364

0.9629

0.1105

0.3176

0.4777

0.85

Random forest

0.57

0.7733

0.514

0.39

0.5185

New algorithm

0.69

0.028

0.9731

0.25

0.0504

Classification tree

0.3

1

0

0.3

0.4615

0.9

Gradient boosting

0.323

1

0.0336

0.3077

0.4706

0.9

Random forest

0.5

0.8175

0.3699

0.3594

0.4993

New algorithm

0.678

0.0892

0.9229

0.3215

0.1397

Classification tree

0.3

1

0

0.3

0.4615

0.925

Gradient boosting

0.312

1

0.0175

0.304

0.4663

0.925

Random forest

0.461

0.869

0.291

0.3456

0.4945

New algorithm

0.672

0.0892

0.9229

0.3215

0.1397

Classification tree

0.3

1

1

0.3

0.4615

0.95

Gradient boosting

0.302

1

0.0029

0.3006

0.4622

0.95

Random forest

0.399

0.9173

0.1805

0.3246

0.4795

New algorithm

0.656

0.1221

0.8855

0.3187

0.1766

Classification tree

0.3

1

0

0.3

0.4615

0.975

Gradient boosting

0.3

1

0

0.123

0.2191

0.975

Random forest

0.342

0.9616

0.0911

0.3094

0.4682

New algorithm

0.635

0.1486

0.8443

0.2696

0.1916

Actually, this dataset case is quite difficult to analyze, because even mentioned five significant indicators, cannot demonstrate the situation very well. For instance, Classification tree algorithm predict the same probability of default for all the observations. While in one validation set, chosen threshold predicts “good borrowers” and gives the sensitivity-0 and specificity-1, in other validation set it predicts “bad borrowers or defaulters” and demonstrates sensitivity-1 and specificity-0. By averaging, the model shows quite good results compared to other models, but it is clear that the implementation of such model in credit scoring field is meaningless. So the comparison will be done within three other models-Gradient boosting, Random forest and New algorithm. Choosing the threshold for each model, which demonstrates highest F-score, the following results are get:

Table 17

Final results of the models on the first dataset

Threshold

Model

Accuracy

Sensitivity

Specificity

Precision

F-score

0.8

Gradient boosting

0.458

0.9017

0.2737

0.3495

0.5037

0.85

Random forest

0.57

0.7733

0.514

0.39

0.5185

0.975

New algorithm

0.635

0.1486

0.8443

0.2696

0.1916

As it can be seen from the table, Gradient boosting and forest models demonstrate much better results in terms of sensitivity, precision and F-score, but the accuracy and specificity of the New algorithm is higher. But actually, a threshold can be chosen for Gradient boosting, which demonstrates approximately the same results in terms of specificity and much higher results in terms of sensitivity. For instance, in case of 0.6 the sensitivity of boosting model is 0.3564 and specificity-0.837. The same situation is with Random forest. In case of 0.6 threshold its sensitivity is 0.2153 and sprecificity-0.9021. So it is clear that the new algorithm in case of this dataset cannot compete with two below mentioned algorithms, but it is more preferable than Classification tree. More detailed conclusions will be done after the comparison of models on the second dataset in conclusion unit.

4.4 Models Construction and their results on the second dataset

In this unit the results of the comparisons of four mentioned models will be represented, but before that some significant points will be mentioned. Firstly, in this certain case cross-validation will not be used, because of the size of the dataset. As it has already been mentioned the second dataset has 30000 observations, 24 independent variables and a dependent variable. Besides, the new algorithm multiplies the number of observations by the chosen k (the number of nearest neighbors), and huge time is needed to construct an ensemble model on such a big dataset, thus the results will be compared only on a randomly chosen test data.

Secondly, unlike the models construction on the first dataset, here not “caret” package and its “train” function is used, but four different functions for different models:

a) Classification tree-“rpart”;

b) Gradient boosting-“gbm”;

c) Random forest-“randomForest”;

d) New algorithm-“randomForest”.

The main reason of shifting on mentioned functions is the time of model construction. Unlike the “train” function, which optimize some parameters of the model, in case of mentioned four functions, the parameters are given beforehand. It gives an opportunity to significantly decrease the time of models training.

Thirdly, in case of this dataset, the k nearest neighbors are identified in different way. In the first case, a matrix with the distance between all the observations is created, but here it is impossible to construct a matrix with 30000 rows and 30000 columns, that's why the “kd_tree” algorithm is used (Wehr, Radkowski, 2018). These are the key differences in the model construction process on the first and second datasets.

The results in this case will be represented unlike in the first case. Here a table with some different thresholds and different indicators will not be created, because actually readers have an opportunity to become familiar with the process of thresholds choice, so there is no need to construct such a huge table and write about the whole process in details. Instead, only the final results of the four models (after threshold choice) will be represented. But it is significant to mention, that here different thresholds for different models are used, because for example in case of threshold 0.5 and higher, Gradient boosting model demonstrates 0 sensitivity, that's why for this model 4 thresholds have been tested (0.2, 0.225, 0.25, 0.275).

Finally, the following results are get:

Table 18

Final results of the models on the second dataset

Model

Threshold

Accuracy

Sensitivity

Specificity

Precision

F-score

0.5

Classification tree

0.8173

0.3144

0.9608

0.6929

0.4323

0.225

Gradient boosting

0.7799

0.5201

0.8538

0.5029

0.5114

0.6

Random forest

0.8154

0.4713

0.9133

0.6073

0.5307

0.75

New algorithm

0.5545

0.3984

0.5989

0.2203

0.2837

Classification tree algorithm works much better on this dataset, because as it has been mentioned in the first case, it is meaningless to implement such a model, but here it demonstrates results, comparable to others. For example, its accuracy and specificity are the highest one for this certain thresholds. In terms of sensitivity, the best model is Gradient boosting, while the F-score is the highest in Random forest. Actually, the Random forest and Gradient boosting demonstrate approximately the same results with certain thresholds, so it is difficult to say which model is better in this certain case. In different cases and having different goals, one of these models can be chosen. The situation is quite different for the other two models. It has been mentioned that Classification tree demonstrates good results, compared to other models, but having a goal of getting high sensitivity, the model actually fails. It can reach very high sensitivity-1, but the accuracy will be too low. So, if a bank's management (having this certain dataset) is risky and specificity is more important for them than sensitivity, this model can be used, otherwise its implementation will be meaningless.

The new algorithm here demonstrates much worse results in terms of accuracy, sensitivity and specificity, compared to other methods, that's why it is clear that in case of this certain dataset the New algorithm is absolutely unimplementable.

5. Conclusion

The goal of this research paper is to construct a new model in programming language R and compare it with Classification tree, Gradient boosting and Random forest algorithms having certain criteria. The models have been compared on two datasets, one of which has 1000 observations and second-30000 observations. In the first case 5-fold cross-validation is used, while in the second case-randomly chosen test data, because cross-validation on such a huge data can be problematic. Firstly, some important characteristics of proposed method will be mentioned, secondly the conclusion of demonstrated results will be written and finally, the ways of the New model development, connected with further research, will be represented.

The main disadvantage of the New model is that in some cases it actually needs huge time to train. The good example is the second dataset, analyzed in this research paper. Multiplying the number of observations by k (the number of nearest neighbors) a new huge dataset is get, on which it is really time consuming to train machine learning models, especially ensembles.

The comparisons on the first dataset indicate that Classification tree algorithm implementation (having this dataset) is meaningless. The New algorithm cannot compete with Random forest and Gradient boosting, because in case of certain thresholds they demonstrate better results in terms of some indicators and approximately the same in terms of others, but still is much more preferable than Classification tree. The last two algorithms work approximately in the same level.

The situation is approximately the same in the second case but with one very significant difference. Here the New algorithm demonstrates the worst results and cannot compete even with Classification tree algorithm. Classification tree, in its turn is worse than the last two algorithms, which demonstrate approximately the same results.

So, to conclude, Random forest and Gradient boosting works in the same level, and are much better than other models. Besides, in the first case the New algorithm demonstrates much better and results in the second case-vice versa. Thus, having these two certain datasets, the implementation of a model, which needs much more time to train and demonstrates worse results is meaningless, but here are actually some quite significant points, which have to be mentioned to fully conclude the results of this research paper and concern to the further research and development of the New algorithm.

Firstly, it is clear that demonstrating better or worse results on these two datasets, does not mean that the model will be better or worse compared to others. Different machine learning models have different bias and variance. The models with low bias and high variance usually demonstrate good results on comparable big datasets, while on small datasets, they meet the problem of overfitting. The vice versa is in case of models with high bias and low variance So, to say in other words, some other comparisons on different datasets have to be done to understand can the New algorithm implemented in credit scoring field or not.

Secondly, some empirical recommendations about the choice of the model parameter (the number of nearest neighbors) can be developed depending on the number of observations, independent variables and dependent variable, by this way increasing the accuracy of the model.

Thirdly, some other models can be used as a base for the New algorithm. As it has already been mentioned, Random forest is used as a base, which is actually can be changed. For example, the number of observations multiplication by the k (the number of nearest neighbors) can become a reason of overfitting. Thus, a new model with less variance and more bias, such as Logistics regression, Classification tree or Linear discriminant analysis can be a base of the model.

Fourthly, not only nearest neighbors can be used in the new model construction, but also farthest. For instance, the same model can be constructed having five nearest and five furthest observations.

Besides, some other variables in model construction process can be used. For instance, a new dataset can be created with:

a) independent variables of an observation;

b) independent variables of its nearest neighbors;

c) the difference between them.

Actually it can make the model even more time consuming, but at the same time make it better in terms of accuracy, sensitivity and other important indicators.

The last important development which can be done is the construction of mathematical model of the New algorithm, which can better explain its advantages, disadvantages and fields, there it is expedient to implement.

It is also important to mention that the proposed method can be used not only in credit scoring field, but also in other fields, there the classification problem has to be solved. Besides, the New algorithm can be a little bit changed and become a tool for regression problem solving.

References

1) Marques, A, Garcia, A., & Sanchez, J. (2013). A literature review on the application of evolutionary computing to credit scoring. Journal of the Operational Research Society, 64, 1384-1399.

2) Vojtek, M., & Kocenda, E. (2006). Credit Scoring Methods. Czech Journal of Economics and Finance, 56, 152-167.

3) Brown, L., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Experts Systems with Applications, 39, 3446-3453.

4) Oskarsdottir, M., Bravo, C., Sarraute, C., Vanthienen, J., & Baesens, B. (2019). The value of big data for credit scoring: Enhancing financial inclusion using mobile phone data and social network analytics. Applied Soft Computing Journal, 74, 26-39.

5) Altman, E. (1968). Financial Ratios, Discriminant Analysis and Prediction of Corporate Bankruptcy. The Journal of Finance, 23, 589-609.

6) Henley, W. E., & Hand, D. J. (1997). Construction of a k-nearest neigbour credit-scoring system. IMA Journal of Management Mathematics, 8(4), 305-321.

7) (2019, April, 20). Commercial and Industrial Loans, All Commercial Banks [ACILACB]. Retrieved from https://fred.stlouisfed.org/series/ACILACB

8) Allamy, H. K., & Rafiqul, Z. K. (2014). METHODS TO AVOID OVER-FITTING AND UNDER-FITTING IN SUPERVISED MACHINE LEARNING (COMPARATIVE STUDY).

9) Basel Committee of Banking Supervision. (1999). PRINCIPLES FOR THE MANAGEMENT OF CREDIT RISK. Basel.

10) Weissova, I., Kollar, B., & Siekelova, A. (2015). Rating as a Useful Tool for Credit Risk Measurement. Procedia Economics and Finance, 26, 278-285.

11) Hanic, A., Dzelihodzic, E. Z., & Dzelihodzic, A. (2013). Scoring Models of Bank Credit Policy Management. Economic Analysis, 46(1-2), 12-27.

12) Marquez, J. S. (2008). An introduction to Credit Scoring For Small and Medium Size Enterprises.

13) Vayssieres, M. P., Plant, R. E., & Allen-Diaz, B. H. (2000). Classification trees: An alternative non-parametric approach for predicting species distributions., Journal of Vegeration Science, 11(5), 679-694.

14) Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. Multiple Classifier Systems. MCS 2000. Lecture Notes in Computer Science, 1857, 1-15.

15) Marques, A. I., Garcia, V., & Sanchez, J. S. (2012). Exploring the behaviour of base classifiers in credit scoring ensembles. Expert Systems with Applications, 39(11), 10244-10250.

16) Xiao, H., Xiao, Z., & Wang, Y. (2016). Ensemble classification based on supervised clustering for credit scoring. Applied Soft Computing, 43, 73-86.

17) Tang, L., Cai, F., & Ouyang, Y. (2018). Applying a nonparametric random forest algorithm to assess the credit risk of the energy industry in China. Technological Forecasting and Social Change.

18) Tanaka, K., Kinkyo, T., & Hamori, S. (2016). Random forests-based early warning system for bank failures. Economics Letters, 148, 118-121.

19) Yufei, X., Chuanzhe, L., YuYing, L., & Nana, L. (2017). A boosted decision tree approach using Bayesian hyper-parameter optimization for credit-scoring. Expert Systems with Applications, 78, 225-241.

20) Natekin, A., & Alois, K. (2013). Gradient Boosting Machines, A Tutorial. Frontiers in neurobotics.

21) Hue, S., Hurlin, C., Topkavi, S., & Dumitrescu, E. (2017). Machine Learning for Credit Scoring: Improving Logistic Regression with Non Linear Decision Tree Effects.

22) Subramanian, J., & Simon, R. (2013). Overfitting in prediction models - Is it a problem only in high dimensions? Contemporary Clinical Trials, 36, 636-641.

23) Rocha-Muniz, C. N., Befi-Lopes, D. M., & Schochat, E. (2014). Sensitivity, specificity, and efficiency of speech-evoked ABR. Hearing Research, 317, 15-22.

24) Garanin, D. A., Lukashevich, N. S., & Salkutsan, S. V. (2014). The Evaluation of Credit Scoring Models Parameters Using Roc Curve Anaysis. World Applied Sciences Journal, 30(8), 938-942.

25) Bergmeir, C., Costantini, M., & Benitez, J. M. (2014). On the usefulness of cross-validation for directional forecast evaluation. Computational Statistics & Data Analysis, 76, 132-143.

26) Fawcett, T. (2006). Introduction to ROC analysis. Pattern Recognition Letters, 27, 861-874.

27) Lobo, J. M., Jimenez-Valverde, A., & Real, R. (2008). AUC: a misleading measure of the performance of predictive distribution models. Journal of Global Ecology and Biogeography, 17, 145-151.

28) Wehr, D., & Radkowski, R. (2018). Parallel kd-Tree Construction on the GPU with an Adaptive Split and Sort Strategy. International Journal of Parallel Programming.

29) Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

30) Hofmanm, H. (2000). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Hamburg: University of Hamburg, Institute of Statistics and Econometrics.

Appendix 1

The development of the new model

# Downloading dataset and changing the class of some variables

library(readr)

library(dplyr)

##

## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':

##

## filter, lag

## The following objects are masked from 'package:base':

##

## intersect, setdiff, setequal, union

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(gbm)

## Loaded gbm 2.1.5

credit<-read_csv("C:\\Users\\User\\Downloads\\german_credit_data-2.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:

## cols(

## X1 = col_double(),

## Age = col_double(),

## Sex = col_character(),

## Job = col_double(),

## Housing = col_character(),

## `Saving accounts` = col_character(),

## `Checking account` = col_character(),

## `Credit amount` = col_double(),

## Duration = col_double(),

## Purpose = col_character(),

## Risk = col_character()

## )

str(credit)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 11 variables:

## $ X1 : num 0 1 2 3 4 5 6 7 8 9 ...

## $ Age : num 67 22 49 45 53 35 53 35 61 28 ...

## $ Sex : chr "male" "female" "male" "male" ...

## $ Job : num 2 2 1 2 2 1 2 3 1 3 ...

## $ Housing : chr "own" "own" "own" "free" ...

## $ Saving accounts : chr NA "little" "little" "little" ...

## $ Checking account: chr "little" "moderate" NA "little" ...

## $ Credit amount : num 1169 5951 2096 7882 4870 ...

## $ Duration : num 6 48 12 42 24 36 24 36 12 30 ...

## $ Purpose : chr "radio/TV" "radio/TV" "education" "furniture/equipment" ...

## $ Risk : chr "good" "bad" "good" "good" ...

## - attr(*, "spec")=

## .. cols(

## .. X1 = col_double(),

## .. Age = col_double(),

## .. Sex = col_character(),

## .. Job = col_double(),

## .. Housing = col_character(),

## .. `Saving accounts` = col_character(),

## .. `Checking account` = col_character(),

## .. `Credit amount` = col_double(),

## .. Duration = col_double(),

## .. Purpose = col_character(),

## .. Risk = col_character()

## .. )

credit<-as.data.frame(credit)

credit$`Checking account`<-NULL

credit$`Saving accounts`<-NULL

Credit_amount<-credit$`Credit amount`

credit$`Credit amount`<-NULL

credit$Credit_amount<-Credit_amount

class(credit$Risk)<-as.factor(credit$Risk)

class(credit$Sex)<-as.factor(credit$Sex)

class(credit$Housing)<-as.factor(credit$Housing)

class(credit$Purpose)<-as.factor(credit$Purpose)

any(is.na(credit))

## [1] FALSE

# Dividing into train and test data

part<-201:1000

credit_train<-credit[part, ]

credit_test<-credit[-part, ]

# Constructing Classification tree model and measuring some indicators

set.seed(1998)

control<-trainControl(method="cv",number=2,p=0.8)

tree_model<-train(Risk~.-X1,data=credit_train,

method="rpart",

trControl=control)

tree_test_pr<-predict(tree_model,credit_test,type="prob")

tree_pr<-as.factor(ifelse(tree_test_pr$good>0.5,"good","bad"))

caret::confusionMatrix(tree_pr,as.factor(credit_test$Risk),mode="prec_recall")

## Confusion Matrix and Statistics

##

## Reference

## Prediction bad good

## bad 0 0

## good 57 143

##

## Accuracy : 0.715

## 95% CI : (0.6471, 0.7764)

## No Information Rate : 0.715

## P-Value [Acc > NIR] : 0.5356

##

## Kappa : 0

##

## Mcnemar's Test P-Value : 1.195e-13

##

## Precision : NA

## Recall : 0.000

## F1 : NA

## Prevalence : 0.285

## Detection Rate : 0.000

## Detection Prevalence : 0.000

## Balanced Accuracy : 0.500

##

## 'Positive' Class : bad

##

# Constructing Random forest model and measuring some indicators

forest_model<-train(Risk~.-X1, data=credit_train,

method="rf",

trControl=control)

forest_test_pr<-predict(forest_model,credit_test,type="prob")

forest_pr<-ifelse(forest_test_pr$good>0.5,"good","bad")

caret::confusionMatrix(as.factor(forest_pr),as.factor(credit_test$Risk),mode="prec_recall")

## Confusion Matrix and Statistics

##

## Reference

## Prediction bad good

## bad 22 23

## good 35 120

##

## Accuracy : 0.71

## 95% CI : (0.6418, 0.7718)

## No Information Rate : 0.715

## P-Value [Acc > NIR] : 0.5970

##

## Kappa : 0.2403

##

## Mcnemar's Test P-Value : 0.1486

##

## Precision : 0.4889

## Recall : 0.3860

## F1 : 0.4314

## Prevalence : 0.2850

## Detection Rate : 0.1100

## Detection Prevalence : 0.2250

## Balanced Accuracy : 0.6126

##

## 'Positive' Class : bad

##

# Constructing Gradient boosting model and measuring some indicatrs

boosting_model<-train(Risk~.-X1,data=credit_train,

method="gbm",

trControl=control)

boosting_test_pr<-predict(boosting_model,credit_test,type="prob")

boosting_pr<-as.factor(ifelse(boosting_test_pr$good>0.5,"good","bad"))

caret::confusionMatrix(boosting_pr,as.factor(credit_test$Risk),mode="prec_recall")

## Confusion Matrix and Statistics

##

## Reference

## Prediction bad good

## bad 6 5

## good 51 138

##

## Accuracy : 0.72

## 95% CI : (0.6523, 0.781)

## No Information Rate : 0.715

## P-Value [Acc > NIR] : 0.4733

##

## Kappa : 0.0928

##

## Mcnemar's Test P-Value : 1.817e-09

##

## Precision : 0.5455

## Recall : 0.1053

## F1 : 0.1765

## Prevalence : 0.2850

## Detection Rate : 0.0300

## Detection Prevalence : 0.0550

## Balanced Accuracy : 0.5351

##

## 'Positive' Class : bad

##

# Constructing new model

# Making from factor variables-dummy variables to measure the distance between observations

credit_new<-rbind(credit_train,credit_test)

credit_new$Sex<-ifelse(credit_new$Sex=="male",1,0)

table(credit_new$Job)

##

## 0 1 2 3

## 22 200 630 148

credit_new$job_zero<-ifelse(credit_new$Job=="0",1,0)

credit_new$job_one<-ifelse(credit_new$Job=="1",1,0)

credit_new$job_two<-ifelse(credit_new$Job=="2",1,0)

credit_new$job_three<-ifelse(credit_new$Job=="3",1,0)

credit_new$Job<-NULL

table(credit_new$Housing)

##

## free own rent

## 108 713 179

credit_new$housing_free<-ifelse(credit_new$Housing=="free",1,0)

credit_new$housing_own<-ifelse(credit_new$Housing=="own",1,0)

credit_new$housing_rent<-ifelse(credit_new$Housing=="rent",1,0)

credit_new$Housing<-NULL

table(credit_new$Purpose)

##

## business car domestic appliances

## 97 337 12

## education furniture/equipment radio/TV

## 59 181 280

## repairs vacation/others

## 22 12

credit_new$purpose_business<-ifelse(credit_new$Purpose=="business",1,0)

credit_new$purpose_car<-ifelse(credit_new$Purpose=="car",1,0)

credit_new$purpose_domestic<-ifelse(credit_new$Purpose=="domestic appliances",1,0)

credit_new$purpose_education<-ifelse(credit_new$Purpose=="education",1,0)

credit_new$purpose_equipment<-ifelse(credit_new$Purpose=="furniture/equipment",1,0)

credit_new$purpose_radio<-ifelse(credit_new$Purpose=="radio/TV",1,0)

credit_new$purpose_repairs<-ifelse(credit_new$Purpose=="repairs",1,0)

credit_new$purpose_others<-ifelse(credit_new$Purpose=="vacation/others",1,0)

credit_new$Purpose<-NULL

# Scaling the independent variables of the dataset

proc<-preProcess(credit_new[ ,-1],method=c("scale","center"))

credit_new<-predict(proc,credit_new)

# Finding distances between the observations

distance<-dist(credit_new[ ,-c(1,5)])

distance<-as.matrix(distance)

diag(distance)<-max(distance)

distance[801:1000, ]<-max(distance)

# Creating a matrix with the row number of the nearest neighbors of observations

neighs<-matrix(NA,nrow=1000,ncol=30)

neighs<-as.data.frame(neighs)

for(i in 1:1000) {

for(j in 1:30) {

neighs[i,j]<-sort(distance[ ,i],index.return=TRUE)[[2]][j]

}

}

# Constructing new data with the differences between independent variables of observations and independent variables of their nearest neighbors

credit_new$X1<-NULL

vec<-seq(1,630,21)

data_pr<-matrix(NA,nrow=1000,ncol=630)

data_pr<-as.data.frame(data_pr)

for(i in vec) {

data_pr[1:1000,i:(i+18)]<-credit_new[1:1000,-4]-credit_new[neighs[1:1000,(i+20)/21],-4]

data_pr[1:1000,i+19]<-credit_new[neighs[1:1000,(i+20)/21],4]

data_pr[1:1000,i+20]<-credit_new[1:1000,4]

}

colnames(data_pr)<-rep(as.character(1:21),30)

# Creating two datasets (train and test) to construct the new model

pr_train<-data_pr[1:800, ]

pr_test<-data_pr[801:1000, ]

data_pr_train<-rbind(pr_train[ ,1:21],pr_train[ ,22:42],

pr_train[ ,43:63], pr_train[ ,64:84],

pr_train[ ,85:105], pr_train[ ,106:126],

pr_train[ ,127:147], pr_train[ ,148:168],

pr_train[ ,169:189], pr_train[ ,190:210],

pr_train[ ,211:231], pr_train[ ,232:252],

pr_train[ ,253:273], pr_train[ ,274:294],

pr_train[ ,295:315], pr_train[ ,316:336],

pr_train[ ,337:357], pr_train[ ,358:378],

pr_train[ ,379:399], pr_train[ ,400:420],

pr_train[ ,421:441], pr_train[ ,442:462],

pr_train[ ,463:483], pr_train[ ,484:504],

pr_train[ ,505:525], pr_train[ ,526:546],

pr_train[ ,547:567], pr_train[ ,568:588],

pr_train[ ,589:609], pr_train[ ,610:630])

data_pr_test<-rbind(pr_test[ ,1:21],pr_test[ ,22:42],

pr_test[ ,43:63], pr_test[ ,64:84],

pr_test[ ,85:105], pr_test[ ,106:126],

pr_test[ ,127:147], pr_test[ ,148:168],

pr_test[ ,169:189], pr_test[ ,190:210],

pr_test[ ,211:231], pr_test[ ,232:252],

pr_test[ ,253:273], pr_test[ ,274:294],

pr_test[ ,295:315], pr_test[ ,316:336],

pr_test[ ,337:357], pr_test[ ,358:378],

pr_test[ ,379:399], pr_test[ ,400:420],

pr_test[ ,421:441], pr_test[ ,442:462],

pr_test[ ,463:483], pr_test[ ,484:504],

pr_test[ ,505:525], pr_test[ ,526:546],

pr_test[ ,547:567], pr_test[ ,568:588],

pr_test[ ,589:609], pr_test[ ,610:630])

names<-c(colnames(credit_new[ ,-4]),"Risk_neigh", <...


Подобные документы

  • Мотивация труда: задачи и инструменты. Краткий обзор классических теорий мотивации персонала. Характеристика гостиницы "Forest Inn", анализ кадрового направления ее работы. Рекомендации по совершенствованию системы стимулирования работников предприятия.

    дипломная работа [477,4 K], добавлен 18.05.2011

  • Реинжиниринг как радикальное перепроектирование деловых процессов для улучшения показателей деятельности предприятия. Анализ реинжиниринга бизнеса в компании "IBM Credit". Информационное обеспечение совершенствования дивизиональной организации управления.

    курсовая работа [239,0 K], добавлен 04.12.2015

  • Теоретические аспекты основных понятий, сущности реинжиниринга. Использование потенциала реинжиниринга в Российских условиях. Практическое применение реинжиниринга на примере компаний: Ford Motor Company, IBM Credit, Kodak. Реинжиниринг бизнес-процессов.

    реферат [15,2 K], добавлен 30.11.2010

  • Реинжиниринг - фундаментальное перепроектирование бизнес-процессов компаний для достижения улучшения показателей их деятельности. Особенности реинжиниринга в банковской сфере на примере "IBM Credit": этапы проведения, участники, результаты, перспективы.

    курсовая работа [49,6 K], добавлен 03.05.2012

  • Impact of globalization on the way organizations conduct their businesses overseas, in the light of increased outsourcing. The strategies adopted by General Electric. Offshore Outsourcing Business Models. Factors for affect the success of the outsourcing.

    реферат [32,3 K], добавлен 13.10.2011

  • Critical literature review. Apparel industry overview: Porter’s Five Forces framework, PESTLE, competitors analysis, key success factors of the industry. Bershka’s business model. Integration-responsiveness framework. Critical evaluation of chosen issue.

    контрольная работа [29,1 K], добавлен 04.10.2014

  • Searching for investor and interaction with him. Various problems in the project organization and their solutions: design, page-proof, programming, the choice of the performers. Features of the project and the results of its creation, monetization.

    реферат [22,0 K], добавлен 14.02.2016

  • Системный подход к задачам информационного менеджмента. Обоснование архитектуры технологической среды обработки информации, варианта создания информационной системы на базе стоимости владения. Оценка использования ресурсов. Реинжиниринг бизнес-процессов.

    курсовая работа [660,6 K], добавлен 20.03.2014

  • Ознакомление с предложениями и рекомендациями по выбору модели управления инвестиционными рисками. Исследование и анализ особенностей финансовой политики рассматриваемой компании. Изучение организационно-экономической характеристики предприятия.

    дипломная работа [142,0 K], добавлен 24.08.2017

  • Description of the structure of the airline and the structure of its subsystems. Analysis of the main activities of the airline, other goals. Building the “objective tree” of the airline. Description of the environmental features of the transport company.

    курсовая работа [1,2 M], добавлен 03.03.2013

  • Organizational legal form. Full-time workers and out of staff workers. SWOT analyze of the company. Ways of motivation of employees. The planned market share. Discount and advertizing. Potential buyers. Name and logo of the company, the Mission.

    курсовая работа [1,7 M], добавлен 15.06.2013

  • Analysis of the peculiarities of the mobile applications market. The specifics of the process of mobile application development. Systematization of the main project management methodologies. Decision of the problems of use of the classical methodologies.

    контрольная работа [1,4 M], добавлен 14.02.2016

  • История основания корпорации в городе Рочестер (США) в 1906 г. Появление первого ксерокопировального аппарата с незатейливым названием Model A. Выпуск в 2003 г. цифровой печатной машины нового поколения - iGen3. Изобретения, принадлежащие компании Xerox.

    презентация [1,7 M], добавлен 01.12.2013

  • Formation of intercultural business communication, behavior management and communication style in multicultural companies in the internationalization and globalization of business. The study of the branch of the Swedish-Chinese company, based in Shanghai.

    статья [16,2 K], добавлен 20.03.2013

  • Value and probability weighting function. Tournament games as special settings for a competition between individuals. Model: competitive environment, application of prospect theory. Experiment: design, conducting. Analysis of experiment results.

    курсовая работа [1,9 M], добавлен 20.03.2016

  • Nonverbal methods of dialogue and wrong interpretation of gestures. Historical both a cultural value and universal components of language of a body. Importance of a mimicry in a context of an administrative communication facility and in an everyday life.

    эссе [19,0 K], добавлен 27.04.2011

  • Basic assumptions and some facts. Algorithm for automatic recognition of verbal and nominal word groups. Lists of markers used by Algorithm No 1. Text sample processed by the algorithm. Examples of hand checking of the performance of the algorithm.

    курсовая работа [22,8 K], добавлен 13.01.2010

  • History of the "First Credit bureau". Seven of the leading commercial banks in the Republic of Kazakhstan. Formation of credit stories on legal entities and granting of credit reports: credit score, conditions, capacity, capital, collateral, character.

    презентация [777,2 K], добавлен 16.10.2013

  • Lists used by Algorithm No 2. Some examples of the performance of Algorithm No 2. Invention of the program of reading, development of efficient algorithm of the program. Application of the programs to any English texts. The actual users of the algorithm.

    курсовая работа [19,3 K], добавлен 13.01.2010

  • Review of development of cloud computing. Service models of cloud computing. Deployment models of cloud computing. Technology of virtualization. Algorithm of "Cloudy". Safety and labor protection. Justification of the cost-effectiveness of the project.

    дипломная работа [2,3 M], добавлен 13.05.2015

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.