Factors of successful protection from pressure on business
Concept and economic essence of property rights. Justification and development of the business protection model against possible damage to business activities caused by the influence various external and internal market factors and economic conditions.
Рубрика | Экономико-математическое моделирование |
Вид | дипломная работа |
Язык | английский |
Дата добавления | 11.08.2020 |
Размер файла | 5,0 M |
Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже
Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.
Pure profit 2011
Pure profit 2012
Pure profit 2013
Pure profit 2014
Pure profit 2015
Pure profit 2016
Pure profit 2017
Pure profit 2018
Enterprise status
0
Recoded to target variable.
Administrative position
1
Administrative connections
1
In political party
1
Political party
0
Too few values
In association or SRO
1
Association SRO name
0
Too much distinct values
Case publications
1
Application topic
0
Next four variables are used instead
Criminal prosecution
1
Capture
1
Corruption
1
Barriers
1
Application date
0
Information is not related to the research topic
Application year
0
Information is not related to the research topic
Have court case
1
Is guilty
1
Reviewed by BAC
1
Max BAC stage
1
Max BAC stage, grouped
1
Supported by BAC public council
1
Reaction not passed by the applicant
1
Reaction not passed by «BAC»
1
Reaction consultation
1
Reaction target letters control
1
To ombudsman
1
Attachment 5
R-Studio Output
Import data and define the original set of variables.
cop_data <-data.frame(read_excel("cop_data.xlsx"))
dataset <-cop_data[c(
"region_code_spark",
"federal_districts", "largest_fed_districts",
"macro_okved_code", "macro_okved_code_group",
"spark_web_site", "spark_stock_ticket",
"company_age_till_2020", "company_age_including_liquidation",
"age_till_application_date",
"n_employees_upperbound", "n_employees_added",
"authorized_capital",
"administrative_position", "administrative_connections",
"in_political_party",
"in_association_or_sro",
"case_publications",
"criminal_prosecution", "capture", "corruption", "barriers",
"have_court_case", "is_guilty",
"reviewed_by_bac", "max_bac_stage", "supported_by_bac_public_council",
"reaction_not_passed_by_applicant", "reaction_not_passed_by_bac",
"reaction_consultation",
"reaction_target_letters_control", "to_ombudsman",
"is_working", "target_light_clear",
"target_light_extended", "target_strong_extended"
)]
Part 1. Data preparation and feature enginiering.
Check missings data - the formatting was bad, so i commented out and attached screenshot.
# missmap(dataset, col=c("blue", "white"), legend=T, margins = c(7,7))
Which variables have missing data, percent
# Targets
sum(is.na(dataset$target_light_clear))/nrow(dataset)
## [1] 0.253112
sum(is.na(dataset$target_light_extended))/nrow(dataset)
## [1] 0.186722
sum(is.na(dataset$target_strong_extended))/nrow(dataset)
## [1] 0.186722
# other variables
sum(is.na(dataset$n_employees_upperbound))/nrow(dataset)
## [1] 0.1037344
sum(is.na(dataset$authorized_capital))/nrow(dataset)
## [1] 0.03319502
Target variables distributions
table(dataset$is_working)
##
## 0 1
## 255 227
table(dataset$target_light_clear)
##
## 0 1
## 184 176
table(dataset$target_light_extended)
##
## 0 1
## 184 208
table(dataset$target_strong_extended)
##
## 0 1
## 301 91
From here the data preparation step begins. I have two types of variables: categorical and contionous variables. Let me start with contionous variables and prepare them for the analysis.
There are three variables in continous scale: age, size (n_of_empoyees) and authorized capital.
Age. For age I have 3 variables: “company age until 2020” (not including liquidations dates), “company age until liquidation”, “company age until application”. Lets take a look at each of them.
# company_age_till_2020
summary(dataset$company_age_till_2020)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.50 12.12 16.50 17.06 22.00 31.00
boxplot(dataset$company_age_till_2020, main ='Company age until 2020')
hist(dataset$company_age_till_2020, main ='Company age until 2020', xlab ='Age until 2020')
# company_age_including_liquidation
summary(dataset$company_age_including_liquidation)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.50 15.00 15.35 20.88 31.00
boxplot(dataset$company_age_including_liquidation, main ='Company age until closed')
hist(dataset$company_age_including_liquidation, main ='Company age until closure', xlab ='Age until closure')
# age_till_application_date
summary(dataset$age_till_application_date)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 5.00 9.00 10.26 15.00 23.00
boxplot(dataset$age_till_application_date, main ='Company age until application')
hist(dataset$age_till_application_date, main ='Company age until application', xlab ='Age until applicaiton')
Age data seems ok, no outliers, probably not normally distributed variables.
# company_age_till_2020
hist(dataset$company_age_till_2020, main ='Company age until 2020')
ggqqplot(dataset$company_age_till_2020)
shapiro.test(dataset$company_age_till_2020)
##
## Shapiro-Wilk normality test
##
## data: dataset$company_age_till_2020
## W = 0.96384, p-value = 1.644e-09
# company_age_including_liquidation
hist(dataset$company_age_including_liquidation, main ='Company age until liquidation')
ggqqplot(dataset$company_age_including_liquidation)
shapiro.test(dataset$company_age_including_liquidation)
##
## Shapiro-Wilk normality test
##
## data: dataset$company_age_including_liquidation
## W = 0.96839, p-value = 1.107e-08
# age_till_application_date
hist(dataset$age_till_application_date, main ='Company age until application')
ggqqplot(dataset$age_till_application_date)
shapiro.test(dataset$age_till_application_date)
##
## Shapiro-Wilk normality test
##
## data: dataset$age_till_application_date
## W = 0.9583, p-value = 1.94e-10
Since the null hypothesis in Shapiro-Wilk test is that data is distributed normally, from here we can say that for all three variables data provide us with enough evidence that the data distribution is not normal.
Since data distribution is not normal, we cannot compare means. It will be better to use non-parametric test - Kruskal-Wallis rank sum test that works with medians. Since among all three variables the `company_age_until_liquidation' is the only variable, which can be applied to the analysis (because other two variables allows us to “look in the future”), it is reasonable to provide this variable with Kruskal-Wallis analysis against all target variables.
# is_working
kruskal.test(x = dataset$age_till_application_date, g = dataset$is_working)
##
## Kruskal-Wallis rank sum test
##
## data: dataset$age_till_application_date and dataset$is_working
## Kruskal-Wallis chi-squared = 6.1061, df = 1, p-value = 0.01347
test <-dataset[c("is_working", "age_till_application_date")]
ggboxplot(test, x ="is_working", y ="age_till_application_date",
palette =c("#00AFBB", "#E7B800"),
ylab ="age_till_application_date", xlab ="is_working",
ylim =c(0,60))
# in here I have doubted becuase of the graphical representation and small visual
# difference
# So I decided to double-check the test with Mann-Whitney-Wilcoxon Test (usually # used for two groups comparison)
# but the result have repeated, meaning that at .05 significance level, we can conclude the ages of the enterprise divided by status of currently 'currently working' are taken nonidentical populations.
wilcox.test(dataset$age_till_application_date ~dataset$is_working)
##
## Wilcoxon rank sum test with continuity correction
##
## data: dataset$age_till_application_date by dataset$is_working
## W = 25176, p-value = 0.01348
## alternative hypothesis: true location shift is not equal to 0
# target_light_clear
test_df <-dataset[!is.na(dataset$target_light_clear),]
kruskal.test(x = test_df$age_till_application_date, g = test_df$target_light_clear)
##
## Kruskal-Wallis rank sum test
##
## data: test_df$age_till_application_date and test_df$target_light_clear
## Kruskal-Wallis chi-squared = 1.4822, df = 1, p-value = 0.2234
test <-test_df[c("target_light_clear", "age_till_application_date")]
ggboxplot(test, x ="target_light_clear", y ="age_till_application_date",
palette =c("#00AFBB", "#E7B800"),
ylab ="age_till_application_date", xlab ="target_light_clear",
ylim =c(0,60))
# target_light_extended
test_df <-dataset[!is.na(dataset$target_light_extended),]
kruskal.test(x = test_df$age_till_application_date, g= test_df$target_light_extended)
##
## Kruskal-Wallis rank sum test
##
## data: test_df$age_till_application_date and test_df$target_light_extended
## Kruskal-Wallis chi-squared = 0.93575, df = 1, p-value = 0.3334
test <-test_df[c("target_light_extended", "age_till_application_date")]
ggboxplot(test, x ="target_light_extended", y ="age_till_application_date",
palette =c("#00AFBB", "#E7B800"),
ylab ="age_till_application_date", xlab ="target_light_extended",
ylim =c(0,60))
# target_strong_extended
test_df <-dataset[!is.na(dataset$target_strong_extended),]
kruskal.test(x = test_df$age_till_application_date, g = test_df$target_strong_extended)
##
## Kruskal-Wallis rank sum test
##
## data: test_df$age_till_application_date and test_df$target_strong_extended
## Kruskal-Wallis chi-squared = 0.0092569, df = 1, p-value = 0.9234
test <-test_df[c("target_strong_extended", "age_till_application_date")]
ggboxplot(test, x ="target_strong_extended", y ="age_till_application_date",
palette =c("#00AFBB", "#E7B800"),
ylab ="age_till_application_date", xlab ="target_strong_extended",
ylim =c(0,60))
So, from here we can conclude that company age until application medians vary in groups, created only by target variable responsible for working in 2020.
Size. Observe the size variable, which is very skewed. so size variable is highly skewed.
# number of missing data
sum(is.na(dataset$n_employees_upperbound))
## [1] 50
plot(density(dataset$n_employees_upperbound[!is.na(dataset$n_employees_upperbound)]), main ='№ employees excluding missing', xlab ='№ of employees')
hist(dataset$n_employees_upperbound[!is.na(dataset$n_employees_upperbound)], main ='№ employees excluding missing', xlab ='№ of employees')
# by imputing missing variables with 5
sum(is.na(dataset$n_employees_added))
## [1] 0
plot(density(dataset$n_employees_added), main ='№ employees imputing missing by 5', xlab ='№ of employees')
hist(dataset$n_employees_added, main ='№ employees imputing missing by 5', xlab ='№ of employees')
So it is clear that both variables are skewed regardless whether we impute our missing data or not. So this is definitely categorization case and I have practical premises how to do that.
dataset$category_by_size_missing <-
ifelse(dataset$n_employees_upperbound <=15, 'Micro',
ifelse(dataset$n_employees_upperbound >=16&dataset$n_employees_upperbound <=100, 'Small',
ifelse(dataset$n_employees_upperbound >=101&dataset$n_employees_upperbound <=250, 'Medium','Big'
)))
dataset$category_by_size_added <-
ifelse(dataset$n_employees_added <=15, 'Micro',
ifelse(dataset$n_employees_added >=16&dataset$n_employees_added <=100, 'Small',
ifelse(dataset$n_employees_added >=101&dataset$n_employees_added <=250, 'Medium','Big'
)))
But can we impute missing data by 5's? Does this distort the general proportions for our data?
# this is interesting how general proportions change in micro class
# in each case we see that imputing 5 over missing data increases the general
# proportion of negative class in 'Micro' size category
# Thus, replacing missing data by 5 is not correct in this case.
# true
xtabs(~dataset$is_working +dataset$category_by_size_missing)
## dataset$category_by_size_missing
## dataset$is_working Big Medium Micro Small
## 0 12 12 148 35
## 1 25 15 138 47
# replaced
xtabs(~dataset$is_working +dataset$category_by_size_added)
## dataset$category_by_size_added
## dataset$is_working Big Medium Micro Small
## 0 12 12 196 35
## 1 25 15 140 47
# true
xtabs(~dataset$target_light_clear +dataset$category_by_size_missing)
## dataset$category_by_size_missing
## dataset$target_light_clear Big Medium Micro Small
## 0 14 8 112 23
## 1 17 16 97 37
# replaced
xtabs(~dataset$target_light_clear +dataset$category_by_size_added)
## dataset$category_by_size_added
## dataset$target_light_clear Big Medium Micro Small
## 0 14 8 139 23
## 1 17 16 106 37
# true
xtabs(~dataset$target_light_extended +dataset$category_by_size_missing)
## dataset$category_by_size_missing
## dataset$target_light_extended Big Medium Micro Small
## 0 14 8 112 23
## 1 20 19 110 50
# replaced
xtabs(~dataset$target_light_extended +dataset$category_by_size_added)
## dataset$category_by_size_added
## dataset$target_light_extended Big Medium Micro Small
## 0 14 8 139 23
## 1 20 19 119 50
# true
xtabs(~dataset$target_strong_extended +dataset$category_by_size_missing)
## dataset$category_by_size_missing
## dataset$target_strong_extended Big Medium Micro Small
## 0 22 17 173 57
## 1 12 10 49 16
# replaced
xtabs(~dataset$target_strong_extended +dataset$category_by_size_added)
## dataset$category_by_size_added
## dataset$target_strong_extended Big Medium Micro Small
## 0 22 17 205 57
## 1 12 10 53 16
Well, looking at cross-tabs we can see that proportion of `Micro' category is the biggest. Other classes are definitely a minority. So I decided to create some extra features for size I can play around and check which one permorm better with each target.
xtabs(~dataset$is_working+dataset$category_by_size_missing)
## dataset$category_by_size_missing
## dataset$is_working Big Medium Micro Small
## 0 12 12 148 35
## 1 25 15 138 47
xtabs(~dataset$target_light_clear+dataset$category_by_size_missing)
## dataset$category_by_size_missing
## dataset$target_light_clear Big Medium Micro Small
## 0 14 8 112 23
## 1 17 16 97 37
xtabs(~dataset$target_light_extended+dataset$category_by_size_missing)
## dataset$category_by_size_missing
## dataset$target_light_extended Big Medium Micro Small
## 0 14 8 112 23
## 1 20 19 110 50
xtabs(~dataset$target_strong_extended+dataset$category_by_size_missing)
## dataset$category_by_size_missing
## dataset$target_strong_extended Big Medium Micro Small
## 0 22 17 173 57
## 1 12 10 49 16
# micro and other categries
dataset$category_by_size_melse<-ifelse(dataset$category_by_size_missing == "Micro", "Micro", "Else")
table(dataset$category_by_size_melse)
##
## Else Micro
## 146 286
# small + micro categoris vs. the Medium + Big
dataset$category_by_size_2_cat<-
ifelse(dataset$category_by_size_missing == "Micro"|dataset$category_by_size_missing == "Small", "Small", "Big")
table(dataset$category_by_size_2_cat)
##
## Big Small
## 64 368
AUTHORUZED CAPITAL The situation with authorized capital is similar with size variable. The distribution is highly skewed
test_df <-dataset[!is.na(dataset$authorized_capital),]
hist(test_df$authorized_capital, main ='Companies authorized capital distribution',
xlab ='Size of the authorized capital')
So I decided to categorize this variable too. Here by 3 approximately equal groups.
dataset$auth_capital_group <-
ifelse(dataset$authorized_capital <=10000, 'under_10k',
ifelse(dataset$authorized_capital>=10001&dataset$authorized_capital <=210000, 'under_210k', 'over_210k'))
table(dataset$auth_capital_group)
##
## over_210k under_10k under_210k
## 150 188 128
# even have associatino with is_working, but the associations for latter steps...
chisq.test(xtabs(~dataset$is_working +dataset$auth_capital_group))
##
## Pearson's Chi-squared test
##
## data: xtabs(~dataset$is_working + dataset$auth_capital_group)
## X-squared = 9.4727, df = 2, p-value = 0.008771
The next step of data preparations is working with categorical variables. Despite the fact these variables are categorical, we need to check some of them in order to simplify further analysis and eliminate the probability of spurios associations (I mean small groups with high chance of `win' or `loose').
REGION The problem with this variable is that it has too many categories So there might be problems with analysing it in the future. In order to deal with this problem, I have prepaired several variations of this variable in order to test them.
Too many categories (and also small categories here!). It will cause problems in future.
table(dataset$region_code_spark)
##
## 5 6 7 11 17 18 22 27 28 29 30 31 32 33 35 37 40 41
## 1 1 1 2 1 5 5 4 1 1 7 5 4 5 5 3 4 2
## 43 44 46 47 48 49 51 53 55 56 57 58 62 65 67 68 69 71
## 2 2 5 3 4 1 1 3 3 8 3 4 2 6 5 1 3 3
## 72 76 79 80 82 89 91 102 103 113 116 121 123 124 125 126 134 136
## 7 3 3 1 1 4 6 19 1 2 13 11 19 5 7 9 5 9
## 138 142 152 154 159 161 163 164 173 174 178 196 750 799
## 6 5 15 8 3 14 7 3 2 10 21 8 36 103
First - grouppings by federal districts and separate variables for Moscow and Moscow region. This variant looks much better.
table(dataset$federal_districts)
##
## Caucasus Central Far_East Moscow
## 12 61 26 103
## Moscow_region North_West Saint_Petersburg Siberian
## 36 21 21 33
## South Urals Volga
## 46 29 94
And the second option - even bigger groupping by the following logic (close regions): Saint Petersburg + NORTH_WEST = NORTH_WEST SOUTH + CAUCASUS = SOUTH FAR_EAST + SIBERIAN = FAR_SIBERIAN and separate variables for moscow and moscow region
table(dataset$largest_fed_districts)
##
## Central Far_Siberia Moscow Moscow_region North_West
## 61 59 103 36 42
## South Urals Volga
## 58 29 94
OKVED ACTIVITY. The same situation for OKVED activity. The initial categorization have a few groups which are small enough to cuase a concern.
table(dataset$macro_okved_code)
##
## administrative Building Culture_sport
## 7 83 4
## Education energy_gas_steam Financial_insurance
## 3 10 23
## Health Hotels_catering Information
## 5 4 14
## manufacturing mining Other_services
## 80 10 5
## real_estate rural Science
## 43 24 47
## Trading Transportation water_supp
## 98 18 4
So I have prepared the second option, where all OKVED codes with counts less then 15 recoded to the `other_categories' group.
table(dataset$macro_okved_code_group)
##
## Building Financial_insurance manufacturing
## 83 23 80
## other_categories real_estate rural
## 66 43 24
## Science Trading Transportation
## 47 98 18
Business against corruption stage.
The final variable I have changed was `max_bac_stage', which depicts the maximal stage the observed application has passed, according to “Business against corruption” procedure.
Despite the fact that 4-th and 9-th stages are the biggest, there is no big difference between these stages. For instance, 3-rd and 4-th stages are both about getting expert resolution about the case. So i decided to play around and recode these stages into more meningfull and bigger groups to see which varaible will perform better.
table(dataset$max_bac_stage)
##
## 0 1 2 3 4 5 6
## 12 40 35 28 247 29 91
dataset$cop_stage <-ifelse(dataset$max_bac_stage <=2, 'Information_collection',
ifelse(dataset$max_bac_stage ==3|dataset$max_bac_stage ==4, 'Resolution',
'Council_discussion'))
table(dataset$cop_stage)
##
## Council_discussion Information_collection Resolution
## 120 87 275
Part 2. Relationships discovery.
# some code that can help me
chisqmatrix_stat <-function(x) {
names =colnames(x); num =length(names)
m =matrix(nrow=num,ncol=num,dimnames=list(names,names))
for (i in1:(num-1)) {
for (j in (i+1):num) {
#m[i,j] = chisq.test(x[,i],x[,j],)$p.value
m[i,j] =chisq.test(x[,i],x[,j],)$statistic
}
}
return (m)
}
chisqmatrix_pval <-function(x) {
names =colnames(x); num =length(names)
m =matrix(nrow=num,ncol=num,dimnames=list(names,names))
for (i in1:(num-1)) {
for (j in (i+1):num) {
m[i,j] =chisq.test(x[,i],x[,j],)$p.value
}
}
return (m)
}
Since I have already checked the differnece in medians for continous variable - age, in this section I generate a matrix of chi-square coeffs for categorical variables.
First step is without dummies, the second - check categorical variables with multiple levels by creating dummies.
Target: is_working
# check for distributions
is_working_vars <-c(
"federal_districts",
"largest_fed_districts",
#"macro_okved_code",
"macro_okved_code_group",
"spark_web_site",
"spark_stock_ticket",
"category_by_size_missing",
"category_by_size_melse",
"category_by_size_2_cat",
"administrative_position",
"administrative_connections",
"in_political_party",
"in_association_or_sro",
"case_publications",
"criminal_prosecution",
"capture", "corruption", "barriers",
"have_court_case", "is_guilty", "reviewed_by_bac",
"max_bac_stage", "supported_by_bac_public_council",
"reaction_not_passed_by_applicant", "reaction_consultation",
"reaction_target_letters_control", "to_ombudsman",
"reaction_not_passed_by_bac",
"auth_capital_group",
"cop_stage",
"is_working")
is_working_cs <-dataset[is_working_vars]
# this is a very log output, it just returns xtabs for each variable and a target
#for (i in 1:length(is_working_vars)){
# print(xtabs(~is_working_cs$is_working + is_working_cs[,i]))
#}
# variables to worry about: spark_stock_ticket, reaction_not_passed_by_bac (somehow, still low counts)
xtabs(~datasetreaction_consultation)
is_working_vars <-c(
"federal_districts",
"largest_fed_districts",
#"macro_okved_code",
"macro_okved_code_group",
"spark_web_site",
"spark_stock_ticket",
"category_by_size_missing",
"category_by_size_melse",
"category_by_size_2_cat",
"administrative_position",
"administrative_connections",
"in_political_party",
"in_association_or_sro",
"case_publications",
"criminal_prosecution",
"capture", "corruption", "barriers",
"have_court_case", "is_guilty", "reviewed_by_bac",
"max_bac_stage", "supported_by_bac_public_council",
"reaction_not_passed_by_applicant", "reaction_consultation",
"reaction_target_letters_control", "to_ombudsman",
"reaction_not_passed_by_bac",
"auth_capital_group",
"cop_stage",
"is_working")
is_working_cs <-dataset[is_working_vars]
is_working_cs_mat_stat =chisqmatrix_stat(is_working_cs)
is_working_cs_mat_stat <-format( data.frame(is_working_cs_mat_stat)["is_working"], scientific = F)
is_working_cs_mat_pval =chisqmatrix_pval(is_working_cs)
is_working_cs_mat_pval <-format( data.frame(is_working_cs_mat_pval)["is_working"], scientific = F)
is_working_cs_df <-data.frame(c(is_working_cs_mat_stat, is_working_cs_mat_pval))
rownames(is_working_cs_df) <-rownames(is_working_cs_mat_stat)
colnames(is_working_cs_df) <-c("Statistic","P-value")
#write_xlsx(data.frame(is_working_cs_df), 'is_working_cs.xlsx')
is_working_cs_df[2]
## P-value
## federal_districts 0.2097364820098783
## largest_fed_districts 0.1491242995084314
## macro_okved_code_group 0.0000004029985171
## spark_web_site 0.0000000001211396
## spark_stock_ticket 0.0952582185988386
## category_by_size_missing 0.0992954794302832
## category_by_size_melse 0.0332232176140461
## category_by_size_2_cat 0.0945626696786416
## administrative_position 0.0545086826433797
## administrative_connections 0.1339695931231736
## in_political_party 0.0049908143773374
## in_association_or_sro 0.0000046940502687
## case_publications 0.3562620121957124
## criminal_prosecution 0.3424585386949022
## capture 0.0605566407878402
## corruption 0.6580199336591293
## barriers 0.0657458127196991
## have_court_case 0.7558063380989922
## is_guilty 0.8370032224854111
## reviewed_by_bac 0.7171329654303231
## max_bac_stage 0.4826337816347389
## supported_by_bac_public_council 0.2403666393970288
## reaction_not_passed_by_applicant 0.0029202062366615
## reaction_consultation 0.1563821496506542
## reaction_target_letters_control 0.8901658295767594
## to_ombudsman 0.0278554894020524
## reaction_not_passed_by_bac 0.6128206928637800
## auth_capital_group 0.0087706387308923
## cop_stage 0.6333066056508621
## is_working NA
#check - looks that code worked ok
chisq.test(xtabs(~dataset$is_working +dataset$macro_okved_code_group))
##
## Pearson's Chi-squared test
##
## data: xtabs(~dataset$is_working + dataset$macro_okved_code_group)
## X-squared = 44.792, df = 8, p-value = 4.03e-07
chisq.test(xtabs(~dataset$is_working +dataset$in_political_party))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: xtabs(~dataset$is_working + dataset$in_political_party)
## X-squared = 7.8828, df = 1, p-value = 0.004991
It is time to check for categorical data with several levels.
is_working_dummies <-is_working_cs[c("federal_districts", "largest_fed_districts", "macro_okved_code_group", "max_bac_stage", "cop_stage", "category_by_size_missing", "category_by_size_melse", "category_by_size_2_cat" )]
is_working_dummies$max_bac_stage <-as.factor(is_working_dummies$max_bac_stage)
dums <-dummyVars(" ~ .", data = is_working_dummies)
is_working_dums <-data.frame(predict(dums, newdata = is_working_dummies))
is_working_dums$is_working <-is_working_cs$is_working
is_working_dums_pval =chisqmatrix_pval(is_working_dums)
is_working_dums_pval <-format( data.frame(is_working_dums_pval)["is_working"], scientific = F)
is_working_dums_stat =chisqmatrix_stat(is_working_dums)
is_working_dums_stat <-format( data.frame(is_working_dums_stat)["is_working"], scientific = F)
is_working_dums_df <-data.frame(c(is_working_dums_stat, is_working_dums_pval))
rownames(is_working_dums_df) <-rownames(is_working_dums_stat)
colnames(is_working_dums_df) <-c("Statistic","P-value")
#write_xlsx(data.frame(is_working_dums_df), 'is_working_cs_dums.xlsx')
is_working_dums_df[2]
## P-value
## federal_districtsCaucasus 0.27897424106
## federal_districtsCentral 0.83224346507
## federal_districtsFar_East 0.61213940228
## federal_districtsMoscow 1.00000000000
## federal_districtsMoscow_region 0.59160589429
## federal_districtsNorth_West 0.04971145510
## federal_districtsSaint_Petersburg 0.04971145510
## federal_districtsSiberian 0.72908740098
## federal_districtsSouth 0.50155081889
## federal_districtsUrals 1.00000000000
## federal_districtsVolga 0.60749515839
## largest_fed_districtsCentral 0.83224346507
## largest_fed_districtsFar_Siberia 0.44993068298
## largest_fed_districtsMoscow 1.00000000000
## largest_fed_districtsMoscow_region 0.59160589429
## largest_fed_districtsNorth_West 0.00267739138
## largest_fed_districtsSouth 1.00000000000
## largest_fed_districtsUrals 1.00000000000
## largest_fed_districtsVolga 0.60749515839
## macro_okved_code_groupBuilding 0.53145177236
## macro_okved_code_groupFinancial_insurance 0.02246213833
## macro_okved_code_groupmanufacturing 1.00000000000
## macro_okved_code_groupother_categories 0.01243103093
## macro_okved_code_groupreal_estate 0.00008810458
## macro_okved_code_grouprural 0.07828050203
## macro_okved_code_groupScience 1.00000000000
## macro_okved_code_groupTrading 0.00006265461
## macro_okved_code_groupTransportation 1.00000000000
## max_bac_stage.0 0.61921498154
## max_bac_stage.1 0.43926554687
## max_bac_stage.2 0.48552370132
## max_bac_stage.3 0.09244717585
## max_bac_stage.4 0.88017735354
## max_bac_stage.5 1.00000000000
## max_bac_stage.6 1.00000000000
## cop_stageCouncil_discussion 1.00000000000
## cop_stageInformation_collection 0.40992816476
## cop_stageResolution 0.58180770101
## category_by_size_missingBig 0.07191971039
## category_by_size_missingMedium 0.86181236403
## category_by_size_missingMicro 0.03322321761
## category_by_size_missingSmall 0.35175519303
## category_by_size_melseElse 0.03322321761
## category_by_size_melseMicro 0.03322321761
## category_by_size_2_catBig 0.09456266968
## category_by_size_2_catSmall 0.09456266968
## is_working NA
# check high p_values whether ther are erros in code or not
# looks like ok the values are pretty close to expected
xtabs(~is_working_dums$macro_okved_code_groupScience +is_working_dums$is_working)
## is_working_dums$is_working
## is_working_dums$macro_okved_code_groupScience 0 1
## 0 230 205
## 1 25 22
tst =chisq.test(xtabs(~is_working_dums$macro_okved_code_groupScience +is_working_dums$is_working))
tst$observed
## is_working_dums$is_working
## is_working_dums$macro_okved_code_groupScience 0 1
## 0 230 205
## 1 25 22
tst$expected
## is_working_dums$is_working
## is_working_dums$macro_okved_code_groupScience 0 1
## 0 230.13485 204.86515
## 1 24.86515 22.13485
xtabs(~is_working_dums$largest_fed_districtsMoscow +is_working_dums$is_working)
...Подобные документы
Характеристика программной среды Business Studio 3.6. Демонстрационная база на примере покупки и доставки офисной мебели. Содержание, временная и логическая очередность операций бизнес-процесса компании "Аккорд" г. Ростов-на-Дону; области моделирования.
курсовая работа [1,3 M], добавлен 01.06.2014Процесс интеграции технических и программных средств во все аспекты деятельности предприятия. Группа контроллинга и ее задачи. Операционно-ориентированный расчет себестоимости продукта (услуги). Определение времени выполнения и стоимости процесса.
реферат [547,5 K], добавлен 14.09.2010Модель оценки долгосрочных активов (Capital Asset Pricing Model, САРМ). Оценка доходности и риска на основе исторических данных. Выбор оптимального портфеля из рискованных активов. Риск и неопределенность денежных потоков. Расчет бета-коэффициента.
презентация [104,1 K], добавлен 30.07.2013Mathematical model of the grinding grating bending process under the action of a meat product load parabolically decreasing along the radius. Determination of the deflection of a knife blade under the action of a parabolic load of the food medium.
статья [1,3 M], добавлен 20.10.2022Mission, aims and potential of company. Analysis of the opportunities and threats of international business. Description of the factors that characterize the business opportunities in Finland. The business plan of the penetration to market of Finland.
курсовая работа [128,3 K], добавлен 04.06.2013Definition and stages of business cycles, their causes and the characteristic of kinds. Types and a continuity of business cycles. Kondratyev's wave. A role of cycles in stabilization of a policy of the state. Great depression as an economic crisis.
реферат [130,5 K], добавлен 20.03.2011Business plans are an important test of clarity of thinking and clarity of the business. Reasons for writing a business plan. Market trends and the market niche for product. Business concept, market analysis. Company organization, financial plan.
реферат [59,4 K], добавлен 15.09.2012Support of business entities on the part of specialized agencies of the state on world markets. Interconnection of economic diplomacy of Ukraine in international cooperation with influence on the results of foreign economic activity of the country.
статья [30,1 K], добавлен 19.09.2017Impact of globalization on the way organizations conduct their businesses overseas, in the light of increased outsourcing. The strategies adopted by General Electric. Offshore Outsourcing Business Models. Factors for affect the success of the outsourcing.
реферат [32,3 K], добавлен 13.10.2011Technical and economic characteristics of medical institutions. Development of an automation project. Justification of the methods of calculating cost-effectiveness. General information about health and organization safety. Providing electrical safety.
дипломная работа [3,7 M], добавлен 14.05.2014Business as a combination of types of activities: production, distribution and sale, obtaining economic profit. Basic types and functions of banks. The principle of equilibrium prices and financial management. The use of accounting in the organization.
контрольная работа [17,8 K], добавлен 31.01.2011Prospects for reformation of economic and legal mechanisms of subsoil use in Ukraine. Application of cyclically oriented forecasting: modern approaches to business management. Preconditions and perspectives of Ukrainian energy market development.
статья [770,0 K], добавлен 26.05.2015The concept of economic growth and development. Growth factors: extensive, intensive, the growth of the educational and professional level of personnel, improve the management of production. The factors of production: labor, capital and technology.
презентация [2,3 M], добавлен 21.07.2013Entrepreneurial risk: the origins and essence. The classification of business risk. Economic characteristic of entrepreneurial risks an example of joint-stock company "Kazakhtelecom". The basic ways of the risks reduction. Methods for reducing the risks.
курсовая работа [374,8 K], добавлен 07.05.2013The essence, structure, оbjectives and functions of business plan. The process’s essence of the bank’s business plan realization. Sequential decision and early implementation stages of projects. Widely spread mistakes and ways for their improvement.
курсовая работа [67,0 K], добавлен 18.12.2011Description situation of the drugs in the world. Factors and tendencies of development of drugs business. Analysis kinds of drugs, their stages of manufacture and territory of sale. Interrelation of drugs business with other global problems of mankind.
курсовая работа [38,9 K], добавлен 13.09.2010Executive summary. Progect objectives. Keys to success. Progect opportunity. The analysis. Market segmentation. Competitors and competitive advantages. Target market segment strategy. Market trends and growth. The proposition. The business model.
бизнес-план [2,0 M], добавлен 20.09.2008Financial position of the "BTA Bank", prospects, business strategy, management plans and objectives. Forward-looking statements, risks, uncertainties and other factors that may cause actual results of operations; strategy and business environment.
презентация [510,7 K], добавлен 17.02.2013Directions of activity of enterprise. The organizational structure of the management. Valuation of fixed and current assets. Analysis of the structure of costs and business income. Proposals to improve the financial and economic situation of the company.
курсовая работа [1,3 M], добавлен 29.10.2014Software as a Service, a form of cloud computing service model of software users. SaaS subscription model: key features, market drivers and constraints. Impact of SaaS subscription services business in the economy and society in Russia and abroad.
дипломная работа [483,8 K], добавлен 23.10.2016