Factors of successful protection from pressure on business

Concept and economic essence of property rights. Justification and development of the business protection model against possible damage to business activities caused by the influence various external and internal market factors and economic conditions.

Рубрика Экономико-математическое моделирование
Вид дипломная работа
Язык английский
Дата добавления 11.08.2020
Размер файла 5,0 M

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Pure profit 2011

Pure profit 2012

Pure profit 2013

Pure profit 2014

Pure profit 2015

Pure profit 2016

Pure profit 2017

Pure profit 2018

Enterprise status

0

Recoded to target variable.

Administrative position

1

Administrative connections

1

In political party

1

Political party

0

Too few values

In association or SRO

1

Association SRO name

0

Too much distinct values

Case publications

1

Application topic

0

Next four variables are used instead

Criminal prosecution

1

Capture

1

Corruption

1

Barriers

1

Application date

0

Information is not related to the research topic

Application year

0

Information is not related to the research topic

Have court case

1

Is guilty

1

Reviewed by BAC

1

Max BAC stage

1

Max BAC stage, grouped

1

Supported by BAC public council

1

Reaction not passed by the applicant

1

Reaction not passed by «BAC»

1

Reaction consultation

1

Reaction target letters control

1

To ombudsman

1

Attachment 5

R-Studio Output

Import data and define the original set of variables.

cop_data <-data.frame(read_excel("cop_data.xlsx"))

dataset <-cop_data[c(

"region_code_spark",

"federal_districts", "largest_fed_districts",

"macro_okved_code", "macro_okved_code_group",

"spark_web_site", "spark_stock_ticket",

"company_age_till_2020", "company_age_including_liquidation",

"age_till_application_date",

"n_employees_upperbound", "n_employees_added",

"authorized_capital",

"administrative_position", "administrative_connections",

"in_political_party",

"in_association_or_sro",

"case_publications",

"criminal_prosecution", "capture", "corruption", "barriers",

"have_court_case", "is_guilty",

"reviewed_by_bac", "max_bac_stage", "supported_by_bac_public_council",

"reaction_not_passed_by_applicant", "reaction_not_passed_by_bac",

"reaction_consultation",

"reaction_target_letters_control", "to_ombudsman",

"is_working", "target_light_clear",

"target_light_extended", "target_strong_extended"
)]

Part 1. Data preparation and feature enginiering.

Check missings data - the formatting was bad, so i commented out and attached screenshot.

# missmap(dataset, col=c("blue", "white"), legend=T, margins = c(7,7))

Which variables have missing data, percent

# Targets

sum(is.na(dataset$target_light_clear))/nrow(dataset)

## [1] 0.253112

sum(is.na(dataset$target_light_extended))/nrow(dataset)

## [1] 0.186722

sum(is.na(dataset$target_strong_extended))/nrow(dataset)

## [1] 0.186722

# other variables

sum(is.na(dataset$n_employees_upperbound))/nrow(dataset)

## [1] 0.1037344

sum(is.na(dataset$authorized_capital))/nrow(dataset)

## [1] 0.03319502

Target variables distributions

table(dataset$is_working)

##
## 0 1
## 255 227

table(dataset$target_light_clear)

##
## 0 1
## 184 176

table(dataset$target_light_extended)

##
## 0 1
## 184 208

table(dataset$target_strong_extended)

##
## 0 1
## 301 91

From here the data preparation step begins. I have two types of variables: categorical and contionous variables. Let me start with contionous variables and prepare them for the analysis.

There are three variables in continous scale: age, size (n_of_empoyees) and authorized capital.

Age. For age I have 3 variables: “company age until 2020” (not including liquidations dates), “company age until liquidation”, “company age until application”. Lets take a look at each of them.

# company_age_till_2020
summary(dataset$company_age_till_2020)

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.50 12.12 16.50 17.06 22.00 31.00

boxplot(dataset$company_age_till_2020, main ='Company age until 2020')

hist(dataset$company_age_till_2020, main ='Company age until 2020', xlab ='Age until 2020')

# company_age_including_liquidation
summary(dataset$company_age_including_liquidation)

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.50 15.00 15.35 20.88 31.00

boxplot(dataset$company_age_including_liquidation, main ='Company age until closed')

hist(dataset$company_age_including_liquidation, main ='Company age until closure', xlab ='Age until closure')

# age_till_application_date
summary(dataset$age_till_application_date)

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 5.00 9.00 10.26 15.00 23.00

boxplot(dataset$age_till_application_date, main ='Company age until application')

hist(dataset$age_till_application_date, main ='Company age until application', xlab ='Age until applicaiton')

Age data seems ok, no outliers, probably not normally distributed variables.

# company_age_till_2020
hist(dataset$company_age_till_2020, main ='Company age until 2020')

ggqqplot(dataset$company_age_till_2020)

shapiro.test(dataset$company_age_till_2020)

##
## Shapiro-Wilk normality test
##
## data: dataset$company_age_till_2020
## W = 0.96384, p-value = 1.644e-09

# company_age_including_liquidation
hist(dataset$company_age_including_liquidation, main ='Company age until liquidation')

ggqqplot(dataset$company_age_including_liquidation)

shapiro.test(dataset$company_age_including_liquidation)

##
## Shapiro-Wilk normality test
##
## data: dataset$company_age_including_liquidation
## W = 0.96839, p-value = 1.107e-08

# age_till_application_date
hist(dataset$age_till_application_date, main ='Company age until application')

ggqqplot(dataset$age_till_application_date)

shapiro.test(dataset$age_till_application_date)

##
## Shapiro-Wilk normality test
##
## data: dataset$age_till_application_date
## W = 0.9583, p-value = 1.94e-10

Since the null hypothesis in Shapiro-Wilk test is that data is distributed normally, from here we can say that for all three variables data provide us with enough evidence that the data distribution is not normal.

Since data distribution is not normal, we cannot compare means. It will be better to use non-parametric test - Kruskal-Wallis rank sum test that works with medians. Since among all three variables the `company_age_until_liquidation' is the only variable, which can be applied to the analysis (because other two variables allows us to “look in the future”), it is reasonable to provide this variable with Kruskal-Wallis analysis against all target variables.

# is_working
kruskal.test(x = dataset$age_till_application_date, g = dataset$is_working)

##
## Kruskal-Wallis rank sum test
##
## data: dataset$age_till_application_date and dataset$is_working
## Kruskal-Wallis chi-squared = 6.1061, df = 1, p-value = 0.01347

test <-dataset[c("is_working", "age_till_application_date")]
ggboxplot(test, x ="is_working", y ="age_till_application_date",
palette =c("#00AFBB", "#E7B800"),
ylab ="age_till_application_date", xlab ="is_working",
ylim =c(0,60))

# in here I have doubted becuase of the graphical representation and small visual
# difference
# So I decided to double-check the test with Mann-Whitney-Wilcoxon Test (usually # used for two groups comparison)
# but the result have repeated, meaning that at .05 significance level, we can conclude the ages of the enterprise divided by status of currently 'currently working' are taken nonidentical populations.
wilcox.test(dataset$age_till_application_date ~dataset$is_working)

##
## Wilcoxon rank sum test with continuity correction
##
## data: dataset$age_till_application_date by dataset$is_working
## W = 25176, p-value = 0.01348
## alternative hypothesis: true location shift is not equal to 0

# target_light_clear
test_df <-dataset[!is.na(dataset$target_light_clear),]
kruskal.test(x = test_df$age_till_application_date, g = test_df$target_light_clear)

##
## Kruskal-Wallis rank sum test
##
## data: test_df$age_till_application_date and test_df$target_light_clear
## Kruskal-Wallis chi-squared = 1.4822, df = 1, p-value = 0.2234

test <-test_df[c("target_light_clear", "age_till_application_date")]
ggboxplot(test, x ="target_light_clear", y ="age_till_application_date",
palette =c("#00AFBB", "#E7B800"),
ylab ="age_till_application_date", xlab ="target_light_clear",
ylim =c(0,60))

# target_light_extended
test_df <-dataset[!is.na(dataset$target_light_extended),]
kruskal.test(x = test_df$age_till_application_date, g= test_df$target_light_extended)

##
## Kruskal-Wallis rank sum test
##
## data: test_df$age_till_application_date and test_df$target_light_extended
## Kruskal-Wallis chi-squared = 0.93575, df = 1, p-value = 0.3334

test <-test_df[c("target_light_extended", "age_till_application_date")]
ggboxplot(test, x ="target_light_extended", y ="age_till_application_date",
palette =c("#00AFBB", "#E7B800"),
ylab ="age_till_application_date", xlab ="target_light_extended",
ylim =c(0,60))

# target_strong_extended
test_df <-dataset[!is.na(dataset$target_strong_extended),]
kruskal.test(x = test_df$age_till_application_date, g = test_df$target_strong_extended)

##
## Kruskal-Wallis rank sum test
##
## data: test_df$age_till_application_date and test_df$target_strong_extended
## Kruskal-Wallis chi-squared = 0.0092569, df = 1, p-value = 0.9234

test <-test_df[c("target_strong_extended", "age_till_application_date")]
ggboxplot(test, x ="target_strong_extended", y ="age_till_application_date",
palette =c("#00AFBB", "#E7B800"),
ylab ="age_till_application_date", xlab ="target_strong_extended",
ylim =c(0,60))

So, from here we can conclude that company age until application medians vary in groups, created only by target variable responsible for working in 2020.

Size. Observe the size variable, which is very skewed. so size variable is highly skewed.

# number of missing data
sum(is.na(dataset$n_employees_upperbound))

## [1] 50

plot(density(dataset$n_employees_upperbound[!is.na(dataset$n_employees_upperbound)]), main ='№ employees excluding missing', xlab ='№ of employees')

hist(dataset$n_employees_upperbound[!is.na(dataset$n_employees_upperbound)], main ='№ employees excluding missing', xlab ='№ of employees')

# by imputing missing variables with 5
sum(is.na(dataset$n_employees_added))

## [1] 0

plot(density(dataset$n_employees_added), main ='№ employees imputing missing by 5', xlab ='№ of employees')

hist(dataset$n_employees_added, main ='№ employees imputing missing by 5', xlab ='№ of employees')

So it is clear that both variables are skewed regardless whether we impute our missing data or not. So this is definitely categorization case and I have practical premises how to do that.

dataset$category_by_size_missing <-
ifelse(dataset$n_employees_upperbound <=15, 'Micro',
ifelse(dataset$n_employees_upperbound >=16&dataset$n_employees_upperbound <=100, 'Small',
ifelse(dataset$n_employees_upperbound >=101&dataset$n_employees_upperbound <=250, 'Medium','Big'
)))

dataset$category_by_size_added <-
ifelse(dataset$n_employees_added <=15, 'Micro',
ifelse(dataset$n_employees_added >=16&dataset$n_employees_added <=100, 'Small',
ifelse(dataset$n_employees_added >=101&dataset$n_employees_added <=250, 'Medium','Big'
)))

But can we impute missing data by 5's? Does this distort the general proportions for our data?

# this is interesting how general proportions change in micro class
# in each case we see that imputing 5 over missing data increases the general
# proportion of negative class in 'Micro' size category
# Thus, replacing missing data by 5 is not correct in this case.

# true
xtabs(~dataset$is_working +dataset$category_by_size_missing)

## dataset$category_by_size_missing
## dataset$is_working Big Medium Micro Small
## 0 12 12 148 35
## 1 25 15 138 47

# replaced
xtabs(~dataset$is_working +dataset$category_by_size_added)

## dataset$category_by_size_added
## dataset$is_working Big Medium Micro Small
## 0 12 12 196 35
## 1 25 15 140 47

# true
xtabs(~dataset$target_light_clear +dataset$category_by_size_missing)

## dataset$category_by_size_missing
## dataset$target_light_clear Big Medium Micro Small
## 0 14 8 112 23
## 1 17 16 97 37

# replaced
xtabs(~dataset$target_light_clear +dataset$category_by_size_added)

## dataset$category_by_size_added
## dataset$target_light_clear Big Medium Micro Small
## 0 14 8 139 23
## 1 17 16 106 37

# true
xtabs(~dataset$target_light_extended +dataset$category_by_size_missing)

## dataset$category_by_size_missing
## dataset$target_light_extended Big Medium Micro Small
## 0 14 8 112 23
## 1 20 19 110 50

# replaced
xtabs(~dataset$target_light_extended +dataset$category_by_size_added)

## dataset$category_by_size_added
## dataset$target_light_extended Big Medium Micro Small
## 0 14 8 139 23
## 1 20 19 119 50

# true
xtabs(~dataset$target_strong_extended +dataset$category_by_size_missing)

## dataset$category_by_size_missing
## dataset$target_strong_extended Big Medium Micro Small
## 0 22 17 173 57
## 1 12 10 49 16

# replaced
xtabs(~dataset$target_strong_extended +dataset$category_by_size_added)

## dataset$category_by_size_added
## dataset$target_strong_extended Big Medium Micro Small
## 0 22 17 205 57
## 1 12 10 53 16

Well, looking at cross-tabs we can see that proportion of `Micro' category is the biggest. Other classes are definitely a minority. So I decided to create some extra features for size I can play around and check which one permorm better with each target.

xtabs(~dataset$is_working+dataset$category_by_size_missing)

## dataset$category_by_size_missing
## dataset$is_working Big Medium Micro Small
## 0 12 12 148 35
## 1 25 15 138 47

xtabs(~dataset$target_light_clear+dataset$category_by_size_missing)

## dataset$category_by_size_missing
## dataset$target_light_clear Big Medium Micro Small
## 0 14 8 112 23
## 1 17 16 97 37

xtabs(~dataset$target_light_extended+dataset$category_by_size_missing)

## dataset$category_by_size_missing
## dataset$target_light_extended Big Medium Micro Small
## 0 14 8 112 23
## 1 20 19 110 50

xtabs(~dataset$target_strong_extended+dataset$category_by_size_missing)

## dataset$category_by_size_missing
## dataset$target_strong_extended Big Medium Micro Small
## 0 22 17 173 57
## 1 12 10 49 16

# micro and other categries
dataset$category_by_size_melse<-ifelse(dataset$category_by_size_missing == "Micro", "Micro", "Else")
table(dataset$category_by_size_melse)

##
## Else Micro
## 146 286

# small + micro categoris vs. the Medium + Big
dataset$category_by_size_2_cat<-
ifelse(dataset$category_by_size_missing == "Micro"|dataset$category_by_size_missing == "Small", "Small", "Big")
table(dataset$category_by_size_2_cat)

##
## Big Small
## 64 368

AUTHORUZED CAPITAL The situation with authorized capital is similar with size variable. The distribution is highly skewed

test_df <-dataset[!is.na(dataset$authorized_capital),]

hist(test_df$authorized_capital, main ='Companies authorized capital distribution',
xlab ='Size of the authorized capital')

So I decided to categorize this variable too. Here by 3 approximately equal groups.

dataset$auth_capital_group <-
ifelse(dataset$authorized_capital <=10000, 'under_10k',
ifelse(dataset$authorized_capital>=10001&dataset$authorized_capital <=210000, 'under_210k', 'over_210k'))
table(dataset$auth_capital_group)

##
## over_210k under_10k under_210k
## 150 188 128

# even have associatino with is_working, but the associations for latter steps...
chisq.test(xtabs(~dataset$is_working +dataset$auth_capital_group))

##
## Pearson's Chi-squared test
##
## data: xtabs(~dataset$is_working + dataset$auth_capital_group)
## X-squared = 9.4727, df = 2, p-value = 0.008771

The next step of data preparations is working with categorical variables. Despite the fact these variables are categorical, we need to check some of them in order to simplify further analysis and eliminate the probability of spurios associations (I mean small groups with high chance of `win' or `loose').

REGION The problem with this variable is that it has too many categories So there might be problems with analysing it in the future. In order to deal with this problem, I have prepaired several variations of this variable in order to test them.

Too many categories (and also small categories here!). It will cause problems in future.

table(dataset$region_code_spark)

##
## 5 6 7 11 17 18 22 27 28 29 30 31 32 33 35 37 40 41
## 1 1 1 2 1 5 5 4 1 1 7 5 4 5 5 3 4 2
## 43 44 46 47 48 49 51 53 55 56 57 58 62 65 67 68 69 71
## 2 2 5 3 4 1 1 3 3 8 3 4 2 6 5 1 3 3
## 72 76 79 80 82 89 91 102 103 113 116 121 123 124 125 126 134 136
## 7 3 3 1 1 4 6 19 1 2 13 11 19 5 7 9 5 9
## 138 142 152 154 159 161 163 164 173 174 178 196 750 799
## 6 5 15 8 3 14 7 3 2 10 21 8 36 103

First - grouppings by federal districts and separate variables for Moscow and Moscow region. This variant looks much better.

table(dataset$federal_districts)

##
## Caucasus Central Far_East Moscow
## 12 61 26 103
## Moscow_region North_West Saint_Petersburg Siberian
## 36 21 21 33
## South Urals Volga
## 46 29 94

And the second option - even bigger groupping by the following logic (close regions): Saint Petersburg + NORTH_WEST = NORTH_WEST SOUTH + CAUCASUS = SOUTH FAR_EAST + SIBERIAN = FAR_SIBERIAN and separate variables for moscow and moscow region

table(dataset$largest_fed_districts)

##
## Central Far_Siberia Moscow Moscow_region North_West
## 61 59 103 36 42
## South Urals Volga
## 58 29 94

OKVED ACTIVITY. The same situation for OKVED activity. The initial categorization have a few groups which are small enough to cuase a concern.

table(dataset$macro_okved_code)

##
## administrative Building Culture_sport
## 7 83 4
## Education energy_gas_steam Financial_insurance
## 3 10 23
## Health Hotels_catering Information
## 5 4 14
## manufacturing mining Other_services
## 80 10 5
## real_estate rural Science
## 43 24 47
## Trading Transportation water_supp
## 98 18 4

So I have prepared the second option, where all OKVED codes with counts less then 15 recoded to the `other_categories' group.

table(dataset$macro_okved_code_group)

##
## Building Financial_insurance manufacturing
## 83 23 80
## other_categories real_estate rural
## 66 43 24
## Science Trading Transportation
## 47 98 18

Business against corruption stage.

The final variable I have changed was `max_bac_stage', which depicts the maximal stage the observed application has passed, according to “Business against corruption” procedure.

Despite the fact that 4-th and 9-th stages are the biggest, there is no big difference between these stages. For instance, 3-rd and 4-th stages are both about getting expert resolution about the case. So i decided to play around and recode these stages into more meningfull and bigger groups to see which varaible will perform better.

table(dataset$max_bac_stage)

##
## 0 1 2 3 4 5 6
## 12 40 35 28 247 29 91

dataset$cop_stage <-ifelse(dataset$max_bac_stage <=2, 'Information_collection',
ifelse(dataset$max_bac_stage ==3|dataset$max_bac_stage ==4, 'Resolution',
'Council_discussion'))

table(dataset$cop_stage)

##
## Council_discussion Information_collection Resolution
## 120 87 275

Part 2. Relationships discovery.

# some code that can help me
chisqmatrix_stat <-function(x) {
names =colnames(x); num =length(names)
m =matrix(nrow=num,ncol=num,dimnames=list(names,names))
for (i in1:(num-1)) {
for (j in (i+1):num) {
#m[i,j] = chisq.test(x[,i],x[,j],)$p.value
m[i,j] =chisq.test(x[,i],x[,j],)$statistic
}
}
return (m)
}

chisqmatrix_pval <-function(x) {
names =colnames(x); num =length(names)
m =matrix(nrow=num,ncol=num,dimnames=list(names,names))
for (i in1:(num-1)) {
for (j in (i+1):num) {
m[i,j] =chisq.test(x[,i],x[,j],)$p.value
}
}
return (m)
}

Since I have already checked the differnece in medians for continous variable - age, in this section I generate a matrix of chi-square coeffs for categorical variables.

First step is without dummies, the second - check categorical variables with multiple levels by creating dummies.

Target: is_working

# check for distributions
is_working_vars <-c(
"federal_districts",
"largest_fed_districts",
#"macro_okved_code",
"macro_okved_code_group",
"spark_web_site",
"spark_stock_ticket",
"category_by_size_missing",
"category_by_size_melse",
"category_by_size_2_cat",
"administrative_position",
"administrative_connections",
"in_political_party",
"in_association_or_sro",
"case_publications",
"criminal_prosecution",
"capture", "corruption", "barriers",
"have_court_case", "is_guilty", "reviewed_by_bac",
"max_bac_stage", "supported_by_bac_public_council",
"reaction_not_passed_by_applicant", "reaction_consultation",
"reaction_target_letters_control", "to_ombudsman",
"reaction_not_passed_by_bac",
"auth_capital_group",
"cop_stage",
"is_working")
is_working_cs <-dataset[is_working_vars]

# this is a very log output, it just returns xtabs for each variable and a target
#for (i in 1:length(is_working_vars)){
# print(xtabs(~is_working_cs$is_working + is_working_cs[,i]))
#}

# variables to worry about: spark_stock_ticket, reaction_not_passed_by_bac (somehow, still low counts)

xtabs(~datasetreaction_consultation)

is_working_vars <-c(
"federal_districts",
"largest_fed_districts",
#"macro_okved_code",
"macro_okved_code_group",
"spark_web_site",
"spark_stock_ticket",
"category_by_size_missing",
"category_by_size_melse",
"category_by_size_2_cat",
"administrative_position",
"administrative_connections",
"in_political_party",
"in_association_or_sro",
"case_publications",
"criminal_prosecution",
"capture", "corruption", "barriers",
"have_court_case", "is_guilty", "reviewed_by_bac",
"max_bac_stage", "supported_by_bac_public_council",
"reaction_not_passed_by_applicant", "reaction_consultation",
"reaction_target_letters_control", "to_ombudsman",
"reaction_not_passed_by_bac",
"auth_capital_group",
"cop_stage",
"is_working")

is_working_cs <-dataset[is_working_vars]

is_working_cs_mat_stat =chisqmatrix_stat(is_working_cs)
is_working_cs_mat_stat <-format( data.frame(is_working_cs_mat_stat)["is_working"], scientific = F)
is_working_cs_mat_pval =chisqmatrix_pval(is_working_cs)
is_working_cs_mat_pval <-format( data.frame(is_working_cs_mat_pval)["is_working"], scientific = F)

is_working_cs_df <-data.frame(c(is_working_cs_mat_stat, is_working_cs_mat_pval))
rownames(is_working_cs_df) <-rownames(is_working_cs_mat_stat)
colnames(is_working_cs_df) <-c("Statistic","P-value")
#write_xlsx(data.frame(is_working_cs_df), 'is_working_cs.xlsx')
is_working_cs_df[2]

## P-value
## federal_districts 0.2097364820098783
## largest_fed_districts 0.1491242995084314
## macro_okved_code_group 0.0000004029985171
## spark_web_site 0.0000000001211396
## spark_stock_ticket 0.0952582185988386
## category_by_size_missing 0.0992954794302832
## category_by_size_melse 0.0332232176140461
## category_by_size_2_cat 0.0945626696786416
## administrative_position 0.0545086826433797
## administrative_connections 0.1339695931231736
## in_political_party 0.0049908143773374
## in_association_or_sro 0.0000046940502687
## case_publications 0.3562620121957124
## criminal_prosecution 0.3424585386949022
## capture 0.0605566407878402
## corruption 0.6580199336591293
## barriers 0.0657458127196991
## have_court_case 0.7558063380989922
## is_guilty 0.8370032224854111
## reviewed_by_bac 0.7171329654303231
## max_bac_stage 0.4826337816347389
## supported_by_bac_public_council 0.2403666393970288
## reaction_not_passed_by_applicant 0.0029202062366615
## reaction_consultation 0.1563821496506542
## reaction_target_letters_control 0.8901658295767594
## to_ombudsman 0.0278554894020524
## reaction_not_passed_by_bac 0.6128206928637800
## auth_capital_group 0.0087706387308923
## cop_stage 0.6333066056508621
## is_working NA

#check - looks that code worked ok
chisq.test(xtabs(~dataset$is_working +dataset$macro_okved_code_group))

##
## Pearson's Chi-squared test
##
## data: xtabs(~dataset$is_working + dataset$macro_okved_code_group)
## X-squared = 44.792, df = 8, p-value = 4.03e-07

chisq.test(xtabs(~dataset$is_working +dataset$in_political_party))

##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: xtabs(~dataset$is_working + dataset$in_political_party)
## X-squared = 7.8828, df = 1, p-value = 0.004991

It is time to check for categorical data with several levels.

is_working_dummies <-is_working_cs[c("federal_districts", "largest_fed_districts", "macro_okved_code_group", "max_bac_stage", "cop_stage", "category_by_size_missing", "category_by_size_melse", "category_by_size_2_cat" )]

is_working_dummies$max_bac_stage <-as.factor(is_working_dummies$max_bac_stage)
dums <-dummyVars(" ~ .", data = is_working_dummies)
is_working_dums <-data.frame(predict(dums, newdata = is_working_dummies))
is_working_dums$is_working <-is_working_cs$is_working

is_working_dums_pval =chisqmatrix_pval(is_working_dums)
is_working_dums_pval <-format( data.frame(is_working_dums_pval)["is_working"], scientific = F)
is_working_dums_stat =chisqmatrix_stat(is_working_dums)
is_working_dums_stat <-format( data.frame(is_working_dums_stat)["is_working"], scientific = F)

is_working_dums_df <-data.frame(c(is_working_dums_stat, is_working_dums_pval))
rownames(is_working_dums_df) <-rownames(is_working_dums_stat)
colnames(is_working_dums_df) <-c("Statistic","P-value")
#write_xlsx(data.frame(is_working_dums_df), 'is_working_cs_dums.xlsx')
is_working_dums_df[2]

## P-value
## federal_districtsCaucasus 0.27897424106
## federal_districtsCentral 0.83224346507
## federal_districtsFar_East 0.61213940228
## federal_districtsMoscow 1.00000000000
## federal_districtsMoscow_region 0.59160589429
## federal_districtsNorth_West 0.04971145510
## federal_districtsSaint_Petersburg 0.04971145510
## federal_districtsSiberian 0.72908740098
## federal_districtsSouth 0.50155081889
## federal_districtsUrals 1.00000000000
## federal_districtsVolga 0.60749515839
## largest_fed_districtsCentral 0.83224346507
## largest_fed_districtsFar_Siberia 0.44993068298
## largest_fed_districtsMoscow 1.00000000000
## largest_fed_districtsMoscow_region 0.59160589429
## largest_fed_districtsNorth_West 0.00267739138
## largest_fed_districtsSouth 1.00000000000
## largest_fed_districtsUrals 1.00000000000
## largest_fed_districtsVolga 0.60749515839
## macro_okved_code_groupBuilding 0.53145177236
## macro_okved_code_groupFinancial_insurance 0.02246213833
## macro_okved_code_groupmanufacturing 1.00000000000
## macro_okved_code_groupother_categories 0.01243103093
## macro_okved_code_groupreal_estate 0.00008810458
## macro_okved_code_grouprural 0.07828050203
## macro_okved_code_groupScience 1.00000000000
## macro_okved_code_groupTrading 0.00006265461
## macro_okved_code_groupTransportation 1.00000000000
## max_bac_stage.0 0.61921498154
## max_bac_stage.1 0.43926554687
## max_bac_stage.2 0.48552370132
## max_bac_stage.3 0.09244717585
## max_bac_stage.4 0.88017735354
## max_bac_stage.5 1.00000000000
## max_bac_stage.6 1.00000000000
## cop_stageCouncil_discussion 1.00000000000
## cop_stageInformation_collection 0.40992816476
## cop_stageResolution 0.58180770101
## category_by_size_missingBig 0.07191971039
## category_by_size_missingMedium 0.86181236403
## category_by_size_missingMicro 0.03322321761
## category_by_size_missingSmall 0.35175519303
## category_by_size_melseElse 0.03322321761
## category_by_size_melseMicro 0.03322321761
## category_by_size_2_catBig 0.09456266968
## category_by_size_2_catSmall 0.09456266968
## is_working NA

# check high p_values whether ther are erros in code or not
# looks like ok the values are pretty close to expected
xtabs(~is_working_dums$macro_okved_code_groupScience +is_working_dums$is_working)

## is_working_dums$is_working
## is_working_dums$macro_okved_code_groupScience 0 1
## 0 230 205
## 1 25 22

tst =chisq.test(xtabs(~is_working_dums$macro_okved_code_groupScience +is_working_dums$is_working))
tst$observed

## is_working_dums$is_working
## is_working_dums$macro_okved_code_groupScience 0 1
## 0 230 205
## 1 25 22

tst$expected

## is_working_dums$is_working
## is_working_dums$macro_okved_code_groupScience 0 1
## 0 230.13485 204.86515
## 1 24.86515 22.13485

xtabs(~is_working_dums$largest_fed_districtsMoscow +is_working_dums$is_working)

...

Подобные документы

  • Характеристика программной среды Business Studio 3.6. Демонстрационная база на примере покупки и доставки офисной мебели. Содержание, временная и логическая очередность операций бизнес-процесса компании "Аккорд" г. Ростов-на-Дону; области моделирования.

    курсовая работа [1,3 M], добавлен 01.06.2014

  • Процесс интеграции технических и программных средств во все аспекты деятельности предприятия. Группа контроллинга и ее задачи. Операционно-ориентированный расчет себестоимости продукта (услуги). Определение времени выполнения и стоимости процесса.

    реферат [547,5 K], добавлен 14.09.2010

  • Модель оценки долгосрочных активов (Capital Asset Pricing Model, САРМ). Оценка доходности и риска на основе исторических данных. Выбор оптимального портфеля из рискованных активов. Риск и неопределенность денежных потоков. Расчет бета-коэффициента.

    презентация [104,1 K], добавлен 30.07.2013

  • Mathematical model of the grinding grating bending process under the action of a meat product load parabolically decreasing along the radius. Determination of the deflection of a knife blade under the action of a parabolic load of the food medium.

    статья [1,3 M], добавлен 20.10.2022

  • Mission, aims and potential of company. Analysis of the opportunities and threats of international business. Description of the factors that characterize the business opportunities in Finland. The business plan of the penetration to market of Finland.

    курсовая работа [128,3 K], добавлен 04.06.2013

  • Definition and stages of business cycles, their causes and the characteristic of kinds. Types and a continuity of business cycles. Kondratyev's wave. A role of cycles in stabilization of a policy of the state. Great depression as an economic crisis.

    реферат [130,5 K], добавлен 20.03.2011

  • Business plans are an important test of clarity of thinking and clarity of the business. Reasons for writing a business plan. Market trends and the market niche for product. Business concept, market analysis. Company organization, financial plan.

    реферат [59,4 K], добавлен 15.09.2012

  • Support of business entities on the part of specialized agencies of the state on world markets. Interconnection of economic diplomacy of Ukraine in international cooperation with influence on the results of foreign economic activity of the country.

    статья [30,1 K], добавлен 19.09.2017

  • Impact of globalization on the way organizations conduct their businesses overseas, in the light of increased outsourcing. The strategies adopted by General Electric. Offshore Outsourcing Business Models. Factors for affect the success of the outsourcing.

    реферат [32,3 K], добавлен 13.10.2011

  • Technical and economic characteristics of medical institutions. Development of an automation project. Justification of the methods of calculating cost-effectiveness. General information about health and organization safety. Providing electrical safety.

    дипломная работа [3,7 M], добавлен 14.05.2014

  • Business as a combination of types of activities: production, distribution and sale, obtaining economic profit. Basic types and functions of banks. The principle of equilibrium prices and financial management. The use of accounting in the organization.

    контрольная работа [17,8 K], добавлен 31.01.2011

  • Prospects for reformation of economic and legal mechanisms of subsoil use in Ukraine. Application of cyclically oriented forecasting: modern approaches to business management. Preconditions and perspectives of Ukrainian energy market development.

    статья [770,0 K], добавлен 26.05.2015

  • The concept of economic growth and development. Growth factors: extensive, intensive, the growth of the educational and professional level of personnel, improve the management of production. The factors of production: labor, capital and technology.

    презентация [2,3 M], добавлен 21.07.2013

  • Entrepreneurial risk: the origins and essence. The classification of business risk. Economic characteristic of entrepreneurial risks an example of joint-stock company "Kazakhtelecom". The basic ways of the risks reduction. Methods for reducing the risks.

    курсовая работа [374,8 K], добавлен 07.05.2013

  • The essence, structure, оbjectives and functions of business plan. The process’s essence of the bank’s business plan realization. Sequential decision and early implementation stages of projects. Widely spread mistakes and ways for their improvement.

    курсовая работа [67,0 K], добавлен 18.12.2011

  • Description situation of the drugs in the world. Factors and tendencies of development of drugs business. Analysis kinds of drugs, their stages of manufacture and territory of sale. Interrelation of drugs business with other global problems of mankind.

    курсовая работа [38,9 K], добавлен 13.09.2010

  • Executive summary. Progect objectives. Keys to success. Progect opportunity. The analysis. Market segmentation. Competitors and competitive advantages. Target market segment strategy. Market trends and growth. The proposition. The business model.

    бизнес-план [2,0 M], добавлен 20.09.2008

  • Financial position of the "BTA Bank", prospects, business strategy, management plans and objectives. Forward-looking statements, risks, uncertainties and other factors that may cause actual results of operations; strategy and business environment.

    презентация [510,7 K], добавлен 17.02.2013

  • Directions of activity of enterprise. The organizational structure of the management. Valuation of fixed and current assets. Analysis of the structure of costs and business income. Proposals to improve the financial and economic situation of the company.

    курсовая работа [1,3 M], добавлен 29.10.2014

  • Software as a Service, a form of cloud computing service model of software users. SaaS subscription model: key features, market drivers and constraints. Impact of SaaS subscription services business in the economy and society in Russia and abroad.

    дипломная работа [483,8 K], добавлен 23.10.2016

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.