Maximizing the Effectiveness of Competitions at Kaggle Platform

Development of tools aimed at maximising the efficiency of the competition of Kaggle. Approbation of a new modified knn algorithm, which allows to obtain more accurate results for classification problems, to participants Of kaggle competitions.

Рубрика Программирование, компьютеры и кибернетика
Вид дипломная работа
Язык английский
Дата добавления 01.12.2019
Размер файла 611,7 K

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

However, we acknowledge, that the phenomenon of the platform having such a big percentage of “passive” users is itself interesting and the search of the reasons of such passiveness can serve material for other works.

Figure 1 illustrates the distribution of the users by their performances. Where the scale for the user performance was initially determined by the Kaggle itself and has been kindly provided in the database on their web site. The scale of the user performance varies from the minimum of zero to a maximum of five. As it can be observed from the image only not so much users have high performance tiers. However, this is the users, that are active in the platform. They are of our primary concern in our research since we want to understand which factors are most attractive for the best users with the highest performance rankings. Ideally, one will argue, that the users whose activity score is just two cannot be considered “bests”. However, since the number of users with high rankings such as four or five is quite limited as it can be observed we decided to enlarge our sample and include also those users who have a ranking two or more.

Overall, for all three million users at the moment of conducting the analysis of the platform, Kaggle provides just 1498 competitions. We have analyzed only the top user's competitions for the purposes of this work.First of all, we would like to underline that out of 318 unique competitions that we gathered in our sample quite a few have only one contestant. Obviously, these competitions have more than just a single contestant, however in the sample that we gathered it used to be the fact that only one person has been involved.

Figure2. The top 10 competitions by the frequency of occurrence

The frequencies ofthe competitions have some other interesting information in them. For example, from the figure2 one can see, that among the top ten competitions by frequency, there is a one ground observation that even by eye has easily detected significantly more participants than the other ones. In fact,this competition has almost twice the number of contestants than the nearest second one. It is the “UCI Math77B: collaborative Filtering” competition. Though the number of occurrences of this competition is high that does not impose, that it must have the biggest number of contestants. In fact, the high frequency of occurrences of the competition means only, that it has been suggested as a potential choice option to a big number of users. However, if we look at this competition information, we can see that it only has 68 total competitors which are terrifically far from the best number of 8391 contestant's indicator of “Home Credit Default Risk” competition.The fact that the “UCI Math77B: collaborative Filtering” has been suggested to over 1500 users and only 68 of them decided to participate in it speaks about the big inefficiency of this competition.Since, “Home Credit Default Risk” is so much standing out from other ones in the list of competitions by the quantity of contestantswe had stopped at it to get some general understanding of it and see if there is something extraordinary about the competition that helped it to attract so much attention from the community of data scientists. In fact, we have found out that there is nothing particularly interesting. For example, the competition is offering only seventy thousand dollars reward which is not so high in comparison with the other competitions held in Kaggle, that can offer one hundred thousand or even more than a million dollars rewards. Moreover, the sixty thousand rewards had been divided between three first places and the first-place awarder receive only thirty-five thousand dollars(Home Credit Default Risk, n.d.).In addition to that, it seems to be the case, that the competition timeline is quite short. Therefore, we cannot state that the competition managed to attract more contestants because of its long duration. Nevertheless, while searching on the internet we have found some articles on this particular competition and even some video classes and other recordings that are using this competition(Соревнование Kaggle Home Credit Default Risk -- анализ данных и простые предсказательные модели, 2018). We think, that potentially these factors could contribute to the popularity of the competition.

If we speak about the timeline of the competitions we can mention, that from the metadata provided by Kaggle it seems to be the case, that the top competitions by their durability were hosted mostly by theuniversities or some sort of data science organizations. In fact, five out of ten most durable competitions were held by various universities among which the Moscow Institute of Physics and Technology(MIPT) has the biggest number of competitions: two.

Naively, before conducting the analysis we had expected to see a linear relationship between the number of participants and the length of the timeline of the competition. Nevertheless, as figure3 illustrate there is hardly any relationship exists.It can also be noticed, that there a big density of the points around the interval of 0 and 5 thousand. Whichbasically means, that the considerable number of competitions had a 5 thousand hours duration. Even if we scale into this cloud of points and restrict ourselves only with this subset of our sample, there is no clear relationship whatsoever between the duration of the competitions and the number of users it manages to acquire.

Figure3. The relationship between the number of contestants and the competition's timeline

On the other hand, there can bespotted an inverted u-shapedskewed to the left relationship between the log scaled amount of reward they are offering to the participants and the number of contestants of the competitions. Here in the figure4, we have shown the relationship between the number of participants and the log of the numberof rewards since on the usual unchanged scale the density of the points is very high near zero and the other points look like outliers.

Figure4. The relationship between the number of contestants and the competition's reward

Now we wouldlook at the contestants of the competitions.Here we can make several interesting remarks too. For example, we have looked at the distribution of the number of competitions that the contestants have decided to participate in.In this case, there is one contestant who has joined in considerably more competitions than the other ones. Thus, this contestant has participated in 61 competitions. However, assessing the level of successfulness of the contestant we can state that it does not have a very high score. In fact, the score of this contestant is quite moderate.

We have also calculated the scores of the all other contestants. This was done by the place that they managed to reach in the competition. In this case, we took the one minus the fraction of the place that the contestant got and the number of people who had participated in the competition. Therefore, if the score is high near to one that means that the contestant has been successful and has had a high place in the leaderboards table of the competition. The figure5 illustrate the resulted scores of the contestants that we have calculated.

Figure5. The distribution of the scores among the contestants

We also found out from our database, that the majority of the users have preferred to participate in the competitions on their own, without formulation of any teams.

To sum up the discussion of our sample we would provide the table with the frequency of occurrences of gold, silver and bronze medals that users managed to acquire. This will serve for us as one of the sources to determine the level of skillfulness of the users.

Table 1 The distribution of contestant's medals

Competition's Gold

Competition's

Silver

Competition's Bronze

Kernel's Gold

Kernel's

Silver

Kernel's

Bronze

Discussion's

Gold

Discussion's

Silver

Discussion's

Bronze

0

24822

22371

19788

25166

25404

25039

25277

24991

20341

1

574

600

3688

238

-

238

-

284

2387

>=2

8

2433

1928

-

-

127

127

2+127

2676

4.2 The implementation of the models

At this point in the work after getting some initial sense of our sample we are ready to conduct the analysis. First of all, we will deal with the task of identification of the factors that may influence the contestant's choice while deciding whether to participate in the competition or no. In order to do so, we conducted a choice-based conjoint analysis. In fact, as we have already mentioned, the choice-based conjoint analysis assumes under its hood a multinomial logistic regression which is what we have done for this section. Table 2 illustrates the output of the analysis. Here and after the significance codes as following: “***” for 0, “**” for 0.01, “.” for 0.05.

Table 2 The output of the model

Coefficients

Estimate

Std. Error

Z value

Pr(>|z|)

NumPrizes

1.552e-0.1

1.887e-02

8.223

< 2e - 16***

RewardTypeEUR

-1.525e+0.1

9.615e+01

-0.159

0.873949

RewardTypeJobs

-2.396e+00

1.409e-01

-17.003

< 2e - 16***

RewardTypeKnowledge

-6.165e+00

1.992e-01

-30.955

< 2e - 16***

RewardTypeKudos

-1.492e+01

1.613e+02

-0.092

0.926305

RewardTypeSwag

-2.789e+00

1.787e-01

-15.606

< 2e - 16***

RewardTypeUSD

-2.027e+00

1.060e-01

-19.120

< 2e - 16***

RewardQuantity

-8.282e-07

9.974e-08

-8.304

< 2e - 16***

Score

7.683e-01

7.009e-02

10.962

< 2e - 16***

dateDiff

4.692e-05

7.389e-06

6.350

2.15e-10***

BanTeamMergersTrue

5.875e-01

7.074e-02

8.305

< 2e - 16***

HasKernelsTrue

3.590e-01

7.195e-02

4.989

6.06e-07***

seasonEnabledspring

8.004e-01

4.851e-02

16.500

< 2e - 16***

seasonEnabledsummer

3.903e-01

5.424e-02

7.195

6.22e-13***

seasonEnabledwinter

1.991e-01

5.366e-02

3.711

0.000206***

From the output, it can be seen, that several factors have been considered significant on the significance level of 0.05 for our model. The coefficients of the variables are the most interesting ones for us because they make the model well interpretable. The coefficients of the model in our case correspond to the alternations of the log odds since there are no terms that are interacting with each other. The negative coefficients indicate a negative log-odds conditioned on the predictor variable given all other terms being fixed and vice versa the positive ones speak about a positive log odds ratio. The bigger the coefficient, that is the further away it is from zero, the stronger our assumptions about the outcomes of the model can be. For example, a single glance at Table 2 let us notice that the number of rewards for the competition encoded as “NumPrizes”has a positive log odds ratio. This says that according to the model the one-unit increase of the number of the places that guarantee a prize for contestants is expected to change the log odds by 0.1522. Converting this log-odds ratio into the odds we can take the exponent of 0.1522 and observe that the odds ratio gives us a value of approximately 1.17. We can also easily convert the odd ratio to percentage as 1.17 ratio says that the unit increase in “NumPrizes” gives us a 17% increase in the odds of participating in the competition.

The majority of our variables for the model are represented as factors, that is they have different levels. For the understanding of the output of the model, it is important to realize how the model deals with the situation when it has been provided with this kind of variablesinstead of just numeric variables. In order to do that the observer should check the ground level factor. By ground level factor we mean the one in comparison with which the model conducted its computations. For example, the “HasKernelsTrue” variable is a factor variable and has a positive coefficient. In this case, the ground level factor with which it has been compared is the “HasKernelsFalse” variable, which though is not represented in the output of the model, but it is there, hidden under the curtains. The positive coefficient of “HasKernelsTrue” variable tells us that if the competition offers Kernels, this makes their odds of attracting contestant higher by 43% than the ones that are not offering Kernels.

At first, before carrying out any analysis according to our understanding and personal experience we thought that generally, users while choosingthe competitions are paying the most attention to the subject of the competition, whether it is interesting to them personally or not, and to the amount of money the competition is offering to its contestants. However, as you can see from Table 2 the amount of money coded under the “RewardQuantity” variable though being significant in our model has a negative coefficient. Under the interpretation of the model that we have developed, it means that the unit increase in the variableis expecting to change the odds ratio by 0.99. This factwould have been quite surprising for us unless we made the exploratory analysis of the data.Thus, after conducting the exploratory data analysis and getting acquitted with the sample we started to expect a result like this one. Actually, figure4 has already given us this sense of understanding of the sample. However, the question why this the case is, why the users are reluctant to participate in the competitions that pay high rewards is somewhat mysterious for us though we have some thoughts about it. As it can be seen from our model output in Table 2 the other variable that is the “NumPrizes” which represents the number of places that guarantee to the contestants that they will earn the rewards has a positive coefficient, indicating the more places guarantee prizes the more eager people are to participate in the competitions. This can explain the way users are behaving in a platform. We think, that as rational creator's people are seeking an opportunity to maximize their profit. However, as it frequently happens in the societythere are not so much people who are ready to invest their money and effort into very risky projects that can make them in an instant either very happy, in our case by winning the competition reward, or very sad by failing to achieve the positions in the leaderboard desk that guarantee winning. Therefore, people are more willing to be a part of the competition where their chances are higher to win than the ones that offer more money but at the same time are supposed to be more complicated to accomplish.

We also had a look at the characteristics of the users themselves and the way they can be connected to the final choice of them that they are making while choosing the competition. Table 3 shows the results of the analysis.

Table 3 The output of the model

Coefficients

Estimate

Std. Error

Z value

Pr(>|z|)

Comp_rating

-5.175e-05

4.339e-05

-1.193

0.23294

Comp_hihgestRank

4.416e-04

1.104e-0

4.022

6.29e-05***

Comp_gold

-1.180e+00

9.750e-02

-12.103

Comp_silver

-3.977e-01

4.554e-02

-8.733

Comp_bronze

-1.939e-01

2.324e-02

-8.340

Kernel_gold

-3.089e-02

2.047e-01

-0.151

0.88005

Kernel_bronze

-3.254e-01

1.079e-01

-3.014

0.00258**

Disc_gold

NA

NA

NA

NA

Disc_silver

-3.406e-01

1.278e-01

-2.665

0.0770**

Disc_bronze

8.222e-02

1.543e-02

5.330

9.83e-08***

Interpreting the results, we can highlight some interesting things. First of all, it can be seen that one of the coefficients is not defined because of singularities.The variable named “Disc_gold” which encodes the number of golden points that the user managed to get has all coefficients as “NA” or not available. The reasons are the fact that there is almost a perfect linear relationship between this variable and some other in our model. And this indeed the case, if we look at Table1, we can see that there are only three values for this variable with the majority of them being zero. This is very similar to the variable named “Kernel_bronze”. Checking the figure of these variables we saw that there is indeed a linear relationship. Moreover, if we compute the correlation between these two variables, we will see that the coefficient is equal to 0.95, which indicates a strong linear relationship. Some other factors that can be highlighted are the fact that every new unit of points as gold, silver or bronze medals in all the competitions are lessening the odds of participation in the competitions. We think, that this can be partially explained by the big sparsity of the matrix that we have in our sample. This can be once again seen from the Table 1, which illustrates that most of the values in almost all these variables are zeros and we think, that this can influence the output of the model.

As a logical continuation of the choice-based conjoint analysis we have developed a method that allows us to determine the probability of the participation of the contestant given the set of characteristics of the competition. Moreover, the method can be easily expanded to include not only the features of the competitions while computing the probabilities but also the features of the users. This can give us an opportunity to assess the effects of user-specific variables on the outcome for varying levels of competition-specific variables.

Thus, the model works as following: giving the output of the logistic regression and the new input data of the characteristics we just multiply the coefficients of the regression by the variables and receive as a result the utility formed by the characteristics. Finally, using the logit formula one can easily calculate the probability of participation given the set of features. This can be done for any set of features as long as they are consistent with the model.

Referring to the second task of this section we will now discuss our classification model, that we have developed.As it was already explained in the previous part of the work, we took a pure k nearest neighbors' algorithm and tried to modify it to increase its efficiency. Therefore, we call it a modified knn method. The results of the modified knn are quite crucial not only in the context of this work but also in general because potentially it can be used everywhere where some researcher is trying to solve a classification problem.

For the purposes of this work, we have tested our modified model on the following classification task: we have tried to classify if a contestant will be successful in the competition given his or her set of features. We decided to define the “successfulness” of the contestant by his score using the eponymous variable “Score”. We have already explained that the variable “Score” records the relative position that the contestants managed to hit in the competition. The values of “Score” varies in the range of [0, 1]. The higher the score the better the contestant performed in the competition.

Quite subjectively, we took as a threshold value of the “success” a score more than or equal to 0.75 and encoded these observations as 1 and the others as 0. The features that we took upon which we would try to determine their success are the set of individual prizes in the competitions, kernel, and discussion sections as well as the number of followers and the number of followings. Figure6 demonstrates the results of our modified model as well as the pure knn model applied to this dataset. The figure6 illustrates the various values of accuracy that the two models are managed to get with different values of the number of neighbors: k.

However, as we have already mentioned we are not going to constrain ourselves with only the accuracy as this can be quite misleading. Therefore, we decided to compute also the sensitivity and the specificity of the models and compare them with each other. To do so we have constructed two confusion matrices with the outputs of the models. Table 4 shows the results for the pure knn model.

Figure7. The comparison of accuracies of the two models

Accordingly, we have computed the sensitivity of the model is equal to approximately 0.48 and the specificity almost 0.979.

Table 4 The confusion matrix of the knn

Fail

Success

Fail

4362

495

Success

92

455

The sensitivity can be interpreted in our case as the probability of a positive output of the model given that the contestant has been successful. Meanwhile, the specificity is the probability of a negative output of the model given that the contestant has failed in the competition. By doing the same calculations we get the results of our modified knn method from table 5. The sensitivity, in this case, was equal to 0.72 and the specificity was not considerably changed being equal to 0.973.

Table 5 The confusion matrix of the modified knn

Fail

Success

Fail

4337

265

Success

117

685

Summing up, we can state that overall our modified model has performed better than the simple knn showing better results of accuracy as well as showing better sensitivity and a comparable specificity as its outcomes. We can also state, that the objective of the paper was successfully met because we managed to identify the factors, which can contribute to the attraction of the most active and high-scored users of the platform.

Conclusion

At this stage of the work, we can state that the objectives of the work have been successfully met. We managed to provide tools to contribute to the development of a set of recommendations aimed at the maximization of the effectiveness of the competitions at the Kaggle platform. Moreover, we have also implemented a modified knn method with enhanced performance to make the solutions to classification tasks more precise.

At the beginning of the work we have identified for ourselves the following five tasks:

1) Retrieve relevant data from web resources.

2) Create a model which will realize the model on datasets.

3) Compare the model performance with the other benchmarking model performances.

4) Identify the core features of the competitions and their relationships with the contestant's choice.

5) Determine whether the contestants will be successful at the competition or not.

We are glad to state that all of these tasks have been accomplished. We wrote from scratch a web crawler, which gives us an opportunity to retrieve the necessary information about the skillful users of the platform from the web. This was the starting point of our work as it is the single vital tool that makes the accomplishment of this work more realistic. Then, we have done some manipulations with the data to make it suitable for further analysis. After all the preparations have been finished, we conducted our analysis. Because of which we identified the relationship between the core features of the competitions and the contestant's choice of participation in them. This gave us an opportunity to understand which features and how can contribute to the choice of participants. Based on the results of this analysis companies can develop a new model of user's attraction by enabling the features that are expected to increase the odds of the participation of the new skillful contestants. Moreover, we also suggested a method of classification that based on the set of features of the contestants lets us determine whether he or she will achieve success or not.

Overall, summing up the whole research we hope that the work will be helpful for the companies who are going to launch new competition as they will get the necessary understanding of which features can be neglected and which ones are necessary in order to succeed. However, we are also emphasizing the theoretical value of the work as it is introducing a new, modified model, which can find its implementation basically in all the classification tasks, especially the ones where researchers were preferring to use simple knn.

However, the work also has its limitations. It will be very helpful to try the modified model on some other datasets and assess its performance. Also, a speed of the model is a primary concern. Some further work must be carried out in that direction.

Another possible direction of the further research would be the acquirementof the dataset that contains the features of the users right at the moment of the start of the competition for every single choice that they had made. This will allow to get more accurate image of the situation of decision-making and assess more accurately the probabilities of participations.

Reference list

1) Соревнование KaggleHomeCreditDefaultRisk -- анализ данных и простые предсказательные модели. (19 June 2018 г.). Retrieved from https://habr.com/ru/post/414613/

2) (2019). Retrieved from Kaggle: https://www.kaggle.com/kaggle/meta-kaggle

3) Adhikari , A., & DeNero, J. (n.d.). Computational and Inferential Thinking. Retrieved from https://www.inferentialthinking.com/chapters/intro

4) Akinsola, J. E. (2017). Supervised Machine Learning Algorithms: Classification and Comparison. International Journal of Computer Trends and Technology (IJCTT), 128-138.

5) Bafandeh , S. I., & Bolandraftar, M. (2013). Application of K-Nearest Neighbor (KNN) Approach for Predicting Economic Events: Theoretical Background. Int. Journal of Engineering Research and Applications, 605-610.

6) Beyer, K. R. (1997, December 28). When Is "Nearest Neighbor" Meaningful? Retrieved from https://www.researchgate.net/publication/2845566_When_Is_Nearest_Neighbor_Meaningful

7) Bhattacharya, A. (2018, May 18). Introduction to Kaggle for Beginners in Machine Learning and Data Science! Retrieved from https://medium.com/datadriveninvestor/introduction-to-kaggle-for-beginners-in-machine-learning-and-data-science-865199d7ead2

8) Boeing, G., & Waddell, P. (2016). New Insights into Rental Housing Markets across the United States: Web Scraping and Analyzing Craigslist Rental Listings. Journal of Planning Education and Research.

9) Chapter 1. Bootstrap Method. (n.d.). Retrieved from http://www.math.ntu.edu.tw/~hchen/teaching/LargeSample/notes/notebootstrap.pdf

10) Dobney, S., Carlos, O., & Revilla, M. (2017). More realism in conjoint analysis: the effect of textual noise and visual style. International Journal of Market Research.

11) Facebook Recruiting Competition. (n.d.). Retrieved from Kaggle: https://www.kaggle.com/c/FacebookRecruiting

12) Geek Out: Crowdsourcing Data Science With Kaggle. (2017, February 22). Retrieved from https://www.business.com/articles/crowdsourcing-data-science-with-kaggle/

13) Good, P. I. (2011). Analyzing the Large Number of Variables in Biomedical and Satellite Imagery. John Wiley & Sons.

14) Hauser, J. R. (n.d.). Note on Conjoint Analysis. Retrieved from MIT: http://www.mit.edu/~hauser/Papers/NoteonConjointAnalysis.pdf

15) Hess, S., Daly, A., & Batley, R. (2018). Revisiting consistency with random utility maximisation: theory and implications for practical work. Theory and Decision, 181-204.

16) Home Credit Default Risk. (n.d.). Retrieved from Kaggle: https://www.kaggle.com/c/home-credit-default-risk

17) Hundert, M. (2009). ADVANTAGES AND DISADVANTAGES OF THE USE OF CONJOINT ANALYSIS IN CONSUMER PREFERENCES RESEARCH. Folia Oeconomica Stetinensia, 347-357.

18) Islam, M. J.-A. (2010). Investigating the Performance of Naive- Bayes Classifiers and K- Nearest Neighbor Classifiers. Journal of Cases on Information Technology (JCIT), 133-137.

19) Kaggle. (n.d.). Retrieved from Kaggle: https://www.kaggle.com/

20) Kang, X., Zhang, G., Ou, X., Guo, L., Bing, T., & Wang, J. (2018). KNN-Based Representation of Superpixels for Hyperspectral Image Classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 4032 - 4047.

21) Kotz , S., Campbell , R. B., Balakrishnan , N., Vidakovic , B., & Johnson, N. L. (2004). Encyclopedia of Statistical Sciences. John Wiley & Sons, Inc.

22) Li, B. (2011). The multinomial logit model revisited: A semi-parametric approach in discrete choice analysis. Transportation Research Part B: Methodological, 461-473.

23) Maldonado, S., Montoya, R., & Weber, R. (2014). Advanced conjoint analysis using feature selection via support vector machines. European Journal of Operational Research.

24) Mark J. Ventresca, J. W. (2001, May). Archival Research Methods. Retrieved from ucsb: http://www.soc.ucsb.edu/faculty/mohr/classes/soc4/summer_08/pages/Resources/Readings/Ventresca%20&%20Mohr.pdf

25) Rachinger, M., Rauter, R., Mьller, C., Vorraber, W., & Schirgi, E. (2018). Digitalization and its influence on business model innovation. Journal of Manufacturing Technology Management.

26) Rao, V., & Pilli, L. (2014). Conjoint Analysis for Marketing Research in Brazil. Revista Brasileira de Marketing.

27) Singh, H. (2018, June 18). Statistics That Prove IoT will become Massive from 2018. Retrieved from http://customerthink.com/statistics-that-prove-iot-will-become-massive-from-2018/

28) Sperandei, S. (2014). Understanding logistic regression analysis. Biochemia medica.

29) Steiner, M., & MeiЯner, M. (2018). A User's Guide to the Galaxy of Conjoint Analysis and Compositional Preference Measurement. Marketing ZFP, 3-25.

30) Sunasra, M. (2017, November 11). Performance Metrics for Classification problems in Machine Learning. Retrieved from https://medium.com/thalus-ai/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b

31) Voleti, S., Srinivasan, V., & Pulak, G. (2017). An approach to improve the predictive power of choice-based conjoint analysis. International Journal of Research in Marketing, 325-335.

32) Wu, X. (2007). Top 10 algorithms in data mining. Knowledge and Information Systems, 1-37.

Appendix 1

Table 5 The explanation of the variable's names encoding in the sample

Encoding of the variables

Explanation

id

A unique id given by the website for every user

Trial

The number of decisions made by the user

Selection

The competition number selected from all the alternatives by the user at the given trial

Alt

The alternatives available to user while choosing competition

compName

The competition's names

choice

A binary analog of selection. Indicates whether the competition has been chosen among the alternatives or not.

Comp_solo

The number of competitions made without the formulation of any team by the user

Ranking

The position in the leaderboards table of the user in the given competition

Comp_level

The level of skills of the user in the competitions section

Comp_rating

The current rank accumulated by the user in the competitions

Comp_highestRank

The highest ranking that the user achieved in competitions

Comp_gold

Number of gold medals in competitions

Comp_silver

Number of silver medals in competitions

Comp_bronze

Number of bronze medals in competitions

Kernel_level

The level of skills of the user in the kernels section

Kernel_rating

The current rank accumulated by the user in the kernels

Kernel_gold

Number of gold medals in kernels

Kernel_silver

Number of silver medals in kernels

Kernel_bronze

Number of bronze medals in kernels

Disc_level

The level of skills of the user in the discussions section

Disc_rating

The current rank accumulated by the user in the discussions

Disc_gold

Number of gold medals in discussions

Disc_silver

Number of silver medals in discussions

Disc_bronze

Number of bronze medals in discussions

Followers

Number of followers of the given user

Following

Number of followings of the given user

Subtitle

The subtitles of the competitions

CompetitionTypeId

The competition's id by its type

HostName

The name of the competition's hosting company

EnableDate

The date competition has been launched

DeadlineDate

The date competition has been expired

HasKernels

Binary indicator whether the kernels were allowed in the competition or not

OnlyAllowKernelSubmissions

Binary indicator whether the competitions allow only submissions with kernel or not

HasLeaderboard

Binary indicator whether the competitions keep leaderboards or not

EvaluationAlgorithmAbbreviation

The abbreviation of the competition's evaluation algorithm

EvaluationAlgorithmName

The full name of the evaluation algorithm

ValidationSetValue

The range of the allowed outputs for the competitions

MaxDailySubmissions

The maximum number of daily submissions for the competitions

MaxTeamSize

The maximal allowed team size for the competitions

BanTeamMergers

Binary indicator whether the team mergers are allowed or not

EnableTeamModels

Binary indicator whether the team models are allowed or not

RewardType

The factor variable with values “USD”, “EUR”, “Knowledge”, “Kudos”, “Swag”, “None”

RewardQuantity

The amount of money the competition is giving as a reward for the winners

NumPrizes

The number of places in the leaderboard table that are guaranteed to win the reward

TotalCompetitors

The total number of competitors the competition managed to acquire

TotalSubmissions

The total number of submissions the competition managed to get

seasonEnabled

The season of the year when the competition has been launched

Appendix 2

The web crawler

1. import pandas as pd

2. import numpy as np 

3. import requests

4. from bs4 import BeautifulSoup

5. import json

6. import time

7. import datetime

8. import re

9. import csv

10.

11.

12. users = pd.read_csv("D:\\HDDREQ\\Desktop\\New folder\\Users.csv")

13.

14. users = users[users.PerformanceTier > 1]

15. userNames = users.iloc[:,1]

16.

17. np.random.seed(7)

18. idx = np.random.choice(list(userNames.index), size = 5000, replace = False)

19. userNames = userNames[idx]

20.

21. class KaggleInfo(object):

22. meta = ["userId", "userName", "country", "competitionsSummary", "scriptsSummary",

23. "discussionsSummary", "following", "followers"]

24.

25. ActivityInfo = ["totalResults","rankPercentage", "rankOutOf", "rankCurrent", "rankHighest", "totalGoldMedals","totalSilverMedals", "totalBronzeMedals", "highlights"]

26. highlights = ["title", "medal", "score", "scoreOutOf"]

27.

28. def __init__(self):

29. self.results = []

30.

31. def getUser(self, index, data):

32. time.sleep(5)

33. url = "https://www.kaggle.com/{0}".format(data[index])

34. print(url)

35. print("Getting user {0}...".format(data[index]))

36. r = requests.get("https://www.kaggle.com/{0}".format(data[index]))

37. print(r)

38. if r.elapsed> datetime.timedelta(seconds = 8):

39. raise TimeoutError("The response is too slow")

40. soup = BeautifulSoup(r.content)

41. idx = re.search("false}\\);performance", soup.findAll("script")[19].text).span()[1] - 13 #get the JSON part

42. jSon = soup.findAll("script")[19].text[77:idx]

43. user = json.loads(jSon)

44. return user

45.

46. def getUserInfo(self, user):

47. lst = []

48. for m in meta:

49. if m in ["competitionsSummary", "scriptsSummary", "discussionsSummary"]:

50. for j in ActivityInfo:

51. if j != "highlights":

52. val = data[m][j]

53. if val is None:

54. lst.append("None")

55. else:

56. lst.append(val)

57. print(val)

58. else:

59. for k in range(len(data[m][j])):

60. for l in highlights:

61. val = data[m][j][k][l]

62. if val is None:

63. lst.append("None")

64. else:

65. lst.append(val)

66. elif m in ["following", "followers"]:

67. lst.append(data[m]["count"])

68. else:

69. lst.append(data[m])

70. return lst

71.

72. def __writeCSV(self):

73. with open('C:\\Users\\User\\Desktop\\data.csv','w',newline='') as f:

74. w = csv.writer(f)

75. for t in self.results:

76. w.writerows(t)

77.

78. def main(self, lowerBound, upperBound):

79. for i in idx[lowerBound:upperBound]:

80. try:

81. print(i)

82. user = self.getUser(i, userNames)

83. lst = self.getUserInfo(user)

84. self.results.append([lst])

85. except TimeoutError:

86. self.results.append([0])

87. self.__writeCSV(results)

Appendix 3

The development of the new model

1. library(utils)

2.

3. # some part of the code was inspired by (Adhikari & DeNero, n.d.)

4.

5. distance <- function(row1,row2){

6. return( sqrt( sum( (row1-row2)^2 )) )

7. }

8.

9. all_distance <- function(data,new_point){

10. data = data[,-which(names(data) == 'class')]

11.

12. distance_from_point <- function(row){

13. return( distance(new_point,row))

14. }

15. return(apply(data, 1, distance_from_point))

16. }

17.

18. tabel_with_distances <- function(data,new_point){

19. data$distance = all_distance(data,new_point)

20. return(data)

21. }

22.

23. closest <- function(data,new_point,k){

24. with_dist = tabel_with_distances(data,new_point)

25. with_dist = with_dist[order(with_dist$distance),]

26. return(with_dist[1:k,])

27. }

28.

29. majority <- function(topK){

30. ones = sum(topK$class == 1)

31. zeros = sum(topK$class == 0)

32. if(ones > zeros){

33. 1

34. }else{

35. 0

36. }

37. }

38.

39. # compute the modified majority by implementing the technique described in the work

40.

41. majority_bts <- function(data = wine, point, k,topK){

42. NumZeros = count_zero(topK$class)

43. if(NumZeros == round(nrow(topK)/2) | NumZeros == round(nrow(topK)/2) + 1){

44. make_bootsp_own(data, point,k)

45. }else{

46. majority(topK)

47. }

48. }

49.

50. make_bootsp_own <- function(data,point,k) {

51. result <- vector("numeric")

52. d <- data

53. for (i in 1:100) {

54. topK <- closest(d,point,k)

55. idk <- sample(1:nrow(d),

56. nrow(d),replace = TRUE)

57. result[i] <- majority(topK)

58. d<-d[idk,]

59. }

60. print(result)

61. if (mean(result) > 0.57) {

62. return(1)

63. } else {

64. return(0)

65. }

66. }

67.

68. classify <- function(data,new_point,k){

69. closestK = closest(data,new_point,k)

70. return( majority_bts(data, new_point, k, closestK) )

71. }

72.

73. # supplementary functions to evaluate the accuracy

74.

75. count_zero <- function(vec){

76. res = length(vec) - sum(vec != 0)

77. return(res)

78. }

79.

80. count_equal <- function(vec1,vec2){

81. return(count_zero(vec1-vec2))

82. }

83.

84. evaluate_accuracy <- function(training,test,k){

85. test_attr = test[,-which(names(test) == 'class')]

86. classify_testrow <- function(row){

87. return( classify(training, row, k) )

88. }

89. print(0)

90. res = apply(test_attr,1,classify_testrow)

91. return(res)

92.

93. }

1. # Evaluate the model

2.

3. tmp <- read.csv("C:\\Users\\User\\Documents\\AFinal.csv")

4. idk <- sample(seq(1:nrow(tmp)))

5. test <- tmp[idk[20001:25404],]

6. training <- tmp[idk[1:20000],]

7.

8. accur_mine = vector('numeric', 10)

9. for (k in 5:14) {

10. print(k)

11. RESULT_loop_1 = evaluate_accuracy(training,test,k)

12. res = count_equal(RESULT_loop_1, test$class)/nrow(test)

13. accur_mine[k-4] = res

14. }

15. table(RESULT, test$class)

16.

17.

18. library(class)

19. tr<-cbind(sapply(training[,1:15], jitter), training$class)

20. tst<-cbind(sapply(test[,1:15], jitter), test$class)

21. tr<-as.data.frame(tr)

22. tst<-as.data.frame(tst)

23. colnames(tr)[16]<-"class"

24. colnames(tst)[16]<-"class"

25. accur = vector('numeric',10)

26. for (i in 5:15) {

27. print(i)

28. pr <- knn(tr, tst, tr$class, k = i)

29. tab <- table(pr, tst$class) ...


Подобные документы

  • Lists used by Algorithm No 2. Some examples of the performance of Algorithm No 2. Invention of the program of reading, development of efficient algorithm of the program. Application of the programs to any English texts. The actual users of the algorithm.

    курсовая работа [19,3 K], добавлен 13.01.2010

  • Review of development of cloud computing. Service models of cloud computing. Deployment models of cloud computing. Technology of virtualization. Algorithm of "Cloudy". Safety and labor protection. Justification of the cost-effectiveness of the project.

    дипломная работа [2,3 M], добавлен 13.05.2015

  • Basic assumptions and some facts. Algorithm for automatic recognition of verbal and nominal word groups. Lists of markers used by Algorithm No 1. Text sample processed by the algorithm. Examples of hand checking of the performance of the algorithm.

    курсовая работа [22,8 K], добавлен 13.01.2010

  • Архитектура операционной системы Android. Инструменты Android-разработчика. Установка Java Development Kit, Eclipse IDE, Android SDK. Настройка Android Development Tools. Разработка программы для работы с документами и для осуществления оперативной связи.

    курсовая работа [2,0 M], добавлен 19.10.2014

  • Technical and economic characteristics of medical institutions. Development of an automation project. Justification of the methods of calculating cost-effectiveness. General information about health and organization safety. Providing electrical safety.

    дипломная работа [3,7 M], добавлен 14.05.2014

  • Overview history of company and structure of organization. Characterization of complex tasks and necessity of automation. Database specifications and system security. The calculation of economic efficiency of the project. Safety measures during work.

    дипломная работа [1009,6 K], добавлен 09.03.2015

  • Этапы разработки автоматизированной системы приема и бронирования заказов столиков в заведениях. Анализ среды разработки Android Development Tools. Общая характеристика диаграммы компонентов IOS приложения. Рассмотрение системы контроля версий сервера.

    курсовая работа [8,7 M], добавлен 14.05.2014

  • Анализ функциональной структуры и обеспечивающей части АСУ. Проектирование функциональной структуры подсистемы управления проблемами, разработка модели в среде CPN Tools и алгоритма работы. Описание программного и технического обеспечения проекта.

    дипломная работа [5,6 M], добавлен 26.06.2011

  • Средства разработки, ориентированные на конкретные СУБД. Наиболее известные приложения на основе Eclipse Platform. Проект NetBeans IDE, его возможности. KDevelop — свободная интегрированная среда разработки для UNIX-подобных операционных систем.

    реферат [107,5 K], добавлен 14.04.2014

  • Основные алгоритмические структуры. Запись алгоритма в словесной форме, в виде блок-схемы. Система команд исполнителя. Язык высокого уровня. Создание программы и её отладка. Интегрированные среды разработки: Integrated Development Environment, IDE.

    лекция [61,7 K], добавлен 09.10.2013

  • IS management standards development. The national peculiarities of the IS management standards. The most integrated existent IS management solution. General description of the ISS model. Application of semi-Markov processes in ISS state description.

    дипломная работа [2,2 M], добавлен 28.10.2011

  • Кратка историческая справка развития языка Java. Анализ предметной области. Java platform, enterprise and standart edition. Апплеты, сервлеты, gui-приложения. Розработка программного кода, консольное приложение. Результаты работы апплета, сервлета.

    курсовая работа [549,2 K], добавлен 23.12.2015

  • Information security problems of modern computer companies networks. The levels of network security of the company. Methods of protection organization's computer network from unauthorized access from the Internet. Information Security in the Internet.

    реферат [20,9 K], добавлен 19.12.2013

  • Класифікація комп'ютерних ігор відповідно до інтерактивних ігрових дій гравця. Мобільні пристрої з сенсорними екранами. Програмна реалізація гри жанру Tower Defence на базі платформи Java Platform Micro Edition для мобільних пристроїв з сенсорним екраном.

    дипломная работа [693,2 K], добавлен 14.04.2014

  • Новые тенденции развития СУБД и областей их применения. Структурные элементы базы данных. Объектно-ориентированная модель программных компонентов. Формы, модули и метод разработки "Two-Way Tools". Масштабируемые средства для построения баз данных.

    дипломная работа [589,5 K], добавлен 16.12.2013

  • Основные понятия и определения стеганографии. Методы сокрытия данных и сообщений, цифровые водяные знаки. Атаки на стегосистемы и методы их предупреждения. Технологии и алгоритмы стеганографии. Работа с S-Tools. Особенности специальной программы.

    контрольная работа [2,2 M], добавлен 21.09.2010

  • Описание функции file info в программе Erdas Imagine, которая позволяет получить подробную информацию об изображении. Графики спектральных характеристик для разных объектов на снимке. Инструменты Profile tools для исследования спектральных характеристик.

    лабораторная работа [533,8 K], добавлен 09.12.2013

  • Программы автоматизированного перевода: электронные словари, tools-приложения, система Translation Memory, редакторское ПО. Анализ использования САТ-программ в практической деятельности. Выполнение перевода при помощи переводчиков Wordfast и Promt.

    курсовая работа [46,5 K], добавлен 10.11.2011

  • Consideration of a systematic approach to the identification of the organization's processes for improving management efficiency. Approaches to the identification of business processes. Architecture of an Integrated Information Systems methodology.

    реферат [195,5 K], добавлен 12.02.2016

  • Создание образа диска с помощью программного продукта Nero для резервного копирования, распространения программного обеспечения, виртуальных дисков, тиражирования однотипных систем. Возможности Alcohol 120%, Daemon Tools для эмуляции виртуального привода.

    курсовая работа [188,9 K], добавлен 07.12.2009

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.