Maximizing the Effectiveness of Competitions at Kaggle Platform
Development of tools aimed at maximising the efficiency of the competition of Kaggle. Approbation of a new modified knn algorithm, which allows to obtain more accurate results for classification problems, to participants Of kaggle competitions.
Рубрика | Программирование, компьютеры и кибернетика |
Вид | дипломная работа |
Язык | английский |
Дата добавления | 01.12.2019 |
Размер файла | 611,7 K |
Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже
Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.
However, we acknowledge, that the phenomenon of the platform having such a big percentage of “passive” users is itself interesting and the search of the reasons of such passiveness can serve material for other works.
Figure 1 illustrates the distribution of the users by their performances. Where the scale for the user performance was initially determined by the Kaggle itself and has been kindly provided in the database on their web site. The scale of the user performance varies from the minimum of zero to a maximum of five. As it can be observed from the image only not so much users have high performance tiers. However, this is the users, that are active in the platform. They are of our primary concern in our research since we want to understand which factors are most attractive for the best users with the highest performance rankings. Ideally, one will argue, that the users whose activity score is just two cannot be considered “bests”. However, since the number of users with high rankings such as four or five is quite limited as it can be observed we decided to enlarge our sample and include also those users who have a ranking two or more.
Overall, for all three million users at the moment of conducting the analysis of the platform, Kaggle provides just 1498 competitions. We have analyzed only the top user's competitions for the purposes of this work.First of all, we would like to underline that out of 318 unique competitions that we gathered in our sample quite a few have only one contestant. Obviously, these competitions have more than just a single contestant, however in the sample that we gathered it used to be the fact that only one person has been involved.
Figure2. The top 10 competitions by the frequency of occurrence
The frequencies ofthe competitions have some other interesting information in them. For example, from the figure2 one can see, that among the top ten competitions by frequency, there is a one ground observation that even by eye has easily detected significantly more participants than the other ones. In fact,this competition has almost twice the number of contestants than the nearest second one. It is the “UCI Math77B: collaborative Filtering” competition. Though the number of occurrences of this competition is high that does not impose, that it must have the biggest number of contestants. In fact, the high frequency of occurrences of the competition means only, that it has been suggested as a potential choice option to a big number of users. However, if we look at this competition information, we can see that it only has 68 total competitors which are terrifically far from the best number of 8391 contestant's indicator of “Home Credit Default Risk” competition.The fact that the “UCI Math77B: collaborative Filtering” has been suggested to over 1500 users and only 68 of them decided to participate in it speaks about the big inefficiency of this competition.Since, “Home Credit Default Risk” is so much standing out from other ones in the list of competitions by the quantity of contestantswe had stopped at it to get some general understanding of it and see if there is something extraordinary about the competition that helped it to attract so much attention from the community of data scientists. In fact, we have found out that there is nothing particularly interesting. For example, the competition is offering only seventy thousand dollars reward which is not so high in comparison with the other competitions held in Kaggle, that can offer one hundred thousand or even more than a million dollars rewards. Moreover, the sixty thousand rewards had been divided between three first places and the first-place awarder receive only thirty-five thousand dollars(Home Credit Default Risk, n.d.).In addition to that, it seems to be the case, that the competition timeline is quite short. Therefore, we cannot state that the competition managed to attract more contestants because of its long duration. Nevertheless, while searching on the internet we have found some articles on this particular competition and even some video classes and other recordings that are using this competition(Соревнование Kaggle Home Credit Default Risk -- анализ данных и простые предсказательные модели, 2018). We think, that potentially these factors could contribute to the popularity of the competition.
If we speak about the timeline of the competitions we can mention, that from the metadata provided by Kaggle it seems to be the case, that the top competitions by their durability were hosted mostly by theuniversities or some sort of data science organizations. In fact, five out of ten most durable competitions were held by various universities among which the Moscow Institute of Physics and Technology(MIPT) has the biggest number of competitions: two.
Naively, before conducting the analysis we had expected to see a linear relationship between the number of participants and the length of the timeline of the competition. Nevertheless, as figure3 illustrate there is hardly any relationship exists.It can also be noticed, that there a big density of the points around the interval of 0 and 5 thousand. Whichbasically means, that the considerable number of competitions had a 5 thousand hours duration. Even if we scale into this cloud of points and restrict ourselves only with this subset of our sample, there is no clear relationship whatsoever between the duration of the competitions and the number of users it manages to acquire.
Figure3. The relationship between the number of contestants and the competition's timeline
On the other hand, there can bespotted an inverted u-shapedskewed to the left relationship between the log scaled amount of reward they are offering to the participants and the number of contestants of the competitions. Here in the figure4, we have shown the relationship between the number of participants and the log of the numberof rewards since on the usual unchanged scale the density of the points is very high near zero and the other points look like outliers.
Figure4. The relationship between the number of contestants and the competition's reward
Now we wouldlook at the contestants of the competitions.Here we can make several interesting remarks too. For example, we have looked at the distribution of the number of competitions that the contestants have decided to participate in.In this case, there is one contestant who has joined in considerably more competitions than the other ones. Thus, this contestant has participated in 61 competitions. However, assessing the level of successfulness of the contestant we can state that it does not have a very high score. In fact, the score of this contestant is quite moderate.
We have also calculated the scores of the all other contestants. This was done by the place that they managed to reach in the competition. In this case, we took the one minus the fraction of the place that the contestant got and the number of people who had participated in the competition. Therefore, if the score is high near to one that means that the contestant has been successful and has had a high place in the leaderboards table of the competition. The figure5 illustrate the resulted scores of the contestants that we have calculated.
Figure5. The distribution of the scores among the contestants
We also found out from our database, that the majority of the users have preferred to participate in the competitions on their own, without formulation of any teams.
To sum up the discussion of our sample we would provide the table with the frequency of occurrences of gold, silver and bronze medals that users managed to acquire. This will serve for us as one of the sources to determine the level of skillfulness of the users.
Table 1 The distribution of contestant's medals
Competition's Gold |
Competition's Silver |
Competition's Bronze |
Kernel's Gold |
Kernel's Silver |
Kernel's Bronze |
Discussion's Gold |
Discussion's Silver |
Discussion's Bronze |
||
0 |
24822 |
22371 |
19788 |
25166 |
25404 |
25039 |
25277 |
24991 |
20341 |
|
1 |
574 |
600 |
3688 |
238 |
- |
238 |
- |
284 |
2387 |
|
>=2 |
8 |
2433 |
1928 |
- |
- |
127 |
127 |
2+127 |
2676 |
4.2 The implementation of the models
At this point in the work after getting some initial sense of our sample we are ready to conduct the analysis. First of all, we will deal with the task of identification of the factors that may influence the contestant's choice while deciding whether to participate in the competition or no. In order to do so, we conducted a choice-based conjoint analysis. In fact, as we have already mentioned, the choice-based conjoint analysis assumes under its hood a multinomial logistic regression which is what we have done for this section. Table 2 illustrates the output of the analysis. Here and after the significance codes as following: “***” for 0, “**” for 0.01, “.” for 0.05.
Table 2 The output of the model
Coefficients |
Estimate |
Std. Error |
Z value |
Pr(>|z|) |
|
NumPrizes |
1.552e-0.1 |
1.887e-02 |
8.223 |
< 2e - 16*** |
|
RewardTypeEUR |
-1.525e+0.1 |
9.615e+01 |
-0.159 |
0.873949 |
|
RewardTypeJobs |
-2.396e+00 |
1.409e-01 |
-17.003 |
< 2e - 16*** |
|
RewardTypeKnowledge |
-6.165e+00 |
1.992e-01 |
-30.955 |
< 2e - 16*** |
|
RewardTypeKudos |
-1.492e+01 |
1.613e+02 |
-0.092 |
0.926305 |
|
RewardTypeSwag |
-2.789e+00 |
1.787e-01 |
-15.606 |
< 2e - 16*** |
|
RewardTypeUSD |
-2.027e+00 |
1.060e-01 |
-19.120 |
< 2e - 16*** |
|
RewardQuantity |
-8.282e-07 |
9.974e-08 |
-8.304 |
< 2e - 16*** |
|
Score |
7.683e-01 |
7.009e-02 |
10.962 |
< 2e - 16*** |
|
dateDiff |
4.692e-05 |
7.389e-06 |
6.350 |
2.15e-10*** |
|
BanTeamMergersTrue |
5.875e-01 |
7.074e-02 |
8.305 |
< 2e - 16*** |
|
HasKernelsTrue |
3.590e-01 |
7.195e-02 |
4.989 |
6.06e-07*** |
|
seasonEnabledspring |
8.004e-01 |
4.851e-02 |
16.500 |
< 2e - 16*** |
|
seasonEnabledsummer |
3.903e-01 |
5.424e-02 |
7.195 |
6.22e-13*** |
|
seasonEnabledwinter |
1.991e-01 |
5.366e-02 |
3.711 |
0.000206*** |
From the output, it can be seen, that several factors have been considered significant on the significance level of 0.05 for our model. The coefficients of the variables are the most interesting ones for us because they make the model well interpretable. The coefficients of the model in our case correspond to the alternations of the log odds since there are no terms that are interacting with each other. The negative coefficients indicate a negative log-odds conditioned on the predictor variable given all other terms being fixed and vice versa the positive ones speak about a positive log odds ratio. The bigger the coefficient, that is the further away it is from zero, the stronger our assumptions about the outcomes of the model can be. For example, a single glance at Table 2 let us notice that the number of rewards for the competition encoded as “NumPrizes”has a positive log odds ratio. This says that according to the model the one-unit increase of the number of the places that guarantee a prize for contestants is expected to change the log odds by 0.1522. Converting this log-odds ratio into the odds we can take the exponent of 0.1522 and observe that the odds ratio gives us a value of approximately 1.17. We can also easily convert the odd ratio to percentage as 1.17 ratio says that the unit increase in “NumPrizes” gives us a 17% increase in the odds of participating in the competition.
The majority of our variables for the model are represented as factors, that is they have different levels. For the understanding of the output of the model, it is important to realize how the model deals with the situation when it has been provided with this kind of variablesinstead of just numeric variables. In order to do that the observer should check the ground level factor. By ground level factor we mean the one in comparison with which the model conducted its computations. For example, the “HasKernelsTrue” variable is a factor variable and has a positive coefficient. In this case, the ground level factor with which it has been compared is the “HasKernelsFalse” variable, which though is not represented in the output of the model, but it is there, hidden under the curtains. The positive coefficient of “HasKernelsTrue” variable tells us that if the competition offers Kernels, this makes their odds of attracting contestant higher by 43% than the ones that are not offering Kernels.
At first, before carrying out any analysis according to our understanding and personal experience we thought that generally, users while choosingthe competitions are paying the most attention to the subject of the competition, whether it is interesting to them personally or not, and to the amount of money the competition is offering to its contestants. However, as you can see from Table 2 the amount of money coded under the “RewardQuantity” variable though being significant in our model has a negative coefficient. Under the interpretation of the model that we have developed, it means that the unit increase in the variableis expecting to change the odds ratio by 0.99. This factwould have been quite surprising for us unless we made the exploratory analysis of the data.Thus, after conducting the exploratory data analysis and getting acquitted with the sample we started to expect a result like this one. Actually, figure4 has already given us this sense of understanding of the sample. However, the question why this the case is, why the users are reluctant to participate in the competitions that pay high rewards is somewhat mysterious for us though we have some thoughts about it. As it can be seen from our model output in Table 2 the other variable that is the “NumPrizes” which represents the number of places that guarantee to the contestants that they will earn the rewards has a positive coefficient, indicating the more places guarantee prizes the more eager people are to participate in the competitions. This can explain the way users are behaving in a platform. We think, that as rational creator's people are seeking an opportunity to maximize their profit. However, as it frequently happens in the societythere are not so much people who are ready to invest their money and effort into very risky projects that can make them in an instant either very happy, in our case by winning the competition reward, or very sad by failing to achieve the positions in the leaderboard desk that guarantee winning. Therefore, people are more willing to be a part of the competition where their chances are higher to win than the ones that offer more money but at the same time are supposed to be more complicated to accomplish.
We also had a look at the characteristics of the users themselves and the way they can be connected to the final choice of them that they are making while choosing the competition. Table 3 shows the results of the analysis.
Table 3 The output of the model
Coefficients |
Estimate |
Std. Error |
Z value |
Pr(>|z|) |
|
Comp_rating |
-5.175e-05 |
4.339e-05 |
-1.193 |
0.23294 |
|
Comp_hihgestRank |
4.416e-04 |
1.104e-0 |
4.022 |
6.29e-05*** |
|
Comp_gold |
-1.180e+00 |
9.750e-02 |
-12.103 |
||
Comp_silver |
-3.977e-01 |
4.554e-02 |
-8.733 |
||
Comp_bronze |
-1.939e-01 |
2.324e-02 |
-8.340 |
||
Kernel_gold |
-3.089e-02 |
2.047e-01 |
-0.151 |
0.88005 |
|
Kernel_bronze |
-3.254e-01 |
1.079e-01 |
-3.014 |
0.00258** |
|
Disc_gold |
NA |
NA |
NA |
NA |
|
Disc_silver |
-3.406e-01 |
1.278e-01 |
-2.665 |
0.0770** |
|
Disc_bronze |
8.222e-02 |
1.543e-02 |
5.330 |
9.83e-08*** |
Interpreting the results, we can highlight some interesting things. First of all, it can be seen that one of the coefficients is not defined because of singularities.The variable named “Disc_gold” which encodes the number of golden points that the user managed to get has all coefficients as “NA” or not available. The reasons are the fact that there is almost a perfect linear relationship between this variable and some other in our model. And this indeed the case, if we look at Table1, we can see that there are only three values for this variable with the majority of them being zero. This is very similar to the variable named “Kernel_bronze”. Checking the figure of these variables we saw that there is indeed a linear relationship. Moreover, if we compute the correlation between these two variables, we will see that the coefficient is equal to 0.95, which indicates a strong linear relationship. Some other factors that can be highlighted are the fact that every new unit of points as gold, silver or bronze medals in all the competitions are lessening the odds of participation in the competitions. We think, that this can be partially explained by the big sparsity of the matrix that we have in our sample. This can be once again seen from the Table 1, which illustrates that most of the values in almost all these variables are zeros and we think, that this can influence the output of the model.
As a logical continuation of the choice-based conjoint analysis we have developed a method that allows us to determine the probability of the participation of the contestant given the set of characteristics of the competition. Moreover, the method can be easily expanded to include not only the features of the competitions while computing the probabilities but also the features of the users. This can give us an opportunity to assess the effects of user-specific variables on the outcome for varying levels of competition-specific variables.
Thus, the model works as following: giving the output of the logistic regression and the new input data of the characteristics we just multiply the coefficients of the regression by the variables and receive as a result the utility formed by the characteristics. Finally, using the logit formula one can easily calculate the probability of participation given the set of features. This can be done for any set of features as long as they are consistent with the model.
Referring to the second task of this section we will now discuss our classification model, that we have developed.As it was already explained in the previous part of the work, we took a pure k nearest neighbors' algorithm and tried to modify it to increase its efficiency. Therefore, we call it a modified knn method. The results of the modified knn are quite crucial not only in the context of this work but also in general because potentially it can be used everywhere where some researcher is trying to solve a classification problem.
For the purposes of this work, we have tested our modified model on the following classification task: we have tried to classify if a contestant will be successful in the competition given his or her set of features. We decided to define the “successfulness” of the contestant by his score using the eponymous variable “Score”. We have already explained that the variable “Score” records the relative position that the contestants managed to hit in the competition. The values of “Score” varies in the range of [0, 1]. The higher the score the better the contestant performed in the competition.
Quite subjectively, we took as a threshold value of the “success” a score more than or equal to 0.75 and encoded these observations as 1 and the others as 0. The features that we took upon which we would try to determine their success are the set of individual prizes in the competitions, kernel, and discussion sections as well as the number of followers and the number of followings. Figure6 demonstrates the results of our modified model as well as the pure knn model applied to this dataset. The figure6 illustrates the various values of accuracy that the two models are managed to get with different values of the number of neighbors: k.
However, as we have already mentioned we are not going to constrain ourselves with only the accuracy as this can be quite misleading. Therefore, we decided to compute also the sensitivity and the specificity of the models and compare them with each other. To do so we have constructed two confusion matrices with the outputs of the models. Table 4 shows the results for the pure knn model.
Figure7. The comparison of accuracies of the two models
Accordingly, we have computed the sensitivity of the model is equal to approximately 0.48 and the specificity almost 0.979.
Table 4 The confusion matrix of the knn
Fail |
Success |
||
Fail |
4362 |
495 |
|
Success |
92 |
455 |
The sensitivity can be interpreted in our case as the probability of a positive output of the model given that the contestant has been successful. Meanwhile, the specificity is the probability of a negative output of the model given that the contestant has failed in the competition. By doing the same calculations we get the results of our modified knn method from table 5. The sensitivity, in this case, was equal to 0.72 and the specificity was not considerably changed being equal to 0.973.
Table 5 The confusion matrix of the modified knn
Fail |
Success |
||
Fail |
4337 |
265 |
|
Success |
117 |
685 |
Summing up, we can state that overall our modified model has performed better than the simple knn showing better results of accuracy as well as showing better sensitivity and a comparable specificity as its outcomes. We can also state, that the objective of the paper was successfully met because we managed to identify the factors, which can contribute to the attraction of the most active and high-scored users of the platform.
Conclusion
At this stage of the work, we can state that the objectives of the work have been successfully met. We managed to provide tools to contribute to the development of a set of recommendations aimed at the maximization of the effectiveness of the competitions at the Kaggle platform. Moreover, we have also implemented a modified knn method with enhanced performance to make the solutions to classification tasks more precise.
At the beginning of the work we have identified for ourselves the following five tasks:
1) Retrieve relevant data from web resources.
2) Create a model which will realize the model on datasets.
3) Compare the model performance with the other benchmarking model performances.
4) Identify the core features of the competitions and their relationships with the contestant's choice.
5) Determine whether the contestants will be successful at the competition or not.
We are glad to state that all of these tasks have been accomplished. We wrote from scratch a web crawler, which gives us an opportunity to retrieve the necessary information about the skillful users of the platform from the web. This was the starting point of our work as it is the single vital tool that makes the accomplishment of this work more realistic. Then, we have done some manipulations with the data to make it suitable for further analysis. After all the preparations have been finished, we conducted our analysis. Because of which we identified the relationship between the core features of the competitions and the contestant's choice of participation in them. This gave us an opportunity to understand which features and how can contribute to the choice of participants. Based on the results of this analysis companies can develop a new model of user's attraction by enabling the features that are expected to increase the odds of the participation of the new skillful contestants. Moreover, we also suggested a method of classification that based on the set of features of the contestants lets us determine whether he or she will achieve success or not.
Overall, summing up the whole research we hope that the work will be helpful for the companies who are going to launch new competition as they will get the necessary understanding of which features can be neglected and which ones are necessary in order to succeed. However, we are also emphasizing the theoretical value of the work as it is introducing a new, modified model, which can find its implementation basically in all the classification tasks, especially the ones where researchers were preferring to use simple knn.
However, the work also has its limitations. It will be very helpful to try the modified model on some other datasets and assess its performance. Also, a speed of the model is a primary concern. Some further work must be carried out in that direction.
Another possible direction of the further research would be the acquirementof the dataset that contains the features of the users right at the moment of the start of the competition for every single choice that they had made. This will allow to get more accurate image of the situation of decision-making and assess more accurately the probabilities of participations.
Reference list
1) Соревнование KaggleHomeCreditDefaultRisk -- анализ данных и простые предсказательные модели. (19 June 2018 г.). Retrieved from https://habr.com/ru/post/414613/
2) (2019). Retrieved from Kaggle: https://www.kaggle.com/kaggle/meta-kaggle
3) Adhikari , A., & DeNero, J. (n.d.). Computational and Inferential Thinking. Retrieved from https://www.inferentialthinking.com/chapters/intro
4) Akinsola, J. E. (2017). Supervised Machine Learning Algorithms: Classification and Comparison. International Journal of Computer Trends and Technology (IJCTT), 128-138.
5) Bafandeh , S. I., & Bolandraftar, M. (2013). Application of K-Nearest Neighbor (KNN) Approach for Predicting Economic Events: Theoretical Background. Int. Journal of Engineering Research and Applications, 605-610.
6) Beyer, K. R. (1997, December 28). When Is "Nearest Neighbor" Meaningful? Retrieved from https://www.researchgate.net/publication/2845566_When_Is_Nearest_Neighbor_Meaningful
7) Bhattacharya, A. (2018, May 18). Introduction to Kaggle for Beginners in Machine Learning and Data Science! Retrieved from https://medium.com/datadriveninvestor/introduction-to-kaggle-for-beginners-in-machine-learning-and-data-science-865199d7ead2
8) Boeing, G., & Waddell, P. (2016). New Insights into Rental Housing Markets across the United States: Web Scraping and Analyzing Craigslist Rental Listings. Journal of Planning Education and Research.
9) Chapter 1. Bootstrap Method. (n.d.). Retrieved from http://www.math.ntu.edu.tw/~hchen/teaching/LargeSample/notes/notebootstrap.pdf
10) Dobney, S., Carlos, O., & Revilla, M. (2017). More realism in conjoint analysis: the effect of textual noise and visual style. International Journal of Market Research.
11) Facebook Recruiting Competition. (n.d.). Retrieved from Kaggle: https://www.kaggle.com/c/FacebookRecruiting
12) Geek Out: Crowdsourcing Data Science With Kaggle. (2017, February 22). Retrieved from https://www.business.com/articles/crowdsourcing-data-science-with-kaggle/
13) Good, P. I. (2011). Analyzing the Large Number of Variables in Biomedical and Satellite Imagery. John Wiley & Sons.
14) Hauser, J. R. (n.d.). Note on Conjoint Analysis. Retrieved from MIT: http://www.mit.edu/~hauser/Papers/NoteonConjointAnalysis.pdf
15) Hess, S., Daly, A., & Batley, R. (2018). Revisiting consistency with random utility maximisation: theory and implications for practical work. Theory and Decision, 181-204.
16) Home Credit Default Risk. (n.d.). Retrieved from Kaggle: https://www.kaggle.com/c/home-credit-default-risk
17) Hundert, M. (2009). ADVANTAGES AND DISADVANTAGES OF THE USE OF CONJOINT ANALYSIS IN CONSUMER PREFERENCES RESEARCH. Folia Oeconomica Stetinensia, 347-357.
18) Islam, M. J.-A. (2010). Investigating the Performance of Naive- Bayes Classifiers and K- Nearest Neighbor Classifiers. Journal of Cases on Information Technology (JCIT), 133-137.
19) Kaggle. (n.d.). Retrieved from Kaggle: https://www.kaggle.com/
20) Kang, X., Zhang, G., Ou, X., Guo, L., Bing, T., & Wang, J. (2018). KNN-Based Representation of Superpixels for Hyperspectral Image Classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 4032 - 4047.
21) Kotz , S., Campbell , R. B., Balakrishnan , N., Vidakovic , B., & Johnson, N. L. (2004). Encyclopedia of Statistical Sciences. John Wiley & Sons, Inc.
22) Li, B. (2011). The multinomial logit model revisited: A semi-parametric approach in discrete choice analysis. Transportation Research Part B: Methodological, 461-473.
23) Maldonado, S., Montoya, R., & Weber, R. (2014). Advanced conjoint analysis using feature selection via support vector machines. European Journal of Operational Research.
24) Mark J. Ventresca, J. W. (2001, May). Archival Research Methods. Retrieved from ucsb: http://www.soc.ucsb.edu/faculty/mohr/classes/soc4/summer_08/pages/Resources/Readings/Ventresca%20&%20Mohr.pdf
25) Rachinger, M., Rauter, R., Mьller, C., Vorraber, W., & Schirgi, E. (2018). Digitalization and its influence on business model innovation. Journal of Manufacturing Technology Management.
26) Rao, V., & Pilli, L. (2014). Conjoint Analysis for Marketing Research in Brazil. Revista Brasileira de Marketing.
27) Singh, H. (2018, June 18). Statistics That Prove IoT will become Massive from 2018. Retrieved from http://customerthink.com/statistics-that-prove-iot-will-become-massive-from-2018/
28) Sperandei, S. (2014). Understanding logistic regression analysis. Biochemia medica.
29) Steiner, M., & MeiЯner, M. (2018). A User's Guide to the Galaxy of Conjoint Analysis and Compositional Preference Measurement. Marketing ZFP, 3-25.
30) Sunasra, M. (2017, November 11). Performance Metrics for Classification problems in Machine Learning. Retrieved from https://medium.com/thalus-ai/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b
31) Voleti, S., Srinivasan, V., & Pulak, G. (2017). An approach to improve the predictive power of choice-based conjoint analysis. International Journal of Research in Marketing, 325-335.
32) Wu, X. (2007). Top 10 algorithms in data mining. Knowledge and Information Systems, 1-37.
Appendix 1
Table 5 The explanation of the variable's names encoding in the sample
Encoding of the variables |
Explanation |
|
id |
A unique id given by the website for every user |
|
Trial |
The number of decisions made by the user |
|
Selection |
The competition number selected from all the alternatives by the user at the given trial |
|
Alt |
The alternatives available to user while choosing competition |
|
compName |
The competition's names |
|
choice |
A binary analog of selection. Indicates whether the competition has been chosen among the alternatives or not. |
|
Comp_solo |
The number of competitions made without the formulation of any team by the user |
|
Ranking |
The position in the leaderboards table of the user in the given competition |
|
Comp_level |
The level of skills of the user in the competitions section |
|
Comp_rating |
The current rank accumulated by the user in the competitions |
|
Comp_highestRank |
The highest ranking that the user achieved in competitions |
|
Comp_gold |
Number of gold medals in competitions |
|
Comp_silver |
Number of silver medals in competitions |
|
Comp_bronze |
Number of bronze medals in competitions |
|
Kernel_level |
The level of skills of the user in the kernels section |
|
Kernel_rating |
The current rank accumulated by the user in the kernels |
|
Kernel_gold |
Number of gold medals in kernels |
|
Kernel_silver |
Number of silver medals in kernels |
|
Kernel_bronze |
Number of bronze medals in kernels |
|
Disc_level |
The level of skills of the user in the discussions section |
|
Disc_rating |
The current rank accumulated by the user in the discussions |
|
Disc_gold |
Number of gold medals in discussions |
|
Disc_silver |
Number of silver medals in discussions |
|
Disc_bronze |
Number of bronze medals in discussions |
|
Followers |
Number of followers of the given user |
|
Following |
Number of followings of the given user |
|
Subtitle |
The subtitles of the competitions |
|
CompetitionTypeId |
The competition's id by its type |
|
HostName |
The name of the competition's hosting company |
|
EnableDate |
The date competition has been launched |
|
DeadlineDate |
The date competition has been expired |
|
HasKernels |
Binary indicator whether the kernels were allowed in the competition or not |
|
OnlyAllowKernelSubmissions |
Binary indicator whether the competitions allow only submissions with kernel or not |
|
HasLeaderboard |
Binary indicator whether the competitions keep leaderboards or not |
|
EvaluationAlgorithmAbbreviation |
The abbreviation of the competition's evaluation algorithm |
|
EvaluationAlgorithmName |
The full name of the evaluation algorithm |
|
ValidationSetValue |
The range of the allowed outputs for the competitions |
|
MaxDailySubmissions |
The maximum number of daily submissions for the competitions |
|
MaxTeamSize |
The maximal allowed team size for the competitions |
|
BanTeamMergers |
Binary indicator whether the team mergers are allowed or not |
|
EnableTeamModels |
Binary indicator whether the team models are allowed or not |
|
RewardType |
The factor variable with values “USD”, “EUR”, “Knowledge”, “Kudos”, “Swag”, “None” |
|
RewardQuantity |
The amount of money the competition is giving as a reward for the winners |
|
NumPrizes |
The number of places in the leaderboard table that are guaranteed to win the reward |
|
TotalCompetitors |
The total number of competitors the competition managed to acquire |
|
TotalSubmissions |
The total number of submissions the competition managed to get |
|
seasonEnabled |
The season of the year when the competition has been launched |
Appendix 2
The web crawler
1. import pandas as pd
2. import numpy as np
3. import requests
4. from bs4 import BeautifulSoup
5. import json
6. import time
7. import datetime
8. import re
9. import csv
10.
11.
12. users = pd.read_csv("D:\\HDDREQ\\Desktop\\New folder\\Users.csv")
13.
14. users = users[users.PerformanceTier > 1]
15. userNames = users.iloc[:,1]
16.
17. np.random.seed(7)
18. idx = np.random.choice(list(userNames.index), size = 5000, replace = False)
19. userNames = userNames[idx]
20.
21. class KaggleInfo(object):
22. meta = ["userId", "userName", "country", "competitionsSummary", "scriptsSummary",
23. "discussionsSummary", "following", "followers"]
24.
25. ActivityInfo = ["totalResults","rankPercentage", "rankOutOf", "rankCurrent", "rankHighest", "totalGoldMedals","totalSilverMedals", "totalBronzeMedals", "highlights"]
26. highlights = ["title", "medal", "score", "scoreOutOf"]
27.
28. def __init__(self):
29. self.results = []
30.
31. def getUser(self, index, data):
32. time.sleep(5)
33. url = "https://www.kaggle.com/{0}".format(data[index])
34. print(url)
35. print("Getting user {0}...".format(data[index]))
36. r = requests.get("https://www.kaggle.com/{0}".format(data[index]))
37. print(r)
38. if r.elapsed> datetime.timedelta(seconds = 8):
39. raise TimeoutError("The response is too slow")
40. soup = BeautifulSoup(r.content)
41. idx = re.search("false}\\);performance", soup.findAll("script")[19].text).span()[1] - 13 #get the JSON part
42. jSon = soup.findAll("script")[19].text[77:idx]
43. user = json.loads(jSon)
44. return user
45.
46. def getUserInfo(self, user):
47. lst = []
48. for m in meta:
49. if m in ["competitionsSummary", "scriptsSummary", "discussionsSummary"]:
50. for j in ActivityInfo:
51. if j != "highlights":
52. val = data[m][j]
53. if val is None:
54. lst.append("None")
55. else:
56. lst.append(val)
57. print(val)
58. else:
59. for k in range(len(data[m][j])):
60. for l in highlights:
61. val = data[m][j][k][l]
62. if val is None:
63. lst.append("None")
64. else:
65. lst.append(val)
66. elif m in ["following", "followers"]:
67. lst.append(data[m]["count"])
68. else:
69. lst.append(data[m])
70. return lst
71.
72. def __writeCSV(self):
73. with open('C:\\Users\\User\\Desktop\\data.csv','w',newline='') as f:
74. w = csv.writer(f)
75. for t in self.results:
76. w.writerows(t)
77.
78. def main(self, lowerBound, upperBound):
79. for i in idx[lowerBound:upperBound]:
80. try:
81. print(i)
82. user = self.getUser(i, userNames)
83. lst = self.getUserInfo(user)
84. self.results.append([lst])
85. except TimeoutError:
86. self.results.append([0])
87. self.__writeCSV(results)
Appendix 3
The development of the new model
1. library(utils)
2.
3. # some part of the code was inspired by (Adhikari & DeNero, n.d.)
4.
5. distance <- function(row1,row2){
6. return( sqrt( sum( (row1-row2)^2 )) )
7. }
8.
9. all_distance <- function(data,new_point){
10. data = data[,-which(names(data) == 'class')]
11.
12. distance_from_point <- function(row){
13. return( distance(new_point,row))
14. }
15. return(apply(data, 1, distance_from_point))
16. }
17.
18. tabel_with_distances <- function(data,new_point){
19. data$distance = all_distance(data,new_point)
20. return(data)
21. }
22.
23. closest <- function(data,new_point,k){
24. with_dist = tabel_with_distances(data,new_point)
25. with_dist = with_dist[order(with_dist$distance),]
26. return(with_dist[1:k,])
27. }
28.
29. majority <- function(topK){
30. ones = sum(topK$class == 1)
31. zeros = sum(topK$class == 0)
32. if(ones > zeros){
33. 1
34. }else{
35. 0
36. }
37. }
38.
39. # compute the modified majority by implementing the technique described in the work
40.
41. majority_bts <- function(data = wine, point, k,topK){
42. NumZeros = count_zero(topK$class)
43. if(NumZeros == round(nrow(topK)/2) | NumZeros == round(nrow(topK)/2) + 1){
44. make_bootsp_own(data, point,k)
45. }else{
46. majority(topK)
47. }
48. }
49.
50. make_bootsp_own <- function(data,point,k) {
51. result <- vector("numeric")
52. d <- data
53. for (i in 1:100) {
54. topK <- closest(d,point,k)
55. idk <- sample(1:nrow(d),
56. nrow(d),replace = TRUE)
57. result[i] <- majority(topK)
58. d<-d[idk,]
59. }
60. print(result)
61. if (mean(result) > 0.57) {
62. return(1)
63. } else {
64. return(0)
65. }
66. }
67.
68. classify <- function(data,new_point,k){
69. closestK = closest(data,new_point,k)
70. return( majority_bts(data, new_point, k, closestK) )
71. }
72.
73. # supplementary functions to evaluate the accuracy
74.
75. count_zero <- function(vec){
76. res = length(vec) - sum(vec != 0)
77. return(res)
78. }
79.
80. count_equal <- function(vec1,vec2){
81. return(count_zero(vec1-vec2))
82. }
83.
84. evaluate_accuracy <- function(training,test,k){
85. test_attr = test[,-which(names(test) == 'class')]
86. classify_testrow <- function(row){
87. return( classify(training, row, k) )
88. }
89. print(0)
90. res = apply(test_attr,1,classify_testrow)
91. return(res)
92.
93. }
1. # Evaluate the model
2.
3. tmp <- read.csv("C:\\Users\\User\\Documents\\AFinal.csv")
4. idk <- sample(seq(1:nrow(tmp)))
5. test <- tmp[idk[20001:25404],]
6. training <- tmp[idk[1:20000],]
7.
8. accur_mine = vector('numeric', 10)
9. for (k in 5:14) {
10. print(k)
11. RESULT_loop_1 = evaluate_accuracy(training,test,k)
12. res = count_equal(RESULT_loop_1, test$class)/nrow(test)
13. accur_mine[k-4] = res
14. }
15. table(RESULT, test$class)
16.
17.
18. library(class)
19. tr<-cbind(sapply(training[,1:15], jitter), training$class)
20. tst<-cbind(sapply(test[,1:15], jitter), test$class)
21. tr<-as.data.frame(tr)
22. tst<-as.data.frame(tst)
23. colnames(tr)[16]<-"class"
24. colnames(tst)[16]<-"class"
25. accur = vector('numeric',10)
26. for (i in 5:15) {
27. print(i)
28. pr <- knn(tr, tst, tr$class, k = i)
29. tab <- table(pr, tst$class) ...
Подобные документы
Lists used by Algorithm No 2. Some examples of the performance of Algorithm No 2. Invention of the program of reading, development of efficient algorithm of the program. Application of the programs to any English texts. The actual users of the algorithm.
курсовая работа [19,3 K], добавлен 13.01.2010Review of development of cloud computing. Service models of cloud computing. Deployment models of cloud computing. Technology of virtualization. Algorithm of "Cloudy". Safety and labor protection. Justification of the cost-effectiveness of the project.
дипломная работа [2,3 M], добавлен 13.05.2015Basic assumptions and some facts. Algorithm for automatic recognition of verbal and nominal word groups. Lists of markers used by Algorithm No 1. Text sample processed by the algorithm. Examples of hand checking of the performance of the algorithm.
курсовая работа [22,8 K], добавлен 13.01.2010Архитектура операционной системы Android. Инструменты Android-разработчика. Установка Java Development Kit, Eclipse IDE, Android SDK. Настройка Android Development Tools. Разработка программы для работы с документами и для осуществления оперативной связи.
курсовая работа [2,0 M], добавлен 19.10.2014Technical and economic characteristics of medical institutions. Development of an automation project. Justification of the methods of calculating cost-effectiveness. General information about health and organization safety. Providing electrical safety.
дипломная работа [3,7 M], добавлен 14.05.2014Overview history of company and structure of organization. Characterization of complex tasks and necessity of automation. Database specifications and system security. The calculation of economic efficiency of the project. Safety measures during work.
дипломная работа [1009,6 K], добавлен 09.03.2015Этапы разработки автоматизированной системы приема и бронирования заказов столиков в заведениях. Анализ среды разработки Android Development Tools. Общая характеристика диаграммы компонентов IOS приложения. Рассмотрение системы контроля версий сервера.
курсовая работа [8,7 M], добавлен 14.05.2014Анализ функциональной структуры и обеспечивающей части АСУ. Проектирование функциональной структуры подсистемы управления проблемами, разработка модели в среде CPN Tools и алгоритма работы. Описание программного и технического обеспечения проекта.
дипломная работа [5,6 M], добавлен 26.06.2011Средства разработки, ориентированные на конкретные СУБД. Наиболее известные приложения на основе Eclipse Platform. Проект NetBeans IDE, его возможности. KDevelop — свободная интегрированная среда разработки для UNIX-подобных операционных систем.
реферат [107,5 K], добавлен 14.04.2014Основные алгоритмические структуры. Запись алгоритма в словесной форме, в виде блок-схемы. Система команд исполнителя. Язык высокого уровня. Создание программы и её отладка. Интегрированные среды разработки: Integrated Development Environment, IDE.
лекция [61,7 K], добавлен 09.10.2013IS management standards development. The national peculiarities of the IS management standards. The most integrated existent IS management solution. General description of the ISS model. Application of semi-Markov processes in ISS state description.
дипломная работа [2,2 M], добавлен 28.10.2011Кратка историческая справка развития языка Java. Анализ предметной области. Java platform, enterprise and standart edition. Апплеты, сервлеты, gui-приложения. Розработка программного кода, консольное приложение. Результаты работы апплета, сервлета.
курсовая работа [549,2 K], добавлен 23.12.2015Information security problems of modern computer companies networks. The levels of network security of the company. Methods of protection organization's computer network from unauthorized access from the Internet. Information Security in the Internet.
реферат [20,9 K], добавлен 19.12.2013Класифікація комп'ютерних ігор відповідно до інтерактивних ігрових дій гравця. Мобільні пристрої з сенсорними екранами. Програмна реалізація гри жанру Tower Defence на базі платформи Java Platform Micro Edition для мобільних пристроїв з сенсорним екраном.
дипломная работа [693,2 K], добавлен 14.04.2014Новые тенденции развития СУБД и областей их применения. Структурные элементы базы данных. Объектно-ориентированная модель программных компонентов. Формы, модули и метод разработки "Two-Way Tools". Масштабируемые средства для построения баз данных.
дипломная работа [589,5 K], добавлен 16.12.2013Основные понятия и определения стеганографии. Методы сокрытия данных и сообщений, цифровые водяные знаки. Атаки на стегосистемы и методы их предупреждения. Технологии и алгоритмы стеганографии. Работа с S-Tools. Особенности специальной программы.
контрольная работа [2,2 M], добавлен 21.09.2010Описание функции file info в программе Erdas Imagine, которая позволяет получить подробную информацию об изображении. Графики спектральных характеристик для разных объектов на снимке. Инструменты Profile tools для исследования спектральных характеристик.
лабораторная работа [533,8 K], добавлен 09.12.2013Программы автоматизированного перевода: электронные словари, tools-приложения, система Translation Memory, редакторское ПО. Анализ использования САТ-программ в практической деятельности. Выполнение перевода при помощи переводчиков Wordfast и Promt.
курсовая работа [46,5 K], добавлен 10.11.2011Consideration of a systematic approach to the identification of the organization's processes for improving management efficiency. Approaches to the identification of business processes. Architecture of an Integrated Information Systems methodology.
реферат [195,5 K], добавлен 12.02.2016Создание образа диска с помощью программного продукта Nero для резервного копирования, распространения программного обеспечения, виртуальных дисков, тиражирования однотипных систем. Возможности Alcohol 120%, Daemon Tools для эмуляции виртуального привода.
курсовая работа [188,9 K], добавлен 07.12.2009