Maximizing the Effectiveness of Competitions at Kaggle Platform

Development of tools aimed at maximising the efficiency of the competition of Kaggle. Approbation of a new modified knn algorithm, which allows to obtain more accurate results for classification problems, to participants Of kaggle competitions.

Рубрика Программирование, компьютеры и кибернетика
Вид дипломная работа
Язык английский
Дата добавления 01.12.2019
Размер файла 611,7 K

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Размещено на http://www.allbest.ru/

3

Размещено на http://www.allbest.ru/

1

FEDERAL STATE EDUCATIONAL INSTITUTION OF HIGHER EDUCATION

NATIONAL RESEARCH UNIVERSITY

HIGHER SCHOOL OF ECONOMICS

Saint Petersburg School of Economics and Management

Department of Management

Maximizing the Effectiveness of Competitions at Kaggle Platform

Iskandaryan Sargis

In the field 38.03.02 Management

Educational programme `Management'

Reviewer

Position, degree

Initials Last name

Supervisor

PhD Department of Management, Associate Professor

E.A. Antipov

Saint Petersburg

2019

Abstract

This paper is aimed to explore the influence of the factors that can affect the interaction of the Kaggle's users and the competitions that have been posted on the Kaggle website. The paper's objective is the development of the tools aimed at the maximization of the effectiveness of the competitions in Kaggle. The paper measures the effectiveness by the number of high-ranked users the competition has managed to acquire. In order to get its objective, the paper is using such famously adopted methods as choice-based conjoint analysis and k nearest neighbors (knn). During the research, the paper will also develop a new modified knn algorithm that will allow the researchers to get more accurate results for classification tasks. Apart of the theoretical interest connected with the introduction of a new classification method, the work also brings more practical interest for the companies that seek to increase their effectiveness in the platform. To do that the paper collected an extensive dataset about the Kaggle's competitions and their participants. After the research will be carried out on this dataset the paper will outline the factors that are the most influential in the process of users' attraction from the standpoint of competition's hosts.

Keywords: Kaggle, conjoint analysis, k nearest neighbors, modified knn.

modified algorithm kaggle competition

Table of contents

Introduction

1.Theoretical foundation

1.1Kaggle benefits for the companies

1.2 Acquaintance with the techniques used in the work

2.Statement of the research question

2.1The objective and the tasks

3.Research methods

3.1 Getting the data

3.2 The development of the methods

4.Description of the results

4.1 Exploratory data analysis

4.2 The implementation of the models

Conclusion

Referencelist

Appendix 1

Appendix 2

Appendix 3

Introduction

The paper is an attempt to develop tools that will allow to maximize the effectiveness of competitions in the popular data science platform: Kaggle. Since it has been founded Kaggle has been known as a place to solve different data science and machine learning problems. The range of the features that the platform is suggesting to its users is quite impressive. There a user can take part in the competitions that have been hosted by organizations interested in the modern machine learning and data science solutions of their problems. Apart from that users can join in the discussions section, where they can share with each other ideas about their interpretation of some problems and get the appropriate feedback. Moreover, there is a section called Kernels, which is fully dedicated to the task of making the user's interaction with the platform more accessible. This section can serve to users as online script editor, so that the data scientists will not be obligated to install programming languages such as R or Python in their local machines, rather than that they can write and discuss code right in the platform(Bhattacharya, 2018). These are the features that made the Kaggle so popular among the community of data scientists. However, during all these years it has become a handy tool for organizations too. There are numerous companies and even governmental agencies which are actively participating in this platform by running various competitions. The reason why these companies have decided to become part of the Kaggle community can vary from organization to organization. Without too much thorough thinking one who knows the nature of the platform can outline the main gain from the platform from the perspective of the competition's hosts. Certainly, it is the fact that companies receive solutions to the problems that they are purposing to the community. Afterward, they can integrate these solutions into their daily tasks or even use them during the development of some new product. From the perspective of the competitions' hosts, thegathering of the best solutionstothe tasks can be extremely useful because it will help them to get the best out of the platform. Nevertheless, the possibilities of the platform for companies are not constraint by the provision of solutions from the community. There are other fascinating options too. For instance, companies can use the platform to search and hire new employees. All of these in combination with the fact that there has not been made any similar research on the topic of the effectiveness maximization of competitions justifies the relevance of our work. The results of the paper can be used by the companies to optimize theirexpenses from launching competitions since they will be able to concentrate on the factorsthatattract the most active users while the launching of the competitions and get the “best” results among all others for their problems. Moreover, there are numerous factors that can affect the quality of the competition's participants that a company can gather around. For example, such features as the amount of the monetary reward, that will be given to the people that managed to achieve from first to third positions or the level of difficulty of the tasks that the companies are proposing. Even such a factor as a brand of the company can affect the perception of data scientists and influence theirdecision on participation in the event. Therefore, it is crucial to understand how all these, and many other available features can influence the popularity and success of the competition. This is theproblem that the paper is trying to tackle.To do so in this paper we gathered the platform's metadata that was collected from the Kaggle website including users' characteristics and the features of the competitions themselves.

Based on the above-mentioned information the paper highlights the following goal:

The development of the methods that will help companies to maximize the effectiveness of the competitions.

Due to the specificity of the goal and the nature of the research,the work has some features of both explanatory and exploratory studies since the paper attempts to identify the relationships between various features of the competitions and the participants level of involvement, their quality, and knowledge.At the same time, the paper will attempt to develop a method that will be used to understand whether the given the user's set of features they will manage to succeed in the competition or not.

To accomplish the goal that the paper highlighted, an extensive study must be conducted. The research will have several steps that can be grouped into five separate tasks. The essence of the first task is the retrieval of the data. In our case, we will get the data from the web. Hence, we wrote a program that will be able to scrap all the necessary information and provide us with it. Since, as we already mentioned, the research will be done on the datasets acquired from the Kaggle website it has been decided to use archival research type. Archival research can be referred to the analysis, that is applied to the text and documents including the electronic databases that are generated by the organizations, which is exactly our case(Mark J. Ventresca, 2001). In our case, the main database of the research can be divided into two parts. The first of which consists of the data that has been collected by Kaggle and that has been resting on their servers. The second part will be retrieved from secondary sources, meaning we will use our web scrappers to extract some information about the users of Kaggle, including such things as the number of competitions that they participated in, their titles, number of kernels, etc.

The second task is somehow more ambitious because it assumes the creation of the model, that will be used in the analysis to classify the users within competitions. Though the model will not be invented from scratch, it will be a new way of modification of the well-known k nearest neighbors (knn) model. I with my colleaguealready have a successful experience of the modification of this method, however, if our modification was made specifically to solve regression-type problems, now the paper will make a trial to completely change the task and solve a classification problem. The main difficulty of the task is that it demands a deep understanding of the underlying techniques of the model and the realization of the limitations of it. We can state that only when the model will be created the research will start.

The third task of the paper is the identification of the best technique that can be used for the solution of this specific problem, on our specialized dataset. For these purposes, there will be used the modified and unmodified knnmodels in our research.All of them will be aimed at the same, single goal and after thorough research, the results will be compared. The best method will be summarized and suggested as the most “effective” way to meet the goal of the paper.The model will be used in the fifth task to identify weather the contestant will succeed or not. Finally, the main objective of the fourth task will be the determination of the core characteristics of the competitions.This will be a vital, last step of the paper to complete the realization of the general goal of the research. Overall, the research will tend to answer such questions as what influences the effectiveness of competitions, how much and how?

1. Theoretical foundation

1.1 Kaggle benefits for the companies

The 21st century is a period of rapid changes. The biggest companies are moving towards digitalization by either creating digital products or trying to be actively present in the digital world. The main reason for digital transformation is the exploration of the ongoing digital opportunities, which offers possibilities of the development for new products and services (Rachinger, Rauter, Mьller, Vorraber, & Schirgi, 2018). From our point of view, it is also a way to keep up with changing societal preferences and stay in touch with the customers. This opinion is supported by the fact that already in 2019 almost 27 billion devices will be used by people all over the world(Singh, 2018). Hence, to be successful companies, especially innovative ones, don't hesitate to use various digital channels of communication with customers. Including such ones as internet advertisements in google search pages or the usage of social networks like Facebook, Instagram and etc. and ending with more specialized, targeted platforms that can not only help them to contact with potential customers and make a statement about their innovative approaches but also can offerother benefits too. As one of this kind of platforms, we in this paper will discuss Kaggle. As the company itself write in its website Kaggle is the place to do data science projects(Kaggle, n.d.). However, it is also a place for companies who host competitions to share about their projects and ideas, try to improve some part of the company's inner processes and make a statement about their innovative nature as they are going digital and trying to use digital resources to improve themselves. From this perspective, the platform can be considered as a place for targeted advertising from the hosting companies, as Kaggle has a big audience of various types of users: novice programmers, experts, professors, students, etc. Being founded only in 2010, Kaggle managed to form a community of enthusiasts from all over the world, in more than 100 countries(Geek Out: Crowdsourcing Data Science With Kaggle, 2017).

Concerning, some other benefits for the companies that they can gain from the presence in the Kaggle platform is that they can use it for the recruitment purposes. The managers from HR department can identify the users that are performing the best on the platform and offer them an actual job in the company. In these terms, the platform can be mutually beneficial both for companies as they can acquire the best minds, and users as they will be hired. Kaggle realizes this opportunity too, as it is providing a service in its website where companies can get access to 3million users of Kaggle and chose the ones they find the best. Some of the biggest digital companies of the world are using this feature of the platform in their recruitment process. From our point of view the way Facebook organizes this process is the most interesting.It is an example of one of the best practices as it uses the full capability of the platform(Facebook Recruiting Competition, n.d.). Facebook launches competitions on Kaggle the prize of which is an interview for the application to Facebook. Using competitions as a tool to choose the best candidates they not only solve their problem with vacant positions but also gain insights from multiple data scientist on the projects they are interested in.

Another interesting feature because of which companies must be interested in hosting competitions on Kaggle is the nature of the platform itself. Here one must keep in mind that it is a “home for data-scientists”. The big concentration of data-scientists who are willing to implement their theoretical knowledge in a practical field. The companies can use these resources to solve the tasks they have, to improve the services or products they are providing. Especially, now when machine learning became an important component of the digital world, the chance of the usage of such kind of platform is invaluable.

To sum up companies are interested in Kaggle because it is a place to advertise the company itself, promote it as an innovative one, some digitally oriented companies can gain benefits from the features of the platform as a place to find the most suitable employee and last but not least to actually solve the inner project problems that companies face while launching a product or service. However, at least in the final two cases, the companies face a problem. The essence of the problem is the identification of the so-called “bests”. It will be preferable for the companies to optimize their expenses by using such platforms as they will want to be sure that the best data-scientist are working on their projects, and the best data-scientists have been hired to work in a company. However, it is not so simple to determine how to attract to best possible ones who will be well suited to do the job. Essentially, the paper addresses this problem. It attempts to find and provide a mechanism which can solve the problem with the attraction of the most suitable contestants. While tackling the problem, the paper will conduct research where a machine learning model and its new modification will be used. The results from the implementations of these models will be compared, analyzed and the best model will be provided as a possible solution to the problem.

We think that, in order to get deeper into the essence of the research, the reader must be familiar with some concepts of machine learning that the paper is going to use. It will be helpful in order to fully understand the work that this paper is conducting. Therefore, in the remaining section of this chapter, we are going to describe the techniques that have been used in our research, as well as we are going to compare and highlight the main subtle points while using them. Moreover, if necessary, the main weaknesses and strengths of the techniques will be provided too.

1.2 Acquaintance with the techniques used in the work

While launching a competition, hosting companies are seeking to get the best possible solutions thus increasing the effectiveness of the platform. In this paper, we will measure the effectiveness of these activities by the strength of users that the company can attract for its purposes.Therefore, to make theimplementation of the goal of the work possible we are going to make analysis using various models of machine learning. The task that we have in its nature can be represented as a classification task, because essentially the paper is going to classify the users by their choices, by the decisions they have made and the available options that they have. To solve this classification task, we will use the following methods: conjoint analysis, k nearest neighbors' algorithm as well as our newly developed, modified knn method.All these methods and the underlying techniques that they are using belong to supervised machine learning algorithms family. Generally, though it is necessary to mention that they are other types of algorithms too, here we distinguish two families of machine learning models: supervised and unsupervised machine learning models. The main difference between these families of models is the way the task of the researcher has been formulated. In case of supervised methods, the researcher should find out the function that based on several input-output observations maps a vector of the several classes to the output of examples of the function(Akinsola, 2017).Meanwhile, under the unsupervised methods,there is no assumption that the researcher has prior knowledge about the vector of outputs. As it was already mentioned our case also assumes the usage of the models from the family of supervised methods. Hereafter in the remaining part of this section of the paper, we will discuss thetechniques that we used in the work.

Conjoint analysis is a well-known method that gained substantial popularity during the decades and is widely used in marketing research. It was originally developed back in the 1960s with the contribution of,to our mind one of the greatest statisticians of the time, Tukey. He with Luce proposed a model the purpose of which is to determine which features of the products and services are meeting the preferences of the customers(Dobney, Carlos, & Revilla, 2017). Generally, the conjoint analysisassumes conduction of some sort of surveys to collect the necessary information about the characteristics that companies are willing to test. By testing one must understand the determination of the influence of those characteristics on the consumers' decision-making process(Hauser).As an example of the characteristics, that might be an interest for the company, can serve such elements as a new design color on a web site or even a new button on the remote control for a TV.Though, there are numerous forms of conjoint analysis in this study we will use choice-based conjoint analysis because we are analyzing the choices of the users based on the parameters of the competitions and the relationship of those parameters to the final decision-making. It is remarkable that 80% of conjoint analysis studies apply this method in their research(Steiner & MeiЯner, 2018). As we just mentioned will not be an exception in this case as our research is also going to make a choice-based conjoint analysis. Though our case will be somewhat different from what has been done so far as it will not be a “sterile” experiment that will provide some options to the respondents and ask them to choose a various combination of features according to their tests. Instead of that, we will delegate this task to the real, outer world. Thus, we think that this will allow us to deal with the problem of respondents perception's distortion when they are starting to mark the choices regardless of their interest in them but based on some subjective psychological factors(Hundert, 2009). For example, the case when a person is starting to mark the features randomly because of bore or laziness is a rigid example of such psychological behavior.

A disadvantage of the choice-based analysis model that we face with is that the researcher usually does not have unlimited access to all the variables that can influence the process of decision formulation, therefore the model ought to deal with just a limited number of variables. In this case, we fall under this constraint. However, as almost any other researcher we should put up with this fact as we cannot deal with the problem because of the scarcity of the resources and the time.

The choice-based conjoint analysis uses a multinomial logistic model with heterogeneous parameters under its hood(Voleti, Srinivasan, & Pulak, 2017). Logistic regression is a lot like the ordinary least-squares regression, however, unlike the letter one, it does not make any assumptions on the continuity of the data. That is logistic regression can be used to solve classification tasks, where the researcher is willing to distinguish between different categories. For example, one can take into consideration the case when it is needed to predict if the basketball player will throw the ball dependent on some distance from the basket. The underlying core idea of the logistic regression is the same as in case of the least-squares regression: minimization of the loss function. However, in the case of the logistic regression, due to its nature, the estimate is constraint between values of 0 and 1. That is it can take as an argument any number from the whole real line, but the corresponding values must not exceed 1. Because of the usage of the multinomial logistic regression,the choice based conjoint analysis allows the researcher to interpret the output of the model, which is exactly what we are chasing for: a well-interpretable model(Sperandei, 2014). Fairly, speaking there have been written a considerable amount of papers that are discussing various issues connected to the logistic regression, therefore we think that it will be better to limit the discussion of this model by so.

It is understandable that being used for such a long time the way the analysis has been held has been affected by some modifications from numerous authors.There have been numerous experiments with the identification of the estimates that can establish the significance of the product features for customers, for instance, a new way of the assessment of the relevance of characteristics has been introduced that is using sophisticated methods such as support vector machines (SVM)(Maldonado, Montoya, & Weber, 2014). Though, this kind of approach can significantly help the researcher to make more accurate predictions however the usage of SVM has a significant disadvantage for our purposes, which is that the results of SVM are not well-interpretable, therefore, we cannot make further discussions of the outputs of the model. Another reason why we are restricted in the usage of this technique is that it is well recommended on surveys, however, our case is a bit different. For the purposes of our study, no surveys have been conducted, but other methods of data collection have been used, more thoroughly about that in the upcoming section of the paper will be discussed.

As we promised we are not going to constrain our research by performing only a choice-based conjoint analysis. Another method that we will use is the k nearest neighbors' algorithm. The algorithm has tremendous popularity among researches, partly because of its flexibility and partly because of its simplicity of implementation. The knn algorithm was even included in a list of top 10 most popular data mining techniques (Xindong, 2007).It is astonishing the number of various fields where knn algorithm finds its implementation starting from the music recommendation systems and finishing with the representation of superpixels for images(Kang, et al., 2018).At the same time, the idea behind knn is quite intuitive. Given a bunch of data points and a query point in an n-dimensional space, the algorithm finds the closest points to this query point and compute the estimate of our interest based on some estimates received from those nearest points (Beyer, Ramakrishnan, Goldstein & Shaft, 1997).The flexibility of the algorithm essentially lies in the fact that it can be successfully used during the resolution of both regression and classification problems. In fact, I with my colleague in the previous year, specifically in 2018, took the responsibility to modify the algorithm to make it more accurate during the solutions of regression's tasks. Uncovering some further parts of the work that will be discussed in the next section we developed a new modified knn algorithm to deal with the solutions of the classification tasks too.

One of the crucial factors that makes it for us possible to turn this idea into real life is the fact that knn, unlike other widely used models, makes relatively fewer assumptions about the underlying observations. Specifically, it does not make any assumptions about the distribution of the observations, thus giving an opportunity to the researcher to use the model even when he does not have any prior knowledge about the nature of the observations(Islam, 2010). Allof the above mentioned gives us opportunities to use the model for our analysis. However, like every other model, this one also has its disadvantages. The main limitation that we are going to concern in this paper is the tendency of knn to pay too much attention to the irrelevant features(Bafandeh & Bolandraftar, 2013). Though using some techniques as putting appropriate weights on the observations can help to partially cope with the problem up to some extent we are we are going to make a trial to improve this method even more by making manipulations with the cases when the model will face with almost evenly distributed data classes.

To sum up the section we are willing to mention that our choice of these two models have been justified by the nature of our task, the popularity of the models and as a consequence of the existence of very mature programming tools that will make the interaction with the models significantly less painful. Particularly, we will use these tools when dealing with the knn method. The modified knn will be developed and carried out from scratch on our own.

2. Statement of the research question

2.1 The objective and the tasks

As it was already mentioned in the introduction objective that we decided to tackleis the development of the tools aimed at the maximization of the effectiveness of the competitions in the Kaggle platform. To justify the choice of such a specific goal, it can be noted that nobody had done extensive research about this topic or in this field. Moreover, the popularity of the platform and the fact that it can serve to companies as a key to solutions for various data science and machine learning tasks makes the perspective of such kind of work as this paper even more attractive. Eventually, the companies who will be eager to use the results of the analysis that the paper suggests will be able to optimize their resources and identify the most vital characteristics of the competitions, to acquire the best specialists of the platform.

The research question of the paper is the following: “Are there any specific competition's characteristics that can have a significant influence on the choice of the users, keeping all other factors equal?” To answer this question and reach the objective of the paper we are going to solve the following five tasks:

1) Retrieve relevant data from web resources.

2) Create a model which will realize the model on datasets.

3) Compare the model performance with the other benchmarking model performances.

4) Identify the core features of the competitions and their relationships with the contestant's choice.

5) Determine whether the contestants will be successful at the competition or not.

Specifically, as it can be seen from the order of these tasks, we will get the answer to our research question at the end of the analysis after getting all the data, developing a model and comparing the outputs. Strictly speaking, we can omit the third point from our tasks and still manage to answer the question. However, in that case, the answers that we get may not be the most “optimal” as they will not be evaluated on the best model. Therefore, we will not get rid of the third task.

The main points that we will be necessary to complete the tasks vary from one to another. To solve the first task of data retrieval we wrote a web scrapper program in the programming language Python. The web scraping is just another name for web data extraction. It is a data scrapping used for extracting data from websites(Boeing & Waddell, 2016).It is a quite popular way of data collection which is used in those cases when the web page of interest does not provide some way of interaction mechanism such as application programming interface (API) to make it easier for developers to get the necessary information. Though, Kaggle web site has an API we found it very limited in its capabilities. Therefore, we decided to write our own program to deal with this task.

The second task is a lot more difficult since it demands not only the ability to write a program on some programming language but also a deep understanding of the techniques, that we are going to use to meet the objective of the paper. Therefore, in the theoretical part of the work, we attempted to give at least some general acknowledgmentwiththe methods that we are going to use and develop further in the research section. Since it is a very ambitious task to try to tune all the models that we are going to use in the purposes of the research we are limiting ourselves only with the one method: knn. Another problem in our way in order to bring into the life the second task is the fact that we must not only know the restrictions of the methods that were described in the theoretical part but also we should be able to cope with them and suggest new ways to stretch these limits.

Naturally, here comes the third task and the need to choose some matric upon which we will analyze the performances of the models. If we were solving regression problem, we will use such metrics as the residual sum of squares (RSS). The RSS has beenquite popular because it is quite easily computable and understandable. The name of the RSS comes from the fact that it isthe root squared sum of the differences of the so-called “real” and predicted values for the n-th variable. However, in our case, the usage of such metric is inappropriate because in case of classification tasks the “real” and predicted values will be a Boolean value, that is we will get a sum of a bunch of zeros if the prediction is right and a sum of a bunch of ones if it is not right. This kind of technique to assess the performance of the model is not very informative. There are other metrics that are used specifically for classification type tasks. This paper is going to use a metric called accuracy. However, we will not stop solely on the accuracy of the model and will give the whole output of it with the help of a confusion matrix. The confusion matrix represents a table of the predicted and the true variables. It makes convenient to compare the model output and see for example, how many true positive or else said truly predicted variables we have. It also gives an opportunity to asses not just the accuracy of the predictions but also computes the sensitivity and specificity of the model(Sunasra, 2017).

After the analysis has been conducted, at the fourth and fifth tasks we will identify those characteristics of the competitions and users that mostly contribute to the general popularity and success of them. For our purposes,we are going to assess the contribution of the competitions features on the popularity of the competitions and the user's success in competitions given their own set of characteristics. As it was mentioned earlier to do so we are going to use choice-based conjoint analysis which was described in more details in the theoretical part of the paper along with knn and our new modified knn.

Overall, the methods that we are going to use should be enough to fill the gap and complete the objective of the paper. It is possible to make this kind of claim because the models were specifically chosen for the implementation of the task of the paper. Though it is no surprise that more sophisticated methods of supervised machine learning exist, and they can be used here too, we are constraining ourselves with these models since they are easy to implement, quite popular and if we are speaking about choice-based conjoint analysis are very well-interpretable.

3. Research methods

3.1 Getting the data.

In order to start our research and to solve the problem, at first, we must have a database to work with. The database that has been used in the study can be divided into two categories: primary and secondary data. The primary data comes from the Kaggle web page(Kaggle, 2019). There an interested person can find an extensive amount of information about the website. It is essentially, metadata of all Kaggle activities.

The process of retrieval of the secondary data was the hardest one. Though Kaggle provides API, which can be used to extract some information, it has somewhat limited capabilities for our purposes. Hence, unfortunately, the official Kaggle's API became useless for further work. Nevertheless, we found the solution from this situation. That is, we wrote a program on the Python programming language which is intended to extract all the meta information that we are interested in. Particularly, we were interested in the characteristics of the users of Kaggle. As a result of a very long process of data retrieval, we get the database that finally can be useful for our research. The database consists of such information as the level of competences in all the sections: competitions, discussions, and kernels. Moreover, it contains data about the user's achievements: rankings and medals, once again from all sections of the platform. The last but not least, it has information about what competitions the users have been involved in and how successful they have been in these competitions. The level of success in the competitions is determined by the place that they managed to reach in the leaderboard of the competition.

Due to the technique, that the whole database was collected, it was initially separated into small datasets. There were joined with each other to make the ongoing manipulations easier. Since, as it was mentioned in the theoretical part, one of the methods, that the paper will use is going to be the choice based conjoint analysis model, we altered the shape of the data, to make it more suitable for this type of analysis. The conjoint analysis assumesthat the datasetwill get the categoric output variables for every user, thus we inserted a column in the table which describes whether the user chooses to take part in the competition or no. The issue with this approach is that there are more than hundred competitions that are starting at different times, and as we are modeling the choice of the competition from the user perspective, we cannot just take all the competitions and assign to them a binary indication of participation. Instead of it, for every user, we should take only those competitions that have been launched before the user made his or her mind to join in the competitionand have not been canceled by the time, he or she made the final decision. For example, if a user is involved in “A” event, and it started in the March 5, then we should look only at the events that started in March 5 or earlier and which expire date is later than the March 5.

A kind of adjustment that we must make to meet the preconditions of conjoint analysis is that we must allocate data in a so-called “long” format. This kind of format assumes that the data will be compressed in a way which shows the choice the user made and next to it there will be an alternative that he or she had. So, commonly, the side observer can notice duplication of the information by rows in our dataset. This structure will help us to capture all the possible variants for the users and simulate their decision making while choosing the competition.

At the end of the subsection we would like to mention that appendix 1 illustrates the encodings of the variables used in our sample and explains their meaning.

3.2 The development of the methods

After the data collection process has been successfully completed, we will focus on the remaining tasks of the paper, specifically at the modification of the model, to use it for the classification purposes. The model will be developed by us using such programming languages as R and Python. The choice of these languages is justified by the fact that they are tremendously popular among the data scientist's community. They can either directly or via additional libraries provide all the necessary wide range of statistical software to deal with the task that we have.

The methods that the paper is going to use in the research are the following: conjoint analysis, k nearestneighbors and its modification. While the first is more a tool for marketing analysis, than the others it was suggested by one of the greatest statisticians of the 20th century Tukey. There are distinguished four types of conjoint analysis tools. The traditional, choice-based, adaptive and self-explicated conjoint analysis(Rao & Pilli, 2014). While some types of conjoint analysis assume the ranking of the parameters by the respondents such as the adaptive conjoint type, we don't have an opportunity to allow us this luxury, therefore we will use another, a quitemassively popular type of conjoint analysis a choice-based conjoint analysis. However, usually, the choice-based conjoint assumes that the researcher is going to observe the choices of the respondents, which they make from the available variants in the questionnaire. In our case, we will not have any kind of a questionnaire nor we will conduct a survey, instead of it we will delegate the task of the creation of a questionnaire to the real world. More concretely, we will observe the real-world decisions of the users, and we will analyze the experience that they had while choosing whether to participate in the competition or prefer another one. The usage of conjoint analysis will let us assess the relative significance of the factors that the users assign to while making choices among various features. The task that the conjoint analysis is trying to solve is the following: it tries to maximize the demand of the product by calculating the probability that the customer will buy the predefined amount of the product. The conjoint analysis as any other statistical tool is aimed to deal with the uncertainty while making the predictions. One of the sources of uncertainty that almost all the researchers are facing and so we will be the scarcity of the information about the customer. For example, it is quite common when the researcher while conducting choice modeling does not have complete knowledge about its agent, his or her preferences, age, income, philosophy or attitude. As a result of such limitations, the models must deal with the randomness to acknowledge the uncertainty.

Though the statistical nature of the model it found its justification also from the sociological perspective, thus the model receives more thorough attention from the market researchers. The sociological part of the model is based on the maximization of the random utility. The computation of the utility of the product started to assume the random element which is referred by the “noise” or “error”(Hess, Daly, & Batley, 2018). During the development of the theory and practice of choice modeling, various researchers have made different assumptions on the distributional part of the error terms. Based on those assumptions numerous approaches have been evaluated the most popular one of which assumes the usage of the multinomial logit model, which treats the error as if it belongs to the Gumbel distribution, that is widely used to model the maximum of samples of different distributions(Gumbel, 1954). The choice-based conjoint analysis is exactly based on this idea and is using under its hood a log-regression model. Thus, the logit model has three underlying assumptions one of which is the fact that error should follow Gumbel distribution. The other two are independence and equal variability(Li, 2011).

One more method that we are going to use in this paper is the k nearest neighbors' algorithm. For the sake of clarity, it must be mentioned that we will not just satisfy ourselves by running the algorithm and getting the results. Indeed, we will be more creative in relation to this algorithm and will make a trail to modify it to get more rigorous results in comparison with thesimple knn.

The problem that we are going to concern with knn is that it can perform very well if the query point that we are going to classify is “close” to its so-called one of the class representatives in the training database. However, knn pays too much attention to the redundant data. If the point is situated in the place where the classes are meeting each other or even worse are overlying one to another very frequently the point can be misclassified, even if this overlying occurs as a result of being “close” to the outliers of one of the classes. We decided to try to alter this situation, by making some modifications into the way the model is working in these cases. The model that we wrote is as follows:

1) First, we compute the distance between the query point and the remaining points. The table with all these distances has been formed to make it convenient for further computations.

2) In the second step of the algorithm, we wrote a function to compute the k closest point to the query point. To do so we are using the table with distances formed in the previous step of the model. So far everything has done just like in a normal knn. The changes come at the third stage of the model.

3) The third step of the model is the most interesting. In this step, we are computing the majority of the representative of the particular classes among the k nearest points. For this purpose, we are using both the simple way to calculate the majority and our modified one. The essence of the single way of calculation of the majority is the following: the algorithm computes the number of neighbors that are belonging to different classes and just compares these numbers for each of the classes. The class of the one which number of neighbors is bigger is assigned to the query point. Meanwhile, in our altered version of the computation, we are specifically tackling the points that are situated at the overlay of the classes. For that kind of points, we have figured it out that the algorithm does not work well because of the intersection or lack of the neighbors of the true class that the point belongs to. For example, in a situation when we are trying to classify if the person has diseases or not, if the point is around the meeting place of two classes there can be a situation when the true class is “no”,that is he/she doesn't have a diseases, but from the taken k nearest neighbors k/2+1 of the point have the diseases. Hence, misclassification will occur. To reduce the risk of misclassification we did the following: we took the closest neighbors of the query point and resampled from them n times with replacement, then during every iteration, we computed the majority of the classes of the neighbors and assigned them to a vector of all the n trials. After the iteration cycle has been completed, we computed the mean of the resulting vector and compare it to some threshold value. After that comparison, it becomes obvious, which class we should assign to the query point.

Essentially, the technique that we are going to use at this stage of the model development is inspired by the combination of knn and bootstrap methods. Bootstrap is an important concept in the range of modern statistical tools(Kotz , Campbell , Balakrishnan , Vidakovic , & Johnson, 2004). To understand what this method assumes under its hood it is enough to mention the other name of bootstrap, that is resampling. The resampling technique originally proposed by Efron has been widely used for the validation purposes, then it receives its fame in the precision estimation and sample-size determination fields(Good, 2011). What bootstrap is doing is very like to the algorithm upon which we implemented our method. It is a Mote Carlo-style process that is an independent sampling with replacement to make inference about the estimate that we are interested in. In our case, such an estimate is the relation to the particular class of the query point. Though, for every sample, one can observe that the sample size is predefined,and some points are repeated from sample to sample, however since the resampling is taken out with replacement every time the estimate will slightly differ from the previous one. The core idea is, that the relative frequency of the sampled distributions of our estimates will be close enough to the real parameter so that we can infer based on that estimate(Chapter 1. Bootstrap Method, n.d.). We are using this idea to make sure that we can classify using our estimates from the model.

It is essential to use the combination of these two approaches because it gives some advantages in relation to the way the model is performing. Particularly, it affects positively onto the speed of the model, because if the model was using only our modified approach the computational cost of it will be too high and it can hardly ever be useful.

Summing up this section we would like to mention the metrics upon which the comparison of the simple knn and our modified model will be conducted. Here are several matrices, that are commonly used to estimate the results of classifications. One of the most standard ways to do that is the computation of specificity and sensitivity of the outcomes. The need of the usage of several metrics instead of just a single metric like accuracy is needed because frequently while solving some classification tasks the high accuracy alone is not a very good metric of the model performance. To illustrate why relying on the accuracy alone can be misguiding imagine a situation when the researcher is trying to classify whether the ball is red or blue. If the amount of red balls in the sample is considerably higher than that of the blue balls, he can naively classify every ball as red and have a very high accuracy in the training sample. However, it is obvious that this solution is not favorable and is terrificallynaпve. Certainly, not every ball is red. Therefore, it is necessary to count both the specificity and the sensitivity of the model, which will let us omit this kind of situations.

4. Description of the results

4.1 Exploratory data analysis

At this point of the work, we are going to introduce the primary results of the research with some complimentary comments to get a flavor of the data that has been used in the research and to get the ending results. Thus, the first part of this section will be an exploratory data analysis (EDA) and the second part of it will be the implementation of the models.

We drew a sample by taking a simple random sample of the general population that we have from the Kaggle web site. The usage of this technique of data sampling has advantages that we really seek to see. Firstly, it assures that the sample will be drowned without replacement,which means, that we won't have repeated observations andthe sampling will be done uniformly.Thus, basically, each of the observations will be equally likely to be included in the final sample. The sample, that we drew contains sixty-three variables and more than twenty-five thousand observations. Apparently, the big amount of observations partly can be explained by the fact, that we allocate them by rows. That is if a person participates in several competitions all of them are written from the new row along with all the available alternatives. As it was mentioned already in the previous section this was done out of the conventional purposes to be assured that the choice based conjoint analysis technique that we are using will work flawlessly. However, even without the allocation by rows, our sample is quite big it contains five thousand unique observations, that is users in our case.

The first thing that amazes us straight away is that keeping in mind the popularity of the Kaggle's platform among data scientists and machine learning enthusiasts the number of active users can be considered somewhat low. We found out, thatin fact, after some filtering of the general population, out of more than three million users approximately almost nine thousand were “active” users. Here, by “active” we mean that they have performance tier 2 or more.The remaining users are mostly those type of people who had registered in the platform for some reasons and either has zero or at most two or three activities in the platform. By the activities in this case we mean not only the number of competitions that they joined in but also other forms of activities in the platform such as discussions and kernels. Hence, we are categorizing all of them as so-called “passive” users that are of no big interest for our purposes, therefore we will omit further evaluation of this topic in the work.

Figure1. The distribution of the user's performances in the general distribution

...

Подобные документы

  • Lists used by Algorithm No 2. Some examples of the performance of Algorithm No 2. Invention of the program of reading, development of efficient algorithm of the program. Application of the programs to any English texts. The actual users of the algorithm.

    курсовая работа [19,3 K], добавлен 13.01.2010

  • Review of development of cloud computing. Service models of cloud computing. Deployment models of cloud computing. Technology of virtualization. Algorithm of "Cloudy". Safety and labor protection. Justification of the cost-effectiveness of the project.

    дипломная работа [2,3 M], добавлен 13.05.2015

  • Basic assumptions and some facts. Algorithm for automatic recognition of verbal and nominal word groups. Lists of markers used by Algorithm No 1. Text sample processed by the algorithm. Examples of hand checking of the performance of the algorithm.

    курсовая работа [22,8 K], добавлен 13.01.2010

  • Архитектура операционной системы Android. Инструменты Android-разработчика. Установка Java Development Kit, Eclipse IDE, Android SDK. Настройка Android Development Tools. Разработка программы для работы с документами и для осуществления оперативной связи.

    курсовая работа [2,0 M], добавлен 19.10.2014

  • Technical and economic characteristics of medical institutions. Development of an automation project. Justification of the methods of calculating cost-effectiveness. General information about health and organization safety. Providing electrical safety.

    дипломная работа [3,7 M], добавлен 14.05.2014

  • Overview history of company and structure of organization. Characterization of complex tasks and necessity of automation. Database specifications and system security. The calculation of economic efficiency of the project. Safety measures during work.

    дипломная работа [1009,6 K], добавлен 09.03.2015

  • Этапы разработки автоматизированной системы приема и бронирования заказов столиков в заведениях. Анализ среды разработки Android Development Tools. Общая характеристика диаграммы компонентов IOS приложения. Рассмотрение системы контроля версий сервера.

    курсовая работа [8,7 M], добавлен 14.05.2014

  • Анализ функциональной структуры и обеспечивающей части АСУ. Проектирование функциональной структуры подсистемы управления проблемами, разработка модели в среде CPN Tools и алгоритма работы. Описание программного и технического обеспечения проекта.

    дипломная работа [5,6 M], добавлен 26.06.2011

  • Средства разработки, ориентированные на конкретные СУБД. Наиболее известные приложения на основе Eclipse Platform. Проект NetBeans IDE, его возможности. KDevelop — свободная интегрированная среда разработки для UNIX-подобных операционных систем.

    реферат [107,5 K], добавлен 14.04.2014

  • Основные алгоритмические структуры. Запись алгоритма в словесной форме, в виде блок-схемы. Система команд исполнителя. Язык высокого уровня. Создание программы и её отладка. Интегрированные среды разработки: Integrated Development Environment, IDE.

    лекция [61,7 K], добавлен 09.10.2013

  • IS management standards development. The national peculiarities of the IS management standards. The most integrated existent IS management solution. General description of the ISS model. Application of semi-Markov processes in ISS state description.

    дипломная работа [2,2 M], добавлен 28.10.2011

  • Кратка историческая справка развития языка Java. Анализ предметной области. Java platform, enterprise and standart edition. Апплеты, сервлеты, gui-приложения. Розработка программного кода, консольное приложение. Результаты работы апплета, сервлета.

    курсовая работа [549,2 K], добавлен 23.12.2015

  • Information security problems of modern computer companies networks. The levels of network security of the company. Methods of protection organization's computer network from unauthorized access from the Internet. Information Security in the Internet.

    реферат [20,9 K], добавлен 19.12.2013

  • Класифікація комп'ютерних ігор відповідно до інтерактивних ігрових дій гравця. Мобільні пристрої з сенсорними екранами. Програмна реалізація гри жанру Tower Defence на базі платформи Java Platform Micro Edition для мобільних пристроїв з сенсорним екраном.

    дипломная работа [693,2 K], добавлен 14.04.2014

  • Новые тенденции развития СУБД и областей их применения. Структурные элементы базы данных. Объектно-ориентированная модель программных компонентов. Формы, модули и метод разработки "Two-Way Tools". Масштабируемые средства для построения баз данных.

    дипломная работа [589,5 K], добавлен 16.12.2013

  • Основные понятия и определения стеганографии. Методы сокрытия данных и сообщений, цифровые водяные знаки. Атаки на стегосистемы и методы их предупреждения. Технологии и алгоритмы стеганографии. Работа с S-Tools. Особенности специальной программы.

    контрольная работа [2,2 M], добавлен 21.09.2010

  • Описание функции file info в программе Erdas Imagine, которая позволяет получить подробную информацию об изображении. Графики спектральных характеристик для разных объектов на снимке. Инструменты Profile tools для исследования спектральных характеристик.

    лабораторная работа [533,8 K], добавлен 09.12.2013

  • Программы автоматизированного перевода: электронные словари, tools-приложения, система Translation Memory, редакторское ПО. Анализ использования САТ-программ в практической деятельности. Выполнение перевода при помощи переводчиков Wordfast и Promt.

    курсовая работа [46,5 K], добавлен 10.11.2011

  • Consideration of a systematic approach to the identification of the organization's processes for improving management efficiency. Approaches to the identification of business processes. Architecture of an Integrated Information Systems methodology.

    реферат [195,5 K], добавлен 12.02.2016

  • Создание образа диска с помощью программного продукта Nero для резервного копирования, распространения программного обеспечения, виртуальных дисков, тиражирования однотипных систем. Возможности Alcohol 120%, Daemon Tools для эмуляции виртуального привода.

    курсовая работа [188,9 K], добавлен 07.12.2009

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.