Methods of machine learning for censored demand prediction

Econometric approaches to modeling censored demand - a tool that is used to obtain consistent and unbiased parameter estimates. The neglect of censored data when building a forecast - a significant lack of demand analysis by machine learning methods.

Рубрика Экономико-математическое моделирование
Вид дипломная работа
Язык английский
Дата добавления 23.09.2018
Размер файла 192,4 K

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Размещено на http://www.allbest.ru

Размещено на http://www.allbest.ru

Introduction

econometric forecast demand

The grocery retail market has been under the close scrutiny of economists over the past few decades. High level of interest has been motivated by a number of reasons: first of all, by the availability of large volumes of detailed data on the purchases of each individual consumer. Secondly, by the heterogeneity of behavioral patterns of both sellers and consumers, which are extremely interesting and useful to analyze. Thirdly, by the vastness of the research questions available in this market and requiring an empirical solution. And, finally, by the opportunity to test in this market a lot of techniques of econometric analysis and methods of machine learning that have appeared in recent years. At the moment, there are a lot of studies, examining in detail the different features of grocery retail market. Perhaps the most significant part of the research is devoted to the study of the demand for different groups of products in this market. Demand studies, in turn, can be divided into two large groups: the search and explanation of factors that seem to be the demand determinants and the prediction of future sales volume, prices, etc. In this study, we focus on demand prediction and being based on the sales data of the regional food stores chain, we try to build a model with a comparatively good predictive power.

The studies that preceded this work investigated various issues related to demand modeling and, accordingly, used different methods and approaches. The basic approach to demand modeling considered to be the discrete choice models that appeared in the middle of the last century and were based on the theory of Random Utility Function (Lancaster, 1966). This approach has long been perceived as the main one, because it took into account the consumers' choice heterogeneity in the product-differentiated market. Among the most commonly used models of this approach family of logit models can be named: from binary logits to nested (Guadagni & Little, 1983) and mixed (Nevo, 2001) logits. But, despite the power of the above models, they did not solve the problem of continuous choice. Therefore, later, the focus of the researchers shifted to the discrete-continuous choice models, which allowed taking into consideration the fact that the quantity or volume of consumed goods may be not discrete, but continuous (Bhat, 2005; Richards, Gomez & Pofahl, 2012).

For a comparatively long time, exclusively econometric approaches occupied the leading position in the analysis of food market demand. But the situation changed dramatically when in the late 90's the companies Nilson and IRI Marketing Research began to collect individual data on purchases of retail chains visitors. Advances in individual data availability drew the researchers' attention to the methods of machine learning. Investigators of the «big data» analysis have revealed the huge potential of machine learning methods for working with massive data sets, both in terms of the number of observations and predictors (Richards & Bonnet, 2016). A number of scientists, including Agrawal & Schorling (1996), Varian (2014), Bajari, Nekipelov, Ryan & Yang (2015), showed greater predictive power of machine learning methods compared to traditional econometric approach. Therefore, today, when solving the problem of demand prediction, analysts' preference is often given to machine learning.

However, despite the significant breakthrough made by scientists in the evaluation of demand due to the methods of machine learning, there are still a lot of gaps, filling of which can improve the predictive quality of models. One of such white spots is accounting for the censorship of demand. To date, there are a number of works devoted to censored demand prediction using traditional econometric approaches, as well as several studies on demand forecasting (without censoring) using machine learning methods, at the same time, there are no works that combine censorship and machine learning methods. Therefore, in our work, we fill this gap by constructing an ensemble model of censored demand using machine learning methods and empirically checking its predictive properties on the data of the retail food chain.

The study is conducted on the data, provided by the Perm regional grocery chain «Semya». One product category - pasta - is selected for analysis. The choice of such a product category is justified by the high frequency of purchases of this product, the breadth of the range and a number of other reasons. The initial data from the «Semya» sales represents the full information on the pasta purchases for 6 years: from December 1, 2009 to January 31, 2014.

So, the aim of this study is the construction of censored demand ensemble model which will have relatively better predictive power compared to existing models, not taking censorship into account, with the use of real data on grocery retail market. To achieve the aim it is necessary to cope with the following objectives:

1) To give a firm basis to the research, having revised the current theory of demand estimation;

2) To perform data analysis and, based on it and literature review, made inferences of the most suitable variables to use in model;

3) To emphasize the peculiarities of the data and to justify the necessity of using censorship in the econometric model;

4) To build an econometric model of demand prediction, applying machine learning methods;

5) To complicate the constructed models, adding to them the censoring of the dependent variable;

6) To draw conclusions about the appropriateness of censorship, comparing predictive power of models with and without censorship;

The remainder of the paper is organized as follows. The first part is devoted to a review of relevant literature for this study. The second section provides a detailed description of the data and its preliminary analysis, which explains the need to use the chosen methodology. In the third part the models are presented and the methodology of their evaluation is described step by step. The fourth section discusses the results of models' evaluation. Finally, the last part is devoted to the main conclusions, where the limitations of the model are also mentioned.

1. Literature review

Significant expansion of the food market, increasing the availability of data on purchases, as well as the achievements in econometrics and computer science - all these factors have caused a real revolution in the demand modeling which took place in the last few decades. The most attractive areas for empirical testing of improving models have become: retail (offline and online), healthcare, tourism and others (Richards & Bonnet, 2016).

All current approaches to estimating demand function can be divided into two groups: explanation/description and forecasting/prediction. The main task of explanatory models is the search and interpretation of various effects that affect demand; as to forecast models, their main aim is the most accurate prediction of prices, sales volume and purchase probability, which are of greatest interest to researchers. While in the descriptive approaches the preference is mainly given to econometric instruments, in the forecasting models nowadays the garland belongs to the methods of machine learning. Thanks to the enormous computing power, as well as the possibilities of working with millions of observations and thousands of explanatory variables, the methods of machine learning enable to obtain more accurate prediction of demand. Further, let us say a few words about the evolution of models for demand estimation and explain why today methods of machine learning are more preferable among many researchers engaged in demand prediction.

1.1 Discrete Choice Models

Studying the demand in the grocery retail, it is essential to use models that are able to appeal the heterogeneity in consumer choice in differentiated-product markets (Richards & Bonnet, 2016). It is considered, the discrete-choice models are among the best for solving the above-described problem. Discrete choice models are based on the theory of Random Utility Function (Lancaster, 1966). The new consumer theory differs from the classical one in that consumers derive usefulness from individual characteristics of the product, in other words, they demand for a specific set of characteristics. If each product unit can be representative of its taste, shape, net weight and other characteristics, therefore each package of goods is a unique assembly of components, and rational consumer chooses the optimal combination offered by the retailers. Let us further consider the development of discrete choice models in more detail.

Thus, the basic concepts of consumer demand models for goods and services were laid back in 1974 by Daniel McFadden (McFadden, 1974). McFadden introduced the approach with the premise of a random distribution of preferences between individuals. In view of the ease of interpretation and the ability to reduce the problems of high dimensionality of data, McFaddens's approach quickly became the main approach in demand analysis and served as the basis for a large number of discrete choice models.

One of the first who successfully applied McFadden's approach for grocery retail market and adapted it for marketing purposes were Guadagni and Little (1983). They created the multinomial logit model and applied it to estimation of individual and store-level demand on regular ground coffee. In response to the proposed model a quantitative picture of loyal and prone to switching customers was obtained.

Later the same authors in some extend modernized their own solution of the problem for regular coffee market, having used the nested logit model (Guadagni & Little, 1998). In general, nested logit, as well as the multinomial logit, generalized extreme value models and some other kinds of logit models represented particular cases of mixed logit models. In such specifications restrictions on the random variables describing consumer preferences were imposed. These types of models were created to combat the imperfections of the logit specification proposed by McFadden (one of which, for example, was the independence of irrelevant alternatives problem). So, improvements in the initial logit model made it possible to analyze consumer demand more proper and solve a number of applied problems. Yet, the fundamental assumption of the discreteness of the choice made by consumers seemed to be untenable for class of problems (Richards & Bonnet, 2016).

1.2 Discrete-Continuous Choice Models

For many products, particularly for food and beverages, the process of choosing goods by consumers is more correctly described as a discrete-continuous process, rather than simply discrete. For example, when consumers buy piece of cheese, bottle of water, meet or fruits - they decide by themselves what volume of each product to purchase, at that the weight of each individual unit of the product can be any of a continuous set. In other words, when it comes to buy discrete number of units from the proposed set, but the amount (or weight) of each unit can varies continuously - the consumer faces a discrete-continuous choice.

From the point of view of data, for which the problem of discrete-continuous choice is econometrically solved, they often look like this: a large number of zeroes for non-purchased alternatives and continuous amount for purchased ones (Richards & Bonnet, 2016). Such kind of data is named censored. The early models of discrete-continuous choice with a censored dependent variable are associated with such researchers as Heckman (1979) and Hanemann (1984). Both approaches were indirectly based on the utility-maximization theory. Thus, Heckman proposed a two-step approach of estimation, in which at the first step the probability of non-zero consumption (the participation equation) is estimated. At the second step, the demand equation (the selection equation) is estimated with correction for the potential covariance of the unobserved components in the equations of participation and choice. As to Hanemann, he worked out an approach based on Kunh-Tucker conditions for utility-maximization with the presence of corner solutions. Despite the fact that the Hanemann model contained a simplifying assumption about consumer's single choice from a continuous set, his method, as well as the Heckman's one, provided the impetus for the further development of models of discrete-continuous choice and drew the researchers' attention to the problem of data censorship. Further we will discuss the development of models for censored data in more detail, since the data of this type is used in the framework of our research.

1.3 Censored demand models

As was mentioned earlier, Heckman was one of the first who noted the need for a special analysis of censored data to avoid bias estimates. His method assumed the censoring of the dependent variable, required execution of distributional assumptions of dependent variable or error term and was sensitive to the chosen type of distribution. The Heckman's method with additions and improvements has long been used by researchers as one of the main methods for censored demand estimation.

One more basic approach for demand with censorship (which was developed even earlier than Heckman's one) was proposed by Tobin (1958). Tobin introduced a model that was a combination of a probit model and multiple regression. Later Tobit regression was used as an independent method to assess the demand for different groups of goods (Harris & Shonkwiler, 1994; Dong et. al. 2003), as well as the part of the system of simultaneous demand equations for explaining and predicting (for example, Perali & Chavas, 2000; Dong et. al, 2004).

At the beginning of the 21st century, nonparametric methods for censored demand estimation were widely disseminated (Bester & Hansen, 2009; Hoderlein & White, 2012; Matzkin, 2012). Despite the fact that nonparametric models allowed to weaken the distributional assumptions of the dependent variable, estimation with several regressors was notable for slow rate of convergence and required large computational powers (Ozhegov & Ozhegova, 2017). Also, to estimate demand with censorship, semiparametric approaches were used (Khan & Powell, 2001; Chernozhukov, Fernandez-Val & Kowalski, 2015). Both papers provided an idea of quantile regression estimation. The main advantage of that approaches were that, when using different quantiles, it became possible to take into account the heterogeneity of the effects; the main drawback, in turn, was the following: the quantile regression was suitable only for explanatory, but not predictive purposes by reason of the non-observability of quantile value in out-of-sample data.

Modeling demand in retail in particular, sales volumes prediction, it is necessary to notice that the analyzed data are often censored on the left: among all the observations there is a large number of zero sales. Therefore, to obtain consistent and unbiased estimates, it is necessary to take into account the fact of data censorship. In the case of pasta daily sales on «Semya» data, more than 60% of sales observations are equal to zero, that is why we pay special attention to models with the censorship. Why cannot we just drop zero observations and work only with positive sales volume? If we do this, we will face the endogenous sample selection problem, which again leads us to inconsistent and biased estimates (Sцderbom, 2009).

1.4 Demand modeling in retail with aggregate data

Speaking about demand modeling in retail, it should be noticed that for a relatively long time exclusively econometric approaches dominated in this area. First of all, this was due to the fact that the available for analysis data were aggregated. Data sets consisted of market shares occupied by a particular brand, sales volumes, average prices, etc. The most distinguished approach to estimating demand functions for aggregated data on sales of differentiated products proposed by Berry, Levinsohn and Pakes (1995). In their study the U.S. automobile market is considered. Having used the information on the annual sales volumes of each car model, the average market sales price and the characteristics of the cars, the authors estimated the parameters of the individual utility function of the average household, as well as the contribution of each vehicle characteristic to the marginal cost function.

The further development of the multiple choice models on aggregated data was reflected in the introduction of heterogeneity in consumer tastes by observable and unobservable characteristics (Nevo, 2001). In his paper Nevo examined the U.S. ready-to-eat cereals market. The author put forward the model for demand estimation, which had more complex structure of the utility function from the consumption of each alternative than all existing at the time of writing multidimensional logit models. The utility function proposed by Nevo took into consideration the observed and unobservable characteristics of goods, as well as the heterogeneity of consumers in terms of their tastes, which, in turn, depended on the observed and unobservable characteristics of consumers. The specification of the utility function was also complemented by the «zero alternative», that was, the inclusion of the consumer's ability to buy nothing.

Despite the fact that the breakthrough made by Berry et al. (1995) and Nevo (2001) was truly significant, there were still some problems with aggregate demand modeling in the late 20th century. The core issues for the researchers of that time were: endogeneity problem of marketing activities (in particular, price) and consumer heterogeneity problem. One of those who tried to cope with these challenges for aggregated data was Chintagunta (2001). Before demonstrating his new approach to estimating aggregated demand, Chintagunta explained what the difficulties with using logit models were. First of all, he underlined an impropriety of IIA restriction because of the greater similarity of each individual brand in the choice set with some brands, and less with other brands. Secondly, he denoted that the purchase incidence decision was not often equivalent to the brand choice decision. Therefore, the use of «zero alternative» in the model might not be entirely appropriate. Thirdly, even in the case of accounting the distinction between the purchase incidence and brand choice decisions, it was necessary to require assumptions for computing the share of the unobserved «zero alternative». So, in order to cope with these problems Chintagunta proposed the probit model instead of commonly used logit models. As the main advantage of his approach, the author emphasized the avoidance of the IIA property. Moreover, Chintagunta highlighted the probit model possibility of determining the difference between the purchase and the brand choice trough the general covariance structure assumed for the utilities of the alternatives. The author conducted his research for the shampoos product category. As a result of the study, Chintagunta concluded the necessity of the accounting endogeneity and consumers heterogeneity even after allowing for a non-IIA specification at the individual consumer level and proved that the range of estimated elasticities was larger for probit specification compared to the logit one. As to the limitations of the Chintagunta's study, they were the following: firstly, the appropriateness of the proposed approach just for the categories where consumers made single-unit purchases; secondly, the extreme complexity of estimating a probit model in the case of a large number of alternatives.

Thus, despite the great success of Nevo (2001), Berry et al. (1995), Chintagunta (2001) and other researchers' studies, some econometric problems were not resolved due to the aggregated nature of the data. Therefore, for the further development of demand studies, the exigency to use individual customer data was formed.

1.5 Demand modeling in retail with individual data

In the late 90s of the last century, Nielsen and IRI Marketing Research initiated the collection of microdata about consumers of retail outlets and their purchases (Richards et al., 2016). Now the data represented information about the price, time and other details of each customer's purchases.

Thus, economists, marketers and other specialists working with retail data got the opportunity for multiple improvement of the descriptive and predictive properties of demand estimation models. The use of individual data in studies made it possible to observe and analyze the individual consumer choice. And the individual demand consideration, in turn, allowed to make models richer and «more realistic».

This was especially useful for demand prediction models. Further we will focus on predictive models of demand.

1.6 Predictive demand models

Speaking of the econometric approaches for demand prediction, which are most often used by researchers to work with retail microdata, methods can be conditionally divided into following groups: judgmental, extrapolation, and causal methods (Ali et. al., 2009). Often the data used for such approaches are time series. Judgmental forecasting methods are often used in cases with luck of historical data, when new product is launched, when competitor enters the market and in in some other special situations (Hyndman & Athanasopoulos, 2014). It should be noticed that judgmental forecasts have several significant limitations. Thus, for the most part judgmental methods are subjective and suffer from the anchoring (tendency to converge to an initial reference point). As to the extrapolation, the methods varies from the traditional moving averages and the Box-Jenkins methods, allowing to identify and extrapolate models of time trends, seasonality and autocorrelation, to the more complicated methods, such as Vector Auto Regression (Ali et. al., 2009). Causal forecasting methods are represented by models in which explanatory variables are considered the causes of the results. For example, in (Cooper et. al, 1999) short-term forecast of promotions were obtained through the use of information on store and chain specific historical performance.

In general, we should note that econometric approaches to forecasting demand were quite inflexible, always required the fulfillment of various assumptions on the error or dependent variable distribution and the predictive properties of models were often far from ideal. So, with the advances in availability of detailed data on purchases, many researchers have changed their preferences in favor of machine learning methods. Machine learning methods have better predictive properties than traditional econometric approaches - this fact has been repeatedly proven by a number of researchers. For example, Agrawal and Schorling (1996) compared neural networks and multinomial logit models in predicting the brand's share in product categories and found that neural networks work better; Varian (2014) showed that regression trees are comparatively better than logistic regression for larger dataset and also demonstrated some advantages of such methods as bagging, bootstrapping and boosting over traditional econometric approaches; and, finally, Bajari et. al (2015) compared the predictive powers of a number of traditional econometric models and methods of machine learning, and came to an unambiguous conclusion of the superiority of the latter. Next, consider the methods of machine learning in more details.

1.7 Machine Learning Methods

Methods of machine learning (ML methods) that are widely used nowadays, presuppose the construction of a model according to a principle that would be optimal for some criterion in each subspace of data. The ML methods assume heterogeneity of objects, as well as the unobservability or the obscurity of the source of heterogeneity for the researcher (Ozhegov & Ozhegova, 2017). All methods can be divided into two groups: methods that reduce the number of parameters to be evaluated (variables selection methods), and methods that help to find the most fitted specification form (model selection methods). In turn, variables selection methods can also be divided into three groups, depending on the type of data. For sparse data LASSO regression is more preferable, for dense data - Ridge regression and if we do not know exactly with what kind of data do we work, it is better to use Elastic net or Support Vector Machine (SVM) methods (Bajari et al., 2015). As for model selection methods, the most commonly used methods are the following: Boosting, Regression trees, Bagging, Random Forest and others. In our case we will work with such ML-methods, as Ridge regression, LASSO Regression and Random Forest.

1) LASSO regression.

LASSO regression refers to the shrinkage methods in reason of its effectiveness in shrinking the size of the predictors set. LASSO minimizes an objective function that includes a penalty for many, large regression coefficients (Richards & Bonnet, 2016):

(1)

where: is a penalty parameter - when it is equal to zero, the initial model is converted to the usual linear regression, the estimates of which are got with OLS method; when is significantly greater than zero, all parameter estimates will be reduced to zero. For the selection of cross-validation is often used: we go from infinity to 0, bringing the overfit into the model, thereby, we reduce bias and increase the variance.

2) Ridge regression

The problem solved by Ridge regression is the same as that of LASSO: remove the «bad» factors from the model and use only important ones. To do this, the penalty function is slightly modified: now the penalty is not the sum of the modules, but the sum of the squares. In this case, we severely penalize model for large ratios:

(2)

3) Random Forest.

The model of Random Forest solves several problems at once: select factors, choose the most fitted functional form and deal with the problem of heterogeneity of observations (by constructing different regressions on subsamples). Random forest is model selection method and it is based on tree-method.

Tree method assumes the splitting of the initial sample space into subsamples, choosing as a minimization criterion at each stage the parameter estimation for one of the predictors, which, as a result, will minimize the predicted error of the tree.

As for the Random Forest, the principle of it is as follows: it constructs an ensemble of trees on large number of random samples which were drawn from the training sample and using cross-validation determines the optimal structure of each tree (for each tree, its own subset of factors is selected).

These methods of machine learning have been considered in more detail, since we assume to make the methodological basis of our work the study of (Bajari et al., 2015), in which exactly such methods are used.

2. Data for the study

The data used within the framework of the research are provided by the Perm regional grocery chain «Semya». «Semya» has been functioning on the territory of Perm region since 2002. From its founding to the present day, the number of chain stores has increased from 14 to 82, including 15 outlets located in such regional towns, as: Berezniki, Kungur, Chusovoy, Chernushka, Krasnokamsk, Dobryanka, Lysva, Solikamsk and Chaikovskii.

For the history of its existence, «Semya» has implemented several marketing projects, among which are: the creation of its own trademarks «Family Choice» and «Malosemeika»; the realization of the program for working with European suppliers without intermediaries «Pryamye postavki»; the development of its own mobile app «Semya Mobile» and etc. Besides, «Semya» was one of the first regional food grocery chains that invited its customers to become members of the loyalty program. The program implies a discount of up to 7% to discount cards holders. Also, every two weeks the grocery chain updates the list of products that are discounted for all stores' visitors. Usually a discount is granted for more than 200 trade items of different product categories from bakery to household chemicals.

For the analysis, we have chosen one food category - pasta. This category has been singled out for several reasons: first of all, pasta is included in the compulsory list of socially important food products; it is stored for a long time and characterized by attractive price, high nutritional value, speed and simplicity of preparation (Balashova & Mizhuyeva, 2016). Secondly, pasta refers to daily demand food products, so its purchase is relatively frequent (Balashova & Mizhuyeva, 2016). That is why this provides a large number of observations for analysis. Thirdly, the category «Pasta» is characterized by the breadth of the range in «Semya» stores. Therefore, we can take into account a large number of characteristics in our analysis.

The initial data from the «Semya» sales system included the full information on the 4758471 purchases of the pasta for 6 years: from December 1, 2009 to January 31, 2014 made in Perm and the cities of the region (Table 1). Note that due to some technical limitations, further analysis was conducted on a random sample of 800000 observations, formed on the basis of purchases made in Perm. In Appendix 1 a comparative table for the main characteristics as well as the results of multiple t-test for key variables of the initial set and the analyzed sample are presented. Due to them, it can be concluded that random sample is representative.

Table 1. Purchases of pasta in regional stores of «Semya» chain

Frequency

Share of total

Perm

4758471

76,98%

Berezniki

413570

6,69%

Kungur

339623

5,49%

Chusovoy

182078

2,95%

Dobryanka

172813

2,80%

Lysva

148728

2,41%

Solikamsk

72008

1,16%

Krasnokamsk

48835

0,79%

Chernushka

45679

0,74%

An observation in the data represented a stock keeping unit (henceforth SKU) that was available in a certain store on a specific date. It was known how many units of a single item were purchased every day and at what price they were sold. Also, with the use of the product catalog the number of physical characteristics for each SKU was restored. Thus, for each purchase, not only the price and sales volume were utilized, but also such characteristics as the colour and shape of pasta, the flour type, the volume and type of packaging, the origin country, the brand name. In addition to all of the above, for each observation, the format of the store where the purchase was made was determined and whether the product was a participant of the discount promotion. Next, consider the descriptive statistics for all listed characteristics in more detail.

As mentioned earlier, during the analyzed period, 4758471 purchases of pasta were made, 800000 of which randomly fell into the study sample. From 2009 to 2012, the frequency of purchases increased annually, but in 2013 the figures dropped significantly relative to the previous year. Such a decline in sales can be explained by the closure of a number of «Semya» stores during 2013. Nevertheless, in 2014, sales increased again and almost returned to the figures of 2012 (Table 2).

Table 2. Descriptive statistics on the time of pasta purchases

Mean

St.deviation

Frequency

Share of Total

2009

1,078

1,807

98023

12%

2010

0,865

1,545

123058

15%

2011

0,681

1,339

136610

17%

2012

0,626

1,287

159507

20%

2013

0,786

1,437

135197

17%

2014

0,772

1,433

147605

19%

Figures one and two show the dynamics of pasta purchases in all stores of the «Semya» by months and by years separately (Figure 1), and by months aggregated over the entire period under review (Figure 2). As can be seen, in all years, except 2013, there was an increasing trend of sales from the beginning to the end of the year. It is noteworthy that in all years there was a decline in sales in February, which can be explained by a smaller number of days in this month. In general, the dynamics of sales in all years was similar, that is, there were certain patterns of consumers' purchases depending on the months. Therefore, we should control our future econometric models, in particular, for months and years.

Fig. 1. Dynamics of pasta purchases in «Semya» stores by years

Fig. 2. Dynamics of pasta purchases in «Semya» from 2009 to 2014

We also assumed that demand varied depending on the day of the week. To test our assumption, charts for one month of one year in different stores were constructed. One of those graphs (for one representative store in April 2012) is shown in Figure 3. As can be seen, the graph visualizes an intraweek seasonality: peaks correspond to the weekends, while the lowest sales volumes are reached on Mondays or Tuesdays. In addition to accounting for weekly demand shocks, we decided to create a dummy-variable for the holidays. The table 3 shows the results of the t-test for sales on holiday and non-holiday days, which indicate the need to control the holiday variables (Table 3).

Fig. 3. Intramonth dynamics of pasta purchases (Example: April 2012)

Table 3. Results of the t-test for sales on holiday and non-holiday days

# of observations

Mean

St.Error

Holiday

770071

0,784

0,002

Non-holiday

29929

0,716

0,008

t-statistic

8,345

Continuing the analysis of spatiotemporal characteristics of purchases, it was necessary to describe the type of stores where purchases were made. In total, in the period from 2009 to 2014 pasta purchases were recorded in 33 stores, among them were stores of the following formats: discounter, small, middle, large and hypermarket. So, the greatest number of purchases in the analyzed period (70%) was made in small stores. But the largest one-time volume of purchases was typical for discount stores (Table 4).

Table 4. Descriptive statistics on the stores of pasta purchases

Mean

St.deviation

Frequency

Share of Total

Hyper

1,765

2,142

27372

3%

Large

1,016

1,639

129894

16%

Middle

0,753

1,317

56367

7%

Small

0,561

1,119

560422

70%

Discounter

3,394

2,830

25945

4%

In addition to the above described characteristics of pasta sales, such properties as the brand, the shape and the colour of pasta, the type and weight of the package, the country of origin and the type of flour were collected. In total, 38 brands were represented in the analyzed data. The share of most of them during the period under review was in the range from 0,01% to 5%, but there also were leading brands, such as «Makfa» with share of 18,27%, «Granmulino» (9,35%), «Pasta Zara» (8,12%), «Galina Blanca» (7,19%) and «Ameria» (5,66%) (Table 5). Almost two thirds of purchased brands were Russian and about one third were Italian. Chinese, German, Kazakh, and Vietnamese pasta in total accounted for about 5% of sales (Table 6).

Table 5. Descriptive statistics on the pasta brands

Mean

St.deviation

Frequency

Share of Total

Makfa

1,304

1,803

151757

18,27%

Granmulino

0,650

1,348

78484

9,35%

Pasta Zara

0,380

0,907

66235

8,12%

Gallina Blanca

0,332

0,762

60449

7,19%

Ameria

0,930

1,601

46863

5,66%

Maltagliati

0,521

1,116

44782

5,44%

Divella

0,281

0,707

34797

4,77%

Uvelka

1,143

1,899

35779

4,35%

Semgarnir

0,866

1,383

33224

3,96%

Soledoro

0,951

1,535

10232

3,64%

Makstory

0,495

0,987

27526

3,27%

Malosemeyka

1,699

2,050

27421

3,26%

Arrighi

1,128

1,768

24153

2,92%

Rummo

0,316

0,891

16278

2,44%

Grand Pasta

0,308

0,677

18737

2,22%

SunBonsai

0,208

0,560

15732

1,87%

Nobrand

0,180

0,496

9900

1,84%

Tomadini

0,596

1,091

12458

1,48%

Rollton

0,652

1,147

10860

1,28%

Shebekenskie

0,767

1,349

9976

1,19%

Dobrodeya

0,332

0,665

7312

0,87%

SenSoy

0,389

0,747

7297

0,86%

Garofalo

0,187

0,561

4580

0,83%

DeCecco

0,478

0,760

521

0,69%

Zalezione

0,312

0,879

4906

0,62%

Smak

4,899

2,902

3190

0,53%

ShaHeNodles

0,215

0,615

3469

0,42%

Barilla

0,693

1,046

2233

0,38%

Vnuk

0,490

0,831

2999

0,37%

Kammy

0,336

0,626

2732

0,33%

Business Lunch

0,952

1,669

1565

0,18%

ExtraM

0,828

1,238

1483

0,17%

Longkou

0,407

0,873

599

0,07%

3Glocken

0,288

0,522

417

0,05%

Makmaster

0,280

0,569

396

0,05%

Saratov

4,606

3,122

340

0,05%

KingLion

2,375

1,944

128

0,02%

Souzpishprom

5,333

3,132

15

0

Table 6. Descriptive statistics on the pasta country of origin

Frequency

Share of Total

Russia

518504

64,81%

Italy

241350

30,17%

China

29700

3,28%

Vietnam

7297

0,91%

Kazakhstan

2732

0,34%

Germany

417

0,05%

Now turn to the physical characteristics of pasta. As to the weight, the most purchasing frequency seemed to be for the medium packages from 400 to 500 gram. At the same time, the largest average sales volume was observed for 450 gram and 800 gram packs. Also, let us note that almost all the purchased items (97.11%) were packages, and only 2.89% were boxes (Table 7). The best-selling pasta in color were colorless (more than 97% of the total number of sales); in form - penne, fusilli and spaghetti (about 11% each); in type of flour - wheat (approximately 96%) (Table 8).

Table 7. Descriptive statistics on the pasta packages

Mean

St.deviation

Frequency

Share of Total

Weight (g)

150

0,39

0,87

599

0,07%

200

0,43

1,16

8698

1,09%

250

0,27

0,63

59347

7,42%

300

0,35

0,86

29145

3,64%

350

0,62

1,18

4858

0,61%

400

1,01

2,31

172607

21,58%

450

1,39

2,79

167011

20,88%

500

0,68

1,67

285274

35,66%

600

0,74

1,78

4967

0,62%

700

0,93

2,27

2472

0,31%

800

1,39

2,33

58454

7,31%

950

0,71

1,08

6520

0,82%

1000

0,60

0,86

48

0,01%

Type of package

Packet

0,91

2,10

776880

97,11%

Box

0,50

1,41

23120

2,89%

Table 8. Descriptive statistics on the pasta colour, form and type of flour

Frequency

Share of Total

Colour of pasta

Without colour

780801

97,60%

Multi

16083

2,01%

Green

2480

0,31%

Black

402

0,05%

Red

319

0,04%

Form of pasta

Penne

94001

11,75%

Fusilli

91920

11,49%

Spaghetti

91600

11,45%

Stringozzi

86320

10,79%

Fettuccine

68805

8,60%

Sedani

65766

8,22%

Lumache

41200

5,15%

Conchiglie

34237

4,28%

Tortiglioni

31521

3,94%

Tagliatelle

31532

3,94%

Rotelle

28880

3,61%

Boccoli

28321

3,54%

Radiatori

21760

2,72%

Bucatini

13682

1,71%

Farfalle

13358

1,67%

Fettuccine

12400

1,55%

Fiori

12004

1,50%

Lasagna

10401

1,30%

Lagane

8488

1,06%

Scialatelli

6163

0,77%

Canestrini

6001

0,75%

Alfabeto

1679

0,21%

Type of flour

Wheat

766241

95,78%

White rice

22321

2,79%

Bean

5278

0,66%

Buckwheat

3276

0,41%

Brown rice

1603

0,20%

Starch

562

0,07%

Rye

400

0,05%

Soybeans

317

0,04%

Since the aim of our research is the demand modeling, we proceed to the description of the demand integral components: sales volume and price. The histogram of the sales volume frequency shows that 99% of all one-time sales did not exceed 10. It should also be noted that more than 60% of sales were zero. So when modeling the demand for such data, it is necessary to take into account censorship - what we are going to do in our work (Figure 3).

As to the price, the frequency histogram demonstrates that the average price of a sold pasta package was 48 rubles, while the major share of packs was purchased at a price of up to 100 rubles (Figure 4). Also, it is necessary to emphasize that 353600 pasta packages (which were 44.2% of all observations) were sold as participants of action.

Fig. 4. Frequency histogram of pasta purchases

Fig. 5. Frequency histogram of average price per purchased pasta package

Hereby, having conducted preliminary data analysis, we noted the variability of the under study category on various characteristics, made several assumptions of the inclusion a number of variables in the model, and also made sure of the necessity to use censored demand models.

It should also be distinguished that before models construction all the variables were standardized, since some methods of machine learning (in particular, Ridge and LASSO regressions) work correctly only if this condition is satisfied.

3. Methodology

According to the literature review, machine learning methods are better able to cope with demand predicting (in particular, in grocery retailing), because they produce better out-of-sample fits than linear models without loss of in-sample fit quality (Bajari et al., 2015). Therefore, in order to achieve the most accurate prediction, we assume to use three methods of machine learning and only one traditional econometric approach - linear regression (as a basic model). In our study, we assume to partially follow the algorithm described in the Bajari et al. (2015) research, modernizing it somewhat by adding the stages of estimating censored models. The main steps of the empirical part of the study are supposed to look like this:

1) Split the data randomly into three groups for the subsequent cross-validation, where 25% of the data falls into the test sample, 15% - in the validation, and 60% - in the training.

2) Construct dummy for observation censorship: if the sales volume of the j-th SKU purchases committed in the store m on the day t is greater than zero and - otherwise.

3) Train a probit model, where the created in the previous step dummy variable is taken as the dependent variable, and variables described in equation (3) are used as regressors; train a model on a first training set.

4) Based on the probit model, get the predicted values of sales volumes with a given threshold.

5) Split the training set into two groups «censored» and «uncensored» by a threshold;

6) Train a model for continuous («uncensored») part of train set splitted by a threshold;

7) Combine predictions from models of steps (3) and (6). If the predicted dummy for censorship by model (3) is 0 or prediction on a continuous part of demand by model (6) is below 0 then the predicted demand is 0, otherwise the prediction is equal to prediction from model (6). Calculate RMSE on test set for a given threshold. Choose optimal threshold to split by cross-validation RMSE;

8) Take predictions on a test set obtained from various classes of prediction models (Linear regression, LASSO, Ridge, Random Forest);

9) Train an ensemble model on predictions from various classes of models and obtain their weights for final ensemble model.

10) Calculate RMSE on a second test set for final ensemble model and particular predictive models.

The algorithm without censoring is the same, only the alpha on which we cut censored observations is equal to 0 by default. This assumes that all observations are uncensored.

All the empirical part of the work is conducted on an open resource for the data analysis RStudio using the programming language R.

Further we will consider the above items in detail.

3.1 Econometric models of the demand function

In order to construct predictive models on a step (6) we apply several regression methods.

1) Linear Regression Model

The linear regression seems to be a typical model for demand estimation. It allows approximating the demand through a linear function. In our research the model specification will be the following:

(3)

where:

- the volume of the j-th SKU purchases committed in the store m on the day t.

- the matrix of attributes including log of the price, product characteristics, promotional indicators, time attributes (dummies for a month, a year, an intra-week seasonality and holidays).

- an idiosyncratic shock to each product, market and time.

The model is estimated by the ordinary least squares method. The linear regression model has become the basic specification for machine learning methods, which will be discussed further.

2) Ridge and LASSO Regressions

As the next model specification, we use Ridge and LASSO regressions. Ridge regression refers to the so-called dense models: if we take all coefficients and sort them in descending order, we will note that there are quite a lot factors that strongly affect dependent variable. In our study, we assume the presence of a rather large number of factors (product characteristics, store and time attributes) that affect the demand, so the use of ridge regression seems to be reasonable.

To select a set of imp...


Подобные документы

  • Процесс построения и анализа эконометрической модели в пакете Econometric Views. Составление, расчет и анализ существующей проблемы. Проверка адекватности модели реальной ситуации на числовых данных в среде Eviews. Построение регрессионного уравнения.

    курсовая работа [1,3 M], добавлен 17.02.2014

  • Исследование изменения во времени курса акций British Petroleum средствами эконометрического моделирования с целью дальнейшего прогноза с использованием компьютерных программ MS Excel и Econometric Views. Выбор оптимальной модели дисперсии ошибки.

    курсовая работа [1,2 M], добавлен 14.06.2011

  • Анализ временных рядов с помощью статистического пакета "Minitab". Механизм изменения уровней ряда. Trend Analysis – анализ линии тренда с аппроксимирующими кривыми (линейная, квадратическая, экспоненциальная, логистическая). Декомпозиция временного ряда.

    методичка [1,2 M], добавлен 21.01.2011

  • What is Demand. Factors affecting demand. The Law of demand. What is Supply. Economic equilibrium. Demand is an economic concept that describes a buyer's desire, willingness and ability to pay a price for a specific quantity of a good or service.

    презентация [631,9 K], добавлен 11.12.2013

  • A theoretic analysis of market’s main rules. Simple Supply and Demand curves. Demand curve shifts, supply curve shifts. The problem of the ratio between supply and demand. Subsidy as a way to solve it. Effects of being away from the Equilibrium Point.

    курсовая работа [56,3 K], добавлен 31.07.2013

  • Law of demand and law of Supply. Elasticity of supply and demand. Models of market and its impact on productivity. Kinds of market competition, methods of regulation of market. Indirect method of market regulation, tax, the governmental price control.

    реферат [8,7 K], добавлен 25.11.2009

  • A theory of price. Analysis of Markets. Simple Supply and Demand curves. Demand curve shifts. Supply curve shifts. Effects of being away from the Equilibrium Point. Vertical Supply Curve. Other market forms. Discrete Example. Application: Subsidy.

    контрольная работа [84,0 K], добавлен 18.07.2009

  • Traditional and modern methods in foreign language teaching and learning. The importance of lesson planning in FLTL. Principles of class modeling. Typology of the basic models of education: classification by J. Harmer, M.I. Makhmutov, Brinton and Holten.

    курсовая работа [2,1 M], добавлен 20.05.2015

  • Description of the basic principles and procedures of used approaches and methods for teaching a second or foreign language. Each approach or method has an articulated theoretical orientation and a collection of strategies and learning activities.

    учебное пособие [18,1 K], добавлен 14.04.2014

  • Economics: macroeconomics, microeconomics, economic policy. Terms: "economics", "macroeconomics", "microeconomics", "economic policy", "demand", "supply" and others. Economic analysis. Reasons for a change in demand. Supply. Equilibrium. Elasticity.

    реферат [17,3 K], добавлен 12.11.2007

  • Machine Translation: The First 40 Years, 1949-1989, in 1990s. Machine Translation Quality. Machine Translation and Internet. Machine and Human Translation. Now it is time to analyze what has happened in the 50 years since machine translation began.

    курсовая работа [66,9 K], добавлен 26.05.2005

  • Natural gas market overview: volume, value, segmentation. Supply and demand Factors of natural gas. Internal rivalry & competitors' overview. Outlook of the EU's energy demand from 2007 to 2030. Drivers of supplier power in the EU natural gas market.

    курсовая работа [2,0 M], добавлен 10.11.2013

  • The development in language teaching methodology. Dilemma in language teaching process. Linguistic research. Techniques in language teaching. Principles of learning vocabulary. How words are remembered. Other factors in language learning process.

    учебное пособие [221,2 K], добавлен 27.05.2015

  • The process of scientific investigation. Contrastive Analysis. Statistical Methods of Analysis. Immediate Constituents Analysis. Distributional Analysis and Co-occurrence. Transformational Analysis. Method of Semantic Differential. Contextual Analysis.

    реферат [26,5 K], добавлен 31.07.2008

  • Forms and methods of non-price competition: the introduction of new products, sales promotion, advertising and public relations. The role of advertising in shaping consumer product demand. Functions of advertising as a key element of the market economy.

    курсовая работа [32,5 K], добавлен 24.02.2014

  • Research methods are strategies or techniques to conduct a systematic research. To collect primary data four main methods are used: survey, observation, document analysis and experiment. Several problems can arise when using questionnaire. Interviewing.

    реферат [16,7 K], добавлен 18.01.2009

  • The basic tendencies of making international educational structures with different goals. The principles of distance education. Distance learning methods based on modern technological achievements. The main features of distance education in Ukraine.

    реферат [19,1 K], добавлен 01.11.2012

  • The concept of advertising as a marketing tool to attract consumers and increase demand. Ways to achieve maximum effect of advertising in society. Technical aspect of the announcement: style, design, special effects and forms of distribution channels.

    реферат [16,1 K], добавлен 09.05.2011

  • Principles of learning and language learning. Components of communicative competence. Differences between children and adults in language learning. The Direct Method as an important method of teaching speaking. Giving motivation to learn a language.

    курсовая работа [66,2 K], добавлен 22.12.2011

  • Machine Learning как процесс обучения машины без участия человека, основные требования, предъявляемые к нему в сфере медицины. Экономическое обоснование эффективности данной технологии. Используемое программное обеспечение, его функции и возможности.

    статья [16,1 K], добавлен 16.05.2016

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.