Methods of machine learning for censored demand prediction
Econometric approaches to modeling censored demand - a tool that is used to obtain consistent and unbiased parameter estimates. The neglect of censored data when building a forecast - a significant lack of demand analysis by machine learning methods.
Рубрика | Экономико-математическое моделирование |
Вид | дипломная работа |
Язык | английский |
Дата добавления | 23.09.2018 |
Размер файла | 192,4 K |
Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже
Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.
Due to the estimation results positive influence on the number of pasta purchased packs is provided by the brand uniqueness, the package weight, unusual forms and colours of pasta and surprisingly by the country of origin (Chinese pasta seems more attractive in comparison with the Italian, which was put in the basic category). As to the time attributes, the negative influence of holidays on the purchase volume should be noted, as well as the smaller number of pasta purchases in all years and months compared to April 2014 (these categories were put into basic). What is more vital to notice this is the better explanatory properties of the model accounting censorship (the value of the adjusted for censored linear regression is higher than for model without censorship accounting). But the predictive power of censored model seems to be worse - the quality of the prediction measure (expressed through the RMSE) for model with censoring is lower. Such a result is intuitively incomprehensible - accounting for the zero values of the majority of observations on the contrary has to increase the predictive power of the model.
Probably, the reason lies in the threshold: by default it is equal to 0,5 - but whether such a cut-off gives a minimum predictive error?
To test this, we set off a loop that allows to choose the best threshold using the RMSE values as an optimization condition.
The results of the loop implementation are presented in the Table 10.
Table 10. RMSE values for different censorship thresholds ()
0,00 |
0,05 |
0,10 |
0,15 |
0,20 |
0,25 |
0,30 |
0,35 |
0,40 |
0,45 |
0,50 |
||
RMSE |
1,245 |
1,240 |
1,240 |
1,237 |
1,234 |
1,232 |
1,233 |
1,236 |
1,242 |
1,252 |
1,266 |
|
0,55 |
0,60 |
0,65 |
0,70 |
0,75 |
0,80 |
0,85 |
0,90 |
0,95 |
1,00 |
- |
||
RMSE |
1,288 |
1,314 |
1,347 |
1,380 |
1,413 |
1,450 |
1,491 |
1,526 |
1,586 |
1,647 |
Having based on the results reflected in Table 10, we can conclude that the best cut-off point at which the minimum value of the RMSE is reached - 0,25. Note that at such a threshold, the predictive power of the linear regression with censorship accounting (RMSE=1,232) is higher than of the linear model that does not take this fact into account (RMSE=1,254). Thus, we found the optimal cut-off point for the linear model, and then a similar procedure should be carried out for the three remaining models (Ridge Regression, LASSO regression and Random Forest).
After evaluating the parameters of the basic linear model, the actual dependent variable is fitted in the training set on each of the four models (linear regression, ridge regression, lasso and random forest). Then, for every model the measure of the prediction quality (RMSE) calculated for the training and test subsamples is figured out (Tables 11, 12). It is worth noting that initial expectations of lower RMSE for censored models have confirmed. The predictive power of each model with censorship accounting has been greater.
Table 11. Root Mean Square Error (RMSE) for model specifications without censorship accounting and models' weights in ensemble model
RMSE |
Weight in the linear combined model |
||
Linear regression |
1,254 |
1% |
|
Ridge regression |
1,253 |
6% |
|
LASSO regression |
1,242 |
20% |
|
Random Forest |
0,916 |
73% |
Table 12. Root Mean Square Error (RMSE) for model specifications with censorship accounting and models' weights in ensemble model
RMSE |
Weight in the linear combined model |
||
Linear regression |
1,233 |
13% |
|
Ridge regression |
1,249 |
8% |
|
LASSO Regression |
1,230 |
12% |
|
Random Forest |
0,904 |
67% |
According to the RMSE calculation results, the Random Forest model provides the greatest predicted power for both cases: with and without censorship accounting. Such a result is obtained, first of all, because of the Random Forest mechanism itself. It assumes a multiple internal solution of the minimization problem and the display as the final result the most optimal variant. Besides, Random Forest involves replacing the missing values with column medians - such a procedure does not significantly affect the accuracy in reason of the subsampling and trees grown randomness but allows us to take into account a larger number of observations and variables.
The next step that is taken after RMSE estimation is the determination of the weights of each model for their inclusion in the final ensemble model. To do that the validation set is used: predicted values of the dependent variable from four models are treated as regressors into constrained linear regression, where the actual value of sales volume is used as dependent variable. Constrains for the linear regression are as follows: firstly, the sum of the estimates of the model parameters must be equal to one (since in the future the parameter estimates will be used as weights); secondly, the value of each parameter should be positive (for the same above-described reason). The results of constrained linear regressions estimation are presented in tables 11 and 12 as models weights in the combined model. Random Forest gets more weights in the ensemble models (in both: without and with censorship accounting) due to its good performance.
The final stage is the evaluation of two ensemble models (without and with censorship accounting). The predicted values of the sales volumes of four models (Linear regression, Ridge regression, LASSO regression and Random Forest) weighed on the values of the constrained linear model parameters built on the validation sample have become the regressors of ensemble models. The weights of each model in the final ensemble, as well as the RMSE of combined models, are presented in Table 13.
Table 13. Root Mean Square Error (RMSE) for ensemble models with and without censorship accounting and models' weights in ensemble model
Without censorship accounting |
With censorship accounting |
||
Linear regression |
3% |
24% |
|
Ridge regression |
32% |
27% |
|
LASSO Regression |
33% |
23% |
|
Random Forest |
32% |
26% |
|
RMSE (Test sample) |
0,902 |
0,877 |
Interpreting the results of Table 13, we can conclude that the predictive power of the model with censoring is higher than without censorship accounting. This result confirms our initial hypothesis - the use of machine learning techniques in conjunction with censorship accounting allows to increase the predictive power of the model and improve, thereby, the results of the study.
Conclusion
The demand estimation in retail is quite developed in academic literature; nevertheless, there are still some gaps and contentious issues which generate debates among researchers. In particular, the potential of machine learning methods for censored demand prediction in the food industry has not been studied so far. This study is the attempt to fill this void. Having based on previous demand studies reporting that machine learning methods have more predictive power (Varian, 2014; Bajary et al., 2015), and allowing for censorship of data leads to more consistent and less biased estimates (Tobin, 1958), we assume that the constructed model, combining both the methods of machine learning and censoring, will have better performance than aforesaid models. The models we focus on in this paper include Linear regression as the baseline model, Ridge regression, LASSO regression and Random Forest.
In this paper we analyze the demand for one product category (pasta) on the purchases data provided by the regional retail food chain. The initial data contains the full information on the pasta purchases for 6 years: from December 1, 2009 to January 31, 2014. The sample being analyzed has a size of 800000 observations. Since more than 60% of pasta sales are equal to zero, one needs to account for demand censorship.
We propose an estimator for demand prediction that allows us to use the potential capacity of machine learning methods as well as to consider the data censorship. The estimator is based on the idea of comparing the prediction accuracy of machine learning models with and without censorship accounting and combining them into constrained linear ensemble models. Censoring was carried out due to a specially developed algorithm. In the first stage, using the minimum of the RMSE as an optimization criterion, an optimal cut-off point, which allows to classify observations into censored and uncensored ones, is chosen. Then with the help of a probit model, observations are classified into censored and uncensored ones. Further, censored observations are assigned a value of 0, and uncensored ones are used to train the model. After that, the predictive accuracy of the model expressed by the RMSE calculation is determined. Finally, the algorithm is repeated for all models. It should be noted that all censored models separately (Linear regression, Ridge regression, LASSO regression and Random Forest) have better predictive properties than the same models without censorship consideration; and the models combination via weighted linear regression, in turn, allows to improve the prediction accuracy even more. Thus, the prediction error for an ensemble model with censoring turned out to be equal to 0,877, while for the ensemble without censorship - 0,902.
All in all, our approach shows that censorship accounting of demand makes model predictions more accurate; the use of ensemble linear models makes it possible to select the most powerful models automatically and produces the best prediction accuracy. Since the research is conducted on the basis of real retail food chain data, we can assert that the obtained result has not only theoretical but also practical significance. Thus, the obtained results can be used by the trading network to establish the optimal price for goods with different characteristics and at various time periods, as well as for optimal inventory management (but for this issue it is necessary to have information about the costs of storing products).
Some significant results were obtained from the proposed paper, however, there are some issues which limit the research as well as there are some ways of further development of the study. First of all, it seems obvious that not all possible factors affecting the sales volume of pasta purchases are accounted for as explanatory variables. We had no information, for example, about the size of discounts for purchased packages or about the choice sets provided to consumers in each store and so on. Neglect of the above factors causes an endogeneity problem, whereby the estimates seem to be inconsistent. Another factor that could be the cause of the endogeneity of the price is the simultaneous formation of demand and price because of the cross-sectional data structure. In this case, the method of instrumental variables should help to obtain consistent estimates (Tsyplakov, 2007). As the instrumental variable the cost of the pasta package in other regional stores of the chain (not Perm) can be taken. These indicators correlate with the price and do not change under the influence of the demand shocks, that is, there is no relationship with a random error. Secondly, the research is conducted only for one product category (pasta) - therefore, when applying the developed model to other products, it is necessary to take into account their specificity and make changes to the model. Thirdly, we have data on purchases only in one retail chain, that is, information on the recency and frequency of purchases customers made somewhere else remains unobservable for us, that is, the estimates can be somewhat biased. Finally, due to technical limitations, we were able to use in the study only 800000 observations, which are less than 20% of the initial data. Therefore, it can be assumed that in the case of the sample size increase, even more accurate estimates of the parameters can be obtained, and the predictive power of the models can also be enhanced.
A possible further way of developing research is the use another methods of machine learning, such as SVM or boosting, which are currently considered to be among the best methods of demand prediction.
Another possible way of study development is an attempt to solve an endogeneity problem through the use of data with a panel structure or instrumental variables implementation.
References
1. Балашова М.В., Мижуева С.А., Изучение покупательских предпочтений и ассортиментной политики на рынке макаронных изделий города Астрахани // Вестник АГТУ. 2016. No.1. P. 74-81.
2. Цыплаков, А. (2007). Экскурс в мир инструментальных переменных // Квантиль. 2007. No. 2. P. 21-47.
3. Agrawal D., Schorling C. Market share forecasting: An empirical comparison of artificial neural networks and multinomial logit model // Journal of Retailing. 1996. No. 72(4). P. 383-407.
4. Ali Ц. G., Sayin S., Woensel T., Fransoo J. SKU demand forecasting in the presence of promotions // Expert Systems with Applications. 2009. No. 36(10). P. 12340-12348.
5. Bajari, B. P., Nekipelov D., Ryan S. P., Yang M. Machine Learning Methods for Demand Estimation // National Bureau of Economic Research. 2015. No. w20955.
6. Berry S.T., Levinsohn J., Pakes, A. Automobile prices in market equilibrium // Econometrica. 1995. No.63(4), P. 841-890.
7. Bhat C.R. A multiple discrete-continuous extreme value model: formulation and application to discretionary time-use decisions // Transportations Research Part B. Methodological. 2005. No. 39(8), P. 679-707.
8. Chernozhukov V., Fernandez-Val I., Kowalski A. E. Quantile regression with censoring and endogeneity // Journal of Econometrics. 2015. No. 186(1). P.201-221.
9. Chintagunta P. K. Endogeneity and Heterogeneity in a Probit Demand Model: Estimation Using Aggregate Data // Marketing Science. 2001. No. 4(20). P. 442-456.
10. Cooper L.G. PromoCast™: A New Forecasting Method for Promotion Planning // Marketing Science. 1999. No. 18. P. 301-316.
11. Dong D., Gould B., Kaiser H. Food demand in Mexico: An application of the Amemiya-Tobin approach to the estimation of a censored food system // American Journal of Agricultural Economics. 2004. No. 86(4). P. 1094-1107.
12. Dong. D. S., Kaiser T.M., Harry M. Modeling the Household Purchasing Process Using a Panel Data Tobit Model // Research Bulletin 03-07. National Institute for Commodity Promotion Research and Evaluation, Department of Applied Economics and Management, Cornell University. 2003.
13. Guadagni P. M., Little J. D. C. A Logit Model of Brand Choice Calibrated on Scanner Data // Marketing Science. 1983. No. 27(1). P. 29-48.
14. Guadagni P. M., Little J. D. C. When and what to buy: A nested logit model of coffee purchase // Journal of Forecasting. 1998. No. 27. P. 303-326.
15. Hanemann, W.M. Discrete/continuous models for consumer demand // Econometrica. 1984. No. 53. P. 541-561.
16. Harris T.R., Shonkwiler J.S. Application of Maximum Likelihood to a Bivariate Two-Limit Tobit Model for Estimation of Rural Retail Sales Potential // The Review of Regional Studies. 1994. No. 24(2). P. 143-159.
17. Heckman J. J. Sample selection bias as a specification error // Econometrica. 1979. No. 48. P. 153-161.
18. Khan S., Powell J.,L. Two-step estimation of semiparametric censored regression models // Journal of Econometrics. 2001. No. 103. P. 73-110.
19. Lancaster K.J. A new approach to consumer theory // Journal of Political Economy. 1966. No. 74. P. 132-156.
20. Matzkin R. L. Identication in nonparametric limited dependent variable models with simultaneity and unobserved heterogeneity // Journal of Econometrics. 2012. No. 166(1). P. 106-115.
21. McFadden D. Conditional logit analysis of qualitative choice behavior // Frontiers in Econometrics. 1974. P. 105-142.
22. Nevo A. Measuring Market Power in the Ready-to-Eat Cereal Industry // Econometrica. 2001. No. 69(2). P. 307-342.
23. Ozhegov E. M., Ozhegova A. Regression tree model for analysis of demand with heterogeneity and censorship // HSE Working Papers BRP 174/EC/2017. National Research University Higher School of Economics. 2017.
24. Perali F., Chavas J. (2000). Estimation of Censored Demand Equations from Large Cross-Section Data // American Journal of Agricultural Economics. 2000. No. 82. P. 1022-1037.
25. Richards T. J., Bonnet C., (2016). Models of Consumer Demand for Differentiated Products // Toulouse School of Economics Working Paper. 2016. No. 16(741).
26. Richards T.J., Gomez M. I., Pofahl G.F. A multiple-discrete/continuous model of price promotion // Journal of Retailing. 2012. No. 88(2). P. 206-225.
27. Sцderbom M. Sample selection bias. Estimation of nonlinear models with panel data // Applied Econometrics. 2009. No.15. P. 1-14.
28. Tobin J. Estimation of Relationships for Limited Dependent Variables // Econometrica. 1958. No. 26(1). P. 24-36.
29. Varian H. R. Big data: New tricks for econometrics // Journal of Economic Perspectives. 2014. No. 28(2). P. 3-27.
Appendix
Table 14. Comparison of the initial data set and sample analyzed by descriptive statistics of key variables
Initial data |
Analyzed sample |
||||||||
Mean |
St. deviation |
Frequency |
Share of total |
Mean |
St. deviation |
Frequency |
Share of total |
||
Sales volume |
0,782 |
1,467 |
4512233 |
x |
0,782 |
1,466 |
800000 |
x |
|
Average price |
48,101 |
24,092 |
4512233 |
x |
48,074 |
24,088 |
800000 |
x |
|
Weight |
464,474 |
130,884 |
4512233 |
x |
464,576 |
130,967 |
800000 |
x |
|
Time attributes (by sales volume) |
|||||||||
2009 |
1, 071 |
1,797 |
551828 |
12% |
1,078 |
1,807 |
98023 |
12% |
|
2010 |
0,869 |
1,551 |
696275 |
16% |
0,865 |
1,545 |
123058 |
15% |
|
2011 |
0,686 |
1,344 |
770557 |
17% |
0,681 |
1,339 |
136610 |
17% |
|
2012 |
0,627 |
1,291 |
898022 |
20% |
0,626 |
1,287 |
159507 |
20% |
|
2013 |
0,787 |
1,436 |
760083 |
17% |
0,786 |
1,437 |
135197 |
17% |
|
2014 |
0,770 |
1,436 |
835468 |
18% |
0,772 |
1,433 |
147605 |
19% |
|
The most purchased brands (by sales volume) |
|||||||||
Makfa |
1,311 |
1,809 |
857577 |
19,00% |
1,304 |
1,803 |
151757 |
18,27% |
|
Granmulino |
0,646 |
1,340 |
441885 |
9,79% |
0,650 |
1,348 |
78484 |
9,35% |
|
PastaZara |
0,381 |
0,909 |
373794 |
8,28% |
0,380 |
0,907 |
66235 |
8,12% |
|
GallinaBlanca |
0,332 |
0,774 |
341934 |
7,58% |
0,332 |
0,762 |
60449 |
7,19% |
|
Ameria |
0,934 |
1,615 |
263641 |
5,84% |
0,930 |
1,601 |
46863 |
5,66% |
Table 15. Results of multiple t-test for key variables in initial dataset and random sample
p-value |
||||
Ha: diff<0 |
Ha: diff0 |
Ha: diff>0 |
||
Sales volume |
0,590 |
0,819 |
0,410 |
|
Average price |
0,823 |
0, 353 |
0,177 |
|
Weight |
0,261 |
0,522 |
0,739 |
Note: ;
Размещено на Allbest.ru
...Подобные документы
Процесс построения и анализа эконометрической модели в пакете Econometric Views. Составление, расчет и анализ существующей проблемы. Проверка адекватности модели реальной ситуации на числовых данных в среде Eviews. Построение регрессионного уравнения.
курсовая работа [1,3 M], добавлен 17.02.2014Исследование изменения во времени курса акций British Petroleum средствами эконометрического моделирования с целью дальнейшего прогноза с использованием компьютерных программ MS Excel и Econometric Views. Выбор оптимальной модели дисперсии ошибки.
курсовая работа [1,2 M], добавлен 14.06.2011Анализ временных рядов с помощью статистического пакета "Minitab". Механизм изменения уровней ряда. Trend Analysis – анализ линии тренда с аппроксимирующими кривыми (линейная, квадратическая, экспоненциальная, логистическая). Декомпозиция временного ряда.
методичка [1,2 M], добавлен 21.01.2011What is Demand. Factors affecting demand. The Law of demand. What is Supply. Economic equilibrium. Demand is an economic concept that describes a buyer's desire, willingness and ability to pay a price for a specific quantity of a good or service.
презентация [631,9 K], добавлен 11.12.2013A theoretic analysis of market’s main rules. Simple Supply and Demand curves. Demand curve shifts, supply curve shifts. The problem of the ratio between supply and demand. Subsidy as a way to solve it. Effects of being away from the Equilibrium Point.
курсовая работа [56,3 K], добавлен 31.07.2013Law of demand and law of Supply. Elasticity of supply and demand. Models of market and its impact on productivity. Kinds of market competition, methods of regulation of market. Indirect method of market regulation, tax, the governmental price control.
реферат [8,7 K], добавлен 25.11.2009A theory of price. Analysis of Markets. Simple Supply and Demand curves. Demand curve shifts. Supply curve shifts. Effects of being away from the Equilibrium Point. Vertical Supply Curve. Other market forms. Discrete Example. Application: Subsidy.
контрольная работа [84,0 K], добавлен 18.07.2009Traditional and modern methods in foreign language teaching and learning. The importance of lesson planning in FLTL. Principles of class modeling. Typology of the basic models of education: classification by J. Harmer, M.I. Makhmutov, Brinton and Holten.
курсовая работа [2,1 M], добавлен 20.05.2015Description of the basic principles and procedures of used approaches and methods for teaching a second or foreign language. Each approach or method has an articulated theoretical orientation and a collection of strategies and learning activities.
учебное пособие [18,1 K], добавлен 14.04.2014Economics: macroeconomics, microeconomics, economic policy. Terms: "economics", "macroeconomics", "microeconomics", "economic policy", "demand", "supply" and others. Economic analysis. Reasons for a change in demand. Supply. Equilibrium. Elasticity.
реферат [17,3 K], добавлен 12.11.2007Machine Translation: The First 40 Years, 1949-1989, in 1990s. Machine Translation Quality. Machine Translation and Internet. Machine and Human Translation. Now it is time to analyze what has happened in the 50 years since machine translation began.
курсовая работа [66,9 K], добавлен 26.05.2005Natural gas market overview: volume, value, segmentation. Supply and demand Factors of natural gas. Internal rivalry & competitors' overview. Outlook of the EU's energy demand from 2007 to 2030. Drivers of supplier power in the EU natural gas market.
курсовая работа [2,0 M], добавлен 10.11.2013The development in language teaching methodology. Dilemma in language teaching process. Linguistic research. Techniques in language teaching. Principles of learning vocabulary. How words are remembered. Other factors in language learning process.
учебное пособие [221,2 K], добавлен 27.05.2015The process of scientific investigation. Contrastive Analysis. Statistical Methods of Analysis. Immediate Constituents Analysis. Distributional Analysis and Co-occurrence. Transformational Analysis. Method of Semantic Differential. Contextual Analysis.
реферат [26,5 K], добавлен 31.07.2008Forms and methods of non-price competition: the introduction of new products, sales promotion, advertising and public relations. The role of advertising in shaping consumer product demand. Functions of advertising as a key element of the market economy.
курсовая работа [32,5 K], добавлен 24.02.2014Research methods are strategies or techniques to conduct a systematic research. To collect primary data four main methods are used: survey, observation, document analysis and experiment. Several problems can arise when using questionnaire. Interviewing.
реферат [16,7 K], добавлен 18.01.2009The basic tendencies of making international educational structures with different goals. The principles of distance education. Distance learning methods based on modern technological achievements. The main features of distance education in Ukraine.
реферат [19,1 K], добавлен 01.11.2012The concept of advertising as a marketing tool to attract consumers and increase demand. Ways to achieve maximum effect of advertising in society. Technical aspect of the announcement: style, design, special effects and forms of distribution channels.
реферат [16,1 K], добавлен 09.05.2011Principles of learning and language learning. Components of communicative competence. Differences between children and adults in language learning. The Direct Method as an important method of teaching speaking. Giving motivation to learn a language.
курсовая работа [66,2 K], добавлен 22.12.2011Machine Learning как процесс обучения машины без участия человека, основные требования, предъявляемые к нему в сфере медицины. Экономическое обоснование эффективности данной технологии. Используемое программное обеспечение, его функции и возможности.
статья [16,1 K], добавлен 16.05.2016