Comparison of Machine Learning Algorithms in Demand Prediction Problem
This study is intended the major issues of applying econometric and machine learning techniques to a daily demand prediction problem. The purpose of the paper is going to be achieved via the models’ predictive power comparison on bakery retail chain data.
Рубрика | Экономико-математическое моделирование |
Вид | дипломная работа |
Язык | английский |
Дата добавления | 14.07.2020 |
Размер файла | 528,8 K |
Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже
Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.
Размещено на http://www.allbest.ru
Abstract
This study is intended to cover the major issues of applying econometric and machine learning techniques to a daily demand prediction problem. The current study is focused, mainly, on validating theoretical inferences about techniques advantages according to the empirical approach, therefore we pay special attention to the description and implementation of the methods. We include in the analysis the following prediction models: linear regression, support vector regression, random forest, gradient boosting, and ensemble from these models. At the same time we examine different accuracy metrics: quantile and mean absolute error. The purpose of the paper is going to be achieved via the models' predictive power comparison on bakery retail chain data. Further research in this area of studies could push forward the deeper analysis of predictive techniques and retail daily sales compatibility.
Аннотация
В данной работе проводится эмпирический сравнительный анализ методов из областей эконометрики и машинного обучения в рамках задачи прогнозирования дневного спроса. Особое внимание уделяется описанию и сравнению преимуществ методов с теоретической точки зрения и с последующей валидацией полученных выводов. В работе описаны и использованы следующие модели: линейная регрессия, метод опорных векторов, случайный лес, градиентный бустинг и ансамблевую модель. Описанные модели оцениваются при помощи двух метрик качества: квантильной метрики и метрики абсолютных отклонений. Сравнение техник прогнозирования проводится на данных действующей в России сети пекарен. Анализ представленной проблемы с учётом специфики сферы розничной торговли является фундаментом для последующих исследований в области прогнозирования спроса в данном секторе.
Introduction
A huge amount of collected transaction data in retail chains gives an opportunity to employ it in a wide variety of different business problems: demand prediction, optimal assortment, planning of supplies, inventory management, labor scheduling, and others. Demand forecasting is a basis and the most significant input for successful retail chain yield (Liu, Bhattacharyya, Sclove, Chen & Lattyak, 2001). Demand forecasting is the cornerstone of a data-based revenue management approach.
In our work, we explore the food retail demand prediction problem based on point of sales (POS) transaction data. POS data is information from consumer's purchases including which customer bought what products, at what prices, and when and where a transaction took place.
After defining the business objective, we can transform it into an analytics goal. Using such type of data demand exploring refers to analyzing daily sales at SKU-level (stock-keeping unit level). Therefore, the study aims to construct the daily sales per SKU predictive model that provides the best forecast in terms of prediction accuracy. That is, the research question of the work is “which prediction algorithm provides the highest accuracy in 1-day ahead retail sales prediction problem?”.
According to classical steps of predictive modeling process (data preparing, model selections, validation method selection), to answer the research question we complete the following objectives:
select input variables is important to include;
analyze data specifics (absence of sales);
select techniques are suitable for the research;
choose a prediction accuracy metric.
Key result of the work consists in different predictive models and different evaluation metrics comparison based on theoretical conclusions and quantitative results of the model evaluation as well. Actual retail daily sales data are taken from the Russian retail bakery chain (5 restaurants) and presented at SKU-level for each restaurant over 2019.
The research is of primary significance to production managers, marketing managers, logistics specialists, and financiers in retail trade networks and restaurant chains. Our findings may be used to get a better understanding of how to implement forecasting methods in order to receive sales prediction and how to coordinate different departments' work flows more efficiently. Another important point is that the study contributes to the development of theoretical works concerning characteristic features of different prediction models by providing a wider comparative analysis of techniques.
The structure of the paper is organized as follows. The first section presents a literature review of the retail demand determinants and comparative analysis of prediction techniques. The second part is devoted to data description in detail including graphical and statistical analysis, which allows to choose the appropriate methodology. The third section describes the methods and techniques used in the learning and evaluation prediction model processes. Finally, the last part discusses the results of the study models and methods comparison.
Literature review
1.1 Demand Determinants
A considerable amount of literature on retail demand determination provides a basis for the present study. This section is devoted to theoretical studies as well as empirical ones in the field. We address three crucial aspects. Firstly, the process of selecting input variables for a model, secondly, an extensive discussion of different prediction techniques, and lastly, some points about methods for comparing forecast accuracy. The interrelation between described aspects stresses the need for multi-disciplinary literature analysis due to the discussion of different elements in different fields of science: economics, management, marketing, machine learning. We pay special attention to research in fields of economics and machine learning, while we examine solely key papers in fields of marketing and revenue management. Table 1 represents the key points in the literature review and the appropriate field.
Table 1
Key approaches in the literature review
Approach |
Field of science |
|
Retail sales specialty |
Revenue Management |
|
Retail demand determinants |
Economics, Revenue Management |
|
Sales time-series analysis |
Economics |
|
Sales predictive models |
Economics, Machine Learning |
|
Model accuracy evaluation |
Machine Learning |
To begin with, we examine the first two points about independent variables, based on Lasek, Cercone and Saunders paper approach (Lasek, Cercone & Saunders, 2016), that consists in identifying major overall demand determinants and selecting from them the most appropriate to research task ones. One of the key papers devoted to the food retail demand determinants in the field of revenue management is a review article of Kimes and co-authors (Kimes, Chase, Choi, Lee & Ngonzi, 1998). Authors present some evidence to suggest that competitive points, location, significant changes in pricing policy or product management and macroeconomic indicators do not have a contribution to short term sales, and, as a result, do not have an impact on short term forecast accuracy too. Conversely, temporary changes in prices, holidays, time characteristics, weather, regular promotions and lagged sales and aggregate sales are the most significant demand determinants.
We provide a summary explanation of the relationship between chosen factors and food retail demand. Time characteristics and holidays allow us to consider seasonality (Ehrenthal, Honhon & Woensel, 2014), for instance, intra-week seasonality manifests itself in distinctions between consumers' demand on weekdays and weekends. This logic applies to a steep surge in demand in holidays (Weatherford, Kimes & Scott, 2001). In described empirical works, depending on the forecast time-horizon, are found the number of the day in a week, the number of the day in a month, the number of a month and the number of a year, and some composite variables as examples of time characteristics.
Weather conditions have a significant influence on retail sales by affecting consumers' behavior, particularly, will to go into a store or a restaurant (Nenni, Giustiniano & Pirolo, 2013). The authors emphasize the fact that weather-based prediction accuracy is strongly dependent on the accuracy of the forecast. We overcome the obstacle of the analysis as we investigate historical weather data with actual weather condition, nevertheless, it is a limitation of the predictive model implementation in practice.
Regular promotions and temporary changes in price have a great impact on demand in accordance with the law of demand. Lastly, it is exceedingly important to consider lagged sales and aggregate sales factors. Lagged sales are sales that have their value coming from an earlier point in time (Jin, Williams, Tokar & Waller, 2015). For example, sales week or month ago that additionally reflects sales patterns. Aggregate sales - are sum, average or another aggregation function over lagged sales in a specific period in the past. It is important to note that the last group of determinants (lagged sales and aggregate sales) is specified at the stage of data analysis using autocorrelation techniques. Mentioned authors recommended to conduct autocorrelation analysis and pay attention to sales with lag 7, 14, 21 days, year and month lags. As for aggregates, it is possible to check some arguable smoothing parameters using standard autocorrelation analysis and classical aggregation functions.
We would like to conclude this point by mentioning that the provided list of factors influencing food retail sales may be updated and expanded in other papers through consideration of some additional information such as assortment management or loyalty program characteristics. Analysis of such information is considered to go beyond the scope of work. Therefore, we take into consideration solely the following six groups of demand determinants: temporary changes in prices, holidays, time characteristics, weather, regular promotions and lagged and aggregate sales.
1.2 Prediction techniques
In the second part of the literature review we address classical econometric models as well as recent machine learning methods and we start with the most popular nevertheless inappropriate for our work methods. The mentioned paper of Lasek and co-authors (Lasek et al., 2016) includes the first three estimation methods: smoothing models, ARIMA models, association rules models. A major cause of non-including smoothing and ARIMA models into further main analysis consists in their both simplicity and the absence of crucial complex pattern revealing ability. Previous researchers (Darbellay & Slama, 2000) used these models in similar tasks either to demonstrate increment in prediction accuracy of other ones or used them as baseline models. It is significant to stress that ARIMA-type models can be appropriate in demand forecasting in a longer time-horizon. Forthcoming group of methods is association rules that is designed from the assortment manager's point of view and is not based on the historical sales data approach but price elasticity analysis.
The next examined methods are regression tree, CART and CHAID (Bozkir & Sezer, 2011). All of them are simple decision tree concepts, which are inferior to methods discussed later according to recent machine learning paper (Tanizaki, Hoshino, Shimmura & Takenaka, 2019). Decision tree models are quite powerful in revealing non-linear interrelations between independent variables and robust to outliers, however, they are unsustainable (Zhang & Suganthan, 2014). To address this shortcoming, we examine more complex decision tree models: random forest model (RF) and gradient boosting tree-based model (GB). They are widely used in the problem of time-series forecasting under described above decision tree concept advantages (Qiu, Zhang, Ren, Suganthan & Amaratunga, 2014). The algorithm of both models, explanation of differences in evaluation process are presented in the methodology section.
Following the commenting on the most critical methods of prediction, turn to the remaining appropriate for the research ones. First, we define a naive model as a baseline method - a model used as a reference point for comparing how well another model (typically, a more complex one) is performing. Then we include the classical model from the demand prediction. Ordinary least squares linear model (LM) model appears in almost all analyzed papers as the model is distinguished by the results interpretability. Another classical model in the machine learning field is support vector regression (SVR). There are a variety of reasons for selecting SVR as a method of forecasting, for instance, the ability to reveal complex non-linear relations between variables by selecting kernel function and generally high performance (Bajari, Nekipelov, Ryan & Yang, 2015. Summarized advantages and disadvantages of models (LM, SVR, RF and GB) are presented in table 2.
Table 2
Comparison of models
Model |
Advantages |
Disadvantages |
Source |
|
LM |
It is possible to logically formulate and interpret the model results based on the relationship between the sales and independent variables; Fast training time, simple in use. |
A variety of assumptions to independent variables; An extremely low ability to reveal and account for relationships that do not suppose the researcher. |
(Lasek et al., 2016) |
|
SVR |
Structural risk minimization is the basic concept of this method; Model provides good results in demand prediction tasks. |
Requires a preliminary analysis to choose kernel function; Sensitive to data scaling. |
(Qiu et al., 2014); (Bajari & Nekipelov, 2015). |
|
RF |
Model demonstrates high performance in many fields of research; Ability to aggregate patterns; Robustness to outliers. |
Averaging of results; Impossibility of extrapolating revealed patterns to new data. |
(Zhang & Suganthan, 2014) |
|
GB |
Advantages of RF; More robust to overfitting than RF; Better generalization than in RF. |
Disadvantages of RF; Harder to fit than RF; Long training time. |
(Zhang & Suganthan, 2014) |
1.3 Accuracy metric
The last point of the literature review is devoted to the question of choosing the most appropriate evaluation metric. The problem of choosing the most suitable accuracy metric for sales forecasting is relevant for the economics and management field (Flores, 1986) as well as for machine learning. The relevance for analysts and data scientists is that the metric has a significant influence on the results of choosing the prediction method with the best predictive power. The role of economists and managers is in providing qualitative description of the error evaluation process. The key point is that the appropriate (unbiased) accuracy metric results into the choice of the best prediction method in which practitioners have an interest (Yu, Lu & Stander, 2003).
The main idea of accuracy evaluation in sales prediction problem consists in measuring deviations of the predicted value of sales from the actual one. From the point of view of modelling, such deviation is an error of forecasting and it is minimized using a prediction model. Then the quality of the model is evaluated as a function over errors, which is called an accuracy metric. The metric provides a qualitative measure of the model quality.
Authors of a famous paper made the comparison of the classical evaluation metrics for time-series analysis (Armstrong & Collopy, 1992) and showed that the most preferred metric in terms of its reliability, outlier protection, and sensitivity is different for different depth and properties of data. However, the most intriguing general conclusion of the paper is that root mean squared error, widely spread and base method for comparing forecast accuracy is inferior to other metrics (for instance, mean average error and median relative average error) in all cases reviewed. Authors of a more recent paper (Willmott & Matsuura, 2005; Hyndman & Koehler, 2005) came to the same conclusion for the case of data with high dimensionality while analyzing advantages of time-series model performance evaluation.
Another group of researchers instead of classical symmetric metrics investigates asymmetric ones. For instance, it is shown on the example of bicycles that short-run sales forecasting using classical symmetric metrics has a serious negative consequences (Dopke, Fritsche & Siliverstovs, 2009). Key result is that symmetric metrics provides a bias, in other words, average prediction differs from actual value, as a result, expected forecasts are also biased and, therefore, may be improved. Similar conclusions are provided for the market of energy resources (Auffhammer, 2007). Discussion about advantages of asymmetric metrics are based on the irrelevance of the classical metric assumption.
The core assumption of symmetric metrics is that costs of overprediction and costs of underprediction are equal (Yu, Lu & Stander, 2003). However, that is a quite strong requirement that often is not met, the more difference between costs of overprediction and underprediction, the more bias of the metric. The nature of difference underfitting and overfitting costs is discussed from the manager standpoint (Wang, Webster & Suresh, 2009). It is proved that under classical attitude to risk assumptions, all else being equal, optimal order quantity is lower than expected one due to behavioral features.
For all of these discussed reasons, we analyze metric issues in detail using qualitative economic and management conclusions along with business insights. For this, we refer to the newsvendor problem which analyze manager behavior in the sales forecasting problem from the operational management point of view. The concept is that if too much is ordered, stock is left over, but if it too little is ordered, sales are lost and. In sales prediction problem the first point is overprediction which generates costs of writing off or producing (it is crucial for short-lived commodities), the second point is underprediction which generates underestimated opportunity costs.
Based on Schweitzer and Cachon paper (2000), we analyze how managers actually make decisions of ordering according to suggested forecast of sales. The study suggests that the process of choosing an order quantity caused by solution an expected profit maximization problem in which underestimated opportunity costs and costs of producing are balanced. Suppose be one SKU order quantity, which manager choose according to the sales forecast and let p and c be a selling price and costs of producing for each unit respectively. Then if , actual value of sales in the next period, is unknown in advance for manager, we face with three alternatives depending on and ratio, equation 1 shows profit () for each of three cases.
, (1)
The equation shows that in cases when and do not equal, profit is less than in case with on some value. These values are costs of underestimation and overestimation. They constructed as a multiplication of the value of order error (||) and appropriate constant value: marginal value of sales (p-c) or c. In the paper is shown that if (p-c) more than c, then optimal choice of is higher than even the last is known due to different costs of over- and under- ordering. It theoretically proves the fact that in sales prediction problem symmetric metric, firstly, provides a biased prediction, secondly, does not correspond to actual manager behavior as it is in line with the discussed model according to authors experiments.
In our study we use the most spread quantile metric as an example of asymmetric metric which allow us to choose relative costs of overprediction and underprediction. Also, we evaluate the accuracy of prediction using absolute error, what is agreed with the described theoretical papers and with the majority of empirical works, in order to make more reasonable comparison of results. The deeper analysis of metrics and loss functions is presented at the methodology section.
We briefly summarize the main issues of the literature review section. According to economics and management literature analysis, we selected five major groups of input variables and explain their possible influence on retail sales: temporary changes in prices, holidays, time characteristics, weather, regular promotions and lagged and aggregate sales. Then we find the most appropriate to the prediction problem models via economics and machine learning literature analysis: LM, SVR, RF, GB, and ensemble. Finally, we show why it is important to evaluate and compare model performance using special accuracy metrics (for instance, mean quantile error metric) over classical symmetric metrics (for instance, mean absolute error metric).
2. Data
For the purpose of model estimation, we collect sales data from the bakery chain operating in Russia. We analyze sales in top-5 bakeries located in Saint-Petersburg from 14 January 2019 up to 31 December 2019 without gaps.
Data has the structure of an unbalanced panel due to the differences in the assortment between days and bakeries. The dataset contains over seven hundred unique stock-keeping units (SKU) resulting in 251283 daily observations. The unit of observation is the volume of sales of a stock-keeping unit (SKU) in a bakery in a day.
We analyze the data in the following way. Firstly, we check if there were significant changes in the assortment or consumer behavior, secondly we analyze the function of the sales distribution to get a deeper understanding of the dependent variable, and finally, we study the nature of relationship between daily sales and major demand determinants described above.
Assortment in the reviewed bakeries is mainly presented by 5 categories: buns, drinks, cakes, bread, and main courses. Changes in their pieces of sales shares in bakeries during the year are presented in table 3.
Table 3
Shares of revenue among categories (in %)
Category |
Jan-Mar |
Apr-Jun |
Jul-Sep |
Oct-Dec |
|
Bread |
10% |
10% |
11% |
11% |
|
Buns |
11% |
15% |
17% |
16% |
|
Cakes |
20% |
22% |
23% |
23% |
|
Drinks |
35% |
29% |
26% |
25% |
|
Main courses |
12% |
16% |
16% |
19% |
|
Other |
11% |
9% |
7% |
7% |
The table shows that bakeries assortment did not change significantly during the considered period unless the share of category `Buns' increased from 11% to 16% during the year among all cafes. The share of `Drinks' decreased from 35% to 25% during the year. As for other categories, there are no significant changes among shares, so we can suggest that sales were stationary in terms of categories share ratio and assume that the structure of assortment was not changed significantly. That is the reason for the non-including period fixed effects for the “category” variable.
It is important to note that SKUs in category “Drinks” are about 30% of overall sales, however, in further analysis, the category is excluded. The main reason is that only a little practical application exists for the category due to the specifics of producing drinks is bakeries: semi-finished products have a big shelf life and therefore are not the object of the study. There is no need for accurate prediction in the category.
Thus, in the further analysis we exclude sales of SKUs in the category “Drinks”. It is shown that the structure of sales does not change significantly, and we may use the data in further work without any transformation. The next step is to analyze the dependent variable in detail. For this, we turn to the graphical analysis of sales. Figure 1 depicts the sales distribution histogram during the whole period with the appropriate normal left-censored distribution censorship with the threshold 0 as is in actual sales data. The normal distribution density function has the parameters (mean and deviation) of actual sales distribution.
Fig.1 Sales Distribution
The figure shows the density function is left censored. That is to say, the sales in all bakeries do not take the value less than 0. The challenge may be considered using special techniques, for instance, Tobit model instead of OLS estimation, or appropriate activation function in a neural network. However, the figure shows that the censorship is small enough and therefore does not have a significant influence on the following calculations and implications of the estimated models.
Another important point in the figure is that the distribution function does not have the shape of Gaussian distribution but has the shape of logarithmic distribution. There is a rapid increase in the small volumes of sales density and a slow decrease of the density in the bigger volumes of sales. In order to get unbiased estimates in some models, we test the hypothesis that the logarithm of sales is normally distributed using the chi-squared test. The null hypothesis about different from normal distribution of sales is not rejected for both distributions because of data censorship, however, the test statistics with the logarithmic transformation is decreased from 51004 to 12614, which indicates about higher similarity to the normal distribution. Therefore, the dependent variable in our study is the logarithm of daily sales. The next step is to verify inferences from the previous papers discussed in the literature review. We pay special attention to the following variables: price, time characteristics, and lagged sales.
Average price dynamics among all bakeries (without weighting on sales volume), shown in Figure 2, suggests that there was not a significant change in pricing strategy but a steady growth of the average price among all bakeries we investigate.
Fig.2 The average price dynamics
Nevertheless, a more detailed analysis discovers that the significant change in the average price is explained by the growth of “Other” category sales but not a trend component in a price time series. Table 4 describes the average price among all categories depending on weighing (W) or without weighing (NW) price by sales.
Table 4
Average price among categories (in rubles)
Category |
Jan-Mar |
Apr-Jun |
Jul-Sep |
Oct-Dec |
|||||
NW |
W |
NW |
W |
NW |
W |
NW |
W |
||
Bread |
123 |
115 |
121 |
113 |
124 |
113 |
127 |
115 |
|
Buns |
99 |
97 |
98 |
97 |
100 |
100 |
100 |
99 |
|
Cakes |
222 |
162 |
215 |
172 |
238 |
186 |
245 |
190 |
|
Drinks |
139 |
141 |
138 |
143 |
139 |
145 |
141 |
149 |
|
Main courses |
153 |
159 |
176 |
178 |
174 |
174 |
189 |
188 |
|
Other |
13 |
9 |
24 |
23 |
95 |
111 |
99 |
113 |
Table shows that there were no significant changes in price and structure of sales in “Bread”, “Buns” and “Drink” categories. The growth of non-weighted prices in categories “Cakes” and “Main courses” caused the growth of weighted price. Therefore we conclude about stability of the sales structure among these categories and about stability of consumer preferences what makes possible to estimate one model at the whole period.
We can be sure that temporal changes in mean prices are quite uncommon and can be neglected instead of the “Other” category. The growth of the price in this category is not explained by the increasing number of marketing activities or increasing demand in the second half of the year resulting. That is a consequence of changing a system of price and discount accounting in the category, therefore, there is a structural change in the category “Other” and that is a reason for excluding the category from the further analysis. Table 5 shows the dispersion of the average price from week to week on the representative SKU in the remaining categories.
Table 5
Between and within variation (in % from total) of average price among categories
Category |
Price "Between" variation |
Price "Within" variation |
Price variation |
|
Bread |
14% |
0% |
100% |
|
Buns |
73% |
|||
Cakes |
13% |
|||
Main coursers |
0% |
The table depicts fluctuations of the average price in the category near some average value as the price variation within the category is relatively small and insignificant in comparison to price variation between different categories. It allows us to neglect changes in price in time (from week to week) in further analysis as according to the data these changes are related to marketing activity and unobservable for the researcher but known for the manager. Considered the bakery chain does not provide discounts on specific SKUs but provides personal discounts. Analysis of such pricing strategy is out of the scope of the research. Therefore we follow the assumption that price does not change during the month and during the year.
As a result of data analysis, we exclude from the dataset SKUs related to the categories “Drinks” and “Other”. The remaining dataset contains 112368 observations in four categories.
The following part of the work is devoted to key time characteristics analysis. According to the literature review, we take into consideration the weekday and the number of the day in a month. We show why it is necessary to include such factors into the predictive model by autocorrelation and correlation techniques.
Previous researchers (Bozkir & Sezer, 2011) emphasized a significant impact of the lagged variables in the sales prediction challenge. The issue is theoretically explained by similar guest flows in the same weekdays and specific consumer behavior patterns for some numbers of days in a month. In order to analyze the importance of different lagged variables, we estimate a partial autocorrelation function (PACF). The panel structure of the used dataset allows us to estimate values of PACF through evaluating linear regression presented in equation 2.
, (2)
where:
- volume of sales of i SKU in the t period;
- an idiosyncratic shock of i SKU in the t period;
k - maximum lag.
In the estimation process we observe that coefficient equals to the appropriate component of PACF that is defined as an element of covariation matrix or the coefficient of multiple correlation. According to the nature of the data and a short time period, we suggest that the maximum significant lag equals 30 (the number of days in a month). Figure 3 depicts the sales partial autocorrelation function .
Fig. 3 Daily sales partial autocorrelation function
The bars in the figure represents values of PACF, dotted lines show confidence intervals for PACF values. Confidence intervals are calculated using a t-test for regression presented in equation 1 with the 99% confidence level. PACF values between red lines reflect insignificant at a 99% significance level lags, for instance, hypothesis about insignificant relation to current sales is not rejected at a 1% significance level for all lagged sales with a lag 22 days or more. That is the reason for non-including these lags into further analysis
The figure shows that the lagged volume of sales with lag from 1 to 4 days ago have the biggest impact on the current sales. It is in line with the fact that patterns in consumer activity do not change rapidly but they are maintained during at least two days. As for the volume of sales with the lags of 7, 14, 21 and 28 days, they reflect the week to week patterns and emphasize an importance of including the weekdays in the model and such lagged variables due to similar sales patterns in similar weekdays. In the further analysis we include the following days of the lag of the dependent variable: 1, 2, 3, 4, 7, 14, 21.
Another important group of factors influencing daily sales is weather factors. We collect the information about the average temperature within a day, its maximum and minimum value during the day, humidity level, pressure and values of wind speed. In order to avoid collinearity problems that occurs in linear estimators due to closeness of some explanatory variables, we should carefully select uncorrelated weather variables. Figure 4 contains heatmap showing correlation coefficients between the variables. Each cell contains an appropriate Pearson correlation coefficient that is in range from -1 to 1, the closer to 0, the lower correlation between variables.
Fig. 4. Correlation heatmap for weather variables
The figure represents two important issues, firstly, an extremely high correlation between average temperature in a day and its maximum and minimum value as it is expected to be . Secondly, the figure demonstrates the absence of a strong correlation between other weather characteristics. As a result, in linear estimators an average temperature during a day, a humidity level, pressure and values of wind speed are included. At the same time in non-linear estimators additionally are included minimum and maximum temperature in a day as these variables contain a piece of additional information. The logic of the decision is in collinearity problem that may occur and may the prediction linear model unstable with a low predictive power.
Additionally, several heuristics weather characteristics are included in non-linear estimator analysis: compound weather characteristics and lagged weather characteristics. Compound characteristics are calculated using the formula presented in equation 3.
, (3) |
where:
- compound variable of sales on i SKU in t period calculated with the weather lag j;
- current weather characteristic;
weather characteristic j days ago.
Correlation analysis shows the statistically significant correlation between these variables and the dependent variable. It is important to note that the compound variables are correlated to each other and lagged sales by construction, therefore they are not included into linear estimators as explained before. We include compound variables for four key weather characteristics with the lag of 20 periods (days).
Another group of wethaer characteristics are lagged weather characteristics. They also give an additional information to the sales prediction problem according to general time series properties (Bouktif et al., 2018). After similar partial autocorrelation analysis we conclude that several lagged values should be included. As a result, we included in non-linear estimators 37 lagged characteristics of the temperature, 17 lagged values of humidity and pressure, 20 lagged values of wind speed. Descriptive statistics of key independent variables used in all prediction model are presented in table 6.
Table 6
Descriptive statistics
Variable |
Mean |
St.d. |
Min |
1 quartile |
Median |
3 quartile |
Max |
|
Temperature |
11,0 |
9,7 |
-19,0 |
3,8 |
11,7 |
19,2 |
29,6 |
|
Humidity |
71,0 |
17,1 |
17,0 |
59,0 |
73,0 |
85,0 |
100,0 |
|
Wind speed |
4,5 |
2,1 |
0,2 |
3,0 |
4,4 |
6,0 |
13,0 |
|
Pressure |
1011 |
11 |
974 |
1005 |
1012 |
1019 |
1039 |
|
Weekday |
3,0 |
2,0 |
0 |
1 |
3 |
5 |
6 |
|
Day |
16,9 |
8,7 |
1 |
9 |
18 |
24 |
31 |
|
Temp ratio |
0,5 |
4,8 |
-274,3 |
0,4 |
0,5 |
0,6 |
1184,2 |
|
Temp min ratio |
0,5 |
2,9 |
-379,3 |
0,4 |
0,5 |
0,6 |
281,7 |
|
Temp max ratio |
0,6 |
4,5 |
-224,5 |
0,4 |
0,5 |
0,6 |
1148,0 |
|
Humidity ratio |
0,5 |
0,2 |
0,0 |
0,5 |
0,5 |
0,5 |
2,7 |
|
Wind speed ratio |
0,5 |
0,3 |
0,0 |
0,4 |
0,5 |
0,6 |
2,7 |
|
Pressure ratio |
0,5 |
0,2 |
0,0 |
0,5 |
0,5 |
0,5 |
2,7 |
In this part of the paper we explore data and reveal the key patterns marked in the literature review. We show why it is possible to not include the price of items and the assortment characteristics in the model. After that we analyze the function of the sales distribution and get a deeper understanding of the data nature. And finally, we study the relationship between daily sales and major demand determinants: weather characteristics, time characteristics, lagged values of sales. We show steps of the variable selection procedure for lagged variables and weather characteristics and present the key steps in modelling compound characteristics that provides an additional information into analysis.
3. Methodology
3.1 General algorithm
In this section we describe the techniques used to reach the aim of the work. The comparison of prediction models refers to consistent model out-of-sample accuracy calculating and ordering by its value, the lowest one is responsible for the model with the highest predictive power, that is to say, the most suitable for implementing in practice. The prediction process includes the three general stages after data collection and preparation: tuning hyperparameters of a model if it is needed, estimating the model parameters and creating a forecast (Bouktif et al., 2018). According to the literature review, we compare the prediction power of the following model types: linear regression, support vector regression, random forest and gradient boosting. The final ensemble method includes previous models as input variables. The resulting set of models consists of different models and at the same time 2 different loss functions.
Firstly, in this part we discuss overall steps of making prediction in detail. Secondly, we discuss the loss functions and appropriate accuracy metrics, which shows how close the prediction is to the actual values. Thirdly, we describe and discuss some properties of chosen prediction models.
Starting with the prediction process, we introduce some specific definitions. Significantly, we should divide the dataset into three parts for each step: cross-validation dataset for hyperparameters tuning, training dataset for models evaluation, and hold-out dataset for making predictions. These parts should not cross each other and can be obtained as observations at successive time points. Following the Bergmeir and Benнtez (2012) approach, we define 40% of observations as a validation part (with the following 5-fold cross-validation), the next 40% as a training part and the remaining 20% as a test part.
The prediction process starts with tuning a model. The different models have their specific characteristics and there are some cases when the researcher can set the specific values, however, in a majority of cases it is not known and therefore these characteristics are set empirically. The parameters of the model that should be set before the model learning process are called hyperparameters. We search for optimal hyperparameters via special procedure - cross-validation. The idea is in learning models with the same hyperparameters on different datasets and averaging the measure of the prediction quality. The result is more stable and representative than one score got at the whole dataset.
We explain a cross-validation technique on the 5-fold cross-validation example as it is more suitable for the data. Firstly, the dataset for validation is split into 5 parts with the equal number of observations. After that 4 parts are used for training a model, and the remaining part is used for score calculating. The step is repeated 5 times for different testing parts, the overall average score is calculated. The procedure is repeated for different hyperparameters, so the model with optimal hyperparameters has the lowest error score.
The classical cross-validation methods are constructed to cross-sectional data without any dependencies between observations, but the assumption is false for panel data when a strict relationship between objects and periods exist. Overcoming the obstacle requires some technical analysis. Bergmeir and Benнtez (2012) analyzed the problem of cross-validation for time series and showed that in a large enough datasets the main properties of cross-validated accuracy results (mainly, robustness) are maintained and the dependencies in the data due to time structure of observations do not prevent using classical cross-validation techniques in time series and panel data analysis. The empirical fact is proved by statistical explanation, hence in the further analysis it is possible to implement the cross-validation procedure for selected econometric and machine learning models without any restrictions.
The last but not the least point that should be mentioned is a way of hyperparameters selecting. The most popular approaches are grid search, random search, and Bayesian optimization. The last method is out of the scope of the research due to a computational complexity for ensemble models and neural networks. We base the choice of the approach based on a famous paper of Bergstra and Bengio (2012). Authors provide plenty of evidence of the fact that a random search takes precedence over a grid search. Empirical evidence comes from a comparison in a large study with different models as well as theoretical evidence from a statistical analysis under the classical assumptions. Another significant implication is that random search is effective to implement in cases when some hyperparameters do not matter much.
To sum up, at the first stage after data preprocessing we make a cross-validation procedure for each model containing any hyperparameters. We use 5-fold cross-validation on the dataset with the size of 40% of the original one. The hyperparameters for the procedure are chosen using a random search technique, the hyperparameters corresponding to the lowest out-of-sample error.
3.2 Comparison techniques
The next issue is devoted to the model evaluation and quality estimation process. As we show in literature review section, asymmetric accuracy metric gives an unbiased forecasts and allow for estimate unbiased quality of the model in sales prediction problem in retail. We verify this result empirically by comparison of forecasts created under different accuracy metric: quantile and mean absolute error metric as it is the most frequently used metric in the prediction problem and as its advantages over other symmetric metrics are proved empirically and theoretically for similar problems.
In order to get a better understanding of the model quality, we compare the predictive power of models using the mean absolute percentage error (MAPE) as an example of mean absolute error accuracy metric. MAPE reflects the averaged absolute error value of the model in percentage from the average value of target variable. MAPE is calculated according to the equation 4.
(4) |
where:
m - the total number of objects;
- the period of sales for the i object.
For the aim of comparison of quantile and mean absolute error accuracy metrics we evaluate different models using different loss function depending on the chosen accuracy metric. MAPE metric corresponds to mean absolute error (L1) loss function which minimizes the absolute errors in the process of model evaluation, in other words, difference between the predicted values of sales and the actual one. Turning to quantile accuracy metric, it is necessary to refer to newsvendor problem presented in the literature review section. We show that costs of overprediction one unit equal to costs of producing the unit c while costs of underprediction equal underestimated opportunity costs (p-c). We represent this information at figure 5.
Fig. 5. Loss functions
The figure depicts accuracy metric we want to use, and it shows different slopes of the function at different intervals of predicted variable. Therefore, we define mean quantile error (MQE) accuracy metric according the equation 5.
, (5) |
where:
- indicator function equals 1 if otherwise 0.
The next step is to define appropriate loss function for MQE. Such loss function must minimize errors (e) in the evaluation process and results in minimizing MQE. In equation 6 we show that quantile loss function meets both the requirements.
, (6) |
Let t be ( ) and e be () then the loss function above transform into the form presented in equation 7.
, (7) |
Then minimizing of the quantile loss function leads to I() / I()= =1 or equals t quantile of distribution which guarantees minimum of MQE for a given (), therefore both accuracy metrics provide unbiased prediction and model quality measure under assumption of asymmetric accuracy metric.
In addition, we calculate another metric of the model quality: economic effect. The metric shows the gain in financial benefit realized in the case of using forecasts of the model in comparison to the baseline, current system of forecasting. We calculate financial benefit as an economic profit that equals accounting profit plus opportunities costs that occurs in the case of underprediction.
It is worth to note that baseline of forecasting is unique for each business and it depends on the current system of sales prediction. The basic rule of forecasting in practice is an orientation to 7 days ago sales, so we use it as a baseline for further model prediction power comparison.
As we aimed at the most accurate forecast quality in terms of quantity regardless of its price and cost price, we may use the assumption about equal price and cost for all SKUs. Finally, we got the formula for retail economic profit calculation. In order to make the metric more illustrative, we normalize the metric to one month of operating.
In this part of the work we describe two loss functions (mean absolute error and quantile loss functions) and two accuracy metrics (MAPE and MQE) which we use in order to compare model prediction power and show difference in using quantile and absolute loss functions. That is to say, each model is evaluated twice with different loss functions and after that for each loss function two accuracy metrics are calculated. Additionally, we calculate net economic effect from implementing forecasting methods over traditional and used in practice way of forecasting.
3.3 Prediction techniques
In this section we provide an extensive description of chosen methods with some technical details where it is possible.
The first method is a linear regression model estimated by ordinary least squares. The model is estimated under assumption that the sales are determined by some input variables and the relation in equation 8 takes a place.
, (8) |
where:
p - the number of independent variables which vary over time;
q - the number of independent variables which do not vary over time.
Taking into account panel structure of the data and lagged variables as predictors, the formula is specified as presented in 9 equation.
, (9) |
Such equation is an example of linear dynamic panel model as the equation includes autoregressive components (lagged values of target variable). It is proved (Arellano & Bond, 1988) that for dynamic panel model estimation of the equation using ordinary least squares leads to endogeneity and inconsistent and biased coefficient estimates due to collinearity and a high correlation of errors in the previous periods. Special instrumental variables methods are suggested in the paper, and according to their construction the information about the lagged variables is replaced with less informative variables. That is to say, the suggested solution of endogeneity elimination leads to decrease in model explanatory and prediction power due to excluding some information about independent variables.
Nevertheless, in the prediction task, unlike the explanatory one, statistical properties of the coefficients are unimportant. Prediction power of the model depends on the input variables relevance and validity of the estimation method. Therefore, we may
Support Vector Regression (SVR) suggests another approach to estimating model coefficients. Unlike least squares methods, SVR avoids explicit specification of the regression equation (Cristianini, 2000). It estimates the optimal margin that separate observations close to actual values depending on input variables to the model.
Firstly, input variables to the model are transformed according to chosen kernel function so that further calculations are hold in the space of new variables. The next step is in evaluation of the hyperplane parameters in new space that results into average value of target variables and a margin in an initial space. SVR uses the following rule of the best hyperplane parameters estimation: the sum of distances between a hyperplane and the points of the observation is minimized with the ignoring every point inside the margin. That is to say, these points are suggested to be correct regression prediction without the errors of the estimation. The points outside the margin (far away from the hyperplane) are not estimated correctly, they are not representative for the current regression estimation, so they are penalized for existing (Bajari & Nekipelov, 2015).
The more accurate estimation of SVR is achieved through correct choice of hyperparameters, parameters of the model chosen before the evaluation process. We describe and tune the following hyperparameters: kernel function and kernel parameter, regularization parameter C, tolerance.
We test three kernel functions: linear, radial basis kernel "Gaussian", and polynomial. Generally, type of the kernel function can be chosen based on the type of relation between dependent variables if it is known. Linear kernel function responds to the lack of the transformation of input variables and does not have additional parameters. Gaussian kernel function produces Gaussian transformation and has the gamma parameter which responds for the standard deviation of the transformed distribution of input variables. Polynomial kernel function uses as a parameter gamma with the same sense and the maximum degree of polynomial transformation. In hyperparameters tuning we check degrees from 2 up to 4 and three gamma values: 0.1, 1, 10.
Regularization parameter C respond for the trade-off between margin width and errors of observations outside the margin. In order to avoid overfitting it is necessary to restrict the importance of the margin size in objective function and strike a balance between the number of observations inside the margin and the quality of the model estimated. We check the following values of C: 0.01, 0.1, 1.
Tolerance hyperparameter responds to the absolute width of the margin. Its optimal value depends on the data specifics and should be selected. We check the following values of tolerance: 0.0001, 0.001, 0.01.
The next prediction techniques are random forest and gradient tree boosting model. Both models are ensemble learning methods using regression tree learners with different framework of the estimation.
The building of regression tree is an iterative process. Each regression tree in the ensembles is trained on randomly sampled observations of the training dataset. For each tree at each stage (at each node) a random subset of input variables is selected (the number of samples variables is a feature fraction hyperparameter). After that, the model compares different splits on the selected variables using different thresholds and choose the best threshold and appropriate variable. So after the best split there are two subsamples which are two new nodes of the tree. Comparison of splits quality is made using accuracy metrics defined before: quantile metric and mean absolute error metric. The average value in the split is predicted value.
The process of creating new nodes ends based on different rules. The first rule stops further splits after the maximum depth of the tree is reached (maximum depth hyperparameter). The second rule restricts a new split by the minimum number of observations in a potential new node (minimum samples in leaf hyperparameter). Finally, the tree provides subsamples splitted by the most appropriate input variables by the most appropriate thresholds.
...Подобные документы
Процесс построения и анализа эконометрической модели в пакете Econometric Views. Составление, расчет и анализ существующей проблемы. Проверка адекватности модели реальной ситуации на числовых данных в среде Eviews. Построение регрессионного уравнения.
курсовая работа [1,3 M], добавлен 17.02.2014Исследование изменения во времени курса акций British Petroleum средствами эконометрического моделирования с целью дальнейшего прогноза с использованием компьютерных программ MS Excel и Econometric Views. Выбор оптимальной модели дисперсии ошибки.
курсовая работа [1,2 M], добавлен 14.06.2011A theoretic analysis of market’s main rules. Simple Supply and Demand curves. Demand curve shifts, supply curve shifts. The problem of the ratio between supply and demand. Subsidy as a way to solve it. Effects of being away from the Equilibrium Point.
курсовая работа [56,3 K], добавлен 31.07.2013Machine Translation: The First 40 Years, 1949-1989, in 1990s. Machine Translation Quality. Machine Translation and Internet. Machine and Human Translation. Now it is time to analyze what has happened in the 50 years since machine translation began.
курсовая работа [66,9 K], добавлен 26.05.2005What is Demand. Factors affecting demand. The Law of demand. What is Supply. Economic equilibrium. Demand is an economic concept that describes a buyer's desire, willingness and ability to pay a price for a specific quantity of a good or service.
презентация [631,9 K], добавлен 11.12.2013Law of demand and law of Supply. Elasticity of supply and demand. Models of market and its impact on productivity. Kinds of market competition, methods of regulation of market. Indirect method of market regulation, tax, the governmental price control.
реферат [8,7 K], добавлен 25.11.2009The Chernobyl disaster is a huge global problem of 21st century. Current status of Chernobyl NPP. The most suitable decision of solving problem of wastes is a reburial in the repository "Buryakovka". The process of the Arch assembling and sliding.
реферат [396,5 K], добавлен 19.04.2011The pillars of any degree of comparison. Morphological composition of the adjectives. An introduction on degrees of comparison. Development and stylistic potential of degrees of comparison. General notes on comparative analysis. Contrastive linguistics.
курсовая работа [182,5 K], добавлен 23.12.2014Investigation of the problem with non-local conditions on the characteristic and on the line of degeneracy . The solution of the modied Cauchy problem with initial data. The solution of singular integral equations. Calculation of the inner integral.
статья [469,4 K], добавлен 15.06.2015Planning a research study. Explanation, as an ability to give a good theoretical background of the problem, foresee what can happen later and introduce a way of solution. Identifying a significant research problem. Conducting a pilot and the main study.
реферат [26,5 K], добавлен 01.04.2012A theory of price. Analysis of Markets. Simple Supply and Demand curves. Demand curve shifts. Supply curve shifts. Effects of being away from the Equilibrium Point. Vertical Supply Curve. Other market forms. Discrete Example. Application: Subsidy.
контрольная работа [84,0 K], добавлен 18.07.2009Сritical comparison of Infrared analysis and Mass Spectrometry. Summary of the uses in forensic, the molecular structural mass spectral. The method provides better sensitivity in comparison. To conclude, both techniques are helpful in the forensic study.
реферат [20,1 K], добавлен 21.12.2011The air transport system in Russia. Project on the development of regional air traffic. Data collection. Creation of the database. Designing a data warehouse. Mathematical Model description. Data analysis and forecasting. Applying mathematical tools.
реферат [316,2 K], добавлен 20.03.2016Natural gas market overview: volume, value, segmentation. Supply and demand Factors of natural gas. Internal rivalry & competitors' overview. Outlook of the EU's energy demand from 2007 to 2030. Drivers of supplier power in the EU natural gas market.
курсовая работа [2,0 M], добавлен 10.11.2013Principles of learning and language learning. Components of communicative competence. Differences between children and adults in language learning. The Direct Method as an important method of teaching speaking. Giving motivation to learn a language.
курсовая работа [66,2 K], добавлен 22.12.2011Machine Learning как процесс обучения машины без участия человека, основные требования, предъявляемые к нему в сфере медицины. Экономическое обоснование эффективности данной технологии. Используемое программное обеспечение, его функции и возможности.
статья [16,1 K], добавлен 16.05.2016Анализ существующего программного обеспечения эмпирико-статистического сравнения текстов: сounter оf сharacters, horos, graph, advanced grapher. Empirical-statistical comparison of texts: функциональность, процедуры и функции тестирование и внедрение.
дипломная работа [4,4 M], добавлен 29.11.2013Economics: macroeconomics, microeconomics, economic policy. Terms: "economics", "macroeconomics", "microeconomics", "economic policy", "demand", "supply" and others. Economic analysis. Reasons for a change in demand. Supply. Equilibrium. Elasticity.
реферат [17,3 K], добавлен 12.11.2007The development in language teaching methodology. Dilemma in language teaching process. Linguistic research. Techniques in language teaching. Principles of learning vocabulary. How words are remembered. Other factors in language learning process.
учебное пособие [221,2 K], добавлен 27.05.2015Traditional and modern methods in foreign language teaching and learning. The importance of lesson planning in FLTL. Principles of class modeling. Typology of the basic models of education: classification by J. Harmer, M.I. Makhmutov, Brinton and Holten.
курсовая работа [2,1 M], добавлен 20.05.2015