Time series prediction using reinforcement learning

Learning technique was studied and applied to the embedding dimension and time delay estimation to the benchmark nonlinear time series and the stock market prediction problem. The methodology of reinforcement learning application was proposed and applied.

Рубрика Менеджмент и трудовые отношения
Вид дипломная работа
Язык английский
Дата добавления 07.12.2019
Размер файла 3,5 M

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Размещено на http://www.allbest.ru/

NATIONAL RESEARCH UNIVERSITY

HIGHER SCHOOL OF ECONOMICS

Faculty of Business and Management

MASTER'S THESIS

Field of study: Business Informatics

Degree programme: Big Data Systems

Time series prediction using reinforcement learning

Obalyaeva Ekaterina

Academic Supervisor

Sergey Lisitsyn

Moscow 2019

Contents

Annotation

Chapter 1. Introduction

1.1 Problem statement

1.2 Research objectives

1.3 Research significance

1.4 Research structure

Chapter 2. Time series

2.1 Introduction

2.2 Chaotic time series

2.3 Estimation of embedding dimension and delay

2.4 Time series models

2.4.1 Prophet library

2.5 Recurrent neural networks

2.5.1 LSTM

Chapter 3. Reinforcement learning

3.1 Introduction

3.2 Main elements of reinforcement learning

3.3 Optimistic initial values

Chapter 4. Data and tools

4.1 Sunspots time series

4.2 CORN time series

4.3 Technological stack

Chapter 5. Implementation description

5.1 Reinforcement learning domain

5.1.1 Implementation methodology

5.2 Time series cross-validation

5.3 Metrics

5.4 Comparison with other methods

5.5 Experiment 1. Sunspots dataset

5.6 Experiment 2. CORN prices dataset

Chapter 6. Exerinment results

6.1 Experiment 1. Sunspots dataset

6.1.1 Prophet predictions

6.1.2 Reinforcement learning dimension delay estimation

6.2 Experiment 2. CORN prices dataset

6.2.1 Prophet predictions

6.2.2 Reinforcement learning dimension delay estimation

Chapter 7. Summary and conclusion

7.1 Summary

7.2 Conclusion

Bibliography

Annotation

The idea of learning to act optimally and to adapt to the changes of a time series input can be solved by the means of reinforcement learning. These techniques have many practical applications in finance, healthcare, and industry. Reinforcement learning has recently demonstrated outstanding results in learning to play Atari games. A similar set of methods and algorithms can be applied to improve neural network time series predictions. In the reinforcement learning setting, the agent learns to choose the best dimension embedding and time delay possible. This work includes studying essentials of reinforcement learning; research on the application of reinforcement learning to improvement of time series prediction problem and influence of several types of reward functions the prediction outcome.

Chapter 1. Introduction

1.1 Problem statement

Time series prediction and analysis are crucial tasks for the modern world. Every problem that contains time dependencies can be treated as a time series problem. There are a variety of areas of applications of time series methods stating from stock market price predictions and weather predictions and finishing with the temperature of the spacecraft engine predictions.

With the growth of possibilities in big data processing and storage areas, there appears to be a vast horizon of tasks that now can be completed better than ever before. The cloud technologies allow the use of high-power computers with GPU by researchers easily and freely.

Time series models are commonly associated with regression models such as autoregression and moving average models. These techniques allow getting impressive and highly interpretable results for several problems, both linear and nonlinear.

Artificial neural networks at present are a very commonly used tool for different types of predictions and problems. The idea to use neural networks for time series predictions allows getting arguably higher quality predictions, especially for nonlinear problems than the classic models. The lack of interpretability and high computational cost are justified by immersive prediction results.

The most common idea in time series prediction is to use the delay method. That means that the previous values of the sequence explain the current values of the time series. The problem of those techniques is the dimension embedding and delay values have to be introduced and estimated to achieve high-quality results.

There are plenty of techniques that can be used to estimate dimension and delay values. One of the methods of estimation is based on reinforcement learning. Reinforcement learning is a fast-growing area in machine learning the idea of which is to build the algorithm that can learn from the experience.

Reinforcement method for estimation the embedding dimension and delay is a central focus of the current research. This method is used to push the boundary of the quality of neural network predictions further.

1.2 Research objectives

Currently, there are only a few studies on reinforcement learning for time series dimension and delay estimation. There is no particular methodology of these techniques application. Moreover, there is no research on the different reward functions applied to this problem. Current research objectives include the following points:

1. Studying the reinforcement learning technique for dimension embedding and time delay estimation.

2. Analysis of the influence of different reward functions to the outcome of the learning.

3. Methodology proposition for reinforcement learning application to dimension delay estimation.

1.3 Research significance

Theoretical significance

1. This research studies the applicability of reinforcement technique to the dimension delay estimation. Two types of new reward functions were investigated and analyzed.

2. The proposed methodology of the dimension delay estimation could help to improve reinforcement solutions of time series prediction problems.

Practical significance

1. Based on the result of this paper, it is possible to build or improve trading algorithms with the proposed methodology.

2. The studied method could potentially be beneficial to other domains of time series prediction by introducing the goal-oriented reward function.

1.4 Research structure

The first chapter of the thesis is an introduction to the problem. It contains a problem statement, research objectives, and research significance both to the theoretical and practical applications.

The second chapter is an overview of the methodologies that currently is used in the time series prediction and different approaches to the problem.

Reinforcement methodologies are discussed in the third chapter, where the comprehensive description of the applied methods is provided.

The fourth chapter is dedicated to the datasets and application tools that are used in the current research.

Implementation description and the methodology preposition is provided in the fifth chapter.

Result of the research experiments and their explanations are provided in the sixth chapter.

The conclusion of the thesis is expressed in the seventh chapter.

Chapter 2. Time series

2.1 Introduction

A time series is a sequence of data points taken at equally spaced points in time. Examples of time series are countless: number of sunspots, daily prices of different financial instruments, number of customers in the shop. Time series are used in plenty of different areas, including econometrics, astronomy, engineering, weather forecasting, financial forecasting, as well as all domains, which involves all kinds of temporal measurements [1].

Time series analysis comprises methods that can be used for extracting meaningful statistics and characteristics from time series data to explain or understand the underlying process. One of the popular areas is regression analysis. Regression analysis is used to understand if two or more independent processes are correlated and the values of one series affect the values of another time series.

Time series forecasting methods are used to construct a model to predict future values based on previous observations. Time series prediction methods differ from other prediction methods because the data has a natural ordering. There are time series methods where time ordering is an essential part and cross-sectional methods where there is no particular time ordering and time are incorporated into the dataset. Both methods may be used effectively in a variety of tasks [2].

The stochastic models reflect the fact that the observations which are closer to the current state are more meaningful than the observation, which is further apart. Time series models make use of value ordering and values of some given period could be expressed using values from the previous periods.

Time series analysis methods are divided into non-parametric methods and parametric ones. The parametric methods are based on the hypothesis that the underlying process can be approximated using some finite number of parameters. Parametric methods include moving average and autoregressive models. In these approaches, the model estimates the stochastic process. On the other hand, non-parametric approaches estimate covariance and spectrum of the underlying process without any structure investigation.

Linear models for time series are a variety of types of autoregressive (AR), the moving average (MA) and the integrated (I) models. All those models predict future values based on a linear combination of previous data points. Model ARIMA is the combination of those three types of models, can be upgraded with seasonality (SARIMA), is widely applied to different types of problems [3].

There also models that can represent non-linear behavior in time series, such as heteroskedasticity, which means that the variance of time series changes over time. These models called autoregressive conditional heteroskedasticity (ARCH) models, and they include a wide range of different implementations. Neural networks can also be used in time series prediction tasks [4]

2.2 Chaotic time series

Chaotic dynamical systems are frequently observed in the real world. Chaotic systems are ubiquitous and can be understood through time series analysis [5]. The behavior of the stock market, tornado, turbulence, and weather are a few examples of chaotic systems, which have a significant influence on human beings.

It is believed that Poincarй did the first works on the chaos in the later 19th century. He studied the three-body problem when one of the bodies is negligibly small compared to the other two and found out that the behavior of this system cannot be explained precisely. Later in 1963 Lonez introduced the "butterfly effect," one of the critical concepts of chaos theory, while studying the weather prediction. The main conceptions of the chaos theory were formulated in “Period three implies chaos” by Li and Yorke in 1975 [6].

The most striking feature of the chaotic system is the unpredictability of its future. Mainly it is caused by the “butterfly effect” or in other words to sensitivity to initial conditions. The chaotic system can be detected with the Lyapunov exponent value. It characterizes the rate of separation of infinitesimally close trajectories. If the Lyapunov exponent value of it is positive, then two points tend to move apart over time at an exponential rate. If it is negative, then points converge exponentially quickly, and when it is zero, it is bifurcation. Positive Lyapunov exponent value does not, in general, indicate chaos, but can be used as a signal of it [7, 8]

It is an essential concept that a chaotic system sometimes may be characterized by the presence of attracting sets or attractors in the phase space. These are bounded subsets to which regions of initial conditions of nonzero phase space volume asymptote as time increases. Once a trajectory of the system enters the attractor, it stays in forever if there is no change in the external environment. The attractor is an invariant set, which means that it is an image of itself under time evolution. This property is called ergodicity [7].

For the practical tasks, it is challenging to obtain dynamic equations to reconstruct phase space. For such cases, there is an assumption that the observed time series comes from the attractor of the unknown system, which contains information about the attractor. Then there are some methods to estimate properties of attractor, such as dimension and its degree of sensitivity to initial conditions. Takens suggest one of those methods is called delay-coordinate embedding technique [9]. He showed that under some general conditions, the system could be reconstructed from time series.

Takens's embedding theorem states that a scalar sequence of measurements

from a generic dynamic system includes all the information required to reconstruct the state space completely. In particular, a scalar (the embedding dimension), a scalar (the time delay) and a function exist such that

Taken proved that if, where is a dimension of the chaotic attractor, there exists a one to one correspondence between the reconstructed and the original state space [9, 10].

Reconstruction of the phase space and estimation of the attractor dimension is significant in prediction tasks. The underlying real-world system is usually unknown, so the dimension and delay have to be estimated from the time series data.

2.3 Estimation of embedding dimension and delay

Embedding dimension low boundary is stated in the Taken's theorem. However, to have a high-quality representation of the dynamical system, the embedding dimension should also be large enough. However, very high embedding dimensions, which can be essential to some systems, can be very computationally expensive. It can be required to reduce dimensionality after estimation [11].

The embedding dimension can be estimated using False Nearest Neighbors (FNNs) method [12]. It examines the fraction of nearest neighbors as a function of the embedding dimension. The minimum embedding dimension is found when most of the nearest neighbors do not move apart significantly in the next higher dimensional embedding. That means that the algorithm eliminates "false neighbors," which are points that lie close together due to projection and separated in higher embedding dimension. It is possible to determine a false nearest neighbor as any neighbor for which:

where - times of the neighbor and the reference point, - threshold, is a distance in phase space with embedding dimension. The criterion is used with loneliness tolerance threshold, to eliminate those point, which is the nearest neighbors without being close to the reference point.

One of the main drawbacks of this algorithm is the fact that the quality of the results significantly depends on the choice of the threshold, which is required for the stopping criteria. Another drawback of the method is that the number of FNN increases in the case of noisy data [13].

Taken's theorem doesn't provide any information about the choice of time delay. Taking into consideration that

One of the methods to estimate the time delay is through mutual information. The idea is to find the delay, which corresponds to the first minimum of mutual information. To calculate mutual information, we need to create a histogram for the probability distribution of the data. The mutual information for time delay can be calculated as

As long as the histogram is fine enough, mutual information doesn't depend on the particular choice of the histogram.

There are some other methods, such as autocorrelation function [15] which can be used to estimate time delay. Time delay and mutual information are crucial for attractor reconstruction. The mutual information method doesn't always give consistent results [16].

2.4 Time series models

Different modifications of the autoregressive moving average models (ARMA) are the most popular class of linear models. These models include purely autoregressive (AR) and purely moving-average models as a particular case. The main idea of this model is to replicate linear relationships between the lagged variables. Usually, autoregressive moving average models (ARIMA), as a generalization for ARMA, are used [17].

Autoregressive model of order (AR(p)) is defined as

The autoregressive model is a linear regression model without intercept, which represents value though past values. This model is easy in implementation and arguably is one of the most popular time series models. The autoregressive model may be non-linear and represent non-linear relationships between the current value and past lagged ones.

Moving average model (MA(q)) of order is defined as

Moving average models represent a process as a moving average of white noise. These models are more difficult in implementation. However, moving average models is theoretically tractable [18].

The autoregressive moving average (ARMA) model is a combination of two described models. ARMA is defined as

ARMA models are widely used to approximate a variety of stationary processes, due to their flexibility in predictions.

The drawback of the ARMA models is that they are useful only for stationary problems. The real-world data frequently has a trend and often has some periodic or seasonal behavior. By differentiating the time series and extracting the trend, it is possible to reduce nonstationary components of the data. For those cases, when data need to be differentiated autoregressive integrated moving average (ARIMA) model is introduced. The order of the ARIMA model is denoted with three variables, where are from the ARMA process and is integration order.

Time series is called ARIMA process if its is ARMA stationary process, where is a backshift operator, which is defined as

Even though linear models are viral, easy to explain, and comparatively easy to implement some problems require non-linear dependencies. The non-linear model could be introduced though added non-linear variables into regression models [19].

2.4.1 Prophet library

One of the libraries which use the nonlinear fitting approach with the Bayesian approach is the Prophet library. Facebook developed the Prophet library and distributed as open source PO for Python and R.

In essence, the Prophet library is a decomposable time series model, which contains three main components: trend, seasonality, and holidays. The following equation:

combines all three components. The trend function explains non-periodic behavior. Seasonal components are responsible for modeling periodic changes associated with weekly and annual seasonality are represented by . Also, function represents holidays and any anomaly occasions during the modeling period. There is an assumption that the error term is normally distributed. This specification is similar to the generalized additive model (GAM). It means that the task is framed as a curve-fitting problem, which differs from standard generative ARIMA approaches and has its benefits.

The main advantage of the proposed approach is flexibility in choosing multiple seasonality periods. Also, the Prophet library doesn't require to interpolate missing values, and the fitting process is fast, in comparison with ARIMA models.

Weekly seasonality is modeled using dummy variables, but the annual seasonality is modeled by Fourier series. A trend is a piecewise linear or logistic function. With the linear function, everything is clear. The logistic function of the form allows you to stimulate growth with saturation when with an increase in the rate decreases its growth rate. A typical example is the growth of an application or site audience.

Among other things, the library can select the optimal points of trend change based on historical data. However, they can also be set manually (for example, if you know the release dates of the new functionality, which greatly influenced the key indicators) [20].

Before using the Prophet library, it is recommended to use box-cox transformation [21]:

the parameter is estimated using the profile likelihood function. Box-Cox transformation helps to reduce the variance of time series. After the algorithm is applied, it is required to use inverse Box-Cox transformation to go back to initial values [22].

2.5 Recurrent neural networks

The neural network, in general, is a popular tool in different machine learning domains. There are plenty of different types of them, which can be applied to time series problems as well as to cross-sectional ones. The recurrent neural networks (RNN) are networks for sequential data. Time series data can be assumed as a sequential along with voice and text data. The main feature of the recurrent neural networks is that not only input data is used in the training process but also previous outputs as well. That means that some form of memory is introduced in the RNNs. The idea of memory in time series prediction task is fitting because there is an assumption that the time series values are connected.

Long short-term memory and gated recurrent unit are the two popular, efficient RNN models [23, 24, 25]. LSTM is discussed further in current research.

2.5.1 LSTM

Long short-term memory (LSTM) is a special kind of recurrent neural network architecture capable of learning long-term dependencies. LSTMs are specifically designed to avoid the problem of long-term dependency. Memorizing information for long periods is their typical behavior and not something that they struggle to learn.

Any recurrent neural network has the form of a chain of repeating modules of a neural network. In a typical RNN, the structure of module is straightforward, for example, it can be a single layer with an activation function tanh (hyperbolic tangent). The LSTM structure also resembles a chain, but the modules look different. Instead of a single layer of the neural network, they contain four layers, and these layers interact in a special way.

On figure 1, the main structure of the LSTM neural network is shown. The key component of the LSTM is the cell state -- a horizontal line that runs along the top of the circuit.

Figure 1. LSTM architecture

The state of the cell resembles a conveyor belt. It passes directly through the whole chain, participating only in a few linear transformations. Information can easily flow through it without changing. Moreover, LSTM can remove information from the cell state; this process is governed by structures called filters (gates). Filters allow you to skip information based on certain conditions. They consist of a layer of a sigmoidal neural network and a pointwise multiplication operation. The sigmoidal layer returns numbers from zero to one, which indicates how much of each block of information should be passed further along. Zero, in this case, means "do not skip anything," unit - "skip all." There are three filters in LSTM to protect and monitor cell conditions.

The first step in LSTM is to determine what information can be thrown out of the cell state. This decision is made by a sigmoidal layer called the "forget gate layer." This layer returns a number from 0 to 1 for each number from the state of the cell, where one means "fully preserve" and zero means "completely drop."

The next step is to decide which new information may be stored in the cell state. This stage consists of two stages. Firstly, a sigmoidal layer called "input layer gate" determines which values should be updated. The tanh layer then builds a vector of candidate values that can be added to the cell state.

Afterward, it is time to replace the old state of the cell with the new state.

The decisions on how much information we need to forget and how much we need to update, so on this step, the multiplication and addition are done.

Finally, the decision on what information we want to receive at the output has to be made. The output is based on our cell status and filters are applied to them. First, we apply a sigmoidal layer that decides what information from the state of the cell may proceed. Then the state values of the cell pass through the tanh-layer to get the output values from the range from -1 to 1 and are multiplied with the output values of the sigmoidal layer, which allows displaying only the required information.

Chapter 3. Reinforcement learning

3.1 Introduction

Behind the Reinforcement learning lies the original idea, that we can learn by interaction with the environment. When a child plays with his toys, he learns through observations of consequences to his actions. The knowledge about cause and effect and sequences of actions that lead to the goal is essential to the natural learning process. Learning through interaction is a fundamental concept underlying nearly all principles of learning. Reinforcement learning is an idea of a computational approach to learning through interaction [26].

Reinforcement learning is a machine learning subsection that studies how an agent should act in an environment to achieve the goal. A learning agent must be able to examine the state of the environment and make the right decisions to achieve the desired goal. An environment is usually formulated as a Markov decision-making process (MDPR) [27,28] to include there three aspects - action, sensation, and goal.

Reinforcement learning differs from supervised learning when an external supervisor labels the training set. Each example has the right action which the system should take in that situation. The object of reinforcement learning is to generalize all pairs of situations and actions and to act correctly in situations not present in the training set. Supervised learning could be implemented for learning from interaction; however, it is difficult to obtain representative enough examples to all possible situations in which an agent has to act.

Reinforcement learning is also do not belong to the unsupervised machine learning category, which is about finding patterns, hidden structure, and behaviors in unlabelled data. Since the reinforcement learning is dealing with unlabelled data, it may be considered an unsupervised learning problem, although the goal of reinforcement learning is to maximize the reward instead of trying to find a hidden pattern or structure.

One of the challenges that differentiate reinforcement from other types of algorithms is the trade-off between exploration and exploitation [29, 30]. From the agent's point of view, to obtain many rewards, it must take actions that it has already tried before because it is an effective way to gain big rewards. On the other side, to make the right decisions and discover new good practical actions, agents must explore the environment. In terms of the reward, the agent sometimes must prefer unknown possibly less effective action instead of exploitation of actions which it has experienced before. The difficulty is that neither exploration nor exploitation can be used solely without failing at the job. The agent must balance between those two paths to succeed at the task.

Reinforcement learning applies to various fields, such as robotics, elevator control, telecommunications, and checkers [31, 32] Also, it can be beneficial in engineering and scientific disciplines. Abilities of reinforcement learning are easily integrated with optimization, statistics, and mathematical subjects. For example, the ability of some reinforcement learning methods to learn with parameterized approximators addresses the classical "curse of dimensionality" in operations research and control theory [33].

3.2 Main elements of reinforcement learning

The main elements of a reinforcement learning system are a policy, a reward signal, a value function, and sometimes a model of the environment.

On figure 2, the fundamental process of reinforcement learning is represented. The agent at the moment at the state takes actio, the environment in reaction to the taken action returns the future reward and the next state.

A policy defines how the agent reacts on the given state at a given time. A policy maps states of the environment to actions that agent have to take when it is in those states. According to the given policy agent at the moment picks the action. The policy may be a function or a lookup table, and in general, policies may be stochastic.

Figure 2. Reinforcement learning process

A reward signal defines if the taken actions lead to the goal or not. In each time step as the action has been taken the agent receives the reward. At any point, the agent aims to maximize the total reward it gets in the long run. The reward classifies the bad and the proper actions for the agent to take in each situation. The reward may be used to alter the current policy. For example, if an agent gets a low reward after the action selected by the policy, then the policy needed to be changed, to pick some another action in this situation.

Whereas the reward indicates what is right in the current state, value function indicates what could be beneficial in the long run. The current state's value is calculated as a sum of rewards which agent is likely to receive in the future, starting from that state. Some states can have low immediate rewards, but high values, if they are likely to lead to other states with high reward. The values may be estimated to achieve higher future rewards. Efficient estimation of values of states is one of the essential tasks of reinforcement learning.

Behind the model of the environment lies the idea to mimic the behavior of the environment to predict the next state or reward. Model is vital for planning, by which it is possible to guess possible future actions in each situation -- the methods, which use models called model-based methods. The methods, which are called model-free, use only a trial-error technique to gain information about the environment.

The main objective of the agent is to maximize the total reward, which it receives in the long run. If the sequence of rewards received after time, then the goal is to maximize expected return. Expected discounted return is defined as

where is a parameter, where, called the discount factor. Discount rate defines how good are rewards that are received in the future. The most valuable reward is one that received immediately. The agent with is called myopic and maximizes only current reward; on the other hand, the agent with close to 1 become more farsighted and takes future rewards more seriously. The for episodic processes, when the agent-environment interaction has its finite stages with finite state, can be defined as

.

The value function is a function of state-action pair, and it is an estimate of how good is this state for an agent to perform a given action. If the agent following the policy at time, then is the probability that if. When the agent is starting in state, under a policy, and following it after that, then is a value of state under policy and can be defined by

for all and where is the expected value of the random variable in the condition that agent follows policy and is any time step.

Similarly, is a value of taking action in the state under policy which is called action-value function, can be defined by

The most important property of value function is that for any policy and any state:

This equation is called the Bellman equation for and expresses the relationships between the value of state and value for its successor states. The value is the unique solution to its Bellman equation.

During the reinforcement learning process, the task is to find the best policy possible to maximize the reward. Moreover, there is always at least one policy that is better than or equal to all other policies, called the optimal policy, which corresponds to optimal state-value and action-value functions:

For the exploration process, the - greedy policy is defined as policy, when it takes greedy most profitable actions with probability and random action with probability. This policy allows the algorithm to explore the environment better.

3.3 Optimistic initial values

One of the crucial aspects of discussed reinforcement learning algorithms is an initial action-value estimate. In statistical terms methods which are based on any action-value function estimation are biased by their initial estimates. This bias in practice is not a problem and can help the algorithm to perform better. The downside is that the initial condition has to be picked by the user and not set them to zero at once. The advantage of the optimistic initial estimator is that the algorithm may have some prior knowledge about the level of the reward values which can be expected [37].

Moreover, the optimism in initial values makes the algorithms explore more. Because if the reward is less than initially estimated, the agent moves to other actions to explore. Even if only greedy actions are selected, the agent does a fair amount of exploration.

Chapter 4. Data and tools

4.1 Sunspots time series

Sunspots time series is an accessible benchmark dataset [38]. Dataset is openly distributed by Royal Observatory of Belgium [39].

The dataset contains a monthly average of the number of sunspots. Spots on the sun are the areas that are generated by a strong magnetic field. The spots are visible on the Sun surface because they are a lot cooler than the surrounding area. Sunspots may appear in cycles and last typically for several days, but sometimes they stay for years. The magnetic field of these rejoins of the sun is a lot stronger than the Earth's magnetic field. The spots may appear as a single or in groups. The sunspot cycle has variety both in size and in length, which makes it harder to predict and analyze [40].

The number of sunspots is correlated with solar activity. A more significant number of sunspots corresponds to a higher activity, while the minimum activity is related to very few spots. The Earth's biosphere, space weather, and technology are profoundly affected by solar activity [41].

In this paper, the analysis of the number of sunspots from 1947 till 1992 was performed. On figure 3, the chosen period is shown.

The record of the number of visible sunspots is collected by different researchers and techniques. Also, various attempts to predict the number of sunspots have been made. On the sunspots dataset dimension and delay estimation if frequently performed as well as different types of neural network predictions [42, 43, 44].

Figure 3. Sunspot time series

4.2 CORN time series

For the trading additions to the algorithms, the CORN prices were picked. The daily prices of the CORN were extracted with Alpha Vantage API [45]. CORN stock prices were gathered for the 5 years and were adjusted to the inflation in the commodity section.

The commodities prices predictions could be useful for economic situation illustration and analysis. Taking into consideration that the methods, analyzed in this research, are based on the previous values of the time series commodities prices have relatively stable behavior compared to other stock prices areas.

On figure 4, the real (blue color) and adjusted to the inflation prices (orange color) are shown.

Figure 4. CORN prices

4.3 Technological stack

All code was written on Python 3.6, using Jupyter Notebook in Google Colab as a framework. Google Collaboratory is a cloud service, which is based on the Jupyter Notebooks. The main feature of Google Colab is that it provides free-of-charge access to GPU. For the neural network prediction tasks usage of GPU helps to decrease the runtime significantly [46]. Google Colab is easy to use platform, which is fully configured for deep-learning tasks.

For the neural network model Keras, running on top of TensorFlow backend was applied. Keras is a high-level neural network API, which is a fantastic tool to design neural networks in Python. For Prophet implementation, the Facebook prophet API was applied, and for other statistical calculations, Scikit-learn was used.

Chapter 5. Implementation description

The purpose of the paper is to study the performance of the reinforcement learning method on the time series prediction tasks. The goal is to estimate the embedding dimension and delay for LSTM prediction using reinforcement learning. There are several elements of reinforcement learning, that were addressed in the previous chapter, that have to be defined. There is a paper about reinforcement learning implementations in this domain: the proposed algorithm reinforcement learning-based dimension and delay estimator (RLDDE). The current implementation is similar to proposed in the paper [47].

5.1 Reinforcement learning domain

1. Action. The action is a pair of embedding dimension and time delay that has been estimated. Before the learning process starts range of and have to be set. This particular step is important because if the ranges of dimensions and delays are set incorrectly, it is impossible to find the optimal values. The lowest value of dimension and delay could be as low as possible. However, when setting the upper boundary, it is necessary to consider the smallest length of the dataset. Ranges for the dimension and delay have to be set in correspondence to the dataset domain and its properties. When the ranges are set, all different combinations of ( values form the action space.

2. State. For the state the standard deviation of the training dataset is used. Standard deviation potentially could represent the volatility of the data. Standard deviation changes could be used as a signal, that something in the process has modified. In this implementation for different values of the standard deviation of the time series, different pairs of dimension delay are defined. Standard deviation is a continuous variable, and the state space for this implementation is finite.

Episodes. Reinforcement learning commonly performed in an episodic manner. For example, in the game processes, one game can represent one episode. Some processes naturally have episodes, but in the current situation we have to define the episodes. At the beginning of each episode length of the sliding window is picked randomly. The widow slides from the begging of time series to the end of it. The observations from the window are divided into the test and train dataset, for the train part, the state is calculated, and the action is picked according to the policy. In the next step, the window slides, and if the next state of the training dataset differs from the previous one Q matrix is updated. On figure 4, the episodic process is illustrated.

Figure 4. Episodes

Figure 5. Reinforcement learning process

Policy. The greedy policy is picked for this task because it allows us to explore the environment and converge to the optimal solution at the same time.

To summarize the reinforcement learning process for dimension and delay detection process, the idea of the structure is represented in figure 5.

5.1.1 Implementation methodology

Overall the procedure of learning is divided into these steps:

Step 1. Select the part of the dataset for reinforcement learning. For the studying purposes in this, the dataset is divided into two parts - the first part for reinforcement learning, and the second part is for tests.

Step 2. On the first part of the dataset tune the parameters of the model, have to perform well and have stable behavior on the datasets with different length with random dimension, delay pairs.

Step 3. Using the selected model perform the reinforcement learning algorithm on the first dataset. As a result, there must be policy, which allows to pick an optimal dimension, delay pairs according to the standard deviation of the training set.

Step 4. Using the second part of the dataset evaluate the obtained policy. The policy should be greedy and select the best possible pair on each training dataset during the time series cross-validation procedure.

5.2 Time series cross-validation

For all experiments, the cross-validation technique for time series prediction was applied. The time series cross-validation differs from standard cross-validation, because the data cannot be shuffled, and the splits must be done according to the time series structure of the data [48]. The main idea of cross-validation splits is illustrated in figure 6.

Figure 6. Time series cross-validation

5.3 Metrics

For evaluation of the quality of obtained predictions two metrics were used. The first metric is RMSE, which is the root-mean-square error. This metric is frequently used to measure the differences between predicted and actual values. RMSE is calculated as

where are actual and predicted values, respectively.

The other metric which was used is SMAPE, which is symmetric mean absolute percentage error. This metric is a measure that represents percentage errors and is calculated as

5.4 Comparison with other methods

For each cross-validation pair for embedding dimension, the FNN algorithm was used and mutual information criteria for time delay estimation. Both algorithms were described in the first chapter.

5.5 Experiment 1. Sunspots dataset

Dataset was divided into two parts. The first part was for the reinforcement learning algorithm and the second part of the dataset for the testing and comparison with other methods.

For this dataset prediction with Prophet was performed.

5.6 Experiment 2. CORN prices dataset

Dataset was divided into two parts. The first part was for the reinforcement learning algorithm and the second part of the dataset for the testing and comparison with other methods. The influence of different reward functions was analyzed.

Prophet predictions were made as well for comparison.

Chapter 6. Exerinment results

6.1 Experiment 1. Sunspots dataset

6.1.1 Prophet predictions

Prophet on Sunspots data shows poor prediction quality.

For the three cross-validation parts, shown in figure 4, the following results were obtained using Prophet algorithm with no parameters tuning:

Prophet on the sunspots dataset

Split

Number of observations

Dataset type

RMSE

SMAPE

1

67

test

125,57

113,17%

69

train

19,02

19,95%

2

67

test

45,39

46,58%

136

train

30,21

28,35%

3

67

test

99,08

84,06%

203

train

70,79

67,24%

Figure 5. Trend and seasonality Prophet.

Because the seasonality of the sunspot number is not yearly and is approximately 11 years, the Prophet with custom seasonality was applied as shown in figure 6.

Figure 6. Prophet added seasonality.

The obtained results are shown in table 7 below. Even though the results are better than in the previous scenario, the quality of the predictions is relatively bad as it is shown in figure 7.

Prophet on the sunspots dataset with added seasonality

Split

Number of observations

Dataset type

RMSE

SMAPE

1

67

test

124,64

111,77%

69

train

15,81

14,10%

2

67

test

47,84

48,88%

136

train

27,95

27,57%

3

67

test

101,18

85,78%

203

train

68,81

64,20%

Figure 7. Prophet predictions

6.1.2 Reinforcement learning dimension delay estimation

The LSTM neural network with 50 neurons, 300 epochs, Adam optimizer, and early stopping was used. This configuration was chosen because it showed the best stability and quality results in the first part of the dataset. On the figure 8 code listing is shown.

Figure 8. LSTM implementation

The double Q-learning algorithm was applied. The Q-learning algorithm worked for 350 iterations. As a reward function, inverse RMSE on the test was used. On figure 9, the histogram of test RMSE on the first and the last 100 iterations is shown. The - greedy policy means that in the last iterations, some action may be picked randomly, that is why we can see the long right tail on the figure.

Figure 9. Sunspots RMSE test histogram

Overall the histogram shows that during the reinforcement learning process algorithm works correctly and maximizes the reward by minimizing the RMSE on the test dataset.

After the reinforcement stage, we obtain the matrix (the sum of and obtained in double Q-learning). Now we can pick the actions greedy and obtain the optimized results which are shown in table 8. The first column represents the number of observations in the test and train, respectively. The LTSM was trained for 10 times, and the fourth column is a standard deviation of the RMSE obtained on the test dataset.

For the comparison results of FNN (False Nearest Neighbors) for embedding dimension estimations and MI (Mutual Information) for time delay is provided.

Reinforcement estimation results for sunspots

(N_test,N_train)

Method

RMSE test

Std.

SMAPE

(62,64)

Reinf.

36,65

0.11

55,39%

(5,1)

FNN + MI

40,40

0.12

55,76%

(2,1)

(67,136)

Reinf.

28,79

0.14

23,64%

(4,1)

FNN + MI

29,19

0.13

23,85%

(3,1)

(64,200)

Reinf.

22,13

0.17

25,78%

(4,1)

FNN + MI

22,20

0.17

26,44%

(3,1)

Overall the prediction quality of both variants is much better than the Prophet predictions. For each cross-validation section, the results of reinforcement estimation are better than FNN + MI results.

6.2 Experiment 2. CORN prices dataset

Before all prediction methods are applied two picks at the beginning of the dataset, which could be observed on figure 4, are replaced with half of the sum of previous and next values . All predictions are made for inflation-adjusted prices.

6.2.1 Prophet predictions

Prophet predictions performed on the three cross-validation datasets. The quality of the predictions this time is better than in the sunspot case. The results are shown in table 9. The growth of RMSE with the growth of observations in the train dataset signals that a simpler Prophet model, in this case, has better performance.

Prophet on the CORN dataset

Split

Number of observations

Dataset type

RMSE

SMAPE

1

314

test

5,74

19,55%

316

train

0.30

0,85%

2

314

test

9.22

36,41%

630

train

0.49

1,41%

3

314

test

11.41

48,88%

944

train

0.46

1.48%

On figure 10, prediction results are shown. Prophet model automatically detected the day of the week, and monthly seasonality, which could make sense for the corn stock prices. The seasonality is shown in figure 11.

Figure 10. Prophet results CORN

Figure 11. Seasonality in CORN dataset

6.2.2 Reinforcement learning dimension delay estimation

The LSTM neural network with 50 neurons, 300 epochs, Adam optimizer, and early stopping was used. This configuration was chosen because it showed the best stability and quality results on the first part of the dataset.

The dataset was divided into two parts, and the first part was used for the Reinforcement learning estimation. For the reinforcement learning process, different reward functions were analyzed. The reward column on the following tables means the trading reward on the test sample.

Inverse RMSE reward

In table 11, inverse RMSE was taken as a reward. This reward function could be used to get the most accurate results possible in terms of RMSE. Moreover, it does not, in general, mean anything to results for the trading process. LSTM was trained 10 times, so fourth and sixth columns of the table represent obtained standard deviation for RMSE and reward, respectively.

Reinforcement estimation results for CORN prices

(N_test,N_train)

Method

RMSE test

Std. RMSE

reward

Std. reward

(151,152)

Reinf.

1.33

0.17

0.21

0.12

(2,3)

FNN + MI

0.94

0.33

0.62

0.46

(1,3)

(151,309)

Reinf.

0.84

0.19

0.43

0.32

(2,3)

FNN + MI

0.86

0.28

0.47

0.52

(3,1)

(151,469)

Reinf.

0.52

0.09

0.81

0.28

(3,1)

FNN + MI

0.54

0.13

0.72

0.31

(4,1)

For the first cross-validation section the results on the table are better for FNN + MI methods. However, it seems, that the reinforcement algorithm makes more stable predictions than FNN + MI algorithm.

For the second and third parts both RMSE and reward are better for reinforcement method.

Figure 12. RMSE test on CORN dataset histogram with inverse RMSE reward

Overall through iterations of reinforcement learning the reward and RMSE on the test become better as it is shown in figures 12 and 13.

Figure 13. reward on CORN dataset histogram with inverse RMSE reward

Simple trading reward. In table 12, the result of using the reward from table 4 in the previous chapter is shown. The idea is that the goal of the learning process to achieve the best reward in monetary gain, so the accuracy is not so essential in this situation.

Reinforcement estimation results for CORN prices

(N_test,N_train)

Method

RMSE test

Std. RMSE

reward

Std. reward

(151,152)

Reinf.

1.41

0.17

0.21

0.12

(2,3)

FNN + MI

0.93

0.32

0.62

0.46

(1,3)

(151,309)

Reinf.

0.84

0.19

0.43

0.33

(2,3)

FNN + MI

0.86

0.27

0.47

0.52

(3,1)

(151,469)

Reinf.

0.65

0.04

0.91

0.09

(2,3)

FNN + MI

0.54

0.13

0.72

0.31

(4,1)

Figure 15. RMSE test on CORN dataset histogram with trading reward

This type of reward seems not to work for this dataset. For the first two cross-validation sections pairs are the same as previous time. So, there is no particular improvement in the reward. Even though for the last part the algorithm has a better reward, on the figures 15 it seems that through reinforcement iterations there is no improvement in the reward.

Trading with multiplication reward

In table 13, the results of using reward with trading and multiplication (table 5) are shown.

Reinforcement estimation results for CORN prices

(N_test,N_train)

Method

RMSE test

Std. RMSE

reward

Std. reward

(151,152)

Reinf.

1.16

0.07

0.93

0.06

(2,2)

FNN + MI

0.94

0.32

0.26

0.04

(1,3)

(151,309)

Reinf.

1.21

0.04

0.58

0.02

(2,2)

FNN + MI

0.86

0.28

0.57

0.01

(3,1)

(151,469)

Reinf.

0.64

0.01

0.89

0.02

(2,3)

FNN + MI

0.54

0.13

0.80

0.01

(4,1)

According to the result in table 9, the reinforcement algorithm gets better results in terms of reward for each cross-validation section. In terms of RMSE on the test dataset FNN + MI method is more accurate.

Figure 16. reward on CORN dataset histogram with a trading reward with multiplication

Figure 17. RMSE test on CORN dataset histogram with a trading reward with multiplication

Chapter 7. Summary and conclusion

7.1 Summary

During current research, the reinforcement learning technique was studied and applied to the embedding dimension and time delay estimation to the benchmark nonlinear time series and the stock market prediction problem. Three different types of reward functions were applied, and their influence on the result was analyzed. The methodology of reinforcement learning application was proposed and applied.

7.2 Conclusion

The main advantage of the Reinforcement algorithm is that it is very flexible to the dataset properties. It allows using those properties to benefit the predictions. If it is required to train the model on very inconsistent is size datasets or the process some the data with regime behavior,

Moreover, it is possible to set a different goal to the learning process rather than always train for the best accuracy possible. Real-world problems rarely aim to predict the best accuracy possible. The majority of the prediction problems have some other goal. Using the Reinforcement algorithm, it is possible to shift the predictions closer to the desired value by using a custom reward function.

The main drawback of the Reinforcement algorithm is that it is very computationally expensive. This problem is solvable, for example, in this research with Google Colab, but with a growth of the volume of the data may become crucial. In comparison with other methods like FNN, the Reinforcement method requires a significant volume of computations. The idea that there ...


Подобные документы

  • Изучение сущности логистики - стратегического управления (менеджмента) материальными потоками в процессе закупки, снабжения, перевозки и хранения материалов, деталей и готовой продукции. Суть, преимущества, проблемы и недостатки концепции just-in-time.

    реферат [30,3 K], добавлен 09.04.2011

  • Теоретические аспекты применения системы Just-in-Time, ее основные элементы, которые требуются для эффективной работы. Пример практического применения системы в американской компании Harley-Davidson. Концепция эффективной реакции на запросы потребителей.

    реферат [26,5 K], добавлен 04.02.2011

  • Основные идеи и концепции логистического принципа just in time ("точно в срок"). Характеристика общих черт логистической системы, ее конечные цели. Проектирование и производство в системе JIT. Микрологистические схемы, основанные на концепции JIT.

    курсовая работа [50,6 K], добавлен 06.12.2011

  • Organizational legal form. Full-time workers and out of staff workers. SWOT analyze of the company. Ways of motivation of employees. The planned market share. Discount and advertizing. Potential buyers. Name and logo of the company, the Mission.

    курсовая работа [1,7 M], добавлен 15.06.2013

  • Analysis of the peculiarities of the mobile applications market. The specifics of the process of mobile application development. Systematization of the main project management methodologies. Decision of the problems of use of the classical methodologies.

    контрольная работа [1,4 M], добавлен 14.02.2016

  • Time-менеджмент как мероприятия, направленные на повышение эффективности использования рабочего времени руководителя. Анализ данного показателя на предприятии ООО "Хьюмэн Ресурс ЕКБ", направления его оптимизации и повышения практической продуктивности.

    дипломная работа [233,4 K], добавлен 17.05.2011

  • Программное обеспечение календарного планирования и контроля. Системы календарного планирования проектов и их возможности. Особенности календарного планирования с помощью программных средств Microsoft Project и Time Line, их преимущества и недостатки.

    курсовая работа [600,2 K], добавлен 18.03.2014

  • Основные логистические концепции и системы. Микрологистические системы: KanBan, Just-in-time, MRP-1,MRP-2. Краткое описание коммерческой деятельности ООО "Самсон-К". Разработка логистической стратегии компании на основе микрологистической концепции.

    дипломная работа [3,3 M], добавлен 22.12.2012

  • Logistics as a part of the supply chain process and storage of goods, services. Logistics software from enterprise resource planning. Physical distribution of transportation management systems. Real-time system with leading-edge proprietary technology.

    контрольная работа [15,1 K], добавлен 18.07.2009

  • Толкающая система управления материальными потоками. Тянущая система управления материальными потоками. Логистическая концепция RP. Логистическая концепция "just-in-time". Системы KANBAN, ORT. Управление запасами на предприятии с помощью анализа XYZ.

    курсовая работа [57,9 K], добавлен 18.11.2005

  • Origins of and reasons for product placement: history of product placement in the cinema, sponsored shows. Factors that can influence the cost of a placement. Branded entertainment in all its forms: series and television programs, novels and plays.

    курсовая работа [42,1 K], добавлен 16.10.2013

  • Factors that ensure company’s global competitiveness. Definition of mergers and acquisitions and their types. Motives and drawbacks M and A deals. The suggestions on making the Disney’s company the world leader in entertainment market using M&A strategy.

    дипломная работа [353,6 K], добавлен 27.01.2016

  • Critical literature review. Apparel industry overview: Porter’s Five Forces framework, PESTLE, competitors analysis, key success factors of the industry. Bershka’s business model. Integration-responsiveness framework. Critical evaluation of chosen issue.

    контрольная работа [29,1 K], добавлен 04.10.2014

  • History of development the world leader in the production of soft drinks company "Coca-Cola". Success factors of the company, its competitors on the world market, target audience. Description of the ongoing war company the Coca-Cola brand Pepsi.

    контрольная работа [17,0 K], добавлен 27.05.2015

  • Value and probability weighting function. Tournament games as special settings for a competition between individuals. Model: competitive environment, application of prospect theory. Experiment: design, conducting. Analysis of experiment results.

    курсовая работа [1,9 M], добавлен 20.03.2016

  • The concept of transnational companies. Finding ways to improve production efficiency. International money and capital markets. The difference between Eurodollar deposits and ordinary deposit in the United States. The budget in multinational companies.

    курсовая работа [34,2 K], добавлен 13.04.2013

  • Improving the business processes of customer relationship management through automation. Solutions the problem of the absence of automation of customer related business processes. Develop templates to support ongoing processes of customer relationships.

    реферат [173,6 K], добавлен 14.02.2016

  • When we have time for leisure, we usually need something that can amuse and entertain us. Some people find that collecting stamps, badges, model cars, planes or ships, bottles, or antiques are relaxing hobbies. Free time is organized in many schools.

    сочинение [5,2 K], добавлен 04.02.2009

  • Principles of learning and language learning. Components of communicative competence. Differences between children and adults in language learning. The Direct Method as an important method of teaching speaking. Giving motivation to learn a language.

    курсовая работа [66,2 K], добавлен 22.12.2011

  • Понятие компонентов как определенного типа объектов, их свойства и функции. Режимы создания: Design-time и Run-time, их сравнительная характеристика, условия и возможности использования, преимущества и недостатки. Контролеры за объектами, их значение.

    презентация [1,3 M], добавлен 27.10.2013

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.