Predicting stock market returns using text mining of online news headlines

The main advantages of unstructured text data, their use in the political and economic areas. Research of connection between the online-news headlines and the activities of the financial market. Using of linear regression models for predicts log-returns.

Рубрика Финансы, деньги и налоги
Вид курсовая работа
Язык русский
Дата добавления 28.08.2016
Размер файла 806,6 K

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Размещено на http://allbest.ru

National Research University -Higher School of Economics

Predicting stock market returns using text mining of online news headlines

Eskerhan Dzhantamirov,

Vladimir Pyrlik

Saint-Petersburg

Contents

Introduction

1. Literature review

2. Methodology

2.1 Data

2.1.1 Textual data

2.1.2 Market data

2.1.3 Feature selection

3. Modeling strategy

3. Modeling strategy

3.1 Description of the models

4. Results

Conclusion

Bibliography

Annexes

Abstract

Introduction

Nowadays, with the development of sophisticated and powerful computer systems scholars have an opportunity to use not only prepared and well-structured data, but also the data that at first glance, does not contain or does not allow to retrieve any valuable information.

For example, recently, researchers have shown an increased interest in such type of non-trivial information as unstructured textual data. One of the important advantages of unstructured textual data is its price. Usually data is collected by specialists for certain goal and such surveys whether they are questionnaire or interview are not cheap, but with the development of internet and spread of mobile phones there are enormous amounts data that have many applications in large variety of fields and one of the interesting areas for its implementation is a sentiment analysis. For instance, (Junquй de Fortuny, De Smedt, Martens, & Daelemans, 2012) reviewed 68000 on-line news that were in media coverage and votes for different political parties. Via opinion mining and detection of sentiment of each article they visualized the changes in the tone of published in 2011in Flemish newspapers during the political crisis in Belgium to analyze differences reporting during the key political moments throughout the year.

However, unstructured textual data can be used not only in the political area, but also in the economic and financial. Wang and co-workers , in (Wang, Huang, & Wang, 2013) based on the annual reports of three major Chinese companies and three US companies have tried to predict and measure the change in value of the shares of these companies.

The authors have made novel assumption that the textual part of the financial report contains not less valuable information than its numeric part. This seems to be reliable and innovative approach in catching the dominant mood of the financial market players especially taking into the consideration the results of (Bollen, Mao, & Zeng, 2011) and (Lampenius & Zickar, 2005) that provides solid proofs to the idea that financial decisions are driven in a considerable way by emotion and mood which as a consequence leads to a predictable changes in the value of the company. As for this research particular focus on such topic is explained not so much by the fact that there are no papers associated with the Russian market at the moment, and the fact that such works (making predictions based on headlines) are very few in the world.

The aim of our work is to broaden current knowledge about the connection between the online-news headlines and financial market. Specifically we hope to find empirical evidences of influence of news headers released on Russian news portal RBC on Russian MICEX index and if such influence is found try to estimate it. Such hypothesis is stated, first of all because there is lack of researches that are based on using short texts (in particular headlines). Secondly because of the fact that in most cases scholars have tended to conduct their researches for the global market (Geva & Zahavi, 2014; Harris, 1999; Junquй de Fortuny, De Smedt, Martens, & Daelemans, 2014) and based on the global market indices (i.e. SP500, NASDA, QDJIA, NYSE, Nikkei225), while local markets (such as Russian) are usually left unattended. This seems to be one of the serious shortcomings of the studies currently conducted in the field of text mining and market predictions, mainly because of the fact that local financial markets are often developing economies and have great potential for growth (the total volume of trading in all Russian markets amounted to 369.7 trillion rubles According to Moscow Stock Exchange website (http://moex.com/ru/Report/2012/#page_5_1) ).

This paper is divided into five sections. The first section gives a brief overview articles with description and utilization of unstructured data and provides historical background. The second chapter is devoted to the review of the current literature in this field. In the next part of work data and its description are examined alongside with proposed methodology. The 4 chapter looks at the modeling strategy and finally experiment and evaluation results are explained and discussed in the section number five.

1. Literature review

Most of the developed countries of the world in varying degrees, developed by the forces of the market economy. That is why financial markets are considered to be the heart of any market economy as they contribute to the acceleration of turnover of the capital, stimulation of economic growth of trade and industry, acceleration of technological progress and creation of additional profit. Consequently, the possibility of predicting the changes taking place and analysis of financial markets are important problems not only for the agents directly employed in this or related fields, but also the entire state as a whole.

Enric Junquй de Fortuny and his colleagues have suggested their solution to this problem ((Junquй de Fortuny et al., 2014). Utilizing the corpus of over 671.751 articles that comprises all articles published in on-line versions of all major Flemish newspapers in 2007 they have tried to predict the movement of stock market prices more accurately by including indicators of irrationality next to the traditional trade model features. Authors have tested several models (56 models on the 7 different lags using bag-of-words, sentiment indicator and technical indicator approaches).

During the assessment of models performance they have faced a problem: different models show different usefulness for four various evaluation metrics in this study (accuracy, area under curve - AUC, return and Sharpe ratio). However it was possible to find certain combination of features and lags (using full text information and 4 minute lags) for a bags-of-words model that provides the most sustainable and reasonable results (according to the superior AUC and average Sharpe ratio metrics).

Even though authors have found some evidence that the model certainly provides better results than random performance they argue that models should be verified using more metrics, similar techniques because often it is very difficult to capture this slight effect using aggregate measures (e.g. accuracy or statistical tests). One of the possible explanations for such heterogeneous results is the substantial amounts of noise in the textual data, and since it is rather difficult to remove all the noise from the data authors caution researchers to avoid the use of a single measureto validate trading models.

Another recent work that worth attention in this field of research is the paper of Qing Li and his colleagues (Li et al., 2014). Even though this article might seem to have no significant difference from previous one (in both articles authors trying to predict stock prices based on news) it has much more complex approach.

1) authors assume that the influence of news articles on stocks has two sides:

a) emotion - investors are affected by certain tone of the news (e.g. optimistic expectations to the end of recession;

b) event - adjustment of the investors behavior according to the certain event (lower stock return of Apple after the news about health problems of its CEO Steve Jobs);

2) as a result of dual effect of the news authors used two different sets of the data: a) corpus that contains the discussion threads of the CSI 100 companies from January 1, 2011 to December 31, 2011 from two premier financial discussion boards in China, i.e., www.sina.com and www.eastmoney.com;

b) corpus that contains 124, 470 financial news articles released from January 1, 2011 to December 31, 2011 and related to 100 companies listed in China Securities Index (CSI 100) as the data event;

3) Focus on domain-specific sentiment analysis of the word rather than on general domain opinion analysis. Novelty of such approach is that generally emotionless word is can be counted as typically emotional in the finance realm (e.g. the word bull).

As a result of utilization of complicated techniques researchers have found convincing evidence of existence of influence of the financial news on the stock market. Firm-specific information has particular effect on the activity of investors while news article sentiment leads to changes in investor's decision making.

The researches highlighted above have one thing in common: they all based on the assumption that the entire text of article, report or publications contains important information for prediction, but recently appeared an alternative and slightly more complicated point of view.

It states that the most important information can be extracted not from the article itself, but from its title. So the (Khadjeh Nassirtoussi, Aghabozorgi, Ying Wah, & Ngo, 2015) verified this hypothesis (fairly new in its assumptions). We will review particularly this work in details because it is one of the first works in this field and because of the extreme lack of the researches with such assumptions. In this paper the authors attempted to predict the direction of movement (up or down) one of the currencies (the dollar against the euro) for one day (it is connected with a sufficiently small size of the text and to the fact that often the effect of the news is shown within one or 2 hours after release news).

This paper can be called rather innovative for several reasons: 1) the choice of news. Researches consciously decided to deviate from traditional approach in choosing news for prediction: they did not use text any type categorization apart from focusing on only financial breaking news (to avoid noise);

2) the size of the text. Use of such a short texts has both profits and drawbacks. At one hand there is less noise in such texts as headlines are usually more meaningful and concise. On other hand the standard method of determining of word's significance by its repetition in text cannot be used;

3) the prediction time-line. Authors in this research implemented a system that predicts directional movements of the market 1 hour after the end of 2-h interval which includes the news headlines released within it. Use of such interval approach ensures that all the breaking news even those that have been released slightly later are taken into consideration so no valuable information is left.

However, introduction of the novel multi-layer algorithm that includes: heuristic-hypernym feature selection, sentiment integration and targeted feature reduction - can be considered as the main contribution of this research in the fields of text mining and market predictions.

Thus, due to the utilization of aforementioned advantages of this paper authors managed to achieve an impressive accuracy of predictions of 83.3 % in some cases and that is can be considered as significant results and as authors mention the accuracy (in cases of binary decision, for example Up and Down) in majority of cases is in range of 50% to 70%.

As it has been mentioned before the article (Khadjeh Nassirtoussi et al., 2015) is a good example of research with the use of short texts in this particular case - news headlines. Even though authors proposed relatively novel approach for forecasting intraday directional movements in their research it has some limitations and problems.

First and the most important limitation of this work is the fact that dictionaries (as WordNet) exist only for very limited amount of languages (Chinese, French, German, Hindi, Japanese, Brazilian Portuguese, Russian, Spanish, Swahili) end most of them are on development stage (for instance, Russian) http://wordnet.ru. Thus, it is impossible or rather problematic to replicate the study of Khadjeh Nassirtoussi, Aghabozorgi, Ying Wah, & Ngo for other languages at this stage, but it has great potential for development in the future.

2. Methodology

2.1 Data

Data for the research is collected from different sources as there are two types of data needed for this research:

a) Textual data (headlines);

b) Financial data.

Since there are significant differences between these two datasets so different approaches should be implemented for its retrieval, processing and interpretation.

2.1.1 Textual data

For our research textual data, or rather, headlines were generated from website of an informational agency RBC (RosBusinessConsulting) http://www.rbc.ru - the first and only business channel in Russia. There are several significant reasons in favor of this particular informational agency:

1) Since this is a business channel there is no need in too scrupulous news filtration (we can be certain that there is no general news, for instance, in financially related section of news);

2) Rather big archive of the news (RBC has almost ten times more headlines than other news source LENTA.ru that we used in the previous research) and small interval between the releases of the news.

The data were obtained by parsing and only related to economics or finance section news were selected, also intentional data sorting and filtering have been made to prevent the emergence of "noise" in the final stages of the study. Data retrieval was executed with help of commercially available informational agency Integrum http://www.integrum.ru (agency specializes on the analysis of mass and social media) that provides access to the news archive of different informational portals including RBC. Thus, totally 502000 headlines were obtained for the period from 16.11.2009 to 06.10.2015, the total number of unique words in them - 50046.

For further description and data processing, a process of preliminary data processing was held. Time and date of occurrence of the news has been separated from headlines for easy syncing them with market data on the next stages of research.

At the next process step with help of Morphological Analyzer pymorphy2 https://pymorphy2.readthedocs.org/en/latest/ for Python programming language extra characters and stop words (the list of stop words was gathered manually from different sources such as: list of common stop words in various languages in Python https://pypi.python.org/pypi/stop-words/2014.5.26, list of stop words from website of Ranks NL http://www.ranks.nl/stopwords/russian and from the “tm” package for R) were removed, words were normalized (for instance, verbs were converted to indefinite form, nouns were given in the nominative case and so on for all parts of speech), headers were divided into individual words and converted to lowercase, whereby the data was in form of structured table as follows:

Table 1 - headlines after primary processing

Date

Time

Headlines

1

16.11.2009

15:37

цб рф ожидать увеличение корпоративный кредитный портфель банк рф i квартал

2

16.11.2009

15:39

чистый убыток возродить банкротство general motors июль сентябрь составить миллион доллар

3

16.11.2009

15:45

метр маргеловый саммит атэс принять решение присоединиться к антикризисный мера разрабатывать g

4

16.11.2009

15:46

мурманский морской торговый порт месяц увеличить переработка груз миллион

5

16.11.2009

15:46

цб рф объесть заключённый сделка итог аукцион прямой репо срок превысить миллиард миллион рубль

6

16.11.2009

15:48

российский фондовый индекс вырасти

7

16.11.2009

15:49

признать центр телек нарушить закон защита конкуренция

8

16.11.2009

15:52

россия евросоюз подписать меморандум механизм ранний предупреждение сфера энергетика

9

16.11.2009

15:52

афганистан сформировать новое подразделение борьба коррупция

10

16.11.2009

15:54

улюкай банка втб потребоваться новый допэмиссия акция дополнение

At the next step after the initial processing of information certain headlines were deleted because they were editor's notes or did not concluded much information but were too recent and could bias our results. For instance, in such cases as with the last three lines in the following table 2:

Table 2 - example headlines of with lack important information

Date

Time

Headlines

1

02.06.2014

17:15

ммвб сэлт средневзвешенный курс доллар закрытие торг инструмент usdrub tod мск рубль

2

02.06.2014

17:26

энерго альянс обратиться арбитраж требование признать результат конкурс продажа пнк недействительный

3

02.06.2014

17:39

правительство одобрить продление вступление программа госсофинансирование пенсия конец

4

02.06.2014

17:55

начало москва получить миллиард рубль счёт земельный имущественный торг

5

02.06.2014

18:00

результат торг adr gdr биржа мир

6

02.06.2014

18:07

разбиться мурманский область вертолёт застраховать компания росгосстрах миллион рубль

7

02.06.2014

18:19

фзв июнь начать приём заявление получение компенсация клиент украинский банк лишить лицензия

8

02.06.2014

18:25

фондовый индекс мир гринвич

9

02.06.2014

19:55

фондовый индекс мир гринвич

10

02.06.2014

20:25

фондовый индекс мир гринвич

After delete of these headlines the total amount of them decreased from 502000 to 470000 and visual evaluation of the data was possible: the minimum number of words in the title - 1, the maximum - 25, the average length of the header about 9 words.

Figure 2 - histogram of probability densities

The latter histogram shows a distribution of headers according to the count of words in them. As can be seen from the histogram distribution of headers is close to normal, however, it has some asymmetry to the left side.

Figure 3 - wordcloud (words with the frequency >5000)

Figure 3 shows the most common words in headlines, the larger the word and oilier secretions, the more frequent word among the headlines. It should be noted that this picture shows the most popular words without pre-treatment of the text, since even stop words and numbers are not removed from the data under consideration. For example, the most popular are the word “РФ” which stands for Russian Federation is found in 16.4% of all titles, the second occurrence frequency titles is word “Рубль” which stands for Russian Ruble- 23.5% of cases.

However it should be noted that these words are not final terms, to determine them we should execute in a subsequent part of the study certain algorithm for text clusterization. Finally at the next picture we can see the most common 50 words along with their frequencies.

Figure 4- 50 the most popular words with their frequencies

2.1.2 Market data

As the numerical data historical data archival values of the MICEX index from the investment-analytical resource Finam.ru http://www.finam.ru were collected. A total of 1483 daily values of the MICEX index for the period from 16 November 2009 to 16 October 2015 were extracted.

The primary analysis showed that the minimum value of the MICEX index at the close of trading - 1197, maximum -1860, an average of 1495.

Figure 5 - Distribution of the daily index values

Also for the further research goals we introduce a new variable log-return: r (for daily data) and r2 (for hourly data), which is the log difference of the MICEX index.

Figure 6 - Daily dynamics of the MICEX index

Figure 7 - Daily dynamics of the log-returns

Figure 8 - The empirical distribution of daily log-returns

Figure 5 shows us the distribution of index values at the time of closing, as can be seen from its distribution is approximately normal, as in the case of text data. From the graph we can see that it has bimodal shape and explanation for that is that log-returns are non-stationary time-series and that during this time period time- series had several different regions.

Figure 7 illustrates us daily dynamics of the log-returns of the MICEX index. According to the figure 7 we can say that it is a stationary time series because there is a long-term steady-state level. However we have conducted a test for stationarity, to have convincing evidence:

1) According to the KPSS test for level stationarity the time series is stationary and this assumption could not be rejected at the significance level no less than 0.1;

2) According to Augmented Dickey-Fuller test the time series is stationary and unit root is rejected at significance level 0.01.

The next figure 8shows the empirical distribution of daily log-returns. Distribution graph has slightly more weighted left tail and it happened due to sharp and vivid falls that are typical for the market during the period of reduction.

Figure 9 - Distribution of hourly values of the MICEX index

Since Moscow Exchange works only on weekdays, respectively observing for weekends and holidays was no possible to get, which resulted in a de-synchronization of dates market and text data. To resolve this problem those observations that came out on the weekends or on the public holidays have been removed from the data, as there is no data to synchronize with them.

Then hourly archival values of the MICEX index were obtained from the same investment-analytical resource. A total number of observations were 15777 for the period the same period of time.

The primary analysis showed that the minimum value of the MICEX index at the close of trades - 1193, maximum - 1863, the average value is 1488.

Figure 10 - The empirical distribution of hourly log-returns

Figure 9 shows us the distribution of values of the index at the close of trading, as can be seen from it; data has approximately normal distribution.

In hourly data the problem desynchronization has occurred not only with the dates but also with the hour's data of the MICEX index. Since the Moscow Exchange works only for 9 hours a day from 11.00 to 19.00 hours and also on weekdays only, all news, published on weekends and holidays, were discarded, and the news, released after 19 hours, were attributed to the values of the index of 11:00 hours of the next day .

2.1.3 Feature selection

At this stage we describe pre-processing of the text during which we conduct specific operations to turn textual information into a numeric representation. The first part of pre-processing is feature extraction, during which we have eliminated those words that were too general or have not contained any valuable information (e.g. stop words). The next step is feature selection which main goal is to lower dimensionality of our feature set. For this purpose we have implemented a certain methods of filtration: 1)we have created a corpus that contained only those words that appeared in the whole corpus more than 5 times; 2) from the new corpus we have chosen the most frequent 200 words that in total amount of fords appeared 1713858 times which is almost 40% of total appearances.

The next phase of the study required the extraction of topics. By topic is meant a more general word that includes in its meaning some other words. Figure 3 shows the most common words in headlines, the larger the word, and oily secretions, the more common word. Among those 200 words there were certain “words” that had no meaning or were lead to incorrect form during text normalization, so they were deleted (for example, such words as “i” or “к”). To the 180 words that were left we have assigned certain topics (in total 12 topics) each of which represents a vector of words related to similar content. The example of topics can be seen in Annex 1. Even though we have assigned topics manually, there are certain methods for automatization of this process. For example, LDA topic modelling algorithms for discovering the abstract "topics" that occur in a collection of documents. This technology has been successfully used in similar to ours works and showed good results (Jin et al., 2013) however the main disadvantage of this method is requirement of computing power. We have tried to implement this technique to our goal, in our case the task was specific because of the extremely short texts (sometimes shorter than twitter), and even in spite of that we were able to receive reliable results on short period of time using “twitter lda” (topic modeling for twits). For the long period it was not so easy and the only way to receive topics that were reliable was iterative repetition (one iteration took from 8 to 10 hours on the computer with 4 Gb of RAM and CPU 2.3 GHz), until the predictive models did not show improvement, but due to lack of computing power it seems impossible.

3. Modeling strategy

Thus, we have obtained a new dataset (see example in table 3) via merging two different datasets (textual and market data). The new dataset has information about features that appeared in textual data for every hour of trading on the market. Our main hypothesis is that inclusion and utilization of textual data helps to improve predictive power of the model in comparison with the models that do not utilize textual data. To test this hypothesis our corpus is divided into three parts in a ratio of 60% to 20% to 20%. First 60% of our corpus is a training sample on which we train our model, next 20% is validation sample on which different models are compared with each other to reveal the best and finally the last 20% on corpus are used to determine the predicting power of the best model.

Table 3 - sample of the new data set with textual topics features

DateTime

CLOSE

N

Topic.1

Topic.2

Topic.3

Topic.4

1

2009-11-16 12:00:00

1341.84

16

3

10

10

1

2

2009-11-16 13:00:00

1349.47

23

7

15

17

0

3

2009-11-16 14:00:00

1345.64

23

7

10

12

3

4

2009-11-16 15:00:00

1347.31

21

11

12

9

6

5

2009-11-16 16:00:00

1351.11

24

15

6

11

11

3.1 Description of the models

There are three types of models that are utilized in this work: linear regression model, SVM (Support Vector Machine regression) and random forest. Moreover models are also divided by two types of factors for the model (10 lags, and text features that were determined manually).

1) general linear model with stepwise model selection by AIC (Akaike information criterion) - used as baseline model and predicts log-returns of MICEX. This model is a semi-automated process of building a model by successively adding and removing variables based solely on the AIC criterion and the chosen as a baseline model because of its generality and simplicity of comparison with other models. The first model includes only lags and predicts log-returns of MICEX, whereas second linear model also includes lags and additionally text features;

2) support Vector Machines regression model can be applied not only to classification problems but also to the case of regression and one of the reasons in favor of this model is relatively small time consumption for training large datasets. Moreover one of the advantages of Support Vector Machine is that it can be used to avoid difficulties of using linear functions in the high dimensional feature space and optimization problem is transformed into dual convex quadratic programs. In regression case the loss function is used to penalize errors that are greater than certain threshold - е. Such loss functions usually lead to the sparse representation of the decision rule, giving significant algorithmic and representational advantages. The same way as with general linear model there two types of SVM: the first model includes only lags and text features are added to the second model;

3) random Forest - is a bit more sophisticated and versatile machine learning method capable of performing both regression and classification tasks. A distinct advantage of random forest is the same as SVM's - power to handle large datasets and what is more important random forest has inbuilt regularization so we cannot over fit the model. However ability to discover more complex dependencies costs more time for fitting and computational complexity. In the same fashion as for the two models above we provide two different types of features to the random forest model.

We use mean absolute percentage prediction error (MAPPE) for the training of the models and for the forecasting as a measure of prediction accuracy of a forecasting method. Utilization of MAPPE metrics is partly explained by the fact that we compare the results not only of different models with each other in a certain period but also one model in a different time periods and one of the few ways to compare such models is based on percentage error. Moreover Spyros Makridakis in his article (Jin et al., 2013) claims the MAPPE to be the standard metrics with best characteristics among the various accuracy criteria In addition, it can be used for both evaluating large-scale empirical studies and for presenting specific results.

Finally MAPE is calculated as:

,

where At is the actual value and Ft is the forecast value.

The difference between At and Ft is divided by the actual value At again. The absolute value in this calculation is summed for every forecasted point in time and divided by the number of fitted points n. Multiplying by 100 makes it a percentage error. The smaller the value of MAPPE, the closer are the predicted time series values to the actual values, that is to say, a smaller value suggests a better predictor.

4. Results

Results have shown that inclusion of textual data as one of the factors for forecasting not improved the results of general linear model; improvement of outcome is observed only for certain specifications of the SVM model.

More specifically: as it can be seen from the table 4 (highlighted cells) decrease of the MAPE is observed only for the second test sample and only with inclusion of lags from 7th to 10th with the decrease of MAPE 0,0024; 0,0023; 0,0015 and 0,0017 percent respectfully.

Full evaluation results for different regressions are gathered and shown in ANNEX A2.

Table 4 - results of SVM evaluation for the model with 7:10 lags.

Method

N of Lags

MAPE%

Without textual features

Including textual features

Train sample

Test sample 1

Test sample 2

Train sample

Test sample 1

Test sample 2

svmLinear

7

28.85

16.88

32.60

28.83

16.96

32.36

svmLinear

8

28.85

16.88

32.60

28.83

16.95

32.37

svmLinear

9

28.84

16.92

32.60

28.81

17.00

32.45

svmLinear

19

28.84

16.94

32.61

28.81

17.01

32.44

text headline financial regression

Conclusion

The main aim of this work has been to find evidences in favor of hypothesis stated by Khadjeh Nassirtoussi - that headlines themselves includes a lot of valuable information and inclusion of them in different forecasting models will help to improve the predicting power of the models. Results have confirmed this hypothesis to some extent. Even though a small decrease of the MAPE was observed only for certain specifications of the SVM model we can say that there are several possible ways to improve the results and this research itself:

1) Weakest part of the research is feature selection. All the features were extracted manually on the rule based basis. There different automated and semi-automated techniques, for instance, like LDA topic modeling, that extract features in accordance with certain mathematical basis and shows good results in numerous researches. It will be fruitful to overcame current barriers for their implementation and utilize them on the data for the Russian market;

2) Another possible way to enhance the results is choice of additional models with possibility of scrupulous tuning of their parameters;

3) For the purpose of reliability of the research it would be more useful to use several different metrics of model performance. Thus we would guarantee the accuracy of the results;

In overall research has partially reached its goals and has shown that future work in this area has a potential with a certain adjustments to the proposed technique.

Bibliography

1. Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 1-8. http://doi.org/10.1016/j.jocs.2010.12.007

2. Geva, T., & Zahavi, J. (2014). Empirical evaluation of an automated intraday stock recommendation system incorporating both market data and textual news. Decision Support Systems, 57, 212-223. http://doi.org/10.1016/j.dss.2013.09.013

3. Harris, L. E. (1999). The Information-Content of the Limit Order Book: Evidence from NYSE Specialist Actions The Information-Content of the Limit Order Book: Evidence from NYSE Specialist Actions Abstract.

4. Jin, F., Self, N., Saraf, P., Butler, P., Wang, W., & Ramakrishnan, N. (2013). Forex-Foreteller: Currency Trend Modeling using News Articles. KDD Demo, 1470-1473. http://doi.org/10.1145/2487575.2487710

5. Junquй de Fortuny, E., De Smedt, T., Martens, D., & Daelemans, W. (2012). Media coverage in times of political crisis: A text mining approach. Expert Systems with Applications, 39(14), 11616-11622. http://doi.org/10.1016/j.eswa.2012.04.013

6. Junquй de Fortuny, E., De Smedt, T., Martens, D., & Daelemans, W. (2014). Evaluating and understanding text-based stock price prediction models. Information Processing & Management, 50(2), 426-441. http://doi.org/10.1016/j.ipm.2013.12.002

7. Khadjeh Nassirtoussi, A., Aghabozorgi, S., Ying Wah, T., & Ngo, D. C. L. (2015). Text mining of news-headlines for FOREX market prediction: A Multi-layer Dimension Reduction Algorithm with semantics and sentiment. Expert Systems with Applications, 42(1), 306-324. http://doi.org/10.1016/j.eswa.2014.08.004

8. Lampenius, N., & Zickar, M. (2005). Development and validation of a model and measure of financial risk-taking. The Journal of Behavioral Finance, 6(3), 129-143. http://doi.org/10.1207/s15427579jpfm0603

9. Li, Q., Wang, T., Gong, Q., Chen, Y., Lin, Z., & Song, S. (2014). Media-aware quantitative trading based on public Web information. Decision Support Systems, 61, 93-105. http://doi.org/10.1016/j.dss.2014.01.013

10. Wang, B., Huang, H., & Wang, X. (2013). A support vector machine based MSM model for financial short-term volatility forecasting. Neural Computing and Applications, 22(1), 21-28. http://doi.org/10.1007/s00521-011-0742-z

Annex A1

For example, a variable topic.russia counts the number of words related to Russia topic in the total number of words.

Table A1 - shows a set of topics and their composition

topic. Russia

( 'россия', 'москва', 'рф', `область', 'российский', 'страна', 'московский')

topic. Currency

('доллар', 'евро', `курс', `валютный')

topic. Volume

('миллиард', 'миллион', 'цена', 'результат', 'итог', 'уровень', 'сумма', 'триллион', 'размер', 'крупный', 'объем')

topic. Banks

('цб', 'банка', 'банк', 'кредит', 'кредитный', 'ставка')

topic. World

('сша', 'украина', 'европа', 'мир', 'мид', 'отношения', 'япония', 'сирия', 'китай')

topic. Market

('торг', 'рынок', 'индекс', 'акция', 'дивиденд', 'фондовый', 'аукцион', 'биржа', 'рбк', 'мсфо', 'репо', 'облигация', 'пункт', 'рсб', 'фьючерс, 'торговый', 'финансовый', 'своп', 'размещение')

topic. Movement

('вырасти', 'составить', 'снизиться', 'рост', 'увеличить', 'начало', 'завершить', 'превысить', 'снижение', 'увеличиться', 'повысить', 'достигнуть', 'повышение', 'изменение', 'сократить', 'упасть', 'выше')

topic. Production

(нефть', 'золото', 'газпром', 'нефтяной', 'газа', 'строительство', 'добыча', 'brent', 'самолет', 'поставка', 'автомобиль', 'роснефть')

topic. Company

'чистый', 'прибыль', 'компания', 'совет', 'директор', 'организация', 'продажа', 'данные', 'производство', 'проект', 'акционер', 'сделка', 'получить', 'задолженность', 'группа', 'рейтинг', 'работа', 'убыток', 'открыться', 'покупка', 'разместить'

topic. Government

( 'дело', 'путин', 'правительство',' глава', 'госдума', 'медведев', 'суд', 'президент', 'закон', 'власть', 'подписать', 'срок', 'официальный', 'решение', 'законопроект', 'бюджет', 'одобрить', 'задержать', 'соглашение', 'развитие', 'минфин','министр', 'утвердить', 'уголовный', 'мэр', 'обеспечить', 'военный', 'иск', 'центр', 'программа', 'денежный')

topic. Period

('квартал, 'месяц','планировать','принять','полугодие','неделя','рассмотреть','число','новое','годовой','впервые','конец','первое')

topic. оther

('объесть', 'погибнуть', 'операция', 'средство', 'объявить',' провести', 'связь', 'вопрос', 'проведение', 'дать', 'условие', 'система', 'пожар', 'направить' ,'взрыв', 'начать','произойти','серия','пострадать','общий','отметка','состояние','считать','участие')

Annex A2

Table A2 - shows a MAPE for different specifications of the models

Method

N of Lags

MAPPE for 0.0

MAPPE for 0.1

MAPPE for 0.2

MAPPE for 1.0

MAPPE for 1.1

MAPPE for 1.2

1

lmStepAIC

1

0.2887

0.1684

0.3258

0.2890

0.1703

0.3274

2

svmLinear

1

0.2886

0.1686

0.3259

0.2884

0.1693

0.3273

3

lmStepAIC

2

0.2887

0.1684

0.3258

0.2890

0.1703

0.3274

4

svmLinear

2

0.2886

0.1685

0.3259

0.2884

0.1693

0.3272

5

lmStepAIC

3

0.2887

0.1684

0.3258

0.2890

0.1703

0.3274

6

svmLinear

3

0.2886

0.1685

0.3259

0.2884

0.1693

0.3272

7

lmStepAIC

4

0.2887

0.1686

0.3259

0.2890

0.1704

0.3273

8

svmLinear

4

0.2886

0.1685

0.3259

0.2884

0.1694

0.3271

9

lmStepAIC

5

0.2888

0.1688

0.3260

0.2891

0.1706

0.3273

10

svmLinear

5

0.2886

0.1685

0.3259

0.2884

0.1694

0.3271

11

lmStepAIC

6

0.2889

0.1690

0.3260

0.2891

0.1707

0.3274

12

svmLinear

6

0.2886

0.1686

0.3259

0.2884

0.1694

0.3272

13

lmStepAIC

7

0.2889

0.1690

0.3260

0.2891

0.1707

0.3274

14

svmLinear

7

0.2885

0.1688

0.3260

0.2883

0.1696

0.3236

15

lmStepAIC

8

0.2889

0.1690

0.3260

0.2891

0.1707

0.3274

16

svmLinear

8

0.2885

0.1688

0.3260

0.2883

0.1695

0.3237

17

lmStepAIC

9

0.2888

0.1700

0.3261

0.2889

0.1715

0.3277

18

svmLinear

9

0.2884

0.1692

0.3260

0.2881

0.1700

0.3245

19

lmStepAIC

10

0.2888

0.1700

0.3261

0.2889

0.1715

0.3277

20

svmLinear

10

0.2884

0.1694

0.3261

0.2881

0.1701

0.3244

Abstract

With the rapid spread of the Internet around the world large amounts of free unstructured data emerged and consequently emerged a new field of the research - data science. There are many fields for utilization of this unstructured data from politics and medicine to finances. The main aim of this particular research is to predict intraday movements of the MICEX index based on the text of financially related news-headlines from the RBC new company.

Data was collected, processed and at the next stage features on 12 topics were extracted from this data. Using these features as one of the factors in several regression models we have tried to predict the log-return of the MICEX index. Simple model linear model failed to find any improvement, however for the model that is slightly complicated (SVM) there found specifications when inclusion of the textual data would yield better results.

The results of the proposed study will be hopefully useful directly for investors that are going to make investments in different types of stocks and for the other economic agents that are involved in Russian financial market.

Keywords: machine learning, financial market prediction, topic modeling, intraday market data, autoregressive model, support vector regression, RBC, MICEX.

Размещено на Allbest.ru

...

Подобные документы

  • Study credit channel using clustering and test the difference in mean portfolio returns. The calculated debt-to-capital, interest coverage, current ratio, payables turnover ratio. Analysis of stock market behavior. Comparison of portfolios’ performances.

    курсовая работа [1,5 M], добавлен 23.10.2016

  • The concept, types and regulation of financial institutions. Their main functions: providing insurance and loans, asset swaps market participants. Activities and basic operations of credit unions, brokerage firms, investment funds and mutual funds.

    реферат [14,0 K], добавлен 01.12.2010

  • Types and functions exchange. Conjuncture of exchange market in theory. The concept of the exchange. Types of Exchanges and Exchange operations. The concept of market conditions, goals, and methods of analysis. Stages of market research product markets.

    курсовая работа [43,3 K], добавлен 08.02.2014

  • The General Economic Conditions for the Use of Money. Money and Money Substitutes. The Global Money Markets. US Money Market. Money Management. Cash Management for Finance Managers. The activity of financial institutions in the money market involves.

    реферат [20,9 K], добавлен 01.12.2006

  • Strategy of foreign capital regulation in Russia. Russian position in the world market of investments. Problems of foreign investments attraction. Types of measures for attraction of investments. Main aspects of foreign investments attraction policy.

    реферат [20,8 K], добавлен 16.05.2011

  • Example of a bond valuing. Bond prices and yields. Stocks and stock market. Valuing common stocks. Capitalization rate. Constant growth DDM. Payout and plowback ratio. Assuming the dividend. Present value of growth opportunities. Sustainable growth rate.

    презентация [748,8 K], добавлен 02.08.2013

  • Economic essence of off-budget funds, the reasons of their occurrence. Pension and insurance funds. National fund of the Republic of Kazakhstan. The analysis of directions and results of activity of off-budget funds. Off-budget funds of local controls.

    курсовая работа [29,4 K], добавлен 21.10.2013

  • Federalism and the Tax System. Federal Taxes and Intergovernmental Revenues. Tax Reform. The Progressivity of the Tax System. Political Influences on the Tax System. Main principles of US tax system. The importance of Social Security taxes.

    реферат [15,9 K], добавлен 01.12.2006

  • Fisher Separation Theorem. Consumption Vs. Investment. Utility Analysis. Indifference Curves. The satisfaction levels. Indifference Curves and Trade Off between Present and Future Consumptions. Marginal Rate of Substitution. Capital Market Line.

    презентация [1,5 M], добавлен 22.06.2015

  • Capital Structure Definition. Trade-off theory explanation to determine the capital structure. Common factors having most impact on firm’s capital structure in retail sector. Analysis the influence they have on the listed firm’s debt-equity ratio.

    курсовая работа [144,4 K], добавлен 16.07.2016

  • Тhe balance sheet company's financial condition is divided into 2 kinds: personal and corporate. Each of these species has some characteristics and detail information about the assets, liabilities and provided shareholders' equity of the company.

    реферат [409,2 K], добавлен 25.12.2008

  • The economic benefits to the recipient countries by providing capital, foreign exchange. The question of potential causality between foreign debt and domestic savings in the context of the Kyrgyz Republic. The problem of tracking new private businesses.

    реферат [26,7 K], добавлен 28.01.2014

  • Causes and corresponding types of deflation. Money supply side deflation. Credit deflation, Scarcity of official money. Alternative causes and effects. The Austrian and keynesian school of economics. Historical examples: deflation in Ireland, Japan, USA.

    реферат [45,6 K], добавлен 13.12.2010

  • Brief description of PJSC "Kyivenergo". Basic concepts of dividend policy of the company. Practice of forming and assesing the effiiency of dividend policy of the company. The usual scheme of dividend policy formation consists of six main stages.

    курсовая работа [1004,4 K], добавлен 07.04.2015

  • The stock market and economic growth: theoretical and analytical questions. Analysis of the mechanism of the financial market on the efficient allocation of resources in the economy and to define the specific role of stock market prices in the process.

    дипломная работа [5,3 M], добавлен 07.07.2013

  • Legal regulation of the activities of foreign commercial banks. Features of the Russian financial market. The role and place of foreign banks in the credit and stock market. Services of foreign banks in the financial market on the example of Raiffeisen.

    дипломная работа [2,5 M], добавлен 27.10.2015

  • Mergers and acquisitions: definitions, history and types of the deals. Previous studies of post-merger performance and announcement returns and Russian M&A market. Analysis of factors driving abnormal announcement returns and the effect of 2014 events.

    дипломная работа [7,0 M], добавлен 02.11.2015

  • Financial bubble - a phenomenon on the financial market, when the assessments of people exceed the fair price. The description of key figures of financial bubble. Methods of predicting the emergence of financial bubbles, their use in different situations.

    реферат [90,0 K], добавлен 14.02.2016

  • Data mining, developmental history of data mining and knowledge discovery. Technological elements and methods of data mining. Steps in knowledge discovery. Change and deviation detection. Related disciplines, information retrieval and text extraction.

    доклад [25,3 K], добавлен 16.06.2012

  • Перспективные направления анализа данных: анализ текстовой информации, интеллектуальный анализ данных. Анализ структурированной информации, хранящейся в базах данных. Процесс анализа текстовых документов. Особенности предварительной обработки данных.

    реферат [443,2 K], добавлен 13.02.2014

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.