Prediction of short-term stock price response to news
The study of financial markets in terms of machine learning. Natural language processing approach. Implementation of event-study for searching news. Construct model for predictions. The influence of the news background of exchanges on the price of shares.
Рубрика | Финансы, деньги и налоги |
Вид | магистерская работа |
Язык | английский |
Дата добавления | 15.09.2020 |
Размер файла | 217,4 K |
Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже
Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.
Размещено на http://www.allbest.ru/
ПЕРМСКИЙ ФИЛИАЛ ФЕДЕРАЛЬНОГО ГОСУДАРСТВЕННОГО АВТОНОМНОГО ОБРАЗОВАТЕЛЬНОГО УЧРЕЖДЕНИЯ ВЫСШЕГО ОБРАЗОВАНИЯ «НАЦИОНАЛЬНЫЙ ИССЛЕДОВАТЕЛЬСКИЙ УНИВЕРСИТЕТ
«ВЫСШАЯ ШКОЛА ЭКОНОМИКИ»
Факультет экономики, менеджмента и бизнес-информатики
Выпускная квалификационная работа - МАГИСТЕРСКАЯ ДИССЕРТАЦИЯ
Prediction of short-term stock price response to news
Студента образовательной программы магистратуры «Финансы»
по направлению подготовки 38.04.08 Финансы и кредит
Пепеляев Богдан Андреевич
Пермь 2020
Аннотация
Данная работа посвящена изучению финансовых рынков с точки зрения машинного обучения и моделей портфельной теории. Работа представляет из себя несколько частей, в которых рассматривается event-study, основные модели портфельной теории и методы машинного обучения, а также языковая модель, которая является нововведением. Роль event-study в работе заключается в подборе новостного фона. В ходе работы рассматриваются и сравниваются стандартные подходы к предсказанию цены акций, которые были исследованы раннее, а также модели, которые являются описательными и необходимыми для заключительной части исследования. В обзоре литературы дается развернутый обзор на данные модели и их происхождение. Обзор моделей начинается с линейно-зависимых моделей и заканчивается моделями с большим пространством гипотез. В результатах работы создается новая модель, которая включает в себя все части исследования и показывает релевантные результаты. Подтверждается гипотеза о важности языковых моделей и значимости новостей на поведение акций.
Abstract
This work is dedicated to the study of financial markets in terms of machine learning and models of portfolio's theory. The work consists of several parts as the event-study, the main models of portfolio's theory and machine learning methods, additionally, the research has Natural Language Processing model, which is an innovation in area's approaches. The role of event-study in the work is to select the news background. In the course of the work we will demonstrate standard approaches to predicting stock prices, which were investigated earlier, also we will show models that are descriptive and necessary for the final part of the study and compare them to each other. The literature review provides a detailed review of the models and their origin. The review of models begins with linearly dependent models and ends with models, which has a large space of hypotheses. We will create a new model in the results of the work, which includes all parts of the study and shows relevant results. The hypotheses of the importance of language models and the importance of news on stock behavior were confirmed.
Table of Contents
Introduction
1. Theoretical background
2. Research design
3. Methodology
3.1 Database
3.2 CAPM
3.3 Fama and French three factors
3.4 Fama and French five factors
3.5 BERT
3.6 Simple regressions
3.7 Random Forest
3.8 SVM
3.9 Neural Networks
4. Results
4.1 Data preparation
4.2 Evaluation of models
Conclusion
References
Introduction
People in the world know that stocks are hard to predict in all terms. Many authors proofed hypotheses about low predictability power of historical data from markets. There were many tools for asset pricing in 20th century, but the first powerful was CAPM. The main idea of model is description analysis of market in different terms. This method for asset pricing has a weak testing ability and researchers created modified model after CAPM, which named 3 factors Fama and French model. Despite of some disadvantages of model, they could be implement Event-study approach. We have more relevant models for asset pricing nowadays and we will describe and compare these models in next parts.
Machine learning is a very significant part of life. Machines help to simplify our life in different areas. One part of machine learning is Natural Language Processing, which helps people in different types of text tasks. In our research NLP helps to extract semantic rating from news, twits and other publications. We need ratings for general model.
Our research will help people to understand stocks more. The research consists of three parts. The first is Event-study approach, the second is natural language processing approach and the third is a part of regressions or machine learning.
The last part of research is a mix of regressions and machine learning. The last step is a prediction task, which could be solved by these methods. When we describe about prediction problem, we should understand what we will predict and which methods we will use for it. If we are classifying «up», «stay», «down», signals, it would be one type of research, also we need create hypothesis. If we are classifying impulse of volatility, it would be another type of research. Hence, we have different types of final outputs implication.
The research question: Can combination of Event-study, NLP and regressions beat and predict better than models, which based on historical data?
Relevance: Predictions on stock markets have been object of studies for many decades, but given it's innate complexity, dynamism and chaoticness, it has proven to be a very difficult task. Financial time series is the most difficult area of forecasting, because it has many endogenous factors and features. That makes the task of predicting stock market prices behavior in the future a very hard one. When we excluding or introducing new factors into the model and creates combined models, we simplify the research in this area. However, the fact that stocks are a profit area is relevant too. If the research's model predictions couldn't be significantly better than standard models predictions, the research will give a new base model for future researchers.
Tasks:
• Describe and compare asset pricing models;
• Implementation of event-study for searching news and classification;
• Parsing and collecting data, which consist of three parts for different models;
• Creating hypotheses of research;
• Create and evaluate NLP model for sentiments;
• Construct general model for predictions and tuning parameters of model for better forecasting ability;
1. Theoretical background
When using the Event study method, first of all, it is necessary to determine what will be understood as an event in the framework of a specific study. For example, if it is necessary to identify the reaction of stock prices to the recommendations of analysts, an event may be the appearance in the news feed of information on a change in the recommendation. For other research purposes, the event may be an announcement of a change in the size of dividend payments, publication of a company's financial statements, a merger or acquisition announcement, split of shares, natural disaster and other phenomena, the effect of which must be checked and evaluated in accordance with the hypotheses.
Further, it is assumed that events will be divided according to whether they give the market a signal - positive or negative. A positive event is the announcement of an increase in dividend payments, and a negative event is an announcement of a decrease.
In the case of analysts' recommendations, at first, a positive event may be the release of a recommendation for a purchase, and a negative one for a sale. Despite this phenomenon, a repeat of the recommendation is not news for the market, and therefore should not affect it. Thus, a positive development can serve to improve the recommendations, change level to level to keep selling, c-level selling to buying, as well as keep the level up n pay . Accordingly, any downgrade of the recommendation may be considered a negative event.
The next stage of the Event study is the choice of the event window or the time period during which stock quotes will be observed. In many works devoted to the study of the reaction of stock prices to analysts' recommendations, the event window is 31 days, that is, 15 days before and 15 days after information appears on the revision of the recommendation. Consideration of the period of time before changing the recommendation is necessary, since this event is predictable and may affect quotes even before it occurs. The analysis of the time interval after the change of recommendation is important for assessing the reaction speed of quotations to the receipt of new information, that is, to assess the effectiveness of the market.
If, for research purposes, it is necessary to evaluate the possible drift of quotations in the direction of the published revision of the recommendation, then wider event windows are considered. When analyzing the impact of relatively rare and very significant events for the company, such as a merger, acquisition or restructuring, an event window of several years can be used.
Next, the actual stock return observed on each day of the event window is calculated. Since quotes cannot take negative values, traditionally daily stock returns are calculated based on the lognormal distribution.
The next step is to calculate for each day of the event window the “normal” stock return, the return that would most likely be if the event had not occurred. The simplest and most often used method of calculating normal returns is to determine the average observed return for a certain period of time before the event window. When studying the effect of analysts' recommendations on capitalization, the 120-day forecast period preceding the event window is most often used. However, this method of calculating normal profitability proceeds from the assumption that the normal profitability does not change over time, which is weakly consistent with reality. One can get away from this premise by assuming that the normal returns on shares of various issuers are the same and equal to the returns on the market portfolio. One can get away from both assumptions by assuming a linear relationship between market profitability and the yield of a security of the issuer, which is constant over time using the CAPM model. In this case, on the basis of the forecast period, the OLS method estimates the regression coefficients of the dependence of the yield of the paper of the issuer on the yield of the market portfolio.
Some foreign studies also take into account the dependence of normal stock returns on the size of the company, the ratio of the book value of assets to capitalization, such a model would be 3 FF or its modernized versions. An additional inclusion of these factors makes it possible to identify a change in stock returns caused precisely by a change in recommendation, and not by the analyst's ability to predict a rise or fall in quotes based on the characteristics of the company. Regarding the choice of a model for assessing normal profitability, it is worth noting that the simplest model with an average is acceptable for conducting event analysis, and the results when using multifactor models do not differ significantly from the results obtained in the middle and one-factor model of CAPM. The next step is to calculate the “abnormal” profitability. Event analysis is based on the assumption that the actual observed stock return at each point in time is equal to the sum of normal” and “abnormal returns. Summarizing, it helps to conclude all approach to test your ability in predictability an event's power, but not predict moving entirely. It is necessary to understand which model is more developed to obtain normal level prediction, which after would be transform for next level approach.
Asset pricing models offer forecasts for an investment's projected return, a key factor in assessing the value of an asset or portfolio. Empirical test results from the asset pricing models were disappointing. For example, the Capital Asset Pricing Model is "one of the two or three major contributions academic work made to financial managers after the middle in the 20th century." CAPM's test results are "highly disappointing." Roll (1988) finds that with "all explanatory variables" included, less than 40 percent of the average stock monthly return volatility can be explained for a sample of the largest firms At the same time, Roll (1988) finds some firms with “impressive explanatory power” and suggests an in-depth study of those firms “for insight.”. Fama and French (1992) consider no explanatory power of CAPM beta even when the beta is used in a cross-sectional check on 25 portfolios sorted by market capitalization and book-tomarket ratio as a single component. They claim that the general use of the CAPM in evaluating portfolio efficiency and calculating capital costs should be broken. The CAPM is a one-period static and stable-state equilibrium model with assumptions of reasonable expectations, equal incentives for investment, homogenous knowledge on investment opportunities, and the same understanding of investment return characteristics for all investors.
Main assumptions of the CAPM:
1. Have homogeneous expectations
2. Can lend and borrow limitless sums at a risk-free interest rate
3. Can shorten any assets and retain any fraction of the assets
4. Plan to invest over the same time span
5. Investors think about planned returns and investment uncertainty
Main predictions of the CAPM about all investors:
1. Often combine a risk-free asset with a market portfolio
2. Agree on the average return and the average variance of the market portfolio and of each asset
3. Agree on the estimated MRP and beta of each asset, agree on a minimum variance border market portfolio and expect a mean-efficient return on their beta investments.
Real world assumptions:
1. Expectations heterogeneous
2. Investors don't have reasonable concerns for returns on capital
3. Investors also have a value in jumps, crashes and bankruptcies
4. Investors use various betas for a share Investors hold different portfolios
5. Investors have different market risk premia assumed
6. The market risk premium isn't the difference between potential market portfolio return and risk-free level.
As we can see the CAPM is not suitable model for baseline in the approach. Moreover, many authors demonstrated that it has poor performance level in different periods. Fama and French (1992) “shot down” the model with their research. Despite of that, it was the first model, which could be used in finding an abstract formula for expected returns, which might be worked. we describe more complex model after the first for disclosing of the issue.
The Fama and French three-factor model is extracted from the CAPM model, which has had a large amount of inconsistencies. The appropriateness of using the model itself was questioned due to a significant size difference, in connection with which it was discovered that companies with low relative market value, taking their bet coefficients into account, have indicators higher than large companies of the same sample. Besides this downside, a model with only one factor can't take into account the B / M effect, which has a positive relationship to the average return on shares. Authors who were able to prove the first concept content errors are Banz (1986), who was able to prove the first inconsistency proportions, and Stattman (1980), who explains the weakness of a concept for B / M.
The Fama and French models are decomposed into components. In addition, the model helps to consider portfolio function, the effects of active management, portfolio composition and potential income forecasts. The components will provide a condensed view of the "FF" model:
· Market Award (beta)
· The prize for the size (size bed)
· Management Impact (alpha)
· Profit at zero risk
· Underestimation bonus (value)
· Random error
In the paper, which has a catchy subtitle "Luck versus Skill in the Cross- Section of Mutual Fund Returns," the writers explain their study of the funds while assisting three-factor model, "FF" and "Carhart" four-factor model. Find base components:
Rit - Rft = ai + bi ( RMt - Rft ) + siSMBt + hiHMLt + miMOMt + eit .
As can be seen from the description, sampling should take place according to a non-parametric approach with additional cross-sectional inclusion of panel data:
· Rit : i - profit on funds i , t - per month t
· Rft - risk level
· Rmt - market profitability
· SMBt - premium for size (contract volume)
· HMLt - award for underestimation
· MOMt - effect of price anomaly
· Ai - benchmark return average
The authors in their study aimed to demonstrate that one can see undervalued or overestimated returns by adding back costs to the funds in the model, which in turn will show actual alpha without skewed assumptions. The Fama and French model emerged after clarifying the inconsistencies and complexities of the first model, which divided stocks by ratios and market capitalisation. The authors contributed the ratio of book value to market value, in addition to another grouping. After making adjustments to the model, it became apparent that the model's component coefficients are resistant to additions. The classic Fama and French model classifies shares in 3 categories according to 30, 40, 30% of the survey according to size requirements, from small to big firms. Criteria for portfolio aggregation are assigned to asset valuation and book value-to-cash ratio.
· S / L - small market value, small ratio of book value to market value;
· S / M - small market value, average ratio of book value to market value ;
· S / H - small market value, a large ratio of book value to market value;
· B / L - large market value, small ratio of book value to market value ;
· B / M - large market value, average ratio of book value to market value;
· B / H - large market value, a large ratio of book value to market value;
Apart from portfolio forming, several model parameters can be derived: SMB and HML. The first metric indicates the risk by company size and is the contrast of stock returns with equal capitalization with book value relative to market value. The HML parameter represents the downside of the reviewing company and is determined by the average returns of the portfolio.
The three factor Fama and French model has some significant advantages rather than standard CAPM model. Critical advantage is better description statistics, hence, we have better model in baseline for approach. Despite of obvious findings we will compare models on different metrics and add some options for them.
Nowadays we have more modified model than old three factor Fama and French, it is five factor model, which has advantages on variables as RMW (The profitability factor) - the measure of operating profitability and CMA (The investment factor) - calculated as the change in the book value of total assets from the beginning to the end of the previous period divided by the previous end book value of total assets. Though in many cases the FF5 model has better performance, it is not adapted to any situation. Fama and French (2017) analyzed the foreign market and found that CMA is a wasteful investment driver for Europe, Japan and the Asia Pacific. Meanwhile, Fama and French also found the performance of the new factors to be different for both the small and large stock markets a nd the performance of factors also exists difference for different regions. In addition, Guo (2017) found that the profitability factor significantly improves the average return description, and average return investment patterns in China's stock market are weak. Fama and French (2015) use the data from July 1963 to December 2013 to check the efficiency of a five-factor model for the US market.
Their results indicate a five-factor model does better than the Fama and French three-factor model (1993). But with high spending and poor productivity, the five-factor model struggles to achieve low average returns on small stocks. They also demonstrate that the efficiency of the model is not influenced by the way it measures the variables. Their results also suggest that the value factor (HML) becomes redundant, with two additional factors. Regarding specification, our final model includes all variables, which has any positive or negative impact on target. Our approach isn't specified on precise fit, it is based on common knowledge for better performance in description and predictability. To put it in a nutshell, five factor Fama and French is more competitive than previous models.
Models of portfolio theory is very important in the research, but it consist of another not least significant parts, the second part is Natural Language Processing and the third is Machine Learning. Natural Language Processing helps in sentiment analyzing of big text data, which based on twits, news, bank reports. Analyzing the news feeling involves identifying and describing the 'emotional state' conveyed in the letter. In opinion mining, the sentiment word dictionary plays a crucial role in building linguistic resources which classify sentiment polarity, quantify the breadth of sentiment, and discriminate between feelings. In particular, developed a stock domain-specific dictionary which demonstrated greater accuracy in terms of its ability to forecast market stock changes compared with general sentiment dictionaries. Similarly, using NLP, we extracted sentiment word from the press, measured sentiment score and mined opinion. Kim, Jeong and Ghani (2014) used simple NLP methods for their approach, which started form word2vec method. Actually, they demonstrated advantages of NLP in markets and how model, which based only on news can predict market price. On the other hand, article hasn't got full window back testing, but the impact of NLP is explicit.
Another significant article with complex back testing was created by Ding, Zhang, Liu and Duan (2015). They improved based NLP approach in big data models. Models consist of two state-of-the-art financial-news-based stock market prediction systems. By adopting the technique, which imitates the actions of a day trader who uses our model in a simplistic way, they mimic actual stock trading. If the model shows that the next day an individual stock price will rise, the fictional trader will buy at the opening price in that stock at $10,000 value. The dealer will hold the stock for a single day after a buy. If the stock will make a profit of 2 % or more during the holding period, the trader should sell immediately. Otherwise, the dealer must cash out the stock at the selling price at the end of the day. The same technique is used for shorting, if the formula suggests a downturn in a particular stock price. With developments in NLP techniques, various researchers found that financial reports can have a drastic impact on the share price of a company. Authors are Cutler (1998), Tetlock (2008), Luss and d'Aspremont (2012), Wang and Hua (2014). Moreover, Ding, Zhang, Liu and Duan (2015) have showed that deep learning is useful for event-driven stock price movement prediction by proposing a novel neural tensor network for learning event embeddings, and using a deep convolutional neural network to model the combined influence of long-term events and short-term events on stock price movements. Article is a complex review of NLP and some methods of machine learning, but it hadn't a stable statistics model, which we need to involve in the research as a base point.
We can safely assume that NLP methods would have significant impact in general model, back testing results prove this fact. In order to create general model with previous part we need to create a machine learning wrapper. The wrapper depends on models results. Basic machine learning algorithms are regressions, clusters, random forest, support vector machine and many other. As for algorithms we would use only nowadays methods, which have successful results in predictability. Yoo, Kim and Jan (2005) demonstrated differences between simple regressions, SVMs and NNs in their research. Article explores machine learning methods for forecasting the market. The stock market forecast is seen as a difficult job for forecasting the financial time chain. Authors address recent advances in stock-market forecast models in this article, and discuss their advantages and drawbacks. They analyze numerous world developments and their challenges with stock market forecasts. They find from this study that integrating event information with a prediction model plays very important roles for forecasting more accurately.
Consequently, a specific event weighting method and a reliable event extraction scheme are required to provide better results in the prediction of financial time series. Above all, authors describe about simple regressions with linear relationship between variables. Traditional statistical models are widely used in economics for time series prediction. Many researchers claimed that NNs substantially outperform traditional statistical methods. The NNs showed its capability for predicting market movement correctly 92% of the time, while Box-Jenkins as known as ARIMA only performed at a 60% accuracy rate. The Standard ARIMA model cannot include all possible hypothesis in data and has not sufficient common and local significances, because it is a linear model. As for The Support Vector Machines, they are more comprehensive rather than ARIMA. SVM is a very particular category of learning algorithms characterized by decision function power management, kernel function optimization and solution sparsity. Based on the unique theory of the concept of structural risk minimization to approximate a function by maximizing an upper bound of the generalization error, SVM is seen to be very resistant to the issue of overtraining, finally achieving a high output of generalization. Authors made market simulation and a simple greedy strategy allowed their model to produce more benefit than standard algorithms.
Another article about machine learning methods in stocks was created by Sreelekshmy, Vinayakumar, Gopalakrishnan, Menon and Soman (2017). They suggest a deep learning-based formalization for estimation of stock prices. Deep neural network architectures are known to be capable of detecting hidden patterns and are capable of making predictions. CNN model is able to detect pattern changes and identified by authors as best model for the proposed methodology. For prediction, it uses the knowledge provided at a given moment. While the other two models are used in several other time-dependent data processing, in this case it is not out using the CNN architecture. For LSTM model authors need more reliable variables and non-standard optimization function or custom function, which would have normal errors and good recalculation of model.
All articles have unique knowledge background for general model. Nowadays machine learning algorithms allow to get different ways of predictions in market as like as volatility, «up» or «down» and accurate predictions. Evolution of portfolio theory models gets more successful base descriptive model and it would be a base model in methodology. NLP can improve base model or base regression with text mining or optional mining, which would be based on sentiment index. One issue of general model is overfitting. Overfitting is a common problem for NNs and such models have good performance on train data, however they have low generalization and a weak power on real data. Dropout is a technique for addressing this problem. Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov (2014) showed some methods for this problem in their approach. Dropout avoids overfitting, and offers a way to efficiently integrate several different neural network architectures exponentially. The term "dropout" refers to the decommissioning of units in a neural network. By dropping a component out, we say momentarily withdrawing it and all its incoming and outgoing links from the network. Units randomly dropping from NN in every learning stage. Authors demonstrated results of dropout on MNIST set (base dataset for every deep learning specialist). The error of model decreased in 2 times. Dropout is necessary in the approach for large NNs and sorted by time data or results wouldn't be reliable.
The not solved question is what combination of models is better performing predictions in stocks. Every article has some advantages and disadvantages, which allow to analyze models side by side, but common decision wasn't done by authors. As a result of Theoretical background we have known that non-linear models better than linear models, NLP has significant impact in stock's predictions and custom or newer models in machine learning are better than older models and models of portfolio's theory are too.
The research's base theory consists of:
1. Portfolio's theory or implementation with Event-study
2. Linear regressions
3. Non-linear regressions
4. Natural Language Processing as a separated direction of machine learning
5. Machine Learning methods (not NNs)
6. Deep Learning methods
The general model is a combination of all these methods in one model and all models parameters will have own weights, which helps to perform predictions in stocks. The main advantage of general model is NLP or sentiment index based on BERT. We will test sentiment index in general model and show all important descriptions.
2. Research design
First of all, we need to analyze models of portfolio's theory. We will develop general model with a stable base model such as Fama and French 5 factors or Fama and French 3 factors. It is a significant fact for approach, because reliable descriptive statistics are the key of performing predictions in stocks.
The second is event-study model for news classifications. In this part we can get by a simple model. The task is deviation news ranking and we don't require a hard model for it. Simple model could allow us to get volatility, up or down signals and accurate directions of price movements. Real task of event-study in work is a searching system for training general model and get predictions by the general model.
The third is Natural Language Processing, which has a significant impact in our approach. We can assume that NLP, which based on BERT, can improve the general model to a new level of predictions after Theoretical Background. Authors demonstrated that different methods of NLP will give you different results, but we will test only one - BERT. BERT was proofed by many authors in the world as a reliable NLP method.
The fourth is a machine learning model or wrapper for general model. We will test simple models, pure machine learning models, deep models and compare them. Meanwhile, our overfitting problem would be solved by Dropout technique. We emphasized this method for better performing and getting more generalized model, which results would be successful on real data.
Despite of four important things for tasks, we need to create some hypotheses. The first: Can NLP or sentiment index improve models based on historical data?
The second: Is type of final machine learning wrapper important for predictions?
Both of two hypotheses are important for research. Both of hypotheses complement each other and approach become a unique. If one of them won't be significant, it wouldn't be a serious problem as we know that approach will show a new result in a complex modeling such types of stock's predictions. If none of them won't be significant, we would know that area has some endogenous factors, which we or another people could find in future works.
All of methods are suitable for approach. We showed important techniques of different areas to construct new general model in Theoretical Background and all of these techniques are reliable up to date. Despite of all techniques, we will demonstrate custom fine-tuning for better results. It is not a part of Theoretical Background, but is a significant part for Results and Conclusion.
financial price share news
3. Methodology
3.1 Database
Data was collected from various sources. The first is The New York Stock Exchange through the CRSP (Center for Research in Security Prices) database from 1963 to 2019. Based on NYSE data, an estimate of abnormal returns will build through the five-factor FF model. In addition to the model, the data was taken for a comparative analysis of an individual stock market.
The next part of the data starts from 2006 to 2019 by Apple, Google, Amazon and Facebook from 2012 to 2019, Microsoft to study abnormal returns through sentimental analysis. In order to increase the descriptive ability of the complex model, we took data on LIBOR rates, which have a period from 2006 to 2019.
The next part of the data presents the values ??for USDX from 2006 to 2019, the power index of the American dollar to the six major currencies, which would also add descriptive ability to the final model.
Statements of the banking sector, large financial companies and twitter were taken as a basis for sentiment analysis. The base is a typical online publication during the broadcast or publication of news. In addition to the usual base for sentimental analysis, the base of positive and negative of all possible statements was parsed.
Not parsed data is DJI dataset. The dataset consists of financial reports and contains positive, negative and neutral news.
In the workflow part data would be concatenated for complex dataset. We split dataset for general model on train sample, validate sample and test sample. This method was accepted by machine learning community as simple and veracious.
Data consist of:
1. CRSP NYSE - 1963-2019, 14150 observed data
2. Apple, Google, MSFT, FB, AMZN - 2006-2019, 3370 observed data
3. LIBOR - 2006-2019, 3370 observed data
4. USDX index - 2006-2019, 3370 observed data
5. News data:
a. DJI - 140 000 observed data
b. FPB - 213 000 observed data
c. Twits - 120 000 observed data
We didn't contain NA data in complex dataset and we didn't replace them on average values or similar news. The main problem in data is parsing news and label them to normal dataset. It took about 138 hours of time to parse all data. The next problem is bias distribution of news, but it is not a critical problem. News have normal distribution structure in time line and we usually have 5 positive, 5 negative and 5 neutral news in line.
Despite of unique data, we have some limitations:
1. Dataset is enough, but not big
2. Some news have not a standard sentiment range and they could be positive or neutral and negative or neutral. This problem could be solved by 3 million or more news with labels.
3. Too much news for different companies. If we train model on determined company, we will drop other rubbish news for better performance
4. Some companies have extreme positive or extreme negative news, which has random power in stocks
5. We can't include inside news for some reasons. The first is paid sources. The second is closed domains or private forums
3.2 CAPM
The equation CAPM is used to measure an asset's projected returns. It is based on the concept of a systemic chance of having to pay investors in the form of a risk premium. A risk premium is greater than the risk-free rate of return. Investors prefer a higher risk premium by taking in more speculative investments and making investments. Such a statement will seem to be contradictory to common sense-the investor should be rewarded for the risk he takes when putting money into the company's finances. The model's logic is based on the fact that the investor diversifies his investments, and although different investments included in the investor's asset portfolio have a different risk profile, often losses from one asset can be offset by income from another asset, which significantly lowers the investor's accepted real risk level.
Simple model for market description:
ERi=Rf+вi(Rm?Rf):
· ERi - expected return of investment
· Rf - risk-free rate
· вi - beta of the investment
· (Rm?Rf) - market risk premium?
3.3 Fama and French three factors
Modified model than previous with 2 additional factors. In explaining market prices, the three-factor model uses a different approach. Investors were concerned about three separate risk factors. Fama and French found this fact in their article. In fact, they found investors caring about many different risks in the real world. However, the risks are linked to systematic prices and best explain performance and prices together are market, size, and value. Returns from investors are a reflection of the value of a company's capital. Even in the secondary market, the value of the capital of a firm is best assessed by the price of its securities.
Small businesses should pay more for capital when they borrow or issue securities on the capital markets. Concerned companies with weak expectations, bad financial results, erratic profits, and poor management will pay more. Small and distressed companies have reduced share rates to make up for those losses to creditors. Fama-French found that the most fitting metric with the most predictive strength is the ratio of the stock's net book value to its share price. B/M is significant factor in their approach.
Rit - Rft = ai + bi ( RMt - Rft ) + siSMBt + hiHMLt + + eit:
· Rit : i - profit on funds, t - per month
· Rft - risk level
· Rmt - market profitability
· SMBt - The median market value is used to divide all firms into small and big groups
· HMLt - Using percentiles B/M ration to divide firms into groups
· Ai - benchmark return average
3.4 Fama and French five factors
Nowadays FF model with 2 valuable factors. Fama and French modified their model with five considerations in mind. In comparison to the initial three considerations, the current model introduces the idea that companies projecting better potential profits have greater equity price returns, a consideration called productivity. The fifth element, referred to as finance, refers to the idea of internal spending and dividends, indicating that firms that channel income to large development ventures are likely to suffer stock price losses. Fama and French five factor model has the best descriptions statistics among CAPM and three factor Fama and French model, but not in all aspects.
• MKT (Market Risk) - The market factor is equal to the value-weighted returns of all shares minus the risk-free rate;
• SMB (The size factor) - The median market value is used to divide all firms into small and big groups;
• HML (The value factor) - Using percentiles B/M ration to divide firms into groups;
• RMW (The profitability factor) - The measure of operating profitability;
• CMA (The investment factor) - Calculated as the change in the book value of total assets from the beginning to the end of the previous period divided by the previous end book value of total assets;
• RF (Risk-free) - non-risked asset.
We have chosen 5FF for the best description statistics, which needs to include in final data.
3.5 BERT
BERT is model for NLP tasks. One of our tasks is to extract sentiments from news to predict price activity. Model helps to understand context of news and helps to classifying it for forecasting. On the other hand, it helps to recognize patterns in data.
Mask Language Model (MLM) - masking out some of the words in the input, then bidirectionally conditioning each word for predicting the masked words. Only 15 percent of the words in-series are replaced with a symbol before feeding word sequences into BERT. The model then attempts to determine the original meaning of the masked terms in the list, based on the context given by other unmasked terms.
Next Sentence Prediction (NSP) - BERT learns to form relationships among sentences. The model receives pairs of sentences as feedback in the training cycle, and learns to determine whether the second sentence in the pair is the following sentence in the original text.
3.6 Simple regressions
Regression analysis is a method for modeling measured data and studying their properties. The data consists of pairs of values of the dependent variable (response variable) and the independent variable (explanatory variable). The regression model is a function of an independent variable and parameters with a random variable added. Model parameters are adjusted so that the model best approximates the data. The criterion for the quality of the approximation (objective function) is usually the standard error: the sum of the squares of the difference between the model values and the dependent variable for all values of the independent variable as an argument. Regression analysis is used to predict, analyze time series, test hypotheses, and identify hidden relationships in data.
3.7 Random Forest
Random Forest is stack of trees decisions. Their answers are averaged in the regression problem. If we have a classification problem, the decision makes by majority votes. Differences between simple regressions and RF are stack of outputs, stack votes. The decision tree is an intuitive basic unit of the random forest algorithm. We can consider it as a series of yes or no questions about the input. Ultimately, questions lead to the prediction of a particular class. This is an interpreted model, since decisions are made in the same way as people: we ask questions about the available data until we will have a certain decision.
The basic idea of the decision tree is to form queries with the algorithm, which accesses the data. This means that the decision tree forms nodes containing a large number of samples belonging to the same class. The algorithm tries to detect parameters with similar values.
3.8 SVM
Support-vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. Despite of that we can use SVR modifications for regressions. It depends on our approach in last step of workflow.
3.9 Neural Networks
NNs are non-standard and relevant method for classification and regression tasks. A drastic difference between simple methods and NNs is multidimensionality of hypotheses, also we can weigh time line in approach by LSTM or GRU networks. The last layer of NN has different activation function, it also depends on approach. An additional research is tuning hyper-parameters for better forecasting and classifying.
All of methods have advantages and disadvantages. Despite of big quantity of models, we can compare them and choose the most relevant for approach. NNs have a critical advantage in performance and generalization, but have a serious problem with overfitting. SVMs haven't problems with overfitting, however models haven't advantages in local forms. Simple regressions are standard models, which have many assumptions and low prediction's power. The difference between CAPM, three factor Fama and French model and five factor Fama and French model could be not relevant. We should test all features of model in the next part of the research.
4. Results
4.1 Data preparation
First of all we need to tokenize FBP data, which contains tons of unconstructed text data. This step is preprocessing for modeling in the research. Instruments consist of:
· Python 3.6;
· Csv module;
· Re module;
· Pandas module;
· Nltk module;
Cleaning process:
• Remove duplicates from data;
• Get dummies for label column;
• Remove whitespaces;
• HTML decoding;
• Remove strings gaps;
• Involve stemming and lemming;
• Searching needs company and creating additional pandas frames for them - MSFT, AAPL, AMZN, FB, Google;
• Choose starting year, which needs to train on and sent tokenizer from module;
• Choosing from 5FF:
o MKT, SMB, HML, RW, CMA, DATE;
• Choosing from LIBOR:
o 1 month, 3 months and date;
• Choosing from USDX:
o Close dates and Date;
After the first step we have cleaned database for start to transform it into models. The second step in workflow is to create BERT's model. Instruments consists of:
• Python 3.6;
• Pytorch module;
• Sklearn module;
• Pandas module;
• Torch module;
• Fastai module;
• Wrapper for BERT;
• Pre-trained BERT model;
We tokenize with the maximum sequence length. Involve data bunch for creating separate token for every word. After that we need to train BERT on data without news about your company in period before we have chosen. In our case is data from 2016 up to 2019 or it means to divide on train and test data. Research question in this is to randomize data or not, because line patterns of news can give impact on forecasting. It would be tested in final. Despite of this, we include sequence classification with labels. Our research has 2 or 3 labels, we start on 2 labels, which are positive and negative. We add loss function on logits, embedder, pooler, encoder, train, test and tune model. In final we have got 2 or 3 class predictions on news with sentiments between -1 and 1. We can set neutral sentiments between -0.1 to 0.1 for creating 3 classes in labels.
The third step is concatenating all data in one data. Final data consists of 5FF, X_company (MSFT, Google, FB, AAPL, AMZN), USDX, LIBOR and Sentiments with dates based on BERT modeling. Target variable is X_company and other variables are independent variables. Models are linear regression, random forest, support vector machine, standard NNs and custom NNs.
We need to add one more layer with 3 classes for another direction of the researcher's predictions in the end of workflow. The first class is low volatility news, which varies from -0.3 up to 0.3 percent in day returns. The second class is medium volatility news, which varies from -1 to -0.3 and 0.3 to 1. The third class is high volatility news, which has under -1 or above 1 daily returns. The value is counted by actual price minus past price and divided by past price.
We have pure final dataset and ready to evaluate models. Evaluation of models starts in zero point, which shows a base approach for time series data and we will compare our models after that.
4.2 Evaluation of models
Evaluation of models starts with time series checks in dataset. Firstly, we need to show trends, seasons and we have to make sure our data is stationary. There are three basic criterions for a series to be classified as stationary series:
1. The mean of the time series should not be a function of time. It should be constant.
2. The variance of the time series should not be a function of time.
3. The covariance of the ith term and the (i+m)th term should not be a function of time.
Some steps for making our data stationary. Our approach wouldn't be reliable without these steps:
1. Dickey Fuller test to check the stationarity of the series
2. The null hypothesis of the test is that time series is not stationary
3. The alternate hypothesis is that the time series is stationary.
Figure 1 - AAPL prices
As we can see, our plot isn't stationary and it has seasonable and trend. Price had raised up from 2008 up to 2020.
Test statistic is bigger than Critical value, the p-value is greater than 5%, hence, we can see an increasing trend in the data. Firstly, we need to make the data stationary and we will remove the trend and seasonality from the data.
Figure 2- Stationary AAPL
We can see in Figure 2 that the Test statistic is less than the Critical value and the p-value is less than 5%. In other words, we are confident that the trend is almost removed by log prices. As for mean of time series, we can stabilize it with shift differencing, additionally we have residual with random variation. The seasonal order is 4. We are ready to construct a simple regression as like as ARIMA on data, but it is not necessary.
The next step in evaluation is comparing portfolio's theory models. The comparison table of models is import for general model and we can conclude about usefulness of newer models.
Table 1
Model |
R-squared |
Radj-squared |
Observations |
AIC |
BIC |
|
CAPM |
0,711 |
0,711 |
2879 |
3150 |
3150 |
|
3FF |
0,714 |
0,715 |
2879 |
3147 |
3147 |
|
5FF |
0,719 |
0,719 |
2879 |
3142 |
3145 |
|
Full data (libor, usdx) |
0.832 |
0.832 |
2879 |
2849 |
2853 |
|
Full data plus sentiment index |
0,915 |
0,915 |
2879 |
2798 |
2803 |
As we can see from Table 1, newer models of portfolio's theory have some advantages, but they are not significant. We can construct general model with CAPM or 3FF or 5FF and the difference between results wouldn't be so critical. When we added libor and usdx, our model become more reliable according to statistics. The best model is full data model with sentiment index. Summarizing all things, we have better base model with OLS estimation than standard models of portfolio's theory, additionally the model is only linear. As for predictions with OLS models, authors in the review demonstrated that such models haven't significant prediction's power even with lag one. In the next step of evaluation we will show predictive statistics (RMSE, MAE or SMAPE) with the best of OLS estimated model to compare it with machine learning models, which also would be trained on full data with sentiments.
Table 2 - correlations
Cross-corr table |
Company_close |
Mkt-RF |
SMB |
HML |
RMW |
CMA |
libor_1M |
risk_premium |
usd_index_close |
Bert_ index |
|
Company_close |
1.00 |
0.01 |
-0.03 |
-0.03 |
-0.02 |
-0.03 |
0.32 |
-0.08 |
0.33 |
-0.21 |
|
Mkt-RF |
0.01 |
1.00 |
0.24 |
0.34 |
-0.40 |
-0.19 |
-0.05 |
-0.01 |
-0.00 |
0.19 |
|
SMB |
-0.03 |
0.24 |
1.00 |
0.11 |
-0.30 |
0.00 |
-0.02 |
0.02 |
-0.03 |
0.07 |
|
HML |
-0.03 |
0.34 |
0.11 |
1.00 |
-0.40 |
0.29 |
-0.00 |
-0.03 |
-0.02 |
0.11 |
|
RMW |
-0.02 |
-0.40 |
-0.30 |
-0.40 |
1.00 |
0.03 |
0.04 |
0.01 |
-0.01 |
-0.08 |
|
CMA |
-0.03 |
-0.19 |
0.00 |
0.29 |
0.03 |
1.00 |
0.00 |
-0.00 |
-0.03 |
-0.02 |
|
libor_1M |
0.32 |
-0.05 |
-0.02 |
-0.00 |
0.04 |
0.00 |
1.00 |
0.10 |
0.17 |
0.08 |
|
risk_premium |
-0.08 |
-0.01 |
0.02 |
-0.03 |
0.01 |
-0.00 |
0.10 |
1.00 |
0.16 |
0.20 |
|
usd_index_close |
0.33 |
-0.00 |
-0.03 |
-0.02 |
-0.01 |
-0.03 |
0.17 |
0.16 |
1.00 |
-0.12 |
|
Bert_index |
-0.21 |
0.19 |
0.07 |
0.11 |
-0.08 |
-0.02 |
0.08 |
0.20 |
-0.12 |
1.00 |
We can see on Table 3 that Company_close has strong (according to market decisions) relationship with our new sentiment index. The predictors of portfolio's models have not significant value in table. Additionally, libor and usdx have strong relationship too. In the research the relationship measured by absolute value.
Table 3
Variable |
Prob.of variable ... |
Подобные документы
- Вероятность получить государственную поддержку и конкурентоспособность – диагностирование самоотбора
Дивестиции как процесс продажи части подконтрольных компании активов и изъятие капиталовложений, ее разновидности в зависимости от условий. Методология формирования выборки и Event Study. Анализ и результаты Event Study для российских металлургов.
курсовая работа [294,3 K], добавлен 08.02.2017 Study credit channel using clustering and test the difference in mean portfolio returns. The calculated debt-to-capital, interest coverage, current ratio, payables turnover ratio. Analysis of stock market behavior. Comparison of portfolios’ performances.
курсовая работа [1,5 M], добавлен 23.10.2016Example of a bond valuing. Bond prices and yields. Stocks and stock market. Valuing common stocks. Capitalization rate. Constant growth DDM. Payout and plowback ratio. Assuming the dividend. Present value of growth opportunities. Sustainable growth rate.
презентация [748,8 K], добавлен 02.08.2013Составление портфеля ценных бумаг. Изменение стоимости портфеля, нахождение его фактической доходности. Оценка эффективности инвестиционного проекта с точки зрения владельца портфеля. Виды финансовых инструментов. Депозитные и сберегательные сертификаты.
курсовая работа [47,2 K], добавлен 26.01.2015Исследование влияния деятельности рейтинговых агентств на доходность еврооблигаций российских эмитентов, обращающихся на зарубежных торговых площадках. Анализ показателей доходности ценных бумаг в определенные временные периоды методом Event Study.
дипломная работа [244,5 K], добавлен 31.08.2016The General Economic Conditions for the Use of Money. Money and Money Substitutes. The Global Money Markets. US Money Market. Money Management. Cash Management for Finance Managers. The activity of financial institutions in the money market involves.
реферат [20,9 K], добавлен 01.12.2006Обоснования направления влияния дивидендных выплат. Политика выплаты российских компаний. Метод event study. Расчет нормальной доходности акции, влияние объявлений о дивидендных выплатах на цены. Усредненная избыточная доходность по типам новостей.
курсовая работа [454,5 K], добавлен 13.10.2016Types and functions exchange. Conjuncture of exchange market in theory. The concept of the exchange. Types of Exchanges and Exchange operations. The concept of market conditions, goals, and methods of analysis. Stages of market research product markets.
курсовая работа [43,3 K], добавлен 08.02.2014The Swiss tax system. Individual Income Tax. Income from capital gains. Procedure for taxation of income from capital gains. Distribution of shares in the capital. Tax at the source. The persons crossing the border. Lump-sum taxation. The gift tax.
реферат [14,1 K], добавлен 21.06.2013The concept, types and regulation of financial institutions. Their main functions: providing insurance and loans, asset swaps market participants. Activities and basic operations of credit unions, brokerage firms, investment funds and mutual funds.
реферат [14,0 K], добавлен 01.12.2010Тhe balance sheet company's financial condition is divided into 2 kinds: personal and corporate. Each of these species has some characteristics and detail information about the assets, liabilities and provided shareholders' equity of the company.
реферат [409,2 K], добавлен 25.12.2008History of formation and development of FRS. The organizational structure of the U.S Federal Reserve. The implementation of Monetary Policy. The Federal Reserve System in international sphere. Foreign Currency Operations and Resources, the role banks.
реферат [385,4 K], добавлен 01.07.2011Разработка бизнес-плана для инвесторов с целью финансирования деятельности предприятия на основании договора о предоставлении кредита. Общее описание рынка чая. Анализ конкурентов и разработка стратегии маркетинга. Финансовый план и риски проекта.
бизнес-план [61,5 K], добавлен 22.03.2012Economic essence of off-budget funds, the reasons of their occurrence. Pension and insurance funds. National fund of the Republic of Kazakhstan. The analysis of directions and results of activity of off-budget funds. Off-budget funds of local controls.
курсовая работа [29,4 K], добавлен 21.10.2013Capital Structure Definition. Trade-off theory explanation to determine the capital structure. Common factors having most impact on firm’s capital structure in retail sector. Analysis the influence they have on the listed firm’s debt-equity ratio.
курсовая работа [144,4 K], добавлен 16.07.2016The study of the functional style of language as a means of coordination and stylistic tools, devices, forming the features of style. Mass Media Language: broadcasting, weather reporting, commentary, commercial advertising, analysis of brief news items.
курсовая работа [44,8 K], добавлен 15.04.2012The behavior of traders on financial markets. Rules used by traders to determine their trading policies. A computer model of the stock exchange. The basic idea and key definitions. A program realization of that model. Current and expected results.
реферат [36,7 K], добавлен 14.02.2016Financial bubble - a phenomenon on the financial market, when the assessments of people exceed the fair price. The description of key figures of financial bubble. Methods of predicting the emergence of financial bubbles, their use in different situations.
реферат [90,0 K], добавлен 14.02.2016Law of nature: "the fittest survive". Price war - one of strategies of companies to become a leader. Determination of a price war, positive and negative effects on firms, customers and the public. Possible tactics. Price war in hotel industry.
реферат [24,9 K], добавлен 27.12.2011What are the main reasons to study abroad. Advantages of studying abroad. The most popular destinations to study. Disadvantages of studying abroad. Effective way to learn a language. The opportunity to travel. Acquaintance another culture first-hand.
реферат [543,8 K], добавлен 25.12.2014