Social media sentiment application for financial market trading strategies
This study tests the hypothesis of financial market stock price movement prediction possibility with an application of quantified human sentiment indices gathered from Twitter social network. Using Python text analysis and machine learning algorithms.
Ðóáðèêà | Ýêîíîìèêà è ýêîíîìè÷åñêàÿ òåîðèÿ |
Âèä | äèïëîìíàÿ ðàáîòà |
ßçûê | àíãëèéñêèé |
Äàòà äîáàâëåíèÿ | 07.12.2019 |
Ðàçìåð ôàéëà | 2,5 M |
Îòïðàâèòü ñâîþ õîðîøóþ ðàáîòó â áàçó çíàíèé ïðîñòî. Èñïîëüçóéòå ôîðìó, ðàñïîëîæåííóþ íèæå
Ñòóäåíòû, àñïèðàíòû, ìîëîäûå ó÷åíûå, èñïîëüçóþùèå áàçó çíàíèé â ñâîåé ó÷åáå è ðàáîòå, áóäóò âàì î÷åíü áëàãîäàðíû.
Ðàçìåùåíî íà http://www.allbest.ru/
Abstract
This study tests the hypothesis of financial market stock price movement prediction possibility with an application of quantified human sentiment indices gathered from Twitter social network via Python text parsing functions. Machine learning algorithms, trained on the span of 2017 - 2018 data of stock prices and sentiment indices, failed to predict capitalization change of Apple Inc., Tesla Inc., and Walmart Inc. with the highest result of logistic regression function yielding 55% accuracy and Granger causality tests proving no correlation between human sentiment indices and price movements of the stocks under consideration.
Introduction
Social media for the past decade developed from solely communication tool into the fabric of human sentiments and opinions, dictating crucial political decisions, changing economies of the whole countries and influencing lives of ordinary people more than ever before. Information became the source of power. And social media concentrate, define and multiply that power. Today «Facebook Inc.» alone aggregates 2,3 billion unique active monthly users - 30% of the global population, 66% of which use several of the company products daily for 58 minutes per day overall on average https://blog.hootsuite.com/facebook-statistics/ https://www.statista.com/statistics/947869/facebook-product-mau/. At the same time, Twitter became the translator of political will with the US president Donald J. Trump openly commenting and protecting his presidential decisions in front of the masses. Allowing him to increase social pressure on the desired aspects of agenda, such as for example the Trade War with China https://twitter.com/realDonaldTrump/status/969525362580484098 or nuclear disarment of the North Korea https://twitter.com/realDonaldTrump/status/1109143448634966020 and win political scores.
It is always interesting to trace the process of social media power creation. Aside from politics, sentiments of the masses often define economics. And this paper will demonstrate the ways in which human sentiments can predict and influence the fluctuations on one of the most crucial economics institutions itself - the financial market. Practical part of the work will test whether extraction of abnormal financial gains on stock markets via trading strategies that are based on aggregated evaluation of signals from social media is possible and will try to develop passive trading rules based on the dependence of historical data on stock prices from the social media sentiment indices.
In theory of the late 20th century, financial market and stock prices were expected to be guided by the Efficient Market Hypothesis (EMH), developed by Eugene Fama in 1970 in his article «Efficient Capital Markets: A Review of Theory and Empirical Work», which suggests that the value of traded assets, stocks included, at a given time is a just proxy of the human expectations, regarding assets future performance, based on the available information. In other words, theoretical framework implies that there cannot exist an option to gain abnormal risk-adjusted returns via stock trading, without insider information. Moreover, an article further suggests that there is no way to prove that stable profitable trading strategy exists due to a Joint Market hypothesis (JMH). Meaning that in theory there is no way to prove that either the market is not efficient, and the positive trading strategy based on some finite number of factors exists, or the model is incorrect itself due to the very finiteness of the factors under the consideration that omit important roots.
However, on practice in 10 years after the Efficient Market Hypothesis formulation, in 1980s Warren Buffet leading Berkshire Hathaway established himself as one of the leaders of the US national financial market via investments and stock trading with his net fortune varying around 350-380 million dollars by 1982 https://dariusforoux.com/the-power-of-compounding/. Today his net fortune contributes to 86,6 billion dollars and keeps growing making him one of the most influential men on Earth. The example of Warren Buffet is not unique. But far more important is that it suggests that the EMH fails at least at some aspects and abnormal returns are indeed possible without outsider information.
Behavioral psychology partially suggests the explanation. The deviations of the stock market from the efficient state can be partially attributed to the cognitive biases that are essential for human behavior. Humans tend to exaggerate, make mistakes, develop different interpretation of the same events, process the given information in various scenarios, overreact and remain subject to emotions, stress and external biological factors. Cognitive biases force some stock market participants to act irrationally and ignore undervalued stocks. Subsequently, consistently rational investors that are to a lesser extend subject to behavioral biases tend to outperform those that fail to withstand emotional pressure. These traders gain from value stocks that were omitted by the market. These topics were extensively researched in the paper «Judgment Under Uncertainty: Heuristics and Biases» (1982) by Daniel Kahneman and his colleagues https://www.its.caltech.edu/~camerer/Ec101/JudgementUncertainty.pdf.
Even though debates around the market efficiencies are still open, the existing evidences against EMH at least partial incorrectness are strong and until recent time remain irrefutable. But if the behavioral psychology suggests that human cognitive bias is one of the major reasons for outperforming traders beating the market then algorithms that aggregate human sentiments will be unable to secure abnormal returns. Machine will collect cognitive biases of the majority and loose to the minority that will be able to withstand cognitive traps.
On the other hand, market is governed by humans and their expectations. If the expectations of the majority in terms of market value or market power coincide regarding some asset future price - expectations become a reality. Therefore, sufficiently large cognitive bias can become common market reality and algorithms that succeed to aggregate and predict those sentiments will benefit from asset reflecting predicted prices.
It is especially interesting to see whether the latter or the former scenario dictates the market behavior at a given case. This paper is going to try to construct several strategies that will test the possibility of earning positive abnormal risk-adjusted returns via simulating trading of real stocks of several known companies based on the aggregation of human sentiments. I will use the word analysis of Tweeter posts as a proxy for sentiments in social media and will try to construct indices that will allow to predict whether to sell or to buy stocks at a given time.
Tweeter today contributes to more than half a billion posts a day, each being a highly emotional expression of human sentiments https://www.internetlivestats.com/twitter-statistics/. The goal is to find tweets that mention the desired company, evaluate words that it consists of, subdivide them on groups, where each group is being associated with the distinct levels of «bad» or «good» sentiment and construct indices of human sentiment. Application of indices as factor models that predict stock prices should create a trading strategy, performance of which will be evaluated.
1. Literature review
The major goal of financial market players for all times that these markets exist was to be able to predict market movements and exploit their fluctuations to gain economic profit and be able to control future risks and financial threats.
Eugene Fama in 1965 in his article «The behavior of Stock-Market prices» considers the goal to predict future prices of stocks as basic for his research, claiming that the behavior of stock prices follows the random-walk principle with each subsequent price-level being the random number taken from some common distribution. Which on practice meant that the use of past stock price changes to predict future stock movements is of no value as it cannot yield a satisfactory result. These concepts were further developed by E.Fama in his work «Efficient Capital Markets: A Review of Theory and Empirical Work» five years later. Where the concept of market efficiency receives its development suggesting that the stock market prices are efficient in that their price incorporates all the available information. New information is spread quickly and becomes immediately incorporated into the prices. Meaning that neither technical analysis nor fundamental analysis can end up with a reliable prediction of future prices that will secure higher return for the trader than a randomly chosen portfolio with the comparable amount of risk.
The concept of market efficiency and random-walk model application for stock prices prediction received further consideration in the following decades. The anomaly of the negative stock return at the beginning of the weeks has been considered by Frank Cross in 1973 in his work « The Behavior of Stock Prices on Fridays and Mondays». The main result of his research points out the significant difference in average daily stock returns on Mondays compared to previous Fridays as well as the varying price distribution of stocks at those days. As the general rule, prices demonstrated declining tendencies on Mondays following a rise on the previous trading days, which were regarded as an evidence against the results of E.Fama works on market efficiency.
Furthermore, Kenneth R. French in 1980 in his work «Stock returns and the weekend effect» ended up with negative average Monday stock returns for 1953 - 1977 daily returns of the S&P composite portfolio compared to positive average returns for the rest of the four days of the week.
Aside calendar effect, various other deviations from the suggested efficient market hypothesis were evidenced during the late 20th century. The results of studies on mean-reversion - tendency of asset prices to convert to the long-run mean, were highly contradictory. As J. Poterba and H. Summers failed to reject random-walk stock prices behavior («Mean reversion in stock prices: evidence and implications», 1988) showing, however, positive autocorrelation of returns over short horizons and negative over long ones.
Nowadays there are at least six major known anomalies that persist on the global financial market and go against EMH aside from calendar effect:
1. «Size effect» - firms with lower capitalization value are likely to demonstrate higher performance than industry leaders with higher capitalization;
2. «New Year effect» - stocks that underperformed in the fourth quarter of the prior year tend to outperform the markets in January;
3. «Low Book Value effect» - stocks with lower than average price-to-book ratios demonstrate higher returns than predicted by the random portfolio;
4. «Omitted Securities effect» - phenomena of less liquid stocks with lower trading volume that are omitted by the investors due to some factors tend to outperform the moment they are got closer analysis by the traders allowing first investors to secure profits.
5. «Reversals» - the flow of the firm business cycle that is transferred on stock prices as a cycle of performance interchange from highest to lowest and vice versa allowing investors to anticipate declines after stock prices rises and opposite after the decline.
6. «Dogs of the Dow» - creation of profitable trading strategy via selection of portfolio from Dow Jones Industrial Average with the certain features, such as selecting securities with the highest yield rate at the given consideration horizon or, equivalently, several securities with the lowest absolute stock price.
Empirical test suggest evidence that application of the following phenomena as trading strategies has resulted in positive abnormal risk-adjusted returns, providing evidence against EMH. Further Critique of the EMH was summarized by Burton G. Malkiel in 2003 at his work «The Efficient Market Hypothesis and Its Critics».
The global debate of EMH application and evidence against it is still open and this paper does not intend to provide final argument that will resolve it. The goal of this work is rather to find and exploit passive trading strategy that will, despite the evidence for EMH, receive abnormal returns exploiting human behavioral cognitive bias that remain unaccounted by EMH. Extending the evidence basis against market efficiency remains the secondary goal.
Cognitive biases summarize systematic behavioral delusions that are common for humans and result in wrong assessment of event likelihood and subsequent irrational behavior. Cognitive biases are widely considered by Daniel Kahneman and Amos Tversky in «Judgment Under Uncertainty: Heuristics and Biases» (1982). Scientists subdivide 3 heuristics as main sources for cognitive biases:
1. Availability - estimation of probabilities via verbal assessment based on own experience and past events that come to mind;
2. Anchoring - assessing values via application of starting value (anchor) and further adjustment based on personal believes and experience
3. Representativeness - insensitivity during the evaluation to prior probabilities, sample size, predictability of the events under the assessment, illusion of validity, misconception of regression and chance etc.
Purely biological factors are essential for humans and traders are no exception. The effect of cognitive biases on trading performance is natural. Financial market assumes money bets, which in turn supports propensity towards irrationality with loss aversion, lookback tendency, recency biases and anchoring biases being only the major sources of irrational cognitive behavior.
In this regard, an attempt to exploit human cognitive biases in a trading strategy seems promising via concerning human sentiments from social networks - places, where human tend to react and communicate in the most impulsive way. There exist quite a lot of papers that review human sentiments application for trading. This paper applies the experience of Wenbin Zhang, Steven Skiena («Trading Strategies to Exploit Blog and News Sentiment», 2010), where researchers ended up with the trading strategy generating positive and relatively stable returns via performing sentiment analysis from the blogs and news using natural language text analysis system (NLP).
Additionally, in the work «Stock Movement Prediction from Tweets and Historical Prices» (2018) Yumo Xu and Shay Cohen introduced several prediction models based on data from Twitter and technical analysis, reaching the accuracy of 58, 23%, demonstrating effectiveness of deep generating approach for stock price prediction based on social media data.
For purposes of this research work of Alessandra Cretarola, Gianna Figà-talamanca, and Marco Patacca «A sentiment-based model for the bitcoin: theory, estimation and option pricing» (2017) especially relevant. Researchers aggregated data from «Google» search engine output for the role of sentiment approximation, to predict variations in Bitcoin capitalization proving the behavior of cryptocurrency on markets is closer to volatile stock asset than to conventional currencies.
Major reference role for the current research is attributable to «Stock prediction using Twitter sentiment analysis» (2010) by Anshul Mittal and Arpit Goel and «Sentiment Analysis of Twitter Data for Predicting Stock Market Movements» (2016) by V. Pagolu, K. Challa and G. Panda.
A. Mittal and A. Goel introduced advanced neural networks on tweets yielding accuracy of more than 75% in predicting market prices fluctuations. High accuracy is guaranteed using advanced neural networks and newly developed method for data validation. However, the achieved accuracy of the stock market prices prediction was previously surpassed by J. Bollen, H. Mao and Xiao-Jun Zeng in paper «Twitter mood predicts the stock market» (2010) where groundbreaking 87% accuracy is reached collecting same DJIA data in search of correlation with Twitter posts emotional sentiments for the preceding periods. Similar cross-validation technique for gathered data is used along with SOF neural network. The mood evaluation criteria subdivided words between 6 cognitive vectors allowing extremely accurate evaluation of sentiments in each given post, which coupled with advanced machine learning techniques explain high accuracy rates. Techniques and results of these studies were major reference sources for the given study practical trading strategy creation.
Separate acknowledgement must be granted to the paper «Cryptocurrency market efficiency analysis based on social media sentiment» (2018) and in particular to Oleg Kheyfets as its author, my advisor on important references and papers on machine learning techniques and on computing resources for parsing of texts from Twitter. In his paper Oleg solves the problem of Bitcoin capitalization prediction with a successful application of Logistic regression, Random forest and Yandex CatBoost models, which find their application in this paper as well.
For minor references, works of David Garcia, Frank Schweitzer (2015) «Social signals and algorithmic trading of Bitcoin. Royal Society open science» and Tianyu Ray Li, Anup S. Chamrajnagar, Xander R. Fong, Nicholas R. Rizik, Feng Fu (2018) «Sentiment-Based Prediction of Alternative Cryptocurrency Price Fluctuations Using Gradient Boosting Tree Model» were used. Papers referenced general sentiment analysis application techniques for given paper along with examples on machine learning Gradient boosting models which were applied for the current study. Scientists reviewed cryptocurrencies capitalization predictability taking data on Twitter posts as the major source.
In conclusion, debates around the Efficient Market Hypothesis (EMH) in various papers suggest contradictory results with no final point in dispute being stated. Empirical evidences of multiple successful trading experiences and presence of numerous stock fluctuation anomalies on financial markets imply that at least several limitations to the generally accepted theory are omitted. Some behavioral specialists believe that cognitive biases may influence human sentiments that drive market efficiency deviation from theoretical frameworks. Advancement of machine learning techniques allowed researchers in the recent years to aggregate data on human sentiments from various online sources and construct predictive algorithms for stock price movements that yielded a groundbreaking accuracy of 70-80%. This study is intended to add scientific value for the market efficiency dispute via extending empirical evidences on stock prices prediction using advanced machine learning techniques for conduction of human sentiment analysis as an explanatory factor for stock price movements.
social media financial trading
2. Problem formulation
Problem
This work is intended to establish correlations of stock price movements of several leading United States companies with the quantified expressions of human sentiments gathered from social media datasets directly associated with these companies and expand empirical evidence on financial market predictability for further scientific dispute on efficiency market hypothesis.
Motivation
In terms of motivation, this work is driven by several factors:
1. Self-education
2. Completion of the desired work scope demands application of python packages for data aggregation, parsing, training of machine learning networks and evaluation of statistics on empirical evidence along with hypothesis testing. As part of personal preparation courses were completed on general Python programming and on Python application for data science from DataCamp https://www.datacamp.com/ and Codecademy https://www.codecademy.com/learn/learn-python. Additionally, a half-year offline course on machine learning at «Netology» https://netology.ru/programs was finished. This educational background allowed to gain experience in modern analysis techniques and satisfy internal desire for demanded knowledge. This work is intended to provide practical application of the accumulated knowledge from the bachelor program combined with the results of self-education.
3. Scientific evidence
4. This work provides an extension for the scientific evidence on predictability of stock price movements based on social media sentiments adding more ground for the global dispute on financial market efficiency and behavioral aspects of global economics. With the development of scientific thought this work may provide at least partial reference.
5. Professional development
6. Extensive knowledge and practical experience in application of modern analytical techniques remains a distinct professional advantage. This trend will definitively continue, and this work is intended to declare myself as a capable professional in the field of advanced analytics in economics. Moreover, personal daily professional routine assumes application of modern analytical tools, proficiency in which guarantees superior career paths and increased value for the employer.
7. Commercial interests
8. Successful creation of autonomous, stable and sufficiently accurate trading strategy creates possibilities for commercial trading on real financial markets for real economic profit. Development of online trading bot for high-frequency trading can be seeing as part of further development of this topic.
Contribution
Strict scientific contribution of this paper is attributable mainly for the further development of current economic background in aspects of market efficiency and prediction of stock market fluctuations along with advancement of modern machine learning techniques in the branch of economic analysis. Overall contribution of the work can be subdivided upon five major factors:
1. Data gathering tool
2. Given paper develops universal data gathering algorithm that can be further used for academic research in practically every branch of scientific interest that involves online platforms as sources of information. Algorithm is not limited to parsing of Twitter posts and can easily be rearranged for data gathering from other social networks, which can, as an example, substantially ease academic research for students.
3. Market efficiency dispute
4. The results of this research will not intend to resolve the global dispute on market efficiency hypothesis thesis, recognizing the selectivity of the applied research methods. However, this paper is capable for providing source of additional evidence on financial market predictability and point out on market efficiency anomalies present even on established modern stock markets.
5. Study of financial trading
6. Final goal of this research represents the establishment of high-frequency trading program. And results given in this survey may be used by future researchers for various high-frequency trading hypothesis testing and further study of financial markets via direct application of research methods, algorithms and exploitation of already existing outputs for in-depth interpretation of own results.
7. Machine learning application
8. Completed analysis further expands fields of machine learning application in economics and finance and gives additional precedent for the revision of previous theoretical results with an application of modern analytical tools from machine learning spectrum, the results of which may change modern economic thought.
9. Popularization of data science
10. Hopefully, the results of the given survey will inspire future generations of students on learning data science and modern analytical techniques and will increase the analytical power of future research papers, developing the global scientific community.
In conclusion, the goal of this paper is to test correlation of selected financial assets with human sentiments evaluated by machine learning text processing techniques trained on data from Twitter posts extending the empirical evidence on financial market predictability and developing ground for global efficiency market hypothesis dispute.
3. Methodology
This work will comprise multiple steps to construct stock price movement prediction model based on quantitative evaluation of human sentiments from texts in Twitter posts. In particular:
1. Choice of stocks
2. This paper exploits stock prices for three extremely popular companies on US financial market in order to make sure that the first attempts to predict stock movements will be sufficiently backed by social media attention and therefore - with data for sentiment analysis. Based on this, stocks of Apple Inc. (NASDAQ: AAPL) https://www.apple.com/, Tesla Inc. (NASDAQ: TSLA) https://www.tesla.com/ and Walmart Inc. (NYSE: WMT) https://www.walmart.com/ were chosen as these companies produce goods and services that are widely known and performance of which attracts attention of masses in social media environment.
3. Stock prices aggregation
4. Data on these stocks movements was collected from the financial sites «investing.com» https://ru.investing.com/equities/tesla-motors-historical-data and «finam.ru» https://www.finam.ru/profile for the periods 31.12.2016 - 31.12.2018 i.e. for two complete subsequent years. The period of two years was chosen to guarantee enough data for prediction models. Complete years were taken in order to equally account for seasonal effects in stock prices fluctuations. Closing tickers for stock price values were collected on a daily basis for each day in the period of consideration.
5. Twitter text collection
6. For the purpose of Twitter posts gathering, Python algorithm was applied. The use of «Pandas» and «Selenium» libraries allowed to gather tweets by the pool of key words including «Apple», «Tesla» and «Walmart» directly from Twitter website for the period under the consideration of the study and then aggregate founded posts in a single corpus for the research.
7. Twitter data normalization
8. In order to be able to work with textual data from Twitter, rare words and «stop» words are sanitized. Remaining word pool is then being structured, ordered and assigned with differentiating weights, depending on importance of a particular word at each post under the consideration for a given company.
9. NLP model training
10. At that stage, Natural Language Processing (NLP) model is created using Python libraries. Model is trained on weights of obtained wording pool to predict stocks prices binary change i.e. either increase or decrease of asset price from the previous day closing value. Accuracy of obtained social media sentiment models is evaluated and best-fitting variant is chosen.
11. Correlation tests
12. Time series on obtained social media indices and actual stock prices movements across three chosen companies is formed and dependence between these variables is determined via Granger causality test.
13. Drawing conclusions
14. Results of the empirical tests are interpreted in accordance with conventional statistical norms and standards.
Results of the conducted study will try to establish trading strategy based on social sentiment indices and test its effectiveness.
4. Data
Gathering of data on stock price movement is of no scientific value and is omitted in this section. Sufficient explanation of the stock ticker values obtainment and processing are provided in section #4 on methodology of the study.
The process of search for relevant data in social networks is widely covered at DataCamp online educational resources and in some cases is predetermined by the code of sites including Twitter. Based on the algorithmic solutions provided in DataCamp «Importing data in Python part 1 & 2» an algorithm was developed that is able to scroll twitter post with the predetermined hashtags at given dates and extract text from them, marking initial inputs such as date and key words of search from relevant script blocks. For the period under consideration (31.12.2016 - 31.12.2018) algorithm collected 436 390 Twitter posts.
As illustrated by Chart 1, the data aggregation algorithm delivered relatively stable collection of posts for each month with minor deviations within 15-22 thousand of tweets per month with December 2018 representing the absolute minimum with 15,1 thousand of tweets and September 2017 - an absolute maximum of 22,1 thousand. The scope of this survey does not suggest searching for the reason of evidenced deviations. Reasons for them may vary as human activity in Twitter is driven by uncountable number of factors.
Next important step was to analyze tweets length by the number of symbols. Average tweet comprises of length varying from 100 to 150 symbols with 180 being the absolute limit until 2018, when the maximum post length was extended. Data from the graph supports this evidence. However, for some reason, set contains several tweets with content length exceeding maximum for 2018. Those tweets were removed from the analysis.
Further steps involved preparation of the dataset for the NLP-model training. On practice, tweets at this stage represented selected posts within data and length ranges containing required hashtags. In order to prepare data for machine learning, tweets were subdivided by words and objects that they contained. Pictures, links, GIFs and other non-textual content has been removed. Moreover, a list of English interjections was aggregated, words from which were removed from the text set too as they bear little but no expressive power. Finally, lemmatization procedure was applied. Lemmatization or word normalization technique is a procedure that reverses words to their nominal form. This step is essential for further application of machine learning as it creates data environment for models to act. As the result, words lost their plural forms and verbs their suffixes, i.e. expression «automotive cars» got transformed into «auto» «motion» and «car». Python allows to fulfill following functions with an application of Natural Language Tool Kit or NLTK library. Guidance on NLTK library application is available at DataCamp tutorials.
At this stage, obtained list of words cannot be interpreted by algorithms as there is no quantified measure for words relevance. The procedure called TF-IDF is meant to solve this problem. Function represents a statistical measure that quantifies the importance of each word in a given document in a collection of documents. Importance of each given word in a collection of words is evaluated via statistical weights application that are greater the more frequently given word appears in the document and lower the more frequently given word appeared in the whole collection of documents. In our case a tweet represents a single document. There exist multiple variations of TF-IDF measure. However, given study takes advantage of the following version:
1. «TF» stands for «Term Frequency» counting given term appearance in the post.
2. All terms under consideration are considered equally important.
3. TF(i) = (Number of times term «i» appears in a document) / (Total number of terms in the document).
4. IDF: Inverse Document Frequency, which measures importance of a term. Weights of more frequent terms are decreased. On the contrary, rare terms weights are scaled up:
5. IDF(i) = log(Total # of tweets / # of tweets with term «i»)
The implementation of TF-IDF techniques are widely covered at Christopher D. Manning and Prabhakar Raghavan «Introduction to information retrieval» (2008). Practical application in Python is possible via open «tf-idf» library.
Finally, data from tweets was split into two distinct parts. Words from first 19 months of Twitter data were used for machine learning and modifying, whereas last 5 months of the sample were used for model testing. Division of the time horizon under consideration in proportion 1:4 is widely accepted as a best practice for applied machine learning.
5. Modelling
Setup
For the goal of the trading strategy development, stock movement fluctuations may be considered as a sequence of rises and falls in an asset price relative to previous period. In our case, the binary choice is whether the stock price for the given day-closing ticker is higher or lower than the corresponding closing ticker for the previous day. At this study two different models will be used that proved their effectiveness in various studies («Stock prediction using Twitter sentiment analysis» (2010) by Anshul Mittal and Arpit Goel achieving 75% accuracy, «Twitter mood predicts the stock market» (2010) by J. Bollen, H. Mao and Xiao-Jun Zeng achieving 87% accuracy, and «Cryptocurrency market efficiency analysis based on social media sentiment» (2018) by Oleg Kheyfets achieving 70% accuracy): conventional logistic regression, and Random forest. The setup of each of these two models is one:
1. Binary dependent variable «Y» will take two distinct values: 1 if the price of the given asset has risen from the previous day closing mark and 0 otherwise.
2. Set of explanatory variables «X(i)» is represented by words with TF-IDF procedure assigning values to them in each tweet depending on the quantified index of local importance of these words.
The obtained matrix comprises millions of entries and the goal of NLP models is to demonstrate highest precision on the given set.
Logistic regression
For the purposes of this study logistic regression will be taken as a baseline due to its modern extensive application in scientific needs. This model has several forms but for the stock price prediction purposes we will utilize simple logistic regression with binary categorical response. Its advantage compared to random forest and artificial neural networks comprises in combined output of relevancy measure for the predictor and its direction of association, whereas other models will only point out more important predictors with no direction of association. Application of logistic regression on Python is possible via sklearn.linear_model library.
Random forest
By definition - random forests or random decision trees represent a machine learning method for tasks involving classification and regression analysis and that function via construction of multiple decision trees on the training set of data, yielding the mean prediction (i.e. regression) of the individual trees.
A decision tree (also called a classification tree or regression tree) is a decision support tool used in statistics and data analysis for predictive models.
A random forest structure comprises from two object types:
1. «Leaves» that contain values of the objective function.
2. «Branches» or «stems» that comprise explanatory variables and split to differentiate objective function values.
Each route from the stem to a certain leave classifies a scenario with a distinct value.
The final product of multiple iterations creates a model that predicts the value of the target variable based on several variables at the input. Appendix chart #1 illustrates simplified random forest scheme.
Spheres of random forest application are widely spread beyond the goals of regression analysis. Procedure is known for yielding highly accurate results due to the multiple decision trees involved in the decision-making process. However, random forest often suffers from overfitting biases. Complicated trees tend to construct extremely irregular patterns demonstrating certain reduction in biases.
At the same time, model often requires time-consuming operations involving decision making by each and every decision tree in the scheme and the results of the gobal voting are often hard to interpret.
Application of a Random Forest classification model on Python is possible via opened sklearn.linear_model library. SciKit modules that comprise given library are available at DataCamp online portal.
Precision evaluation
The results of classification models work are evaluated using three-dimensional metric, which is common for the endpoint performance assessment of the developed classifiers. In particular, this metric comprises:
1. Classification accuracy
2. Comprises the conventional definition of accuracy, meaning the percentage of correct predictions among overall number of predictions. Represents basic metric that is frequently subject to false interpretation as high absolute accuracy. Given metric is sensitive to the number of samples under each class of data and overestimates absolute accuracy in cases of high-cost false-positive or/and false-negative results.
3. Logarithmic loss
Frequently abbreviated as Log-Loss - this function penalizes estimation models for false classifications. For N samples belonging to M classes function of logarithmic loss follows the formula:
where, Yij, indicates whether sample i belongs to class j or not, Pij indicates the probability of sample i belonging to class j. The absolute accuracy of the classifier is greater for the minimum values of Log-Loss function.
4. F1-score
5. Metric represents a harmonic mean between precision and recall illustrating the classifier`s precision (i.e. how many instances it classifies correctly) and recall.
6. Precision is the number of correct positive results divided by the number of positive results predicted by the classifier, whereas recall is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive).
7. The goal of the classifier is to maximize the results of F1-score, which mathematically can be expressed as:
7. Empirical evidence
Classification results
The results of the classificatory modelling comprise of the following metrics:
Logistic regression:
1. Accuracy: 55,54%
2. Log loss: 15,43
3. F1-measure: 0,5
Random forest:
1. Accuracy: 44,29%
2. Log loss: 19,24
3. F1-measure: 0,44
The results of a Random Forest technique are lower than those of a conventional Logistic regression function. Close evaluation of the Random Forest model output suggests model overfitting on the training set - common problem for Random Forest procedures. During the model training the problem of insufficiency of computational resources occurred that forced me to decrease number of deep layers and decision trees to complete calculations. As mentioned earlier, a Random Forest technique is a demanding procedure in terms of time and computation resources. Further steps must be to increase the number of decision trees and dramatically reduce the overall wording span that at the moment of study comprised of 256 322 unique words. Evaluation of each word made the process extremely time consuming which in turn forced to decrease the number of trees and subsequently led to model overfitting.
On the other hand, accuracy of logistic regression function is close to the results of the test conducted by Yumo Xu and Shay B. Cohen (2018) in their study for stock prediction where accuracy of 57-58% was reached, which can be seen as a satisfactory result.
Since I possessed no additional or more powerful computational resources - I was unable to solve the problem of a Random Forest accuracy increase and proceeded with Logistic regression for further analysis. Log-Reg yielded significant results under the likelihood-ratio test with P-value tending to zero. Therefore, significance of that test was assumed.
Time series
We proceed now with the prediction of stock movements. The study of assets price correlation will be subdivided among time series of three companies under consideration: Apple, Tesla and Walmart. For each company stock data is aggregated along with twitter words span in the form of time series. Separate Log-reg function is applied for each company. Sentiment is evaluated by the formula «Sentiment = {# of positively labeled tweets} / {total # of tweets for the given day}». Similar technique was applied in works “Trading Strategies To Exploit Blog and News Sentiment” by Skiena and in «Cryptocurrency market efficiency analysis based on social media sentiment» by Oleg Kheyfets, which makes it a common practice.
As we analyze three companies simultaneously, I will describe the general approach first and then proceed with calculations on each given time series set for each company.
In general, the challenge is to test time series of a given company price and time series of this company sentiment on stationarity and modify them in case of non-stationarity presence. Therefore, for each company we evaluate double time series - on price and on sentiments. The following steps will be replicated for this purpose:
1. Illustration of autocorrelation and partial autocorrelation functions for both price and sentiment time series in order to receive basic understanding of stationarity degree.
2. Interpretation of information criteria for both price and sentiment time series in order to construct best process representation using Akaike information criterion (AIC) and Black-Schwartz criterion (BIC)
3. Conduct of Augmented Dickey-Fuller (ADF) test on stationarity with input of parameters obtained from information criteria in order to gain final understanding of process stationarity for both price and sentiment time series or further bring process to stationarity via lag application
4. For two consecutive stationary time series of a single company on price and sentiment Granger causality test is used in order to quantify Granger-correlation metrics and assess the dependency between two adjusted time series.
5. Conclusions on the results are drawn and summarized
Apple sentiment series
We start with Apple sentiment time series evaluation. Information from the graph alone is not enough to indicate information criteria, therefore the process representation is constructed:
Chart # 3: Autocorrelation and partial autocorrelation functions for Apple sentiment time series
In accordance with AIC and BIC criteria, AR(1) model shows the lowest information criteria being the best process representation model for Apple sentiment.
Model (Apple sent.) |
AIC |
BIC |
|
AR(1) |
-295,167 |
-287,947 |
|
AR(2) |
-294.431 |
-284.803 |
|
ARMA(1,1) |
-294.269 |
-284.642 |
The performance of Augmented Dickey-Fuller test (ADF) yields t-statistics = -6,89 with a
P-value = 1.363221619782206e-09, which is insignificantly greater than zero, allowing to reject the null hypothesis conducting that time series expresses the stationarity.
Apple stock price series
However, Apple price time series ended up with a stationary behavior only after differentiation.
Chart # 4: Apple stock price dynamics in USD/stock for 09-12.2018:
Chart # 5: Autocorrelation and partial autocorrelation functions for Apple stock price time series
The construction of autocorrelation and partial autocorrelation graphs assumes the existence of AR(1) behavior, which is further confirmed by the information criteria.
Model (Apple price) |
AIC |
BIC |
|
AR(1) |
493.523 |
500.743 |
|
AR(2) |
495.409 |
505.037 |
|
ARMA(1,1) |
495.377 |
505.004 |
However, the ADF-test based on assumed information criteria ended up with a non-stationary results with t-statistics = -0.3001 and P-value = 0.925. After the differentiation with 1 lag, ADF-test yielded t-statistics = -7.5429 and P-value = 3.3401784327543184e-11, which allowed to reject null hypothesis and claim stationarity with lag-1 differentiated stock price.
Tesla sentiment series
Study then proceeds with Tesla time series sentiment evaluation, stationarity for which was obtained with an application of 2 lag procedure.
Chart # 6: Tesla sentiment index dynamics for 09-12.2018:
Chart # 7: Autocorrelation and partial autocorrelation functions for Tesla sentiment time series
For these series, moving average models demonstrated better fit. Overall, MA(1) model yielded the least results proving to be better information criteria for these series.
Model (Tesla sent.) |
AIC |
BIC |
|
MA(1) |
-223.456 |
-216.235 |
|
MA(2) |
-222.074 |
-212.447 |
|
AR(2) |
-222.418 |
-212.791 |
At the same time, ADF-test rejected the Null hypothesis only at 2 lag model.
T-statistics and P-value for 2 lag ADF-test were -6.3392 and 2.7814448e-16 accordingly which confirmed stationarity.
Tesla stock price series
For the Tesla stock time series only differentiated 2 lag ADF-test proved the stationarity.
Chart # 8: Tesla stock price dynamics in USD/stock for 09-12.2018:
Chart # 9: Autocorrelation and partial autocorrelation functions for Tesla stock time series
Graphs of the autocorrelation and partial autocorrelation suggest AR being the best fit for the Tesla stock price process, which was proven by the data. AR(2) information criteria is chosen.
Model (Tesla price) |
AIC |
BIC |
|
AR(2) |
654.518 |
664.145 |
|
ARMA(2,1) |
655.386 |
667.420 |
|
ARMA(1,2) |
655.538 |
667.572 |
Simple ADF-test with 2 lag failed to reject Null hypothesis. (T-statistics = -6.339, P-value = 2.781), which is why differentiated ADF-test was applied that successfully proved stationarity under P-value = 9.25786482632056e-22, which is close to zero.
Walmart sentiment series
Finally, time series on Walmart have been considered.
Chart # 10: Walmart sentiment index dynamics for 09-12.2018:
Chart # 11: Autocorrelation and partial autocorrelation functions for Walmart sentiment series:
For these series, moving average models demonstrated better fit. Overall, AR(0) model yielded the least results proving to be better information criteria for these series.
Model (WMT sent.) |
AIC |
BIC |
|
AR(0) |
-237.580 |
-232.766 |
|
ARMA(1,1) |
-237.190 |
-227.563 |
|
AR(1) |
-236.968 |
-229.748 |
ADF 2 lag test resulted in t-statistics = -7.786 and P-value = 8.171324079105028e-12, proving the stationarity via the null hypothesis rejection.
Walmart price series
Finally, time series on Walmart have been considered with stationarity being proven only by ADF 2-lag differentiated test.
Chart # 12: Walmart stock price dynamics in USD/stock for 09-12.2018:
Chart # 13: Autocorrelation and partial autocorrelation functions for Walmart price series
The calculated information criteria suggest ARMA(2,1) being the best fit for the process.
Model (WMT price) |
AIC |
BIC |
|
ARMA(2,1) |
284.162 |
292.196 |
|
ARMA(2,2) |
285.956 |
300.396 |
|
AR(1) |
285.644 |
296.864 |
Simple lag-2 ADF-test on stationarity failed to reject null hypothesis with t-statistics = -1.533 and P-value = 0.516, whereas differentiated lag-2 ADF-test rejected the null hypothesis.
(T-statistics = -8.717 and P-value = 3.457505978399314e-14)
Ranger causality test. Theory
In order to answer the question of statistical dependence among time series data sets - Granger causality test was applied. Test is widely used by econometricians for the measure of future values prediction ability based on existing current time series set, therefore creating a proxy for causality.
For stationary time series Y and X, the Null hypothesis is that x does not Granger-cause Y.
In order to test this, proper lagged values of Y to include in a univariate autoregression of Y are found, using the formula:
Yt = b0 + b1Yt-1 + b2Yt-2 + b3Yt-3 +…+ bnYt-n + error term(t)
After that, the autoregression is augmented by including lagged values of X:
Yt = b0 + b1Yt-1 + b2Yt-2 + b3Yt-3 +…+ bnYt-n + b`pXt-p +…+ b`qXt-q + error term(t)
where p is the shortest and q is the longest lag length for which value of x is significant.
One retains in this regression all lagged values of x that are individually significant according to their t-statistics, provided that collectively they add explanatory power to the regression according to an F-test (whose null hypothesis is no explanatory power jointly added by the X's). The null hypothesis is accepted if and only if no lagged values of X are retained in the regression.
Practice
Practical results of Granger causality test application suggest that the chosen approach of stock movement prediction via evaluation of sentiments from social media and/or applied techniques were insufficient for the establishment of desired correlations.
Different VAR-models were constructed for up to 12 lags. Based on which granger test has been conducted. Statistical results are aggregated in the following table
Stock |
Best fit Var-model lag number |
F-statistics |
P-value |
|
Apple |
1 |
0.6792 |
0.4124 |
|
Tesla |
6 |
0.2662 |
0.9506 |
|
Walmart |
6 |
0.2662 |
0.9506 |
As can be seen from the table, all three Granger-causality tests failed to reject null hypothesis of no Granger-correlation existence between modified time series sets. Therefore, it is assumed that no correlation between obtained stock movements and quantified sentiments from Twitter network were established. In other words, this particular study with its limitations, picky approach in applied methods and restricted sample structure established no predictive power of social sentiments on stock price movements. In order to construct trading strategy based on human sentiments, approaches and limitations for the study should be different.
Conclusion
This research comprises available knowledge on human sentiments extraction, evaluation and construction on their basis stock price movement prediction models. At first, stock historical data has been gathered from online trading platforms for two complete subsequent years 2017-2018. Then developed Python algorithm allowed to parse 436 390 twitter posts directly from the social network website and structure obtained data. After that, pool of posts was cleaned from the non-informative objects such as pictures, GIFs etc. Words with little but no expressive power were also excluded from the text span such as stop words for the English language, punctuation, pretexts and pronouns. Obtained set of 256 322 words was normalized and weighted in accordance with relative relevance in each particular tweet under consideration. The final span was subdivided into two parts: for separate training and testing. Based on the obtained during the training results, two prediction models (Random forest and Logistic regression function) were tested to predict chosen companies stock movement with their accuracy and precision being evaluated. Logistic regression received more accurate results with 55% accuracy and was chosen to create social media sentiment index.
Actual tests of the obtained models used three distinct time series sets: prices of each of the three companies under consideration and sentiment indices corresponding to each of them. Each time series set was modified to answer the stationarity requirements and Granger causality test was applied to test on the correlation degree, results of which failed to reject the hypothesis of no correlation existence for each of the time series under consideration.
...Ïîäîáíûå äîêóìåíòû
The stock market and economic growth: theoretical and analytical questions. Analysis of the mechanism of the financial market on the efficient allocation of resources in the economy and to define the specific role of stock market prices in the process.
äèïëîìíàÿ ðàáîòà [5,3 M], äîáàâëåí 07.07.2013Financial bubble - a phenomenon on the financial market, when the assessments of people exceed the fair price. The description of key figures of financial bubble. Methods of predicting the emergence of financial bubbles, their use in different situations.
ðåôåðàò [90,0 K], äîáàâëåí 14.02.2016Economic entity, the conditions of formation and functioning of the labor market as a system of social relations, the hiring and use of workers in the field of social production. Study of employment and unemployment in the labor market in Ukraine.
ðåôåðàò [20,3 K], äîáàâëåí 09.05.2011Law of demand and law of Supply. Elasticity of supply and demand. Models of market and its impact on productivity. Kinds of market competition, methods of regulation of market. Indirect method of market regulation, tax, the governmental price control.
ðåôåðàò [8,7 K], äîáàâëåí 25.11.2009Natural gas market overview: volume, value, segmentation. Supply and demand Factors of natural gas. Internal rivalry & competitors' overview. Outlook of the EU's energy demand from 2007 to 2030. Drivers of supplier power in the EU natural gas market.
êóðñîâàÿ ðàáîòà [2,0 M], äîáàâëåí 10.11.2013The necessity of using innovative social technologies and exploring the concept of social entrepreneurship. Analyzes current level of development of social entrepreneurship in Ukraine, the existing problems of creating favorable organizational.
ñòàòüÿ [54,5 K], äîáàâëåí 19.09.2017General characteristic of the LLC DTEK Zuevskaya TPP and its main function. The history of appearance and development of the company. Characteristics of the organizational management structure. Analysis of financial and economic performance indicators.
îò÷åò ïî ïðàêòèêå [4,2 M], äîáàâëåí 22.05.2015Directions of activity of enterprise. The organizational structure of the management. Valuation of fixed and current assets. Analysis of the structure of costs and business income. Proposals to improve the financial and economic situation of the company.
êóðñîâàÿ ðàáîòà [1,3 M], äîáàâëåí 29.10.2014Basic rules of social protection in USA. Maintenance of legal basis, development and regular updating of general(common) methodological principles of state guarantees and methodical development in sphere of work. Features of payment of work by worker.
êóðñîâàÿ ðàáîòà [29,4 K], äîáàâëåí 12.04.2012A theoretic analysis of market’s main rules. Simple Supply and Demand curves. Demand curve shifts, supply curve shifts. The problem of the ratio between supply and demand. Subsidy as a way to solve it. Effects of being away from the Equilibrium Point.
êóðñîâàÿ ðàáîòà [56,3 K], äîáàâëåí 31.07.2013The definition of Corporate Social Responsibility and main approaches. Stakeholder VS Shareholders. The principles of CSR: features and problems. Sanofi Group Company and its Social Responsibility program. Results and Perspectives, the global need.
êóðñîâàÿ ðàáîòà [43,2 K], äîáàâëåí 09.03.2015The global financial and economic crisis. Monetary and financial policy, undertaken UK during a crisis. Combination of aggressive expansionist monetary policy and decretive financial stimulus. Bank repeated capitalization. Support of domestic consumption.
ðåôåðàò [108,9 K], äîáàâëåí 29.06.2011Analysis of the status and role of small business in the economy of China in the global financial crisis. The definition of the legal regulations on its establishment. Description of the policy of the state to reduce their reliance on the banking sector.
ðåôåðàò [17,5 K], äîáàâëåí 17.05.2016Transition of the Chinese labor market. Breaking the Iron Rice Bowl. Consequences for a Labor Force in transition. Labor market reform. Post-Wage Grid Wage determination, government control. Marketization Process. Evaluating China’s industrial relations.
êóðñîâàÿ ðàáîòà [567,5 K], äîáàâëåí 24.12.2012Assessment of the rate of unemployment in capitalist (the USA, Germany, England, France, Japan) and backward countries (Russia, Turkey, Pakistan, Afghanistan). Influence of corruption, merges of business and bureaucracy on progress of market economy.
ðåôåðàò [15,5 K], äîáàâëåí 12.04.2012Mergers and acquisitions: definitions, history and types of the deals. Previous studies of post-merger performance and announcement returns and Russian M&A market. Analysis of factors driving abnormal announcement returns and the effect of 2014 events.
äèïëîìíàÿ ðàáîòà [7,0 M], äîáàâëåí 02.11.2015The influence of corruption on Ukrainian economy. Negative effects of corruption. The common trends and consequences of increasing corruption. Crimes of organized groups and criminal organizations. Statistical data of crime in some regions of Ukraine.
ñòàòüÿ [26,7 K], äîáàâëåí 04.01.2014Prospects for reformation of economic and legal mechanisms of subsoil use in Ukraine. Application of cyclically oriented forecasting: modern approaches to business management. Preconditions and perspectives of Ukrainian energy market development.
ñòàòüÿ [770,0 K], äîáàâëåí 26.05.2015Gas pipeline construction: calculating the pipe diameter, the pressure required for the transportation of natural gas compressors. The definition of capital costs for construction and operation of the pipeline. Financial management of the project.
ñòàòüÿ [774,7 K], äîáàâëåí 05.12.2012The first stage of market reforms in Kazakhstan is from 1992 to 1997. The second phase is in 1998 after the adoption of the Strategy "Kazakhstan-2030". The agricultural, education sectors. The material and technical foundation of the medical institutions.
ïðåçåíòàöèÿ [455,3 K], äîáàâëåí 15.05.2012