Stock Market Forecasting Model From Multi News Data Source Using a Two-Level Learning Algorithm

Stock prediction retains the attention of a large part of the community. The emergence of new indicators mostly extracted from the web makes this domain of research challenging and in a continuous evolution. The present work tries to address the question of how to model financial news from multi-data source for the purpose of forecasting stock movement. We combined different news sources to enhance the accuracy of stock movement prediction. Data are collected from four financial news websites and proceeded individually by Support Vector Machine (SVM) algorithm then we aggregate outputs using an Artificial Neural Network (ANN) algorithm. Experiments were conducted and the results have shown that the designed two-level learning SVC&ANN algorithm has achieved better accuracy than simple news analysis models using a single information source.


Introduction
Since the correct prediction of the market movement is rewarding for investors and traders, they are in permanent search for new models, systems and indicators. It started from simple historical stock value analysis to volume analysis to financial result reports to technical indicators to events analysis and finally the social network analysis. However, in efficient market the stock evolution is unpredictable. Fama distinguishes three forms of market efficiency [1]:  Weak efficiency: available information contains only historical price and volume.  Semi-strong efficiency: contains all public information included annual reports, dividends announcements.  And strong efficiency: besides the public information it is enriched with private information.
For him None of the forms above allow abnormal benefits and no one can beat the market. He claims that the price already includes all available information. Hence operators have the same degree of information. Since Fama supposes that operators are rational, accordingly they will make the same movements and will try to buy and sell the same shares. This result the dissolution of the market and no exchange will be possible.
Some scientists refute this supposition and consider two categories of market agents: informed agents and non-informed agents. The firsts acquire information with certain cost and they act according this knowledge, whereas uninformed agents decrypt data from the price. The latter doesn't reflect all available information, otherwise informed agents will no more pay for information acquisition. Others demonstrate that there is a fundamental price that the market price trends generally towards this value. This phenomenon is known as mean reversion. Poterba explains that if the market price and the fundamental value diverge the speculative forces eliminate these differences [2]. This make the market predictable and give opportunity for stockholders. Some Academics put the different forms of market efficiency under test, such as Poterba and Summers who show autocorrelations in the weak form (historical data) especially in the case of short-term predictions [3]. For dividend information publication, Charest shows that significant residuals were observed in the month following dividend changes [4]. He concludes that the market was slow in digesting dividend information. Symmetrically, el bousty & al. showed that the best prediction accuracy was four days after information publication [5].
The efficiency of the market depends roughly on the information flow and the human behavior. Any failure in the information process induce market efficiency anomaly. Behavioral finance theory, claims that decisionmaking depends on human psychology. The individual cognitive biases psychological and heuristic variables determine his comportment and strategy [6]. These parameters may involve an irrational behavior and the market strengths can't compensate the human failure. Hence human irrationality maybe another source of market inefficiency.
All the work has been done up to now disproving the market efficiency give reason to the development of some trading strategy, in particular fundamental and technical approaches. The first one is based on the analysis of available information about the company strength, the markets and the economy in general. It is interested in evaluating its real value according to a set of macro and micro economic parameters. On the other hand, the technical analysis is mainly compiling the evolution of historical prices and trading volumes to estimate values and trends. It is largely based on the graph's analysis.
Machine learning algorithms is used to enhance the accuracy of these strategies. Powerful algorithms are combined with technical and/or fundamental indicators and huge amount of data are analyzed thanks to mutualisation of cloud resources. First uses of machine learning in financial prediction deals with structured data such as prices, volume, financial reports…. Now all information, which is often unstructured, are combined to empower the forecasting. This raised more complex challenges especially how to process text and even nonwritten information to uncover hidden knowledge.
This work is an attempt to build a model for forecasting shares from multi news data sources and using twolevel machine learning model. The motivation behind this work is twofold: First, Comparing the accuracy of a single news data source model with a multi news data source model and second, inspect the behaviour of our designed model when varying the number of data sources.
The rest of this work is organized is as follows. Section 2 introduces some previous research work on predicting stocks through text analysis. Section 3 presents method used for retrieving and preparing datasets. Section 4 describes the proposed model. Section 5 depicts realized experiments. Sections 6 shows the results. And finally, Section 7 concludes the contribution of this research work.

Related Work
Stock prediction through machine learning algorithms is an active research area. Some research exploits the historic price trading data (open, high, low & close prices) in stock predictions. Jigar Patel.et all computed ten technical parameters and compared the performance of four machine learning classifiers [7]. Results show that random forest outperforms Artificial Neural Network (ANN), Support Vector Machine (SVM) and naive-Bayes algorithms. David M. Q.Nelsonet all used Long short-term memory (LSTM) algorithm to predict future trends of stock prices from historical prices and some technical analysis indicators [8]. Volume can also be useful in stock movement forecasting as shown by Edson Kambeu [9]. The trading volume for the third previous day influence current stock market index movement at the Botswana Stock Exchange. All the above technics exploit structured data only, but as the web information grows continuously, there is an increasing need for handling unstructured datatoo. Studies showed that extracted data from web has great influence on stocks movements direction [10]. Indeed, news events and social media data were inspected on multiple studies to unlock insights hidden within texts [5,11].
The observed influence of news events and social media data on stock movements along with technical prediction methods led to advanced research combining two or more of these technics. Nisal Wadug and Upkesha Ganeguda [12], suggested four components model for better results in stock prediction. Keyword Extraction Module that extract macroeconomic indicators from published financial reports using crawlers or OCR technology. The second component is the Incident Mining Module which focuses on gathering data from newspapers and social media. News relevance is measured through Impact Analysis Module, Google trend can be used to determine the importance of extracted events. The last component is the Performance Isolation Module, that attempts to separate the exact performance of a company from external effects. Line of studies has demonstrated that the multi-source information outperforms each prediction based on a single source [13], including the work conducted by Xiadong Li et al. that compared a news-based model, historical  price-based, naïve combination of these two sources and a Multi kernel learning model combination [14]. MKL performs better on most tests and accuracy of news-based model is similar to naïve combination model. Aparna Nayak combined all available data about selected companies especially historical data, news and tweets [15]. First, continuous trend pattern is calculated from the last three days prices (1 if the last three days trend continue in the same direction else it's equal to 0) then volume variation is compared to the trend at the same day and volume variation pattern is established. These two patterns are combined with polarity extracted from news and tweets. The daily predictions achieved about 70% accuracy using Boosted Decision Tree Model.
Although news events and tweets proof its influence on stock price movement they are rarely inspected alone without combination with historical data or technical indicators. One of the few researches examining impact of published events on stock evolution is the work realized by Aditi Kaushal and Prerit Chaudhry [16]. They scraped news articles that are linked to the AAPL share from Reuters website, and assigned them a value of 1 or -1 depending on the sentiment reflected by the article (positive/negative). They get a general sentiment for a specific day by summing all sentiment values for the same day. Authors believe that not all news has the same impact on the stock, they consider then another parameter which is the magnitude of the news, it's calculated by multiplying the Google trend value of 'Apple' in specific period by the general sentiment calculated previously. The forecasting of the current day movement involves the magnitude values of the last 14 days. The influence of previous days is decreased exponentially using a decay function. These values are fed into SVM, Logical Regression and Naïve Bayes algorithms and comparison showed that the best accuracy results are obtained using SVM algorithm.
Most researches focus on analyzing news articles scraped from one single data source or combining news analysis with other indicators (historical data, tweets, financial reports, technical indicators), no one tries to aggregate multiple news data sources to predict accurately stock movements. This article suggests a different approach for predicting stock movement by aggregating news collected from four different data sources using Artificial Neural Network.

Data Retrieval
In this work we considered articles collected from four different Moroccan economic journals between 15 September 2014 and 20 February 2019. The process of extraction is depicted in Figure 1. We developed a python utility for extracting articles' links from Journals' websites and then fetches news articles available on those links. Each article is inspected for the purpose of retrieving the companies is talking about and only articles that are linked to one of the listed shares in Casablanca Stock Exchange is kept. In parallel, we collected historical price for those stocks from Investing.com website. At the end of this process, we construct data frame for each economic journal that contains, corpus of collected articles, the title of the article, date of publication, concerned stock, trend at the specific day (1 for up and -1 for down). The shape of one of the journals data frame is presented in the Table 1. The final Input data frame is constructed by aggregating the previous four data frames. Indeed, each row contains the date, the concerning share, the trend column and articles' titles and corpuses published that day for this share Table.2. We observed that for some events, only one journal has published an article about it, whereas for some other events two or more articles were published. Publication of the news in only one journal could be considered as a signal of non-relevance of that event. Hence, all events that were published in one journal were eliminated.

Data Pre-Processing
Before that we can get insights from extracted articles, we should first preprocess these articles. Preprocess phase consists of tokenizing, removing stop words and vectorizing. During the data processing phase each data source is treated separately (each journal is represented by a column in the aggregated input data source) Figure  2.

Text Tokenization
In order to apply any natural language processing technic, text is generally split into tokens (words in our case), then POS tag, named entities or any other technic can be easily applied. The main goal of this operation is eliminating punctuation and storing separately each word.

Stop Word Removal
The pre-processing starts with removing stop words from the corpus. Numbers, words composed from few characters (less than three characters) and those present on most articles are all removed from tokens list. This step is most useful when applying a Bag of words technic in order to reduce the features space.

Vectorization
Vectorization is the process of calculating the occurrence of words in the feature space. Each article corresponds to an occurrence vector and the matrix is made by regrouping all these vectors. The input Matrix X for a data source looks like below:  X= X has as many columns as words in the feature space for that data source.  The analysis of a published article is not such as to accurately predict the evolution of stocks, event relevance is also decisive. The impact of events is usually measured through Google Trend [13,16]. However, it only allows us to evaluate the search frequency for a word (brand in the context of our research). Indeed, Google Trend evaluate the potential of price change but not the direction of the movement. In other words, any mistake in assessing the published article implies a wrong prediction. This is precisely what prompted us to develop an original model for quantifying the importance of published events. News impact is no longer estimated through Google Trend but it's evaluated by the number of data sources publishing this event.

Designed Model
As presented previously we considered four different journals for testing this model and this can be extended to unlimited data sources.

Figure 3. Designed Model
From the data processing step, we generate four different Matrixes Xj, one for each data source (articles columns in the aggregated data source). The trend column in the aggregated input data corresponds to the desired output y that has the shape below.
Our model is built from two levels ( Figure 4). The first inspect the correlation between Input Matrix Xj and the trend y for each journal. SVM algorithm is widely used for text analysis, hence we opted for this algorithm to predict stock movement from the published article. This process is applied to each journal for the purpose of forecasting separately the stock movement. This technic is a kind of confirmation of the price change direction. A stock is more likely to go up when most journals estimated a positive change in the price. The four binary (1 or -1) values got from the first level are fed into an artificial neural network to correct SVM predictions Figure3.

Experiments
The aim of our work is first to inspect whether the use of multi-web news sources enhance the accuracy of stocks predications and second compare the reaction of our designed algorithm to different number of web news sources (two, three and four sources). Hence, we compared the accuracy of single data source with four data source model and then we checked out the evolution of the accuracy when varying the number of data sources. A part of the data is used for tuning the algorithms. Indeed, different values of C, Gamma, kernel, alpha and the max iterations parameters are tried and the best parameters are retained for the rest of the experiments.
For the algorithm training and testing the cross-validation approach was applied. Data were split to five folds and each time four are used for training and the fifth is fed into the algorithm for testing. Besides for the second experiment we also tried the 80%-20% split (80% for training and 20% for test) and 70%-30% split.

Experimental Results
First results showed that the accuracy of stock predictions from one single data source is around 60% (Table  3). This is in line with results obtained in [5]. The aggregation of the four data sources enhanced considerably the accuracy. It attends 86 % for the 80/20 experiment. The use of four different web journals for prediction helps in measuring the relevance of a published article. Hence news published about a brand in four journals is more likely to influence the stock movement than published news in a single journal. Besides, in text and sentiment analysis the way an article is written (style, words, event or opinion article…) is decisive in forecasting. Indeed, this technic aims to confront the predictions from the four web sources and the more likely prediction is selected through a trained ANN. From the second experiment we observed that the accuracy is increasing with the incorporation of new data source. It rose from 81% for two sources to 86% for the four journals (Table 5).

Conclusions
This work built a relevant model for forecasting shares' directional movement from financial news articles. This is a unique research as it is the only paper predicting stock market in a parallel mode from four different web journals. We reached an accuracy of 86% which is a 26% improvement over a single data source model and even better over most similar works that typically have an accuracy situated between 50% and 70% [17]. For Training and testing the produced model, we scraped four financial Moroccan web journals. The number of extracted articles about any of the Moroccan market companies is low to feed any machine learning algorithm. Hence, we considered all articles corresponding to Casablanca Exchange stock market companies. This can have an impact on the performance of the model and accuracy can even be better if all articles are linked to a single company as demonstrated by [18]. The top level of the designed model inspected the correlation between vectorized articles and shares' prices movement. The vectorization process was based on the bag of words technic which leads generally to sparsity and high dimensionality problems. The use of a financial dictionary could decrease considerably the sparsity and dimension of features space. Bag of words abstract the semantic load of words and the context issues. So, the integration of a sentiment analysis component in this model is mandatory and could enhance the performance of this algorithm. This model is naturally not limited to news articles but can also be fed with tweets and social media posts; thus, a multi-type data source can be evaluated in a subsequent work.