### 1 Introduction

This work is about Artificial Neural Networks (ANN) and their applications to financial time series forecasting. We use two types of algorithms, backpropagation (BP) and resilient backpropagation (RBP), to produce the weights needed for prediction. The final scientific objective is to use the network weights to estimate some measures of relative importance. One of the main difficulties when using non-parametric methods such as ANN is the interpretation and meaning of the weights (parameters) obtained. Though interpretation is difficult due to the nature and purpose of machine learning methods, we intend to offer some conclusions on the importance of the variables used for prediction. In this respect, ANN analysis is the method for obtaining information about which variables are more relevant for forecasting.

The
most common architecture for prediction in times series is the single
layer or the multi-layer perceptron feed-forward networks. When deciding
on the activation function it is common to decide on a sigmoid type,
which is the standard when the prediction is on the range between zero
and one. The simplest and most common learning rule for forecasting is
the *error-correction* type. But perhaps one important parameter needed to feed our analysis is the *learning rate* which decides on how well the updates in the network performs.

When using the traditional Backpropagation algorithm we must do some previous work in order to choose the learning rate that best fits our model. But this could be a time consuming process and there is not always assurance that the network will work well. One way to go around this problem is to use a different algorithm that may endogenously determine this learning rate. We decided to use the Resilient Backpropagation (RBP) algorithm which offers a simple and heuristic method to find the network weights without first determining the learning rate.

The RBP algorithm is an improvement on traditional Neural Networks using backpropagation algorithm, first proposed by Riedmiller and Braun [14] in 1993. Riedmiller developed a flexible algorithm to tackle the main problems in the traditional Backpropagation, in especial the vanishing gradient problem and the need for cross-validation analysis for estimating the learning rate. The new algorithm allowed for weights backtraking and a heuristic adjustable learning rate that improved prediction.

The Artificial Neural Networks history began perhaps since 1940’s when McCulloch and Pitts [12] first proposed the idea of simulating neuronal activity using mathematical logic. But it was until early 1960’s that the idea that machines can learn was first explored by Rosenblatt [15] with the creation of the Perceptron Algorithm. For the first time the possibilities of Artificial Intelligence were recognized when this algorithm was tested on an IBM 704 computer. Although there were great expectation about artificial intelligence at that time, the computer technology was not well advanced at that time to make Artificial Neural Networks to work in their full potential. A big lap forward in ANN research was PJ Werbos’s 1974 unpublished PhD Dissertation "Beyond regression: new tools for prediction and analysis in the behavioural sciences", where he first proposed backpropagation to train Neural Networks. A detailed explanation on Backpropagation can be found also in Werbos [17]. Backpropagation algorithm was indeed a break-trough that allowed the effective use of gradient descent method in the training of ANN.

The development of the Neocognitron by Fukushima and Miyake [4] inspired the creation of Convolutional Neural Networks (CNN) which are indeed Deep Neural Networks (DNN) with multi layers but where some hidden layers are called convolutional layers as they perform a convolution that connects to the next hidden layers of neurons. This type of neural networks are often used in pattern recognition problems. In 1982 John Hopfield, with his paper Hopfield [6], invented the Hopfield Network with the purpose of modelling human memory. This later was known as Recurrent Neural Network, in which time lapses between hidden layers and neurons were important to model human learning process and memory. Several other types of DNN with different variations and architectures have been created in recent years. Deep learning is currently a field very dynamic with great possibilities.

On the side of financial time series analysis, Sapankevych and Sankar [16] made a survey on the SVM (Support Vector Machines) and focused on times series prediction using SVM. Kim [11] is an analysis on financial time series using SVM on the Korean composite stock exchange market and and compares the SVM results with Artificial Neural Network (ANN) models and finds that the SVM outperforms slightly those of ANN models. Huang et al. [7] predicted the Japanese Stock Exchange index using SVM and concluded that SVM has a better hit ratio than neural networks. Kara et al. [10] is a similar analysis on the Istanbul stock index using ANN and SVM. Contrary to the previous works on financial time series, this work concluded that the Neural Networks performed better than SVM with a higher hit ratio. Cao and Tay [3] is an analysis using Futures contract in the Chicago Mercantile Market using SVM, backpropagation NN and Regularized Radial Basis function NN. Their results also show that SVM outperforms backpropagation NN and has similar performance against Radial Basis function NN.

This work shows the basic formulation of Artificial Neural Networks and their practical application to time series. We introduce financial forecasting using the Resilient Backpropagation Algorithm (RBP), which was proposed by Martin Riedmiller und Heinrich Braun in Riedmiller and Braun [13] in 1992. They published their work the next year in Riedmiller and Braun [14]. This algorithm tries to solve the problem of the learning rate especially in noisy data. Igel and Husken [9] is also a work in RBP which explains the weight backtracking technique.

As mentioned before, the final scientific objective is to measure which features are important for prediction in whole financial markets. Huang et al. [8] is a paper that includes different measures for assessing the relative importance of each feature in the neural network prediction. The two importance measures found in the literature are the Garson and Yoon contribution measures but we noticed that both are not highly correlated. The interpretation and comparison between both measures is not straight forward and, in average, the correlation between them is about 0.54 in our analysis. Our hypothesis is that these contribution measures are not well suited to describe the relative importance of each feature. We decided to construct a simple measure in order to describe the importance of each feature variable in the best ANN model obtained.

Given
the above scientific objective, we do not focus in model selection
techniques. Although prediction depends on the network architecture and
other technical choices such as the learning rule or the activation
function, the main purpose is to observe which features are better for
prediction. This is information is already embedded in the data and its
complexity. Although the determination of the *best* model is important for prediction (e.g. evolutionary algorithms) we decided to focus on this subject in future research.

In this work we are going to work with ANN for binary classification with the objective of predicting ups and downs in the stock exchange indexes. In the first part of this work we introduce Neural Networks and the Resilient Backpropagation Algorithm. In the second part, we use data from six stock exchange markets (Hong Kong, Japan, Germany, Europe50, Canada and Mexico indexes) in order to obtain prediction on the ups and downs on the stock indexes. The final part of this work includes an analysis on the relative importance of the feature variables using different contribution measures.

### 2 Artificial Neural Networks

#### 2.1 The theory

Artificial Neural Networks using the Backpropagation algorithm is a traditional method for classification and forecasting. Though several versions of Deep Neural Networks (DNN) are now popular powerful tools for analysis, still the backbone behind all architectures of Neural Networks continue to be the gradient descent method used in Feedforward and Backpropagation algorithms. Both ANN and DNN have a wide range of important commercial applications. There have been numerous efforts to design artificial neural networks based on Von Neumann’s architecture, trying to produce intelligent programs that mimic biological neural network. Neurons are very special cells in the human brain, interconnected with each other and responding to stimuli using chemical and electric reactions with connections called synapses. The idea of ANN is to simulate neurons stimuli process and let this neurons to learn by themselves.

ANN can perform complex classification problems. For a simple binary classification, the idea is to construct a decision function $h\left(x\right)$ to approximate the outcome or label $y$ (which is binary, either zero or one). The decision can be interpreted as a simple weighted function, a linear combination of, let’s say, two features ${x}_{1}$ and ${x}_{2}$ :

Which can also be written in the form:

The main objective is to find the vector of weights
$\theta $
and the parameter
$b$
(bias) that may be used to describe and predict the outcome
$y$
. A Neural process is a biological process that describes
how a neural cell learns. A neural cell processes information from
stimuli it receives and then uses a series of synapses to pass on new
information already processed. An ANN tries to reproduce the same
process, using inputs nodes to receive information, one hidden layer to
process the information using an activation function, and then output
nodes where the processed information is received. Given an output
$y$
we must train our network to *learn* and obtain these output values. At the end, the weights
$\theta $
will be obtained and will be used for prediction.

The process of training a ANN will depend on the activation function we want to use as well as the method to find the appropriate weights recursively. Usually, we may initiate to train the ANN with random input values and then apply weights to every data point that will pass on information to a hidden layer where the information will be processed by an activation function. Weights ${\theta}^{T}$ are recalculated iteratively until convergence is achieved. It is desirable that the values we get from the decision function $h\left(x;\theta ,b\right)$ being in a small range, for example, between 0 and 1. One way to achieve this is to map the decision function $h\left(x;\theta ,b\right)$ into a new function what will give an output between 0 and 1:

To represent $g$ as a new function of $z\left(x\right)={\theta}^{T}x+b$ , we may start to use a sigmoid function of the type (although there are other types of functions that may be useful to represent $g$ ):

This is akin to a logistic regression function with
$g\left(z\right)\in \left[\mathrm{0,1}\right]$
. We call this *activation function* because this
function will process the information and put forward an output. This
activation function will be in every neuron, in every hidden layer of
the neural network.

The decision function will approximate the label
$h\left({x}^{\left(i\right)};\theta ,b\right)\approx {y}^{\left(i\right)}$
for every data point
$i=\mathrm{1,...,}n$
. The idea is to use all the past data to learn the values
$\theta $
,
$b$
and to approximate the values of
$y$
(supervised learning). We may try to minimize the sum of square errors (learning rule) as our *objective function* or *loss function* as follows:

The crucial step is to minimize the error function
${(h\left({x}^{\left(i\right)};\theta ,b\right)-{y}^{\left(i\right)})}^{2}$
. The main task will be to obtain the parameters
$\theta $
and
$b$
in such way that the error is minimized. One optimization
technique is to use iteration in order to approach to the optimum values
of
$\theta $
and
$b$
. One optimization method is called the *stochastic gradient descent* algorithm:

Where the $\text{\Delta}{\theta}_{i}$ and $\text{\Delta}b$ are the gradients of the parameters. The important in this method is to find the gradients in $\text{\Delta}{\theta}_{i}$ and $\text{\Delta}b$ in order to iterate until convergence. The term stochastic comes from the idea of initializing the algorithm using random weights, usually between zero and one. The idea to start with random weights during iteration in order to avoid local minima. Another problem in this method is the learning rate determination of the network $\alpha $ . If $\alpha $ is too small the algorithm is very slow and if too large can bring about large fluctuation. Additional analysis is needed to find $\alpha $ , for example using Cross-Validation analysis or optimization.

By computing partial derivatives of the error function with respect to the parameters, the gradients become:

The stochastic gradient descent algorithm allows to learn the decision function $h\left(x;\theta ,b\right)$ computing the above gradients by iteration. First we must set a random value for $\theta $ and $b$ and use the initial values ${x}^{i}$ and ${y}^{i}$ to compute the derivatives 9, 10 and 11. Once the parameters ${\theta}_{1}$ , ${\theta}_{2}$ and $b$ have been computed, we use these values in 6, 7 and 8 and then go back to the input values ${x}^{i}$ and ${y}^{i}$ again. This iteration continues until convergence, when the decision function $h\left({x}^{\text{*}};{\theta}^{\text{*}},{b}^{\text{*}}\right)$ is obtained and used for prediction. To predict $y$ we use the parameters ${\theta}^{\text{*}}$ multiplied by $x$ plus the term ${b}^{\text{*}}$ after been put through the sigmoid function. In this simple network the inputs represent the first layer $l$ and then a single neuron (perceptron) in the next layer $l+1$ where the activation function processes the information and pass it over to the final output which is in the final layer $L$ .

The gradient descent algorithm is a key feature of an ANN. Although more sophisticated algorithms are being developed, still gradient descent algorithm is still the core method in ANN. There is also the disadvantage of vanishing gradient when weights are too small and make the gradient to go to zero. Perhaps the vanishing gradient problems was the main disadvantage of the ANN and also the main motivation to develop more sophisticated networks.

Another idea is to separate the data into smaller problems and to solve for each problem separately. For example, in our binary classification problem, some data with a label equal to zero will be a single cluster between two separated clusters of data labelled one. Now we will need two decision functions with more parameters and we need to construct a neural network with two neurons. The idea is to make a decision function of decision functions so that to predict the label $h\left(\left({h}_{1}\left(x\right),{h}_{2}\left(x\right)\right);\omega ,c\right)\approx y$ . We can use the previous gradient descent method as before, but now we have a more complex structure.

What we are building now is a neural network where the hidden layer that store the activation function $g\left(z\right)$ has two neurons. Both inputs now go through the whole network and the parameters $\omega $ and $\theta $ are now the weights for the decision functions while the biases are now $b$ and $c$ . The new decision function is:

In the above decision function all parameters
and biases must be found at the same time using gradient descent. We
are iterating forward, which means that iteration to update parameters
must go back to each input data point in the training sample
$\left\{{x}^{i},{y}^{i}\right\}$
in layer
$l$
until convergence is accomplished. This iteration process is called *feedforward* Neural Networks.

#### 2.2 Backpropagation algorithm

Another way to learn is to use the *backpropagation algorithm*
(BP). But before we may consider the possibility of more complex ANN
architectures. We may consider adding more neurons but also additional
hidden layers to our Neural Networks, on what is commonly known as
Multilayer Perceptron Network. BP requires that once the *feedforward* process has been completed and we have arrived to the output layer
$L$
, then we go back on the entire network to perform a *backward* pass for all layers
$l=L-\mathrm{1,}L-\mathrm{2,...,2}$
. In order to perform a backward pass we must redefine our variable
$z$
and the decision function. The function
$z$
is now:

And the decision function becomes:

The first decision function is just the layer of inputs ${h}^{\left(1\right)}=x$ , which are used to update the next decision function ${h}^{\left(2\right)}=g\left({\theta}^{\left(1\right)T}{h}^{\left(1\right)}+{b}^{\left(1\right)}\right)$ and so forth until we get the output $h\left(x\right)={h}^{\left(L\right)}=g\left({\theta}^{\left(L-1\right)T}{h}^{\left(L-1\right)}+{b}^{\left(L-1\right)}\right)$ which is a scalar vector. Because we are now using multiple neurons, we can understand ${h}^{\left(l\right)}$ as a vector as well as the input vector $x$ and the parameters $\theta $ and $b$ . We must proceed in a similar way we obtained the gradients in 9, 10 and 11. First we must find the derivative of the function $g\left(z\right)$ respect to the parameters $\theta $ and $b$ . For the vector of weights:

And for the bias:

The second part of the above derivatives, $\partial {z}^{\left(l\right)}/\partial {\theta}^{\left(l\right)}$ and $\partial {z}^{\left(l\right)}/\partial {b}^{\left(l\right)}$ comes from 13:

To find the first part of the above derivatives 15 and 16, we define ${\delta}^{\left(l\right)}=\partial g\left({z}^{\left(l\right)}\right)/\partial {z}^{\left(l\right)}$ . Because ${z}^{\left(l+1\right)}={\theta}^{\left(l+1\right)T}{h}^{\left(l\right)}+{b}^{\left(l+1\right)}$ , then:

and since ${h}^{\left(l\right)}=g\left({z}^{\left(l\right)}\right)$ , then:

Now we can obtain the derivative ${\delta}^{\left(l\right)}$ :

Where the $\odot $ is for the Hadamard product. Now we can find the updates for the gradient descent and perform the backpropagation:

#### 2.3 Resilient Backpropagation Neural Networks (RBP)

The backpropagation algorithm allows the network to learn and get the parameters
$\theta $
and
$b$
in a more refined way. The main drawback of both
feedforward and backpropagation is that, some gradients may become zeros
and then some neurons may deactivate themselves. In rare cases the data
and initial values produce the deactivation of neurons weakening the
network and decreasing the predictive power, what is know as *vanishing* gradient problem. This is the main motivation to develop more complex architectures called *Deep* Neural Networks or DNN.

Another algorithm commonly used in ANN is the heuristic *Resilient Backpropagation*
algorithm (RBP). This algorithm is a slightly different version of the
Backpropagation but instead of using the magnitude of the gradients
$\text{\Delta}{\theta}_{1}$
,
$\text{\Delta}{\theta}_{2}$
and
$\text{\Delta}b$
we use their sign. The purpose of this modification, first proposed by Riedmiller and Braun [14] in 1993 and applied by Anastasiadis et al. [2]
in 2005, was to adapt the learning rate over the entire neural network.
The RBP converges faster than the common BP algorithm but keeps its
more general attributes. Gunther and Fritsch [5] produced an efficient algorithm to estimate the RBP with weight backtraking.

With the Backpropagation algorithm we have seen that the weights are updated following the general form:

Where ${\theta}_{ij}^{\left(l\right)}$ means the weight from neuron $j$ to neuron $i$ in layer $l$ . In the previous section we found that the updated amount $\text{\Delta}{\theta}_{ij}^{\left(l\right)}$ was $\text{\Delta}{\theta}^{\left(l\right)}={\delta}^{\left(l\right)}{h}^{\left(l-1\right)}$ (we removed the transpose upper letter T to facilitate explanation). We know that this is also:

The RBP algorithm proposes that the update is performed with the sign of the derivative rather than the size of it as follows:

Another importance change is that the update parameter ${\delta}^{\left(l\right)}$ is changed to ${\delta}_{ij}^{\left(l\right)}$ in the following form:

Where $0<{\eta}^{-}<1<{\eta}^{+}$ are pre-set values. The value of ${\delta}_{ij}^{\left(l\right)}$ is updated in every step according with the change of sign of the derivative in 24 and the above definition. When the sign changes this means that a minimum has been missed so the network applies the either ${\eta}^{-}\cdot {\delta}_{ij}^{\left(l-1\right)}$ or ${\delta}_{min}$ which ever larger. The reader may notice that this process also eliminates the need of determining exogenously the learning rate $\alpha $ as in the traditional backpropagation algorithm, because the parameter $\eta $ is a close adaptative substitute.

The methods of weight backtraking is also based on heuristics, and the idea is to keep using previous weights for updating (some weights only). For example, if:

But if less than zero, we use the previous update:

This implementation trick avoid the updating of the learning rate then avoiding using the *otherwise*
option above. The advantages of using RBP algorithm is that reduces
computation with the advantage of similar, if not better, precision. It
is very useful when the data contains noise, which means that it
preforms well when applied to financial time series data sets. In the
next section we will present some results using RBP neural networks.

### 3 Time Series Forecasting

#### 3.1 The data

The first task in this work is to forecast a time series using binary classification with ANN methods. A basic classification would be to describe the behaviour of a stock or stock index in order to predict its movement. Predicting stock prices is important as we would want to decide if we need to buy or sell a stock or predict the ups and downs of a price index. In this case we would want to define a label $y$ with binary values $\left(\mathrm{0,1}\right)$ with zero for a drop in the stock prices compared to a day before, and one for an increase in the stock prices. We can use for example, the closing prices of stocks or stock indexes to construct this label. This is a practical way to observe movements in the stock prices, as we need to know if prices will go up or down for very practical decision making. For example, we may decide to sell if a stock price is likely to go down or buy when price is likely to go up. Broadly speaking, trading in the stock market is based in these simple decisions, and usually a good trader uses, among other tools, technical analysis to decide whether is time to sell or buy stock. Although we are not applying this binary classification problem to any specific stock, the underlying principle of classification remains the same.

The
next question is defining the features that will be used to predict the
movements in stock prices. In other words, we need the matrix of
features **x** that will help to define the label **y**. We decided to use some technical analysis concepts as in Kim [11], most of them taken from Achelis [1]. Technical analysis indicators will be our features matrix **x**. An experienced trader may read the concepts in table 1
and along with additional information then try to predict changes in
stock prices. These features are mostly ratios of prices, moving
averages or both.

Table 1.

Selected features and formulas

Table 1 shows twelve well known technical indicators for trading. These are constructed with simple market data such as closing price (CP), lowest low price (LL), highest high price (HH), high (H) and low (L) prices during the trading day or period and it is very common to use Moving Averages (MA) for their construction. The stochastic oscillators such as stochastic $\text{\%}K$ and $\text{\%}D$ are used by traders to know if an stock is overbought or oversold. The Momentum is used to see any change on the price trend while the Rate of Change (ROC) measures the speed of the ascent/descent of every new trend. The Williams’ Accumulation/Distribution indicator measures the market pressure, which advices to sell when Williams’ A/D is low and stock prices are high and buy when Williams’ A/D is high and market prices are low. The Disparity5 is just an ratio of the closing price with respect to the 5 days Moving Average and the OSCP or Price Oscillator is the growth of the moving average. The Commodity Channel Index (CCI) is used as a leading indicator to observe the strength of a stock, whether a stock is overbought or oversold. The Relative Strength Index (RSI) is used as a leading indicator to observe the historical strength of a stock using the ups and downs in historical closing prices. The WilliamsR is a momentum indicator that reflects how the close price compare with the highest price.

All features in table 1
are associated with the prices of the stocks and are used to interpret
the trends of stock market prices. The entire data set for a given stock
market index will be the label **y** and the feature matrix **x**
that describes the label. All technical analysis indicators will be
used to classify our label in both directions, ups and downs for the
entire stock market index. There are dozens more technical analysis
indicators that can be used, but we are trying to use some the most
popular and also applied in other similar research.

This section contains an empirical analysis using RBP algorithm in order to predict time series, particularly changes in stock price indexes. We chose to predict changes in six major European, Asian and North American stock market indexes. We used six stock indexes: The European STOXX50 that contains blue chip stock from the 50 best performing companies in leading sectors in Europe; the DAX which is also an index that contains 30 blue chip German companies; the Nikkei stock exchange index, the Hang Seng index which is the stock exchange index from Hong Kong, the Canadian Toronto Stock Exchange index and the Mexican Stock Exchange index IPC.

We decided to use daily data for each stock exchange from January 2000 to June 2019, less than five thousands daily observation in each market. Compared with the same data from 1980’s and 1990’s, the period of analysis is high frequency data and contains sharp financial crashes, perhaps due to the new trading methods using electronic platforms and the availability of information online. Financial markers are now more competitive as communication technology has improved along with capital mobility. Table 5 in the appendix contains the summary statistics for the six markets on closing, high, low and open market prices.

#### 3.2 Estimation

With the information on the average prices in each market, we first constructed our matrix of features
$x$
that were used to describe the label **y**, where **y**
are the changes in close price. The idea is to approach to the daily
changes in close prices in each stocks market. In this case, the
classification problem will be to assign a label of 1 if the close price
index in time
$C{P}_{t}$
is higher than the previous day
$C{P}_{t-1}$
, and 0 otherwise so that
$y\to \left[\mathrm{0,1}\right]$
. Our data points
$x$
will be the features that help us to find the weights
$w$
and the bias
$b$
.

Because ANN is a supervised machine learning method, we are going to demand to the network to find the best way to predict **y** using the twelve technical analysis features constructed using indicators in table 1 (matrix
$x$
). This matrix will serve as the input data and the main objective of the ANN algorithm is to find patterns in
$x$
so that to approach the vector **y**. Because the weights produced by the ANN are sensitive to the scale of each feature we must normalize the features matrix
$x$
and transform them into a matrix with values in the range of
$\left(\mathrm{0,1}\right)$
to ensure large values do not overcome features with small values.

When the whole data set with **y** and
$x$
have been constructed, the next step is to divide the data
set into a training and test sets. The training set will be used by each
algorithm to learn and the second set for testing if predicted values
match with the data points. In our analysis, we used the first 70% data
points as training set and the remaining as test data set. In the
appendix the reader will find the summary statistics for each feature **x** and the label **y** by stock market.

The only thing left for clarification will be the estimation of the Hit ratio for each prediction. After running each of the ANN models, we will get the predicted values using the test data into a new data set with predicted values for the label $\widehat{y}$. One way to evaluate the performance of the ANN predictions, is to construct a hit ratio. Because the predicted values are real numbers and our test data set has a binary label, we must transform the predicted values into binary outcomes. We set up a threshold of 0.5 which mean that any predicted values higher than 0.5 will be set up to 1 (close price was higher than in previous day) if not equal to zero (close price was lower than in previous day). The prediction results are matched in the form:

The prediction performance is measured using a hit ratio, defined by:

This hit ratio is the percentage of correct matches where $y$ = $\widehat{y}$. This is a simple coefficient that can be used to compare performance (prediction) for each neural network model.

The main part of the empirical analysis requires to use ANN to predict time series. We trained different single layer networks using traditional Backpropagation and Resilient Backpropagation Neural Network algorithm. At first, single layer neural networks were constructed with 6, 12, 18 and 24 neurons each using standard logistic and error functions. Later we trained multi-layer networks with 6 and 12 neurons in three hidden layers. The results are shown in table 2.

Table 2.

Financial Forecasting using Artificial Neural Networks

One disadvantage of the ANN is the cost in training increases when the architecture becomes more complex. As the number of neurons and hidden layers increase, the longer the training time is required. On the other hand, ANN with backpropagation may obtain better performance due to a more flexible updating. Table 2 contains the hit ratios for different ANN models with different architectures. The upper part contains the hit ratios using traditional backpropagation with a learning rate of 0.1 while the lower part contains the hit ratios using resilient backpropagation with weight backtracking.

With the only exception being the Europe50 index, we find larger hit ratios using resilient backpropagation. This does not mean that we cannot achieve better results in traditional backpropagation, but for that we need to find the best learning rate and architecture. And this will require additional statistical analysis in order to decide the correct learning rate, as each model is different.

On the other hand, resilient backpropagation has a flexible and heuristic way to choose the learning rate and update the gradient for a better descent. The reader may notice that there is little room for improvement in each model using backpropagation as we use a single learning rate for every model. However, resilient backpropagation has room for improvement as the learning is controlled during convergence. Estimation in table 2 will vary as long as we choose different activation functions, learning rates and gradient methods for updating, but we decided to leave model selection for future research.

### 4 Contribution measures

This work focuses not only on financial forecasting using ANN but also
offers a descriptive analysis on the overall performance of the features
used for prediction. This is an important issue because we need
information on the relative relevance of each feature in the learning
process. We know that each feature was normalized when constructing the
matrix **x**, so we may be able to apply some indicators on similar data and obtain some comparable results.

Table 3.

Correlation coefficients: Garson, Yoon and Trapezoid Contribution Measures

Index | Garson\Yoon | Garson\Trapezoid | Yoon\Trapezoid |
---|---|---|---|

DAX | 0.625 | 0.816 | 0.808 |

NIKKEI | 0.789 | 0.851 | 0.931 |

IPC | 0.550 | 0.722 | 0.908 |

HS | 0.050 | 0.307 | 0.881 |

EU50 | 0.619 | 0.621 | 0.512 |

TSE | 0.511 | 0.698 | 0.864 |

In order to find the relative importance of each feature we must apply a measure using the weights from the ANN analysis. The magnitude of each weight in every network tell us about the relative importance of each feature. This section provides with some measures on the relative contribution of each feature on the final output in a Neural Network. We estimated each contribution measure on the best single hidden layer ANN neural network. For example, if the input layer has $i=\mathrm{1,2,...}I$ nodes, and the hidden layer has $j=\mathrm{1,2,...,}J$ neurons, the final output weights will come from equal number of nodes $k=\mathrm{1,2,...,}J$ . At first, we introduce two different measures called Garson and Yoon measures, similar to Huang et al. [8]. The first contribution measure is the Garson measure:

And the second is the Yoon measure:

The Garson measure can be interpreted as percentages of contribution on the final output. Yoon contribution index is more complicated to interpret, though we may interpret a high absolute value of Yoon measure as high relevance. Both measures are designed for a single layer Neural Network, then the best single layer results for each model. The results of the estimation are shown in table 4 for each market (the number in the parenthesis shows the number of neurons in the hidden layer). For four markets the ROC seems to be the feature with the highest contribution to the financial forecasting, except for the Hang Sheng index and the Euro50 index. These were the only two indexes where we used the best single layer hit ratio using backpropagation.

Table 4.

Contribution Measures for each Feature by Stock Market

Table 5.

Standard Statistics by Stock Market

Table 6.

Summary Statistics Features

Table 7.

Summary Statistics Features (continue)

One of the problems of the above measures is consistency. Both measures are positively correlated but just. For example, table 3 shows the correlation coefficient between the Garson Measure and the Yoon Measure. Both measures are highly correlated when analysing the Nikkei Index, but they are completely different in the Hang Seng index with a correlation of just 0.05. Another drawback of the Garson and Yoon measures is that they become difficult to calculate in more complex network architectures. Under such considerations, a different measure is needed to evaluate the contribution of each feature in a ANN model.

We decided to give a geometric interpretation to the weights in order to establish their relevance. For example, in a one-hidden layer neural network, we interpret the weights ${w}_{ji}$ and ${v}_{jk}$ as the lengths of the opposite sides of a triangle. Multiplying the network weights in this form we can interpret the entire measure as the area of several triangles that make up to a irregular trapezoid:

An appealing feature of this Trapezoid Contribution measure is that can be applied to any number of hidden layers and neurons in the network and is quite easy to calculate and interpret if we make percentages with the whole area and its parts. Table 4 show the relative importance of each feature from the ANN analysis using Garson, Yoon and the Trapezoid measures. For Japan, Canada, Mexico and German indexes the ROC is the most influential feature to predict the stock market index while the fastK is the most important in the Hong Kong and European50 indexes.

We may notice that the new Trapezoid contribution measure is highly correlated with the Yoon measure but also moderately correlated with the Garson measure. Most importantly, it is easy to calculate and can be applied to more complex network architectures.

### 5 Final Conclusions

This work contains a financial forecasting using both traditional backpropagation and Resilient Backpropagation Neural Networks and also an analysis on the relative importance of features used for forecasting. We use standard single layer and multi-layer feed forward architectures to evaluate the performance of both algorithms, along with sigmoid activation function and error-correction learning rule, which are common for time series forecasting. The use of the RBP algorithm provides a practical solution to the determination of the learning rate and is especially helpful for data sets with noise such as financial stock indexes. The Resilient backpropagation with weight backtracking is a very flexible algorithm that can adjust to changes in model complexity. Some times it can find a better solution when the model specification changes.

This work provides a simple contribution measure in order to evaluate the importance of features in financial times series forecasting. The main reason comes from the lack of consistency in two available indexes: the Garson and the Yoon contribution measures. A simple measure using the concept of an area of a trapezoid captures de idea of contribution to the prediction using the ANN weights. This Trapezoid contribution measure uses the ANN weights from the best model (highest hit ratio from a single layer ANN) to calculate an area of an irregular trapezoid for every feature variable. Although this concept is simple it reflects the magnitude and influence of each weight in the network and can be interpreted as contribution to the forecasting.

We used the trapezoid contribution measure along with the Garson and the Yoon measures to analyse the relevance of each feature in the best ANN model for each of the six stock exchange indexes. We concluded that the ROC is perhaps a very relevant feature at least for four of the stock exchange indexes used: IPC, TSE, DAX and Nikkei. The European50 index and the Hang Seng index seem to respond more to the FastK indicator despite the Garson and Yoon contribution measures are not consistently showing this. In this respect, the trapezoid contribution measure offers additional relevant information that can be used to evaluate the contribution of each feature in the network.