Evaluating Equity Index Risk with Entropy of Component-Returns

Preface:

This was the topic I wrote my Master’s Thesis on and most likely, this is the truest representation of me seeking (and researching) symmetry in securities to date. I will not dump the 80-page work here for several reasons, but mostly due to irrelevance of my research to actual traded markets. I proof-of-concept tested a physics-inspired model heavily reliant on having good measurements or proxies for liquidity, as well as having large data sets with relatively continuous distributions. I thus opted to go for monthly data rather than daily or intra-day data, covering all stocks with continuous monthly data for 16 years (1998-2013) on both the Nikkei 225 Stock Average and the Standard & Poor’s 500 Index.

Getting Bid/Ask spread data and order depth and size isn’t very easy or cheap, and simply wouldn’t give a testable time period that covers enough periods of both high and low volatility. The solution was that I had to use stock market turnover ratios as a liquidity proxy instead but that meant I had to use monthly consolidated data. This helps since it means that stocks had time to disperse significantly further than the minimum quote differential and gave much smoother return distributions, particularly avoiding the annoying discontinuous “zero-peaks” in the probability density functions. Choosing monthly data over weekly data was also rather important as weeks have incredibly varying time periods based on holidays, and thus a week can be anything from one to five days. In addition, it allowed a consistent comparison against the volatility indices on both of these stock indices, as there are monthly roll dates for options going into the volatility index calculations.

I evaluated the entropy gain from the capitalization-weighted individual component return distribution on consecutive months, and adjusted the measurements to measure all entropy gain, entropy gain from the shift of the mean of the distribution, and entropy gain from the shift of the shape of the distribution in separate months. A thought-primer on physical entropy and how I have used it is included in the main post below. As the entropy model I used (Kullback-Leibler divergence) doesn’t “automatically” capture center-shifts, this was approximated by dividing the total entropy gain by the shape-change only component, and proved to be the most statistically significant for improving a uni-variate dependent variable exponential Generalized AutoRegressive Conditional Heteroskedacity (eGARCH) model. In addition the entropy models all manage to model liquidity relatively well on the monthly time period, but sadly none of these results are visible in the more easily visualized time series regression models due to direct correlation effects.

The main computer system used was the R statistics package where I generated capitalization-weighted smoothed return probability distributions, ran Kullback-Leibler divergence models and checked many different external regressors in eGARCH models. Major add-on packages used were Quantmod, rugarch and seewave. Some work was done in VBA (mostly simple pre-processing and condensation of data into monthly returns, as well as evaluating components of the research visually and quickly in the intermediate data processing) with some work done in C++ (generation of repetitive R-code, list calls, and R data extraction – I really needed data that was nestled deep in R output results and wasn’t callable in an efficient manner without horribly repetitive console printouts… yes I suck at R).

Keep in mind when reading, this is a qualitative summary and explanation of what I have done for my thesis research. It does not quote a single thing from my thesis. I cannot graph the results well since I’ve been using 198 data points per time series – I would need to go up one order of magnitude in data points for good visualization of GARCH model dynamic adaptation – and I really cannot forecast anything since the GARCH training models preferably should have 1000 data points for initialization, and even if I use the technical minimum of 100 then the testing set is 98 points and not enough for statistically significant results.

Of course, I have specific models in mind to use for at least daily-updated risk analysis, but these didn’t really fit my academic requirements. I will summarize some basic points of these and how my research can be used to improve current risk evaluation by plugging in to available approaches, but since I would prefer to be able to have back-tested versions of these as a benefit to a potential employer to use for proprietary trading, no greater detail will be provided. To get that, contact me with a job offer!

Contents:

  1. Introduction and Background
    1. Primer on entropy, qualitatively.
    2. Primer on entropy, quantitatively.
    3. My final entropy framework.
  2. Methodology
    1. How things are related in my model
    2. How I defined volatility and liquidity
    3. Generating distributions and why I hate the standard normal (but still had to use it)
    4. GARCH, eGARCH, time series regressions, how I used them and [technical babble]
  3. Brief Results and My Interpretation
    1. TSR results and discussion
    2. eGARCH results and discussion
    3. All the Goodies and How to Make Money With Entropy + eGARCH
  4. How To Make Entropy Commercially Useful [S3 exclusive, in individual posts to be linked later]
    1. Daily update frequency
    2. Plug in with different models:
      1. Arbitrage Pricing Theory
      2. Markov Chains
      3. Multiple Constraint MEM (MCMEM) forecasting [Cannot be done without a good entropy model]
      4. Multivariate MEM (MVMEM) forecasting [Cannot be done without a good entropy model]

For those of you reading this at the moment, please consider getting ahold of the formula sheet I prepared, which covers the most important points of mathematics between the start and 4.1 (Daily update frequency):

Entropy formula sheet


1. Introduction and Background:

The main purpose of this research was to develop and test the theoretical validity of entropy-derived portfolio risk measures in contrast to the current variance-of-returns approaches to risk (Black-Scholes, most VaR, don’t even get me started on technical analysis). Current risk management approaches for equity products – which largely applies thermodynamics and statistical-mechanics-related mathematics in their derivation – thus resemble measuring temperature of a physical system rather than how “specified” or loosely speaking “uncertain” the system is. In the few cases where people have applied entropy-related mathematics to risk evaluation academically it has always had at least one of a number of irritating conceptual problems:

  1. Evaluating entropy over one variable over long-term historical data. Entropy is a snapshot of uncertainty of a “system state” or how likely it is to transition to another state. This physical interpretation is lost if entropy is evaluated as a single number to quantify the variations on a single variable on a long-term time series. If you ever read academic papers trying to use Bayes/Akaike Information Criteria on financial time series to do anything valuation-related you know that entropy is being abused.
  2. The maximum entropy method (MEM) of probability density distribution generation is used with poor constraint definitions. This is a technical topic at which I could probably talk for days, but it hinges on entropy changes as more information about a system becomes available. In unknown systems you can put down “constraints” for the information becoming available, generate a system with added information having the maximum entropy that satisfies the constraints, and say that it is the highest uncertainty possible for future developments of the system. In many cases of financial research options prices (nominal over a payoff curve in the future or as moments around the payoff distribution mean) have been used for these constraints, but since option prices are themselves dependent on the underlying price data, approaches like these become incredibly circular, intellectually limited and lock in the poor risk pricing features of options.
  3. Lagrange multipliers derived from cross-sectional options data distributions being used as MEM constraints. Yes, I know this is a lot like 2. above, but I think this is slightly different. I liken this to painting a floor with the most expensive coating of paint you can only to end up having painted yourself into a corner: It is very (mathematically) appealing, but the method kinda starts on the wrong end of the problem and it’s so demanding / costly that getting out in the end isn’t really an option. Empirical options prices that satisfy the needs that this model demands are simply not available, and the computational demands of the model are enormous. Also: the mathematical similarities between MEM on exponentially modeled systems (like the stock markets or anything measured in growth rates) and using Lagrange multipliers is essentially doing the same thing twice.
  4. Entropy being applied in the information science approach, rather than the physics approach. I know that this is a physicists pet peeve, but it means not being able to apply higher-order physics concepts in a specifically finance-related manner. This would include approaches such as thermodynamic partition functions which could be used to measure risk components, or methods to evaluate exactly what type of uncertainty function shapes risk by testing for example canonical partition functions or grand partition function models, and how they interrelate with other market variables.

So thus I set out to write my master’s thesis on evaluating an entropy model that can be made to fit the requirements I would put on it in any daily-and-up frequency trades. First, it helps getting a little more of an idea of what entropy actually is, so therefore I’ve decided to do these “primers”, reading first into the interpretation of what entropy is and then for those interested a few useful formulas.

1.1 Primer on Entropy, Qualitatively:

Entropy can be interpreted most commonly as the degree to which we don’t have information on a system which can be in several different states. This is slightly different from the common idea that it is “disorder” or even “lack of order”. The exact coordinates of papers on your desk always contain the exact same amount of information, whether they’re stacked neatly in boxes or exist in a heap that threaten to overtake all of your workspace. However, the statement “orderly stack in box A” allows me to draw the conclusion that the center of any paper reachable from your desk is in fact within the outer coordinates of box A reduced by the width and length of said box, whereas “heap all over desk” leaves a bigger set of available coordinates, orientations and interrelations between papers. The system specifications are simply different, and there are less possible combinations of paper position data in the “ordered” state than there is in the “disordered” state, although if you tracked them all individually their coordinates would take up the exact same amount of storage space on a computer. The “ordered” state here has lower entropy, the “disordered” state has higher entropy, not as a unction of the amount of data, but because of the possible spread of data. This desk example is very close to the practical entropy of particles of gas in a box, but very far from any system that is easily modeled mathematically, given that positions are on a continuous spectrum.

Mathematicians and physicists prefer to use a combinatorial approach when starting to study entropy, to then generalize those approaches until their models are wide enough to accompany mathematically “large numbers” in generally far in excess of 10^10, and then mathematically “very large numbers” generally in excess of 10^10^10. In these combinatorial approaches it often helps to think of dice, or coin flips or any other common mental image of something that can generate a random number. Let’s go with dice this time.

1.1.1 Dice Illustration of Entropy

What are the possible combinations of numbers and colours if you throw one black and one white die, both six-sided? 36. The expected number of the sum of the die? Seven, which has six possible combinations to make it, from [1,6] to [6,1]. The sum two just has one possible combination, [1,1] and likewise with twelve, [6,6]. The entropy of a sum two or twelve is thus much lower than that of a sum six. This blows up incredibly fast when you introduce more dice to the system, always leaving 6^N combinations, but only one combination each for the minimum and maximum values as you move inwards. Just think of a system with, say 500 dice, and evaluating all the possible ways you could get a sum of 1750 (kind of the type of entropy we’ll have to deal with here), or even worse, expanding it to everyday thermodynamics where the number of “dice” is somewhere around 10^22 at least!

1.1.2 Rubix Cube Illustration of Entropy

When you move that far up, this viewpoint of entropy gets less and less useful, and you try to relate it to more manageable numbers and things you can actually compare and care about. Therefore, physicists originally looked at why energy is lost to the external environment in any process, and how much energy is required to reverse the process completely. This view of entropy is a little bit more like saying the following: You have a formerly ordered Rubix Cube that some kid has been left to play with since you last fixed it. You now give this cube to someone who is blindfolded, and then have to tell them exactly what to do to get the cube to have one colour per side again. The amount of instructions you have to give is pretty much the entropy of the cube when you give it to the blindfolded person.

1.2 Primer on Entropy, Quantitatively:

The quantitative view is a little bit easier to look into since most of the formulas are relatively simple. Sadly, I’m not too good with getting formulae looking good on WP, so a verbal walkthrough has to be used instead. A document with the relevant formulae will be attached at the end.

1.2.1 Boltzmann Entropy:

This is the most common entropy, and the one that has found applications in science because of the vast mathematical versatility it brings with it. It should be used for discrete systems, but because of how it works you can essentially discretize a system into small enough parts that for all intents and purposes it becomes continuous, much like binomial options pricing. Thus, one can accurately say that if a system has different probable outcomes of future events, it has an entropy. Boltzmann entropy defines an absolute entropy S, given as the probability of individual possible states in a system multiplied by the logarithm of that same probability, and then summed over all possible states. (Sum of -pi*log[pi] for all i in the state, multiplied by a constant that isn’t important for this discussion.) Illustrating this is probably best done with a 20-sided die as our main variable generator, and a system as a collection of states of… say, 1000 dice. 20 happens to be very close to the natural number e, cubed. [e^3 ~ 20.086, close enough.] The number of sequences of dice numbers in this system is 20^1000, which when using natural logarithms means that each number sequence has a probability of 20^-1000 or a natural log probability of -1000*ln(20) which is very close to -3000. Since all sequences of dice numbers are equally probable, we’re taking a probability of 20^-1000 and multiplying it 20^1000 times, giving 1 and recovering a total system entropy of about 3000! Hooray! Using a 55-sided dice will result in roughly 4000 in total system entropy, as will using about 1300 dice. Increasing the size of the system thus linearly increases the possible states of the system, while the number of states of individual components had to nearly triple for the same change (in base 10 you will have to add one zero to do the same thing, in base 2 just double, it’s simply a game of the logarithmic base here).

1.2.2 Shannon Entropy:

For all intents and purposes identical to Boltzmann entropy, but by default in base 2, and without the constant. This entropy is called H(x) and relies on a variable that generates probabilities in discrete systems. Mainly, this was used to blaze the trail for information theory and started the thinking about compressing data, since if you have many repeated sequences or other data constraints you don’t need the system to accept or anticipate inputs from the forbidden regions. Let’s say in our dice example above, we limit the face-up sides to even numbers, essentially now getting 10-sided dice. Since ln(10) ~ 2.303, the entropy of this system immediately becomes 2300 instead of 3000! Knowing in advance the structure of the system and the nature of any sequence in any data allows you to simplify that data into code that has much less information content, but can be recovered without loss if a key is also sent. Say you have sequences of dice from left-to-right of these 1000 dice, either being a sequence of [1, 2, 3 … 19, 20] or [20, 19, 18 … 2, 1]. This system can now be reduced to 1000/20 = 50 bits of information, or less information than is contained in the word “entropy” when saved on a hard drive.

Applying these concepts to continuous systems creates a few problems. Unless the future events with probabilities overlap exactly, entropy will essentially be the same or very, very similar at the upper system bound in continuous distribution. Using the six-side die analogy, [1, 6] and [3, 4] both sum to 7 exactly; neither is 7.01 or 6.98, and throwing two dice it is in fact impossible to get those values as a sum. Though for a continuous variable, it is definitely possible to reach those values and nearly impossible to reach integer values at all. There will always be at least the possibility of rounding errors. In physics, rather than throwing the idea of smoothed entropy out, we simply go full-calculus on the system and wonder what happens if we make the number of dice, coin flips or other infinite, and then work backwards from continuous distributions. This application should work for finance too!

1.2.3 Kullback-Leibler Divergence

Applying this system still represents a problem, since the distribution exists, the value generating functions still exist and we probably can’t make very precise estimations of the results of this in the future like “all dice will show an odd face-up value” or “there will be sequences of data that can be reduced easily”. The stock market equivalent to either of these would then be to say “there are these many stocks in the market, we don’t know the future, and the markets future entropy will thus be completely given by the number of stocks in the market”, or “the natural logarithm entropy of rolling 2000 twenty-sided dice is 6000”.

Here we use a little cheat: we assume any distribution of outcomes is smooth, and we don’t even bother looking for a stable “expected” distribution of outcomes – we look at the most likely distribution at the moment instead. This will still have the distribution problem above (we just shoved around the probability of the outcomes slightly) but we can compare this distribution of outcomes to other outcomes, and evaluate how likely it is to go from one to the other through a little trick called the Kullback-Leibler Divergence. It looks at a distribution we have (let’s call it Q(x)) and compares it to a “true” distribution (let’s call it P(x)). It evaluates the function P(x)[P(x)/Q(x)]dx over all x, thus essentially saying “If I know Q, how likely am I to guess P? How much more do I need to know to get P for certain?” Think loosely of this as seeing our 1000 20-sided dice, all with face-ups that are even, and then getting asked how likely it is that the values are also all divisible by four, or how many times you would be expected to push a die to flip up an adjacent face to make all the numbers on the 1000 dice faces divisible by four. The Rubix Cube example is a little bit like saying “Here we have a cube with all faces at least having one of every color on every side. How many moves would you expect to need to make a cube that has at most four colors per side?” These analogies also capture the fact that the KL distance is not symmetrical; it will not require any moves to go to a set of dice faces that are all even if we know that at the moment all the face-up numbers are evenly divisible by four!

1.3 My Final Entropy Framework:

With this background and outlook, I set out to interpret the probability of a given period nominal return, given one currency unit invested at random in the stock market. Now, readers with their math hats on will ask “random how?” since most “randoms” follow certain constraints. Yes, mine does too, and with the need to interpret a likely investor return, all the stocks were weighted according to their market capitalization. If I throw $1 000 000 000 at an index fund (I wish!) I will get the market allocation in theory. If I estimate the allocations of $1 000 000 000 in individual invested dollars at random, I will get the market portfolio down to 0.0032% of the individual weights.

One problem is however that most stock indices simply don’t have thousands of thousands of stocks to create smooth distributions while also being often quoted in the financial media and seen as close barometers of economic activity, or have good, liquid volatility indices on them to benchmark my entropy models against. So instead of getting relatively good automatic smooth distributions, the avenue to go through on an effectively discrete system thus is to apply a smoothing function of a discrete distribution. I will get into a few details of what should be selected as the smoothing function later, but the approach overall essentially means that we take each stock ticker and evaluates the return as a distribution of returns that could have happened during the month, with the width of that distribution determined by the standard deviation of individual returns across the whole index that month. (A better model dynamically weights each stock according to individual characteristics, but getting the models for this to close in testing was found to add too much unidentifiable errors on overall index autoregressions, and these models were scrapped for the benefit of proof-of-concept testing. Some ways of adding dynamic parameter lists to be read by the R-script for smoothing really tripped the models up.) The workaround idea thus looks like:

  • Read the return of the stock ticker. This is the central point for any process using it.
  • Smooth it using an appropriate smoothing function, saying that rather than existing to 100% at its actual return, there is uncertainty in what could have happened that the discretized reading at the end of the month does not cover.
  • Set the width of this smoothing distribution to be related to the standard deviation of all stock ticker returns on the index that month (this contrasts with an individualized dynamic stock ticker uncertainty model, which should be better but was abandoned for practical reasons).
  • Multiply the height of that stock ticker distribution by a weight linearly related to the stock’s market capitalization.
  • Repeat for all stocks in the index.
  • Sum the distributions.
  • Read the result as that month’s distribution and compare it to the result of the same calculations for the prior month using the Kullback-Leibler divergence.
  • Repeat the last step assuming that both months had the same average return by reducing the return for the individual stocks by the average index return that month. This is done to find if the shape of the distributions is more or less important than the overall distribution taken in its entirety.

Overall then, my model looks at the relationship between returns of contracts on a market, and by using the Kullback-Leibler divergence is particularly sensitive to movements towards the tail ends, particularly if these moves are not symmetrical across the tails. it should therefore evaluate stock market moves earlier than aggregate volatility models, and estimate whether or not the directional market moves are more important than those where the distributions simply change shape.

2. Methodology:

This is probably both the only really interesting part of my research for a casual reader and the most boring, depending on which section you end up reading or are inclined to like.

2.1 How Things are Related in my Model:

The model I set out to research for academic validity essentially can be explained as follows:

  • The formula for the mean free path of a microscopic particle has some interesting features, let’s look into them!
    • As temperature goes up, so does the expected mean free path of the particle.
    • The larger the particle, the less it is expected to move.
  • Analogies of the above quantities in finance sounds like liquidity for temperature, volatility for mean free path, and market capitalization for particle size. (My literature review covers a bunch of books and research papers on why this analogy actually holds up.)
  • Looking at entropy though, it is expected to decrease per constant unit of added energy when the temperature rises (the Clausius equality that is central to introductory thermodynamics).
    • The equity market analogy is a market that treats all stocks more equally when liquidity is high, but sorts them into a low system entropy state when the liquidity is low.
    • This behaviour is perceived in markets due to the liquidity risk premium! The LRP forces returns of some companies to fall considerably more when liquidity is tight, because exiting the position might be difficult and the company on the buy-side therefore has to get a higher expected return as compensation for this risk.
  • An entropy-related measure should thus be statistically significantly different from volatility, in theory.

My approach therefore creates three key testable groups of variables (factors in my thesis, but I’ll do away with that terminology here): liquidity, entropy models, and volatility models. Liquidity is tested in simple linear time-series regression against both the other variable classes, to find if there are any statistical significance to how these variables change over time. Both of these are then tested against realized future volatility on the stock index under consideration for the month ahead in the data, through both time-series regressions and exponential GARCH methodologies. Liquidity thus “seeds” my model, and allows for model-difference testing in time series regressions and GARCH runs.

2.2 How I Defined Volatility and Liquidity:

One key reminder is that all my data – liquidity, volatility, entropy, returns – is modeled on month-to-month end data considering only one month, irrespective of how many trading days there are in it. In hindsight, maybe adjusting liquidity for the month day-count would have helped since it would give a better turnover rate in annualized terms, but it was not a concern at the time. Volatility? Standard deviation of index returns (and some other stuff in some models) during the month, entropy is given by month-end-to-month-end stock returns, and implied volatility is mean volatility index value throughout the calendar month (to somewhat target identifying longer volatility periods rather than single-announcement market adjustments when they happen to fall on the month-end).

2.2.1 Volatility

Better detail on volatility starts with the monthly standard deviation of daily index returns, close-to-close basis. Then a model which adds the range (daily high less the daily low) is added on to the vanilla volatility, and finally a model that also considers the close-to-open overnight gap is also considered. A model further adding opening-to-closing return was also tested but discarded in the final stages. The rationale for these specifications is that volatility is very rarely priced on just the day-close data (if a stock market falls by 5% on one day you don’t wait for the closing bell to try to adjust a leveraged long portfolio to save what you can) and the more data you gather, the faster overall measurement uncertainty decreases and a repeatable model is reached. Similarly, generating values for comparison that repeats the same data several times risks skewing the model to specific measurements, and unless you have separate generating variables most of the uncertainty will eventually come from how you specify your model.

Think of this like grading kids. If you only do 4-hour marathon tests at the end of term, they could be ill that day, or similarly if you just grade on unprepared quizzes, resulting in grades that are far from the children’s actual knowledge. On the other hand, if you have 15 testing variables with some being tests, others being presentations, still more homework and on top of that some projects in the end, your weighting of the individual components probably plays more of a role in the end (or might risk ruining the whole classroom experience if the teacher is too busy handing out assignments to actually teach the subject matter!)

2.2.2 Liquidity

Liquidity, similarly, requires a little bit of tweaking. Since there doesn’t exist a clear definition of how liquidity relates to market observables, we simply assess something that “feels like what liquidity should be and gives good enough results in regressions to go with the theories” and then keep on working from here. Ideally, I would have loved to see total open interest across the span of a day, but getting that data is a bit out of my reach. Similarly, checking the spreads on contracts would have been interesting, but drawing out tick-by-tick spreads and summarizing that across a month for all stocks on an index and then getting a representative number felt like a handful of variables and possible approaches too many.

What I settled on instead was the overall stock market turnover ratio: how much is the monthly turnover as a fraction of total market cap at month end? It is a simple model, possibly benefiting a lot from annualized turnover rates on monthly day counts, but it checks simply how much money moved around in the market. It also helps that on a “closed system” stock market, this idea lends itself very well to the thermodynamic model of mean free path, but applied to stocks. If someone sells say $1 million of stocks on something like Apple, it won’t really change too much due to the open interest and all people covering trading in it, plus all the people with hidden positions. Then putting that money into a very small-capitalization company will invariably force it to move a lot more. A large particle, all else equal, has a much higher momentum at any given speed than a small particle, but since the transaction can be viewed as a collision, the effect of the speed and kinetic energy of the small particle due to the absorbed momentum is much more than that of the large particle. The sum of all these interactions is what is perceived as temperature in thermodynamics, and liquidity in the stock markets.

2.3 Generating Distributions and Why I Hate the Standard Normal:

Generating the distribution has already been covered to an extent in section 1.3, but this section deals much more closely with some of the theoretical aspects of selecting a good smoothing function. Ideally, the smoothing function should have the same features as the “true” distribution over the longer term, but since we don’t know this we have to select a few features we really really want. In my case, this was:

  • Smoothness: I’m running calculus here on a system using mathematics that are supposed to be so large that the system is effectively continuous.
  • Defined for all feasible values of returns: the Kullback-Leibler formula really goes haywire if data points start showing up where either of the functions P(x) or Q(x) is 0.
  • Be leptokurtic: high head and fat tails! Kurtosis is the third moment around the mean and measures how much of the distribution is close to the standard deviation, and stock markets are generally researched to have excess kurtosis in returns (being leptokurtic), meaning that both tail events and close-center events occur more often than what can be expected by the standard normal alone. Conversely, events around the standard deviation, or the “changes we would expect” happen less often than we expect.

Going through the graphing packages and statistics I don’t really find anything that fits all three criteria, and the standard normal fits the first two. Triangular smoothing ir parabolic smoothing dimply doesn’t do either, neither does some other defined-range smoothing kernels (kernel: smoothing function). The ideal candidate would be the logistic distribution, since it has fat tails, high peaks, is smooth, and can model values from positive to negative infinity. Alas… is it in any graphing package? Nope!

Thus the standard normal is the “least bad” alternative, but successfully implementing the logistic distribution would generate “tighter” peaks which would mean that the Kullback-Leibler divergence can pick out better resolution around the center of the monthly distribution, and it wouldn’t over-analyze events far out on the tails. Both of these things would really improve the predictive ability of the model while reducing the weight of stocks that simply end up in feedback loops due to buying or selling pressure due to market capitalization-adjusting portfolios on the market, allowing the resolution of fringe events around 1-2 standard deviations of returns to be really significant. But, standard normal we had to live with so standard normal it is…

2.4 GARCH, eGARCH, Time-Series Regressions, How I Used Them and [Technical Babble]

After having singled out a good index realized volatility measure (adjusted close-to-close plus daily range methodology) and seen that there is some statistical merit to all the different models of entropy, there is a need to actually use these in rather advanced calculations. I will go over my quantitative approaches here:

2.4.1 GARCH:

Generalized Autoregressive Conditional Heteroskedacity (GARCH) methodology is a very advanced theoretical approach for estimating variances that can broadly be separated into a two-step process:

  1. Evaluate the underlying data movements by generating an AR(I)MA model – Autoregressive (Integrated) Moving Average. GARCH models often assume non-integrated series and thus sets the integration order to zero, making it an ARMA model instead. AR(p) models are simply regressions of function values against function values lagged from 1 to p time periods behind. Integrated models simply apply differencing of the time series to remove order-of-integration effects and recover stationarity in time series. (“The index closed today at 10 000 points” is the integrated approach, whereas saying “the index closed down 1.2%” requires differencing of the time series to get the rate of change – most economics and finance time series fall in this area, but are so commonly adjusted to differenced data that it simply enters the model differenced, as does my data.) MA models apply a moving average operator, in most GARCH applications resulting in simply weighting two moving averages of different length dynamically. The time series’ I evaluated used already-differenced data with stationarity conditions filled, and thus the ARMA model was applied and parametrized to ARMA(1,1) as is common for financial time series without large autocorrelation factors beyond the immediately preceding period.
  2. Apply the Generalized Autoregressive Conditional Heteroskedacity model on the error terms of the ARIMA model. The GARCH model looks at both the prior variance from 1 to p periods back and the prior unexplained error from 1 to q periods back in evaluating current variance. Assuming market efficiency means that it’s normal to not model more than 1 parameter for either p or q as the data should be priced in, and lag terms are still carried implicitly through the recursive nature of the variance model. Using monthly data this was seen as a very reasonable assumption that was later confirmed by GARCH test diagnostic data. My model applies the computed Kullback-Leibler divergences as external regressors for the error terms not covered by the variance input regression, again only using one lag period under the expectation of market efficiency and definition on consecutive distributions.

2.4.2 Exponential GARCH:

Vanilla GARCH has a number of drawbacks:

  1. Practically, a requirement to constrict regressors to a range from -1 to 1 exists, resulting in dimensionality problems on unbound probability spaces without likelihood constraints for extreme values. In English, converting variable values to a [-1, 1] range can be a PITA both theoretically and practically, giving you both headaches and model errors.
  2. It can’t really tell the difference between a positive error and a negative error. When was the last time you saw the stock markets hit an ATH with a high volatility index reading?
  3. It equal-weights all observations on the ARMA model generation where a human might trade more off the latest moves and market levels. This is however partially covered in appropriate MA model selection between the two component moving averages. The iterative nature of the GARCH model itself though, particularly when using higher p and q numbers, essentially leaves a cascade of regression weights which is much wider for recent values than for late values. In my research this isn’t a major concern either since I’m using less than 200 data points and there is simply not enough time for data periodicity errors to manifest themselves in a consistent manner.

Using an exponential GARCH variance estimation (natural logarithms on everything in the GARCH model except the induced errors, and changing the unforeseen error evaluation model to one that splits errors into magnitude and direction) adjusts for issues 1 and 2. Models can be tested really quickly without detailed error series evaluation prior to testing and the data holds up. The detailed mathematics are a little bit obscure and blends discretized math with continuous functions and lots of wider function references so I’ll spare you that pain.

2.4.3 Time-Series Regression (TSR) and Johansen Co-integration:

In doing TSR on data that changes from the former input by a small change of an external variable (for example, GDP by number of employed persons, stock valuations by interest rate changes) simply running normal regressions on the variables themselves will yield considerably smaller relative error terms since you’re diluting the error by the base level of the signal, making correlations look huge. But if you isolate the changes alone, you remove the base level and look at the new information in any independent variable change and how that changes the level of the dependent variable. (Reducing spurious correlation). This can be tested for with a number of statistical methods called unit root tests, to see if the time series you are working with are stationary and advising on whether differencing needs to be done because the correlation with the previous data point in the series makes new data statistically insignificant. This process can be iterated until you find a “core” series, then summed over (and over and over) until you get the original series, but the statistical significance only applies to the unintegrated series.

Once that has been done, you can evaluate series that are so-called co-integrated, where two or more series themselves are integrated, but when you add them together with optimal coefficients you get a stationary series. This generally suggests a really deep underlying connection between the individual series, and can be used to indicate whether there may be external factors influencing all of them. One of the more popular tests for this is Johansen co-integration, which was used here as well.

All of my series were either stationary series or integrated of the first order, thus simply requiring one differencing for further use in the TSR and eGARCH methodologies. The only time that I really could find significant co-integration effects were between historical variance-model-, implied- and realized variance-model volatility. This implies that statistical variance impacts implied volatility and is a main predictor of future volatility when measured in this way. Other time series did not have strong direct connections in my model where all adjacent functions were integrated to order one, thus rendering a Johansen co-integration test inapplicable.

2.4.4 Overall Model Considerations:

When evaluating a set of time series with strong correlation, co-integration mechanics and first-order integration, it will be very difficult for any other model to “crack” unexplained variance, since the inherent errors in the current and preceding terms will likely expand beyond the explained variance of external variables. (e)GARCH goes around this in a way by checking if there is an unexplained consistency in the errors and then mapping that to q number of lagged values. This is where it is easier to drop in an external regressor, because that consistency can be tested against if the unexplained error term falls while the external regressor gains statistical significance. GARCH tests on well-defined variable models are thus one of the best ways for checking if a model is generally improved by adding extra variables or if the TSR was right that the error was outside of the model.

Similarly, one can sort of look at the GARCH methodology as internally running a differencing, since we are evaluating model variance by prior variance values (off an ARMA model that more or less does the same thing!) and we are thus able to evaluate variable contribution to errors separately from the high base values of highly correlated prior-period variances.

3. Brief Results and My Interpretations:

3.1 TSR Results and Discussion:

As has been evaluated several times both in this research and in available finance literature, long-term, volatility is autocorrelated – prior changes in volatility have a better-than-random chance of predicting future volatility changes. Similarly, implied volatility on market contracts follows this by pricing in prior changes in volatility to the extent that TSR models cannot price in any of the other variables tested here to consistently and significantly improve said models in the long run. Point 1 for efficient market hypothesis. This is somewhat to be expected: after all this is what hundreds of billions of dollars on each considered market is quoted for daily, and those investors probably want to at least not be beaten to their trades by someone doing middle-school statistics homework. Asking this to happen is a little bit like expecting to be better able than an architect to re-draw the house he already drew!

3.2 eGARCH Results and Discussion:

The really interesting thing showing up in the results is that when running GARCH models, volatilities realized or implied are either marginally useful or completely useless, but do not significantly improve the model overall. Now, looking into the GARCH formula, the prior variance (versus the ARMA model) is already considered, and if it is strongly correlated to the next value, considering it twice doesn’t mean that we explain any more of the ARMA model errors. Realized volatility not adding to explanatory power indicates that the ARMA model is at least as good as the TSR model at evaluating the underlying levels, and comes up on top when the realized volatility just “tries to explain the same errors again”. Implied volatility isn’t very useful either! Here, a case could be made for the market being able to price in both the prior variance and external factors that can be priced in on consistent models that get revealed in the GARCH model, but this does not seem statistically supported!

3.3 All the Goodies and How to Make Money With Entropy + eGARCH:

Running eGARCH with my entropy models, the relative signals of those entropy models, and liquidity as external regressors felt like one big victory lap. On the entropy side, both indices gave massive eGARCH response to overall entropy, whereas entropy only showing the distribution was not consistently statistically significant and even had the reverse sign. (Increasing overall entropy led to increasing variance, while increasing distribution-only entropy led to decreasing variance.) The most important statistical estimator using entropy however was using a formula where both were present, figuring out the excess entropy of distribution by dividing overall entropy by distribution entropy. This was again really statistically significant and had a positive sign relative next-month variances, indicating that distributions of stocks are significant in forecasting long-term volatility while not being priced into the implied volatility as the same statistically significant result of even close to that size couldn’t be observed when using implied volatility as an external eGARCH estimator!

This isn’t even the best part! What is the best part? The excess entropy estimator very closely matched liquidity in how it acted as an eGARCH external regressor, with liquidity being one or two orders of magnitude more statistically significant. It means that from 90% to 99.5% of the time, my excess entropy model models the effect of liquidity in excess of other variables, by simply looking at how stocks move individually. This should be independent on whether you can actually get quotes on liquidity or not! The model can thus cover markets where there are no summaries for volume if you can model the market or contracts according to the entropy framework demands (more articles later).

If you’re sitting on a bank trading desk, got through the whole post down to this part, and understood it, surely you understand just how much money can be made exploiting these features in the market?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s