How we approached IT incident predictions through chaos theory

In this blog, we have introduced the basics of chaos theory and complex systems, including how system incidents and failure prediction have been tackled in the past through deep learning. We have also offered ideas on how chaotic time series analysis can be leveraged to approach this problem.

Predicting IT system incidents

IT teams in large technology companies spend years building networked systems and connected applications. Now, especially in the middle of a worldwide pandemic, software reliability is paramount. It ensures that IT applications and services continue working to keep businesses rolling, employees productive, and customers satisfied.

A single IT incident can put the entirety of the business at risk, even more so in 2020 when we are relying on technology more than ever. In such a scenario, predicting system failures and incidents can provide a major edge over the competition, and can prevent service disruptions and suspensions.

Using machine learning to try and predict IT incidents seems to be a natural next step, and this has been tried before with varying degrees of success. Certain signals have been identified and listed here, which appear in high frequency when an incident is about to occur.

They include: 

  • Problem backlog or average problem age
  • Day of week or day of month
  • Planned change activity
  • Technology health
  • Days between major incidents
  • Days since a major incident
  • The minor incident growth rate

There have also been attempts to predict system failure by examining console logs and developing deep-learning based approaches. Two such approaches are listed below:

Real-time incident prediction for online service systems 

Automated IT system failure prediction: A deep learning approach

Where does chaos come in?

After going through resources on predicting system incidents using ML-based approaches, you might wonder about the role that chaos plays in all this. In fact, it has been discussed and observed that software development and processes are complex systems, and software failure has chaotic characteristics, according to these articles on ResearchGate and the Proceedings of the World Congress on Engineering 2012

We present an idea building on this and propose how chaos theory can be used to go beyond just software failure prediction to IT system incidentwhich involves downtime, bugs, etc prediction as a whole.

What is chaos?

We come across many systems that evolve and change with time every day. Given the nature of such systems, it’s very difficult to predict the future state of such systems in the long-term even with robust statistical models. Examples include weather, turbulent fluids, epidemics, and stock market indices. These systems are generally referred to as dynamical systems.

Such systems are said to exhibit ‘chaos‘. In simple terms, chaos refers to any state of confusion or disorder that shows the absence of some kind of particular order, according to this article in the Hindawi journal.

Chaos, however, is not a new concept. Back in the 1800s, the King of Sweden had posed a problem dealing with chaos. The goal was to predict the movement of three bodies under gravity using Newton’s Laws.

French mathematician Henri Poincaré wrote about the problem, describing all the reasons why it couldn’t be solved. One of the most important reasons he highlighted was how small differences in the initial stages of the system can lead to bigger differences at the end.

In the mid-20th century, while mathematician Edward Lorenz was studying a simple model of the Earth’s weather on an early computer, he observed that restarting simulations generated wildly different results. What Lorenz had stumbled upon was chaos.

As we discussed before, chaotic systems are everywhere. As a matter of fact, they dominate the universe. The motion of a simple pendulum is predictable, but stick another pendulum at the end of it and we have a simple but a very chaotic system. The three-body problem, the population of a species over timethey are all chaotic systems. Chaos is truly everywhere.

The sensitivity of the evolution of chaotic systems to the initial conditions is the main reason why it’s so difficult to make reliable predictions: we never know exactly, precisely, to the infinite decimal point the state of the system. And, according to an article in Space.com, if we’re off even by the tiniest fraction, after enough time we’ll have no idea what the system is doing. 

Random trajectories of the three-body problem based on different initial conditions. Source: The three-body problem

Characteristics of chaotic systems 

According to the article, Application of Chaos Theory in the Prediction of Motorised Traffic Flows on Urban Networks, chaotic systems display the following characteristics:

  • Sensitivity to initial conditions: As we mentioned before, chaotic systems are highly dependent on the initial conditions of the system. Trajectories of two different but close initial conditions diverge exponentially from each other as the system evolves in ‘phase space’. A phase space is the space in which all possible states or configurations of a dynamic system are represented, each denoted by a unique point in the space.
  • Determinism: Chaotic systems are strictly deterministic. A deterministic system is one in which for a given time interval there is only one future state, which follows from the current state. These systems can be described by Ordinary Differential Equations (ODEs).
  • Nonlinearity: A nonlinear system is a system whose inputs and outputs aren’t proportional to each other. The relationship between variables is dynamic in this case.
  • Instability: Chaotic systems have sustainable irregular behavior caused by sensitive dependence on initial conditions. This means that predictions for a given system can only be made to high precision for short intervals of time.
  • Attractors: They are sets of states (d-dimensional), i.e., points in the phase space which are invariant under the system’s dynamics, and where all states close to each other asymptotically approach each other. The system can be started anywhere and it’ll still end up evolving into one of these states. For example, if we drop a ball in a valley, no matter where we drop it from, it’ll always end up at the bottom of the valley. The bottom here is the attractor of this system.
    Chaotic systems have ‘strange attractors’. If a system takes an aperiodic and irregular shape, and never repeats itself on time, the system is said to have a strange attractor.
  • Fractal dimensionality: We already know that the geometrical dimension of a line, plane, and box is 1, 2, and 3, respectively. However, many examples and objects in our everyday life are not geometrically smooth like the ones mentioned above. Complex, non-integer dimensions are called fractal dimensions. Fractal dimensions provide a measure of the complex nature of chaotic systems.

  

The Mandelbrot set, which has its origin in complex dynamics. The boundary of this set is what is called a fractal, which serves as the visual representation of chaos. Source: Mandelbrot set

Making predictions in chaotic systems

To understand how we can predict the evolution of a chaotic system, we need to understand how chaos is generated. Mathematically, chaos can be produced by both ‘discrete’ and ‘continuous’ equations.

Discrete systems can be expressed as: 

xn+1 = f(xn)

Continuous systems can be expressed as :  

A necessary condition for the above systems to show chaos is that f and F need to be non-linear.

In practice, we usually do not have the dynamical equations shown above, like in the case of earthquakes, system incidents, and the stock market. Therefore, the details of the system equations in the phase space and the attractors are unknown. What we have or can obtain is some time series from one or a few of the dynamical variables of the system. Due to this incomplete knowledge, characterizing chaotic systems poses a major challenge.

What we need to do for prediction is to reconstruct the phase space and the attractor.

An interesting question to ponder over is how one goes from one or a few time-series to the multivariate state or phase space that is required for chaotic motions to occur.

Fortunately, this has been extensively studied. The basic assumption we need to make is that the measured time series comes from the attractor of the unknown system with ‘ergodicity’. (Ergodicity expresses the idea that a point of a moving system, either a dynamical system or a stochastic process, will eventually visit all parts of the space that the system moves in uniformly and randomly) Then one can use the measured time series to figure out the properties of the attractor, such as its dimension, its dynamical skeleton, and its degree of sensitivity on initial conditions.

Let’s see how exactly we can figure out the above parameters for the system.

Takens’ Fundamental Embedding Theorem is considered as the foundation of all Chaos-based predictions. According to Wikipedia: “It gives the conditions under which a chaotic dynamical system can be reconstructed from a sequence of observations of the state of a dynamical system. The reconstruction preserves the properties of the dynamical system that do not change under smooth coordinate changes (i.e., diffeomorphisms), but it does not preserve the geometric shape of structures in phase space.” 

A ‘diffeomorphism is typically presented as a smooth, differentiable, invertible map between ‘manifolds’, or rather, between points on one manifold to points on another manifold. 

A ‘manifold’ is a collection of points forming a certain kind of set. 

Using results from the above theorem, we can reconstruct a vector to represent the trajectory of the attractor of the system from the given time series. We can also use Takens’ theorem to figure out the dimension of the embedding. Once these two have been worked out, we can start making predictions.

The math for the predictions is a bit involved, so we will cover it in a separate blog later on.

Prediction deals with the method of computation of the ‘Largest Lyapunov Exponent’, which is the best method so far for analysis and prediction of chaotic behaviors of a given complex system as reported by most researchers in literature. 

Conclusion – applying chaos theory to predict IT system failure 

We started this post by talking about IT system reliability. We are applying the ideas of chaos theory to improve the system reliability for our customers. Freshservice customers send their system data in the form of metric, logs, and traces that are generated by the various IT systems in their organization. Some of this data is continuous (metrics) while others are discrete (logs). 

The machine learning team at Freshservice is developing tools that can compute the properties of the dynamical system, including its Lyapunov exponents. The Lyapunov exponents determine if the system is chaotic or not. If the system is chaotic, we change the representation of the system from one-dimensional time domain data to an m-dimensional reconstructed phase space. 

Once we have the new representation of our dynamical system, we can apply different methods as per the problem we are trying to solve. For example, if we have to forecast the future state of the system we look back at the evolution from a similar situation in the past. For predictions that are more than one step ahead, the procedure is iterated by successively merging the predicted values with the observed data, as explained in this article on InTechOpen

For anomaly detection we use techniques such as change point detection on the Lyapunov exponents. 

Reliability of networks is dependent on detecting anomalies ahead of failure. Analysis of our customer’s data shows that forecasting and anomaly detection techniques in chaotic systems can help in improving the overall reliability of our customer’s systems.  

References:

Cover and inline image: Vignesh Rajan