Interval scorers#

Skchange provides modular and flexible algorithms for change and anomaly detection that also runs fast. Interval scorers, introduced here, are the main components that make this possible.

All detectors are composed of a user-specified interval scorer. It is responsible for evaluating a score over a large amount of data intervals. The choice of interval scorer represents the choice of distributional feature(s) to detect changes in, for example the mean, the variance, the full distribution, or something else. Apart from being constructed, interval scorers are not meant to be used directly. They are used internally by the detectors to evaluate the score. To make full use of the library, however, it is important to understand how they work and how they can be used to build more complex detectors.

There are four types of interval scorers in Skchange:

  1. cost

  2. change_score

  3. saving

  4. local_anomaly_score

Different detectors use different types. We will now go through each of them briefly and show how they relate.

To demonstrate, we use the dataset below.

[1]:
from skchange.datasets import generate_piecewise_normal_data

x = generate_piecewise_normal_data(
    means=[0, 5],
    variances=[16, 1],
    lengths=[120, 80],
    seed=101,
)
x
[1]:
0
0 2.413207
1 2.977178
2 -1.238747
3 1.469285
4 6.841577
... ...
195 6.290761
196 3.462099
197 5.835183
198 5.477154
199 4.570239

200 rows × 1 columns

[2]:
import plotly.express as px
import plotly.io as pio

pio.renderers.default = "notebook"

px.line(x)

Cost#

A cost measures the cost/loss/error of a model fit to a data interval X[s:e]. Here is an example:

[3]:
from skchange.costs import GaussianCost

cost = GaussianCost()
cost.fit(x)
cost.evaluate([[0, 10], [110, 130], [140, 160]])
[3]:
array([[ 48.40630643],
       [116.35768377],
       [ 59.60140582]])

The GaussianCost computes twice the negative log-likelihood of a univariate Gaussian model with a constant mean and variance over the specified intervals x[0:10], x[110:130], and x[140:160]. As you can see, the cost is higher for the interval that contains the change point at index 120. This is an interval where the constant Gaussian model fits poorly.

The computational bottleneck of most change detection algorithms is to evaluate an interval score over a large number of intervals. In Skchange, this is solved as follows:

  • fit receives the data x and precomputed quantities that can help speed up the interval evaluations.

  • evaluate receives an integer array of intervals and evaluates many scores in one call. This enables custom optimisation strategies for each scorer. For example, numba is often used to efficiently loop over all the input intervals and compute the scores from the precomputed quantities.

For the GaussianCost, the cumulative sums and cumulative sums of squares of x are precomputed in fit.

Change score#

Another type of interval score are change scores. A change score takes in a start-split-end configuration (s, k, e) and measures the degree of change between two adjacent intervals x[s:k] and x[k:e]. They can be statistical tests, time series distances, or any other measure of difference. A famous example is the CUSUM score for a change in mean, shown below.

[4]:
from skchange.change_scores import CUSUM

score = CUSUM()
score.fit(x)
score.evaluate([[0, 5, 10], [110, 120, 130], [140, 150, 160]])
[4]:
array([[ 0.70707215],
       [12.51782682],
       [ 0.80226111]])

Again, observe that the change score is highest for the interval that contains the change point at index 120.

Change score from cost#

Now comes one of the strengths of interval scorers: A cost can always be used to make a change score. This is done by using the ChangeScore class as follows:

[5]:
from skchange.change_scores import ChangeScore
from skchange.costs import L1Cost

score = ChangeScore(L1Cost())
score.fit(x)
score.evaluate([[0, 5, 10], [110, 120, 130], [140, 150, 160]])
[5]:
array([[ 0.4173491 ],
       [17.65715966],
       [ 0.        ]])

Internally, the ChangeScore class uses the following expression to compute the change score:

score.evaluate([start, split, end]) = (
    cost.evaluate([start, end]) - (cost.evaluate([start, split]) + cost.evaluate([split, end]))
)

You can read this expression as “score = the cost of the interval without a change point - cost of the interval with a single change point”.

In Skchange, you can always pass a cost to a change detector, even if the detector expects a change score. The ChangeScore class will convert the cost to a change score internally. This is useful because it allows you to use the same cost for different detectors, without having to worry about whether they expect a cost or a change score.

At the same time, we also support change scores that cannot be reduced to costs. This is different from e.g. the ruptures library. There are quite a few important scores that can not be reduced to costs, such as the Mann-Whitney U test, the Kolmogorov-Smirnov test, as well as scores for sparse change detection in high-dimensional data. Moreover, change scores that are implemented directly can often be more computationally efficient.

Anomaly scores#

There are two types of anomaly scores in Skchange: saving and local_anomaly_score. Both measure to which degree a segment of the data is anomalous, but differ in how they define the normal or baseline data behaviour. Like change scores, a cost can always be used to make an anomaly score.

Local anomaly score#

A local anomaly score receives a start-split1-split2-end configuration (s, k1, k2, e) and measures the degree of anomalousness of the interval x[k1:k2] compared to the local context to the left and right, x[s:k1] and x[k2:e]. In the literature, such segment anomalies are sometimes called epidemic changes.

As for change scores, a cost can always be used to create a local anomaly score. Here is an example:

[6]:
from skchange.anomaly_scores import LocalAnomalyScore

score = LocalAnomalyScore(GaussianCost())
score.fit(x)
score.evaluate([[0, 5, 10, 15], [70, 120, 150, 160]])
[6]:
array([[ 2.24663429],
       [92.8856591 ]])

A down-side of the local anomaly score is that they often can be heavy to compute, as there are so many possible combinations of start-split1-split2-end configurations. The saving score is often a more efficient alternative.

Saving#

A saving takes in a start-end configuration (s, e) and measures the difference in cost between a locally and globally estimated model parameter over the interval x[s:e]. The globally estimated model parameter represents the normal data behaviour and is user-specified. In practice, it should be estimated robustly from the data. An example is given below.

[7]:
from skchange.anomaly_scores import Saving
from skchange.costs import L2Cost

baseline_mean = x.loc[: x.shape[0] / 2].median()
print(f"Baseline mean: {baseline_mean.values[0]}")

baseline_cost = L2Cost(param=baseline_mean)
score = Saving(baseline_cost)
score.fit(x)
score.evaluate([[0, 10], [110, 130], [140, 160]])
Baseline mean: 0.2605652216494266
[7]:
array([[ 40.33425723],
       [ 66.00722468],
       [426.62064121]])

Let us go through the example step by step:

  1. We start by estimating the baseline mean of the data by the median of the first half of the dataset. It is estimated to be 0.18.

  2. A baseline cost is then constructed as the L2 cost of the data against the fixed and constant baseline mean. Note the special param parameter for the cost. By default it is None, meaning that the model parameters (the mean in this case) will be optimised for each interval input to evaluate. If it is not None, the model parameters will be fixed to the specified values (0.18 in this case).

  3. Finally, the fixed cost is used to create a saving. Internally, the Saving uses the following expression to evaluate the score:

    optim_cost = baseline_cost.clone().set_params(param=None)
    score.evaluate([start, end]) = (
        baseline_cost.evaluate([start, end]) - optim_cost.evaluate([start, end])
    )
    

In this example, you can see that the saving is highest for the last interval. This is because all the data in the last interval has a mean significantly different from the baseline mean of 0.18.

General interval scorer API#

  • Initialise: Set the parameters of the interval scorer.

  • fit(self, X, y=None) -> self: Fit the interval scorer to the data X. Prepares the interval scorer for efficient evaluation of many intervals.

  • evaluate(self, cuts) -> np.ndarray: Evaluate the score over the intervals specified by cuts. A “cut” is the generic term for a start-end, start-split-end or start-split1-split2-end configuration that the interval scorer can evaluate. The cuts parameter is an array of integers, where each row represents a cut. Other types of interval scorers and cuts may be supported in the future. The output is always a 2D numpy array with one row per cut. The number of columns is either 1 or the number of columns in the input data, depending on the concrete interval scorer.

Custom interval scorers#

It is easy to create custom interval scorers in Skchange. Fill-in templates for the different types of interval scorers can be found in the extension_templates folder on Github.