generate_piecewise_normal_data#

Generate piecewise multivariate normal data.

Generates piecewise multivariate normal data, where the distributional changes from one segment to another may be sparse. E.g., the difference between two mean vectors may only have a few non-zero elements.

Parameters:

meansfloat, list of float, or list of np.ndarray, optional (default=None)

Means for each segment. They are recycled to match the number of segments specified by lengths or n_segments. If floats, they are used for all affected variables (see proportion_affected) If None, random means are generated from a normal distribution with mean 0 and standard deviation 2.

variancesfloat, list of float, or list of np.ndarray, optional (default=1.0)

Variances or covariance matrices for each segment. Vectors are treated as diagonal covariance matrices. They are recycled to match the number of segments specified by lengths or n_segments. If floats, they are used for all affected variables (see proportion_affected) If None, random variances are generated from a chi-squared distribution with 2 degrees of freedom.

lengthsint, list of int or np.ndarray, optional (default=None)

The segment lengths. There are three possible cases:

list or numpy array: Custom set of segment lengths.
int: Length of n_segments equal segments.
None: Generate n_segments random segment lengths with a total sample size of n_samples.

n_segmentsint (default=3)

Number of segments to generate if lengths is an integer or None.

n_samplesint (default=100)

Total number of samples to generate if lengths is not specified.

n_variablesint, optional (default=1)

Number of variables (columns) in the generated data.

proportion_affected: float, list of float, or np.ndarray, optional (default=None)

Proportion of variables affected by each change. I.e., the proportion of non-zero elements in the differences between adjacent means or variances. Only applies when means and variances are None or floats and n_variables > 1. All proportions must be in (0, 1]. The number of affected variables is determined as int(np.ceil(n_variables * proportion_affected)). The proportions are recycled to match the number of segments specified by lengths or n_segments. If None, a random proportion of variables is affected.

randomise_affected_variablesbool, optional (default=False)

If True, the affected variables are randomly selected for each change point. If False, the first variables are affected.

seednp.random.Generator | int | None, optional

Seed for the random number generator or a numpy random generator instance. If specified, this ensures reproducible output across multiple calls.

return_paramsbool, optional (default=False)

If True, the function returns a tuple of the generated DataFrame and a dictionary with the parameters used to generate the data.

Returns:

pd.DataFrame: DataFrame with generated data.
dict: Dictionary containing the parameters used to generate the data. Keys: “n_segments”, “n_samples”, “means”, “variances”, “lengths”, “change_points” (the start indices of each segment), and “affected_variables” (which variables among 0:n_variables are affected by each change). Returned only if return_params is True.

Examples

>>> # Example 1: Two segments with specified means
>>> from skchange.datasets import generate_piecewise_normal_data
>>> df = generate_piecewise_normal_data(
...     means=[0, 5], lengths=5, n_segments=2, n_variables=1, seed=0
... )
>>> df
          0
0  0.640423
1  0.104900
2 -0.535669
3  0.361595
4  1.304000
5  5.947081
6  4.296265
7  3.734579
8  4.376726
9  5.041326

>>> # Example 2: Unspecified means, variances and lengths
>>> df, params = generate_piecewise_normal_data(
...     n_samples=10, n_segments=2, n_variables=2, seed=1, return_params=True
... )
>>> df
          0         1
0 -3.143268  2.391830
1 -2.241742  2.104844
2 -2.577892  2.357425
3 -3.342769  1.647802
4 -3.088434  2.409558
5  0.932471  1.518255
6  0.110841  1.553519
7  0.900891  1.535109
8  2.186813  2.817436
9 -1.818413 -0.078302
>>> params
{'n_segments': 2,
'n_samples': np.int64(10),
'means': [array([-2.60631446,  1.81071173]), array([0.89274914, 1.81071173])],
'variances': [array([[1., 0.],
        [0., 1.]]),
array([[1., 0.],
        [0., 1.]])],
'lengths': array([5, 5]),
'change_points': array([5]),
'affected_variables': [array([0, 1]), array([0])]}