generate_piecewise_data#

generate_piecewise_data(distributions: rv_continuous | rv_discrete | list[rv_continuous] | list[rv_discrete] | None = None, lengths: int | list[int] | ndarray | None = None, *, n_segments: int = 3, n_samples: int = 100, seed: int | Generator | None = None, return_params: bool = False) DataFrame | tuple[DataFrame, dict][source][source]#

Generate data with a piecewise constant distribution.

Generate piecewise segments of data from scipy.stats distributions, where unspecified parameters are randomly generated.

Parameters:
distributionslist of scipy.stats.rv_continuous or scipy.stats.rv_discrete, optional (default=None)

The distributions for generating piecewise data. They are recycled to match the number of segments specified by lengths or n_segments. If None, alternating segments of scipy.stats.norm() and scipy.stats.norm(5) are used. Each distribution is expected to be a scipy distribution instance (e.g., scipy.stats.norm, scipy.stats.uniform). See scipy.stats for a list of all available distributions.

lengthsint, list of int or np.ndarray, optional (default=None)

The segment lengths. There are three possible cases:

  1. list or numpy array: Custom set of segment lengths.

  2. int: Length of n_segments equal segments.

  3. None: Generate n_segments random segment lengths with a total sample size of n_samples.

n_segmentsint (default=3)

Number of segments to generate if lengths is an integer or None.

n_samplesint (default=100)

Total number of samples to generate if lengths is not specified.

seednp.random.Generator | int | None, optional

Seed for the random number generator or a numpy random generator instance. If specified, this ensures reproducible output across multiple calls.

return_paramsbool, optional (default=False)

If True, the function returns a tuple of the generated DataFrame and a dictionary with the parameters used to generate the data.

Returns:
pd.DataFrame

Data frame with generated data.

dict, optional

A dictionary containing the parameters used to generate the data. Only returned if return_params is True. It has the following keys:

  • “n_segments” : number of segments generated.

  • “n_samples” : total number of samples generated.

  • “distributions” : list of scipy.stats.rv_continuous or scipy.stats.rv_discrete with the distributions used for each segment.

  • “lengths” : list of lengths for each segment.

  • “change_points” : list of change points, which are the starting indices of each segment in the data.

Examples

>>> # Example 1: Two normal segments
>>> from skchange.datasets import generate_piecewise_data
>>> from scipy.stats import norm
>>> generate_piecewise_data(
...     distributions=[norm(0, 1), norm(10, 0.1)],
...     lengths=[7, 3],
...     seed=1,
... )
           0
0   0.345584
1   0.821618
2   0.330437
3  -1.303157
4   0.905356
5   0.446375
6  -0.536953
7  10.058112
8  10.036457
9  10.029413
>>> # Example 2: Two Poisson segments
>>> from scipy.stats import poisson
>>> generate_piecewise_data(
...     distributions=[poisson(1), poisson(10)],
...     lengths=[5, 5],
...     seed=2,
... )
    0
0   0
1   0
2   1
3   2
4   0
5   8
6  11
7   9
8   9
9   9
>>> # Example 3: Specify int lengths and n_segments
>>> generate_piecewise_data(
...     distributions=[norm(0), norm(5)],
...     lengths=3,
...     n_segments=3,
...     seed=3,
... )
          0
0  2.040919
1 -2.555665
2  0.418099
3  4.432230
4  4.547351
5  4.784403
6 -2.019986
7 -0.231932
8 -0.865213