generate_piecewise_data#

generate_piecewise_data(distributions: rv_continuous | rv_discrete | list[rv_continuous] | list[rv_discrete] | None = None, lengths: int | list[int] | ndarray | None = None, *, n_segments: int = 3, n_samples: int = 100, seed: int | Generator | None = None, return_params: bool = False) ndarray | tuple[ndarray, dict][source][source]#

Generate data with a piecewise constant distribution.

Generate piecewise segments of data from scipy.stats distributions, where unspecified parameters are randomly generated.

Parameters:
distributionslist of scipy.stats.rv_continuous or scipy.stats.rv_discrete, optional (default=None)

The distributions for generating piecewise data. They are recycled to match the number of segments specified by lengths or n_segments. If None, alternating segments of scipy.stats.norm() and scipy.stats.norm(5) are used. Each distribution is expected to be a scipy distribution instance (e.g., scipy.stats.norm, scipy.stats.uniform). See scipy.stats for a list of all available distributions.

lengthsint, list of int or np.ndarray, optional (default=None)

The segment lengths. There are three possible cases:

  1. list or numpy array: Custom set of segment lengths.

  2. int: Length of n_segments equal segments.

  3. None: Generate n_segments random segment lengths with a total sample size of n_samples.

n_segmentsint (default=3)

Number of segments to generate if lengths is an integer or None.

n_samplesint (default=100)

Total number of samples to generate if lengths is not specified.

seednp.random.Generator | int | None, optional

Seed for the random number generator or a numpy random generator instance. If specified, this ensures reproducible output across multiple calls.

return_paramsbool, optional (default=False)

If True, the function returns a tuple of the generated DataFrame and a dictionary with the parameters used to generate the data.

Returns:
np.ndarray of shape (n_samples, n_variables)

Array with generated data.

dict, optional

A dictionary containing the parameters used to generate the data. Only returned if return_params is True. It has the following keys:

  • “n_segments” : number of segments generated.

  • “n_samples” : total number of samples generated.

  • “distributions” : list of scipy.stats.rv_continuous or scipy.stats.rv_discrete with the distributions used for each segment.

  • “lengths” : list of lengths for each segment.

  • “change_points” : list of change points, which are the starting indices of each segment in the data.

Examples

>>> # Example 1: Two normal segments
>>> from skchange.new_api.datasets import generate_piecewise_data
>>> from scipy.stats import norm
>>> generate_piecewise_data(
...     distributions=[norm(0, 1), norm(10, 0.1)],
...     lengths=[7, 3],
...     seed=1,
... )
array([[ 0.345584...],
       ...
       [10.029413...]])