generate_piecewise_regression_data#

generate_piecewise_regression_data(lengths: int | list[int] | ndarray | None = None, *, n_segments: int = 3, n_samples: int = 100, n_features: int = 1, n_informative: int = 1, n_targets: int = 1, bias: float = 0.0, effective_rank: int | None = None, tail_strength: float = 0.5, noise: float = 1.0, shuffle: bool = True, seed: int | Generator | None = None, return_params: bool = False) tuple[DataFrame, list[str], list[str]] | tuple[DataFrame, list[str], list[str], dict][source][source]#

Generate piecewise linear regression data.

Generate independent segments of data from sklearn.datasets.make_regression.

Parameters:
lengthsint, list of int or np.ndarray, optional (default=None)

The segment lengths. There are three possible cases:

  1. list or numpy array: Custom set of segment lengths.

  2. int: Length of n_segments equal segments.

  3. None: Generate n_segments random segment lengths with a total sample size of n_samples.

n_segmentsint (default=3)

Number of segments to generate if lengths is an integer or None.

n_samplesint (default=100)

Total number of samples to generate if lengths is not specified.

n_featuresint

The total number of features.

n_informativeint

The number of informative features.

n_targetsint

The number of target variables.

biasfloat

The bias term in the linear model. Used across all segments.

effective_rankint | None

The effective rank of the feature matrix. Used across all segments.

tail_strengthfloat

The tail strength of the noise distribution. Used across all segments.

noisefloat

The standard deviation of the Gaussian noise applied to the output. Used across all segments.

shufflebool

Whether to shuffle the samples and features per segment.

seednp.random.Generator | int | None, optional

Seed for the random number generator or a numpy random generator instance. If specified, this ensures reproducible output across multiple calls.

return_paramsbool

Whether to return the parameters used for data generation.

Returns:
pd.DataFrame

The generated data as a DataFrame with columns named “feature_0”, “feature_1”, …, “target_0”, “target_1”, …

list[str]

A list of feature column names.

list[str]

A list of target column names.

dict, optional

If return_params is True, a dictionary containing the parameters used to generate the data, including segment lengths, coefficients, change points, total number of samples and total number of segments.