Skip to content

Utilities

Synthetic data generator

generate_synthetic_cdans

generate_synthetic_cdans(
    n_vars: int = 4,
    n_samples: int = 500,
    tau_max: int = 2,
    n_changing: int = 2,
    autocorr: float = 0.4,
    contemp_strength: float = 0.5,
    lagged_strength: float = 0.4,
    noise_std: float = 0.5,
    nonstationary_amplitude: float = 0.6,
    seed: int | None = 42,
) -> SyntheticDataset

Generate a synthetic nonstationary, autocorrelated time series.

The data-generating process is, for each variable i and time t::

X_i[t] = a_ii(t) * X_i[t-1]                       # autoregressive term
        + sum_{(j,lag) in lagged_pa(i)}            # lagged parents
              b_{ij,lag}(t) * X_j[t-lag]
        + sum_{j in contemp_pa(i)}                 # contemporaneous parents
              c_{ij}(t) * X_j[t]
        + eps_i[t]

For variables in changing_modules the coefficients a, b, c are smoothly varying functions of t (sinusoidal). For other variables the coefficients are constants.

Parameters:

Name Type Description Default
n_vars int

Number of observed variables.

4
n_samples int

Length of the generated time series.

500
tau_max int

Maximum lag for the random lagged-parent structure.

2
n_changing int

Number of variables whose mechanism is nonstationary.

2
autocorr float

Magnitude of the autoregressive coefficient a_ii.

0.4
contemp_strength float

Magnitude of contemporaneous coefficients.

0.5
lagged_strength float

Magnitude of lagged coefficients.

0.4
noise_std float

Standard deviation of the additive noise term.

0.5
nonstationary_amplitude float

Amplitude of coefficient drift for changing modules. The effective coefficient at time t is base + amplitude * sin(2 * pi * t / T).

0.6
seed int | None

RNG seed for reproducibility. None for nondeterministic.

42

Returns:

Type Description
SyntheticDataset

The data, ground-truth graph, and changing-module indices.

SyntheticDataset dataclass

SyntheticDataset(
    data: ndarray,
    lagged_edges: set[tuple[int, int, int]],
    contemporaneous_edges: set[tuple[int, int]],
    changing_modules: set[int],
    metadata: dict = dict(),
)

Container for a synthetic dataset and its ground-truth structure.

Attributes:

Name Type Description
data ndarray

Observations, shape (n_samples, n_vars).

lagged_edges set[tuple[int, int, int]]

Set of ground-truth lagged edges. Each tuple (i, j, lag) means X_i[t - lag] -> X_j[t] with lag >= 1.

contemporaneous_edges set[tuple[int, int]]

Ground-truth contemporaneous DAG. Each tuple (i, j) means X_i[t] -> X_j[t].

changing_modules set[int]

Indices of variables whose generating mechanism is nonstationary (time-varying coefficients).

Lagged design matrices

lagged_design_matrix

lagged_design_matrix(
    data: ndarray, tau_max: int
) -> tuple[np.ndarray, np.ndarray, list[tuple[int, int]]]

Build a lagged design matrix from a time series.

For data of shape (T, n) and tau_max = k, returns:

  • Y of shape (T - k, n): the "current" values X[t] for t = k, k+1, ..., T-1.
  • X_lagged of shape (T - k, n * k): the lagged values, columns ordered as X_0[t-1], X_1[t-1], ..., X_{n-1}[t-1], X_0[t-2], ....
  • column_index: list of (variable, lag) tuples describing each column of X_lagged.

Parameters:

Name Type Description Default
data ndarray

Time series, shape (T, n) with T > tau_max.

required
tau_max int

Maximum lag to include.

required

Returns:

Type Description
tuple

(Y, X_lagged, column_index)

column_for

column_for(var: int, lag: int, n_vars: int) -> int

Return the column index in a lagged design matrix for (var, lag).