Skip to content

Using CDANs on your own CSV data

The synthetic-data examples elsewhere in these docs are convenient for testing, but in practice you have a CSV file with your own variables. This page walks through that workflow.

Required CSV format

Before anything else, your CSV needs to be in the right shape:

  • Rows are time-ordered samples. Oldest measurement at the top, newest at the bottom. CDANs uses row order as the time axis.
  • Columns are variables. Each column is one variable's univariate time series. The column names become the variable names CDANs reports.
  • No missing values. Drop or impute NaNs before fitting. CDANs raises an error if it sees any.
  • All columns numeric. Encode or drop text columns.
  • No timestamp column. If your CSV has one, drop it before fitting — CDANs treats row order as time. If you need a non-time surrogate (e.g. patient phase, experimental condition), pass it separately via the surrogate= argument to CDANs(...).

Minimal example

import pandas as pd
from cdans import CDANs

# Load the CSV. Drop any non-numeric or timestamp columns first.
df = pd.read_csv("your_data.csv")

# Fit CDANs, preserving column names.
model = CDANs(tau_max=1, ci_test="kci")
result = model.fit(df.values, var_names=df.columns.tolist())

# The recovered graph is summarized using your column names.
print(result.graph.summary())

That's the whole core idea: df.values for the data, df.columns.tolist() for the names. Pass both to fit().

Reading the output

result.graph.summary() prints something like:

TimeSeriesGraph(n_vars=4, tau_max=1)
  Lagged edges: 9
    systolic_bp(t-1) -> systolic_bp(t)
    systolic_bp(t-1) -> heart_rate(t)
    heart_rate(t-1) -> spo2(t)
    ...
  Contemporaneous edges: 2 directed, 0 undirected
    systolic_bp -> temperature
    heart_rate -> temperature
  Changing modules (1):
    heart_rate

The variable names match your CSV column headers. To extract edges programmatically:

g = result.graph

# Lagged edges as named tuples instead of integer indices
for src_idx, dst_idx, lag in g.lagged_edges:
    src = df.columns[src_idx]
    dst = df.columns[dst_idx]
    print(f"{src}[t-{lag}] -> {dst}[t]")

# Contemporaneous directed
for src_idx, dst_idx in g.directed_contemp_edges():
    print(f"{df.columns[src_idx]} -> {df.columns[dst_idx]}")

# Variables CDANs flagged as non-stationary
for idx in g.changing_modules:
    print(df.columns[idx])

Practical notes

Sample size. KCI's power scales with n. For 4-6 variables and tau_max <= 2, aim for at least 500-800 samples. Below 300, expect more spurious edges and missed changing modules.

tau_max. Set this to the longest lag you reasonably expect to matter. Setting it too high wastes runtime and can hurt CI-test power (more conditioning variables); too low means missing real long-lag effects. When in doubt, start with tau_max=2 and check whether the recovered structure stabilizes if you increase it.

ci_test. Use "kci" for nonlinear data (slower, ~O(n³) per test) and "fisherz" for linear-Gaussian data (much faster but linear-only). "fisherz" cannot detect non-stationarity by itself, so the changing-modules step in CDANs degrades to a chance result with "fisherz" — use "kci" if you care about which variables have time-varying mechanisms.

Standardization. Not required by the algorithm, but column-standardizing your data (subtract mean, divide by std) often improves KCI's stability when columns have very different scales:

df_std = (df - df.mean()) / df.std()
result = CDANs(tau_max=1, ci_test="kci").fit(df_std.values, var_names=df_std.columns.tolist())

Multivariate time series with subjects/patients. If your CSV is a long-format table with multiple subjects (e.g. one row per patient-timestep), don't fit CDANs on the concatenated rows directly — the algorithm assumes a single time-ordered series. Either fit per-subject and aggregate the results, or restructure into a single representative series before fitting.

Runnable example

A complete script that creates a demo CSV and runs the full workflow above is in examples/from_csv.py. Adapt it by replacing the synthetic-CSV-generation section with a path to your own file.