CSTS: Correlation Structures in Time Series

CSTS is a comprehensive synthetic benchmarking dataset designed specifically for evaluating correlation structure discovery in time series data. The dataset systematically models known correlation structures between time series variables and enables rigorous assessment of clustering algorithms and validation methods.

Dataset Characteristics:

23 distinct correlation structures between three time series variables (iob, cob, ig)
Correlation relationships in three categories: strong positive ([0.7,1]), negligible ([-0.2,0.2]), and strong negative ([-1,-0.7])
Stationary segments with regime-switching correlation structures, no temporal dependencies (autocorrelation, trends, seasonality)
Segments varying in length from 15 minutes to 10 hours (900-36000 observations)
Data variants include normal distributions, non-normal distributions (similar to insulin, carbohydrates, and glucose data), downsampled versions, and three sparsity levels (complete, partial - missing 30% of observations, sparse - missing 90% of observations)
60 subjects in total, 30 per data variant, two splits (exploratory/confirmatory)
Ground truth information for segmentation and clustering, and labels for controlled degraded segmentation and clustering

Motivation:

CSTS addresses a critical gap in time series clustering evaluation by providing a structure-first benchmark with well-defined correlation structures rather than arbitrary classification boundaries. This enables researchers to systematically distinguish between correlation structure deterioration and algorithmic limitations, moving clustering analysis from "art" toward science.

Use Cases:

Evaluating time series clustering algorithms' ability to detect correlation structures under varying data conditions
Assessing clustering validation methods under varying data conditions
Investigating how preprocessing techniques affect correlation structure discovery
Establishing performance thresholds for high-quality clustering results

Source: CSTS: A Benchmark for the Discovery of Correlation Structures in Time Series Clustering