alp_data.chain Module
What is Dataset chaining?
The chain module provides a lightweight way to iterate over multiple ESP datasets as if they were a single dataset. Unlike ConcatenatedDataset, chaining does not merge DataFrames or support transformations on the combined data—it simply yields items from each source dataset in sequence.
Use ChainedDataset when you:
- Need to iterate over multiple datasets without applying joint transformations
- Want to support streaming mode across multiple datasets
- Prefer a lightweight approach that doesn't create a merged DataFrame
For combining datasets with transformation support, see concatenate.md.
How can I chain datasets?
Basic Usage
from alp_data.datasets import InsectSet459, BirdSet
from alp_data.chain import ChainedDataset
# Load individual datasets
dataset1 = InsectSet459(split="validation")
dataset2 = BirdSet(split="HSN-test")
print(f"Dataset 1 length: {len(dataset1)}")
print(f"Dataset 2 length: {len(dataset2)}")
# Chain datasets for iteration
chained = ChainedDataset([dataset1, dataset2])
# Length is the sum of all source datasets
print(f"Chained dataset length: {len(chained)}")
# Iterate over all items
for item in chained:
print(item.keys())
break # Just show first item
Indexing
ChainedDataset supports indexing by mapping the global index to the appropriate source dataset:
# Access items by global index
first_item = chained[0] # From dataset1
last_item = chained[-1] # Not supported - raises IndexError
# Index maps across datasets
# If dataset1 has 100 items and dataset2 has 200 items:
# - chained[0] returns dataset1[0]
# - chained[99] returns dataset1[99]
# - chained[100] returns dataset2[0]
# - chained[299] returns dataset2[199]
Warning
Negative indexing is not supported in ChainedDataset. Attempting to use negative indices will raise an IndexError.
Streaming Mode
Unlike ConcatenatedDataset, ChainedDataset supports streaming mode. All source datasets must have the same streaming mode:
# Streaming mode - all datasets must be streaming
streaming_ds1 = SomeDataset(split="train", streaming=True)
streaming_ds2 = AnotherDataset(split="train", streaming=True)
chained_streaming = ChainedDataset([streaming_ds1, streaming_ds2])
# In streaming mode, len() raises RuntimeError
# Iterate instead:
for item in chained_streaming:
process(item)
Creating from Configuration
You can create a ChainedDataset from a YAML configuration file using the chain keyword:
chain:
datasets:
- dataset_name: insectset459
split: validation
- dataset_name: birdset
split: HSN-test
Load the configuration in Python:
from alp_data import dataset_from_config
chained_dataset, metadata = dataset_from_config("path/to/chain_config.yaml")
Key Differences from ConcatenatedDataset
| Aspect | ChainedDataset | ConcatenatedDataset |
|---|---|---|
| DataFrame handling | Delegates to source datasets | Merges into single DataFrame |
| Transformations | Not supported | Supported via apply_transformations |
| Column handling | Union of all columns reported | Merge strategies (hard/overlap/soft) |
| Streaming | Supported | Not supported |
| Memory footprint | Lightweight | Holds merged DataFrame |
| Metadata merging | Basic | Full merge (names, owners, versions, etc.) |
Understanding the ChainedDataset Class
Key Properties
# Available columns (union of all source dataset columns)
print(chained.columns)
# Available splits (always ["chained"])
print(chained.available_splits)
# Length (sum of source dataset lengths)
print(len(chained)) # Raises RuntimeError in streaming mode
Iteration Behavior
When iterating, items are yielded from each source dataset in order:
# Items come from datasets in order
chained = ChainedDataset([dataset1, dataset2, dataset3])
# Iteration yields:
# - All items from dataset1
# - Then all items from dataset2
# - Then all items from dataset3
for item in chained:
# item comes from whichever dataset it belongs to
pass
Limitations
- No transformations: Cannot apply transforms to the chained dataset as a whole
- No negative indexing: Only non-negative integer indices are supported
- Streaming mode consistency: All source datasets must have the same streaming mode
- No column merging: Columns are not aligned or merged; each item has whatever columns its source dataset provides
Function Reference
Helper class to chain multiple datasets for iteration and indexing.
This class allows iterating over multiple datasets as if they were a single dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
list[Dataset]
|
List of datasets to concatenate for iteration |
required |
Examples:
>>> from alp_data.datasets import InsectSet459, BirdSet
>>> from alp_data.chain import ChainedDataset
>>> dataset1 = InsectSet459(split="validation")
>>> dataset2 = BirdSet(split="HSN-test")
>>> concat_iter = ChainedDataset([dataset1, dataset2])
>>> total_length = len(dataset1) + len(dataset2)
>>> item = next(iter(concat_iter))
>>> assert len(concat_iter) == total_length, "Concatenated iterator length should match sum of source datasets lengths"
Source code in alp_data/chain.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | |
__getitem__(idx)
Get item by global index across chained datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Global index across all chained datasets. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
The item at the specified global index. |
Raises:
| Type | Description |
|---|---|
IndexError
|
If the index is out of bounds. |
RuntimeError
|
If indexing is attempted in streaming mode. |
Source code in alp_data/chain.py
from_config(chain_config)
classmethod
Create a ChainedDataset from a ChainedDatasetConfig object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chain_config
|
ChainedDatasetConfig
|
Configuration object specifying the datasets to chain together. |
required |
Returns:
| Type | Description |
|---|---|
tuple[ChainedDataset, dict]
|
A tuple containing the |