alp_data.concat Module
Combining Datasets: Concatenation vs Chaining
ESP Data provides two ways to combine multiple datasets: ConcatenatedDataset and ChainedDataset. Choose based on whether you need to transform the combined data:
| Feature | ConcatenatedDataset | ChainedDataset |
|---|---|---|
| Use case | Transform combined data | Simple iteration |
| DataFrames | Merged into one | Not merged |
| Transformations | Supported on joined data | Not supported |
| Merge strategies | Hard, overlap, soft | N/A |
| Streaming mode | Not supported | Supported |
| Memory | Holds merged DataFrame | Lightweight |
Use ConcatenatedDataset when you need to apply transforms (filter, deduplicate, etc.) to the combined dataset. Use ChainedDataset when you simply want to iterate over multiple datasets sequentially without any joint transformations.
For ChainedDataset documentation, see chain.md.
What is Dataset concatenation?
The concat module provides utilities for combining multiple ESP datasets into a single unified dataset. This is particularly useful when you want to train models on data from multiple sources or combine different splits of related datasets while maintaining proper data handling and metadata.
More technically, dataset concatenation:
- Combines multiple
Datasetobjects into a singleConcatenatedDataset - Preserves original dataset functionality through source dataset references
- Handles column mismatches through configurable merge strategies
- Maintains proper metadata and configuration merging
- Tracks data provenance for debugging and analysis through a merged
DatasetInfo
A note on dataset merging
There are three merge strategies. The merge logic is implemented through the backend abstraction (pandas or polars), so the strategies behave consistently regardless of which backend the source datasets use:
- Soft Merge: Keeps all columns from all datasets, filling missing values with NaN.
- Overlap Merge: Keeps only columns that exist in all datasets.
- Hard Merge: Requires all datasets to have identical columns, raising an error if they differ.
Warning
When merging two datasets, be aware that if a column with the same name appears in multiple datasets being concatenated, and the data type (dtype) of the column is not the same across datasets, the resulting column in the concatenated dataset may be upcast to object (pandas) or fail to concatenate (polars). This can lead to unexpected behavior in downstream processing.
How can I concatenate datasets?
Datasets can be concatenated using the ConcatenatedDataset class with different merge strategies:
Basic Usage
from alp_data.datasets import AnimalSpeak, InsectSet459
from alp_data.concat import ConcatenatedDataset
# Load individual datasets
dataset1 = AnimalSpeak(split="validation")
dataset2 = InsectSet459(split="validation")
print(f"Dataset 1 length: {len(dataset1)}")
print(f"Dataset 2 length: {len(dataset2)}")
# Concatenate with default soft merge
combined_dataset = ConcatenatedDataset(
datasets=[dataset1, dataset2],
merge_level="soft" # Options: "soft", "overlap", "hard"
)
# Access the combined data
print(f"Combined dataset length: {len(combined_dataset)}")
sample = combined_dataset[0] # Get first sample (should be AnimalSpeak)
print(f"First sample: {sample.keys()}")
Create a combined dataset from a yaml config
You can also create a concatenated dataset from a YAML configuration file. Here is an example of how to do this:
Note the concat keyword at the top level of the config, and datasets list inside it (required).
concat:
datasets:
- dataset_name: beans
split: dogs_test
output_take_and_give: null
- dataset_name: beans
split: esc50_validation
output_take_and_give: null
merge_level: soft
transformations:
- type: label_from_feature
feature: label
output_feature: label
override: true
- type: deduplicate
subset: ["file_name", "label"]
keep_first: true
beans dataset and applying some transformations to the combined dataset.
The concat keyword is a SPECIAL keyword, which tells the dataset_from_config function to create a ConcatenatedDataset instead of a regular dataset. Here's the python code for loading this config:
from alp_data import dataset_from_config
combined_dataset = dataset_from_config("path/to/concat_config.yaml")
Apply transformations before / after concatenation
You can apply transformations to the individual datasets before concatenation as shown in transforms.md. This allows you to treat the data as needed, but you can also apply transformations after concatenation if you want to operate on the combined dataset as a whole. Here is an example of applying a filter transformation after concatenation:
from alp_data.datasets import AnimalSpeak, InsectSet459
from alp_data.transforms import FilterConfig
from alp_data.concat import ConcatenatedDataset
# Load individual datasets
dataset1 = AnimalSpeak(split="validation")
dataset2 = InsectSet459(split="validation")
# Concatenate datasets
combined_dataset = ConcatenatedDataset([dataset1, dataset2])
# Define a filter transformation
filter_config = FilterConfig(
type="filter",
property="species_common",
values=["American Robin", "Bottle-nosed Dolphin"],
mode="include"
)
# Run the transformation on the combined dataset
transform_metadata = combined_dataset.apply_transformations([filter_config])
Warning
If the merge_level was set to "soft" in ConcatenatedDataset, running a filter transformation like this
will end up dropping all rows from datasets that do not have the species_common column, since those rows will be
NaN for those datasets.
As mentioned, you can also apply transforms to individual datasets before concatenation:
# Create and transform individual datasets
animal_dataset = AnimalSpeak(split="train")
animal_filter = FilterConfig(
type="filter",
property="source",
values=["xeno-canto"],
mode="include"
)
animal_dataset.apply_transformations([animal_filter])
insect_dataset = InsectSet459(split="train")
insect_filter = FilterConfig(
type="filter",
property="family",
values=["Cicadidae", "Gryllidae"],
mode="include"
)
insect_dataset.apply_transformations([insect_filter])
# Concatenate the transformed datasets
combined_dataset = ConcatenatedDataset(
[animal_dataset, insect_dataset],
merge_level="overlap"
)
Merge Strategies
The merge_level parameter controls how datasets with different columns are handled:
1. Soft Merge (Default)
Keeps all columns from all datasets, filling missing values with NaN:
# Soft merge - most permissive
combined_dataset = ConcatenatedDataset(
[dataset1, dataset2],
merge_level="soft"
)
2. Overlap Merge
Keeps only columns that exist in all datasets:
# Overlap merge - keeps common columns only
combined_dataset = ConcatenatedDataset(
[dataset1, dataset2],
merge_level="overlap"
)
3. Hard Merge
Requires all datasets to have identical columns:
# Hard merge - strictest option
combined_dataset = ConcatenatedDataset(
[dataset1, dataset2],
merge_level="hard"
)
Understanding the ConcatenatedDataset Class
The ConcatenatedDataset class is the result of dataset concatenation and provides several important features:
Key Properties
# Access dataset information
print(combined_dataset.info.name) # Combined dataset name
print(combined_dataset.info.description) # Merged description
print(combined_dataset.columns) # Available columns (excludes internal tracking)
print(combined_dataset.available_splits) # Always ["concatenated"]
# Sample rate handling
print(combined_dataset.sample_rate) # Unified sample rate if compatible
Data Access
The concatenated dataset maintains full functionality of individual datasets:
# Standard dataset operations
for i, sample in enumerate(combined_dataset):
if i >= 5: # Just show first 5
break
print(f"Sample {i}: {sample.keys()}")
# Direct indexing
specific_sample = combined_dataset[42]
Source Dataset Tracking
Each sample maintains information about its original source:
# The internal tracking is handled automatically
# You get the properly loaded data from the original source dataset
sample = combined_dataset[0]
# This sample was loaded using the appropriate source dataset's __getitem__ method
Configuration and Metadata Merging
DatasetInfo Merging
When datasets are concatenated, their metadata is merged like so:
- Names: Combined with "+" separator (e.g., "animalspeak+barkleycanyon")
- Owners: Deduplicated and joined with ";" separator
- Versions: Highest version is selected using semantic versioning
- Descriptions: Numbered list of original descriptions
- Sources: Deduplicated list of all sources
- Licenses: Unique licenses joined with ";" separator
Sample Rate Validation
Sample rates must be compatible across datasets:
# This will work if both datasets have the same sample rate
combined_dataset = ConcatenatedDataset([dataset1, dataset2])
# This will raise MergeException if sample rates differ
try:
incompatible_dataset = ConcatenatedDataset([audio_16k, audio_44k])
except MergeException as e:
print(f"Sample rate mismatch: {e}")
Output Column Mapping
The output_take_and_give mappings are merged and validated:
from alp_data.datasets import AnimalSpeak
# Create datasets with compatible column mappings
# i.e., either completely different mappings valid to each dataset individually,
# or overlapping, but not conflicting mappings
dataset1 = AnimalSpeak(
split="validation",
output_take_and_give={"canonical_name": "species"}
)
dataset2 = AnimalSpeak(
split="train",
output_take_and_give={"local_path": "path"}
)
# These will be merged successfully
combined_dataset = ConcatenatedDataset([dataset1, dataset2])
# Access the merged output mappings
print(combined_dataset.output_take_and_give)
# Output: {'canonical_name': 'species', 'local_path': 'path'}
# Conflicting mappings will raise MergeException
dataset3 = AnimalSpeak(
split="validation",
output_take_and_give={"canonical_name": "different_name"} # Conflict!
)
try:
bad_combined = ConcatenatedDataset([dataset1, dataset3])
except MergeException as e:
print(f"Mapping conflict: {e}")
Best Practices
1. Choose the Right Merge Strategy
- Use soft merge when datasets have different but complementary columns
- Use overlap merge when you only need common features across datasets
- Use hard merge when datasets should have identical schemas
2. Validate Before Concatenation
It might make sense to a perform a sanity check that multiple datasets can be
concatenated without issues. Incompatible datasets (for e.g. merge strategy "hard"
but different columns, or different sample rates) will raise a MergeException
when you try to concatenate them.
For example, this can be achieved using a check_compatibility function:
# Check dataset compatibility
def check_compatibility(datasets):
sample_rates = [getattr(ds, 'sample_rate', None) for ds in datasets]
if len(set(sr for sr in sample_rates if sr is not None)) > 1:
print("Warning: Different sample rates detected")
columns = [set(ds._data.columns) for ds in datasets]
common_cols = set.intersection(*columns)
print(f"Common columns: {len(common_cols)}")
check_compatibility([dataset1, dataset2])
Limitations and Considerations
Current Limitations
- Memory Usage: All source datasets remain in memory (
streaming=Trueis not supported for concatenation; useChainedDatasetfor streaming). - Single Split: Concatenated datasets only support the "concatenated" split.
- Uniform Backend: All source datasets must use the same backend type (e.g., all polars or all pandas).
Performance Considerations
- Concatenation creates a new DataFrame, which uses additional memory
- Source dataset references are maintained, so original datasets aren't garbage collected
- Index lookups require mapping back to source datasets
Function Reference
A dataset created by concatenating multiple datasets.
This dataset maintains references to the original datasets to enable proper audio loading and other dataset-specific functionality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
list[Dataset]
|
List of datasets to concatenate |
required |
merge_level
|
(hard, overlap, soft)
|
Strategy for handling different columns - "hard": All columns must match exactly across all datasets - "overlap": Keep only common columns across all datasets - "soft": Keep all columns from all datasets (fill missing with NaN) |
"hard"
|
Examples:
>>> from alp_data.datasets import InsectSet459, BirdSet
>>> from alp_data.concat import concatenate_datasets
>>> dataset1 = InsectSet459(split="validation")
>>> dataset2 = BirdSet(split="HSN-test")
>>> ds = ConcatenatedDataset([dataset1, dataset2], merge_level="soft")
>>> assert len(ds) > 0, "Concatenated dataset should not be empty"
>>> assert len(ds) == len(dataset1) + len(dataset2), "Concatenated dataset length should match sum of source datasets lengths"
Source code in alp_data/concat.py
356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 | |
from_config(concat_config)
classmethod
Create a ConcatenatedDataset from a ConcatConfig object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
concat_config
|
ConcatConfig
|
Configuration object specifying the datasets to concatenate and how to merge them. |
required |
Returns:
| Type | Description |
|---|---|
tuple[ConcatenatedDataset, dict]
|
A tuple containing the ConcatenatedDataset instance and metadata about transformations applied. |