Skip to content

alp_data.concat Module

Combining Datasets: Concatenation vs Chaining

ESP Data provides two ways to combine multiple datasets: ConcatenatedDataset and ChainedDataset. Choose based on whether you need to transform the combined data:

Feature ConcatenatedDataset ChainedDataset
Use case Transform combined data Simple iteration
DataFrames Merged into one Not merged
Transformations Supported on joined data Not supported
Merge strategies Hard, overlap, soft N/A
Streaming mode Not supported Supported
Memory Holds merged DataFrame Lightweight

Use ConcatenatedDataset when you need to apply transforms (filter, deduplicate, etc.) to the combined dataset. Use ChainedDataset when you simply want to iterate over multiple datasets sequentially without any joint transformations.

For ChainedDataset documentation, see chain.md.


What is Dataset concatenation?

The concat module provides utilities for combining multiple ESP datasets into a single unified dataset. This is particularly useful when you want to train models on data from multiple sources or combine different splits of related datasets while maintaining proper data handling and metadata.

More technically, dataset concatenation:

  • Combines multiple Dataset objects into a single ConcatenatedDataset
  • Preserves original dataset functionality through source dataset references
  • Handles column mismatches through configurable merge strategies
  • Maintains proper metadata and configuration merging
  • Tracks data provenance for debugging and analysis through a merged DatasetInfo

A note on dataset merging

There are three merge strategies. The merge logic is implemented through the backend abstraction (pandas or polars), so the strategies behave consistently regardless of which backend the source datasets use:

  1. Soft Merge: Keeps all columns from all datasets, filling missing values with NaN.
  2. Overlap Merge: Keeps only columns that exist in all datasets.
  3. Hard Merge: Requires all datasets to have identical columns, raising an error if they differ.

Warning

When merging two datasets, be aware that if a column with the same name appears in multiple datasets being concatenated, and the data type (dtype) of the column is not the same across datasets, the resulting column in the concatenated dataset may be upcast to object (pandas) or fail to concatenate (polars). This can lead to unexpected behavior in downstream processing.

How can I concatenate datasets?

Datasets can be concatenated using the ConcatenatedDataset class with different merge strategies:

Basic Usage

from alp_data.datasets import AnimalSpeak, InsectSet459
from alp_data.concat import ConcatenatedDataset

# Load individual datasets
dataset1 = AnimalSpeak(split="validation")
dataset2 = InsectSet459(split="validation")

print(f"Dataset 1 length: {len(dataset1)}")
print(f"Dataset 2 length: {len(dataset2)}")

# Concatenate with default soft merge
combined_dataset = ConcatenatedDataset(
    datasets=[dataset1, dataset2],
    merge_level="soft"  # Options: "soft", "overlap", "hard"
)

# Access the combined data
print(f"Combined dataset length: {len(combined_dataset)}")
sample = combined_dataset[0]  # Get first sample (should be AnimalSpeak)
print(f"First sample: {sample.keys()}")

Create a combined dataset from a yaml config

You can also create a concatenated dataset from a YAML configuration file. Here is an example of how to do this: Note the concat keyword at the top level of the config, and datasets list inside it (required).

concat:
  datasets:
    - dataset_name: beans
      split: dogs_test
      output_take_and_give: null
    - dataset_name: beans
      split: esc50_validation
      output_take_and_give: null
  merge_level: soft
  transformations:
    - type: label_from_feature
      feature: label
      output_feature: label
      override: true
    - type: deduplicate
      subset: ["file_name", "label"]
      keep_first: true
Here, we're concatenating two splits of the beans dataset and applying some transformations to the combined dataset. The concat keyword is a SPECIAL keyword, which tells the dataset_from_config function to create a ConcatenatedDataset instead of a regular dataset. Here's the python code for loading this config:

from alp_data import dataset_from_config
combined_dataset = dataset_from_config("path/to/concat_config.yaml")

Apply transformations before / after concatenation

You can apply transformations to the individual datasets before concatenation as shown in transforms.md. This allows you to treat the data as needed, but you can also apply transformations after concatenation if you want to operate on the combined dataset as a whole. Here is an example of applying a filter transformation after concatenation:

from alp_data.datasets import AnimalSpeak, InsectSet459
from alp_data.transforms import FilterConfig
from alp_data.concat import ConcatenatedDataset

# Load individual datasets
dataset1 = AnimalSpeak(split="validation")
dataset2 = InsectSet459(split="validation")
# Concatenate datasets
combined_dataset = ConcatenatedDataset([dataset1, dataset2])
# Define a filter transformation
filter_config = FilterConfig(
    type="filter",
    property="species_common",
    values=["American Robin", "Bottle-nosed Dolphin"],
    mode="include"
)

# Run the transformation on the combined dataset
transform_metadata = combined_dataset.apply_transformations([filter_config])

Warning

If the merge_level was set to "soft" in ConcatenatedDataset, running a filter transformation like this will end up dropping all rows from datasets that do not have the species_common column, since those rows will be NaN for those datasets.

As mentioned, you can also apply transforms to individual datasets before concatenation:

# Create and transform individual datasets
animal_dataset = AnimalSpeak(split="train")
animal_filter = FilterConfig(
    type="filter",
    property="source",
    values=["xeno-canto"],
    mode="include"
)
animal_dataset.apply_transformations([animal_filter])

insect_dataset = InsectSet459(split="train")
insect_filter = FilterConfig(
    type="filter",
    property="family",
    values=["Cicadidae", "Gryllidae"],
    mode="include"
)
insect_dataset.apply_transformations([insect_filter])

# Concatenate the transformed datasets
combined_dataset = ConcatenatedDataset(
    [animal_dataset, insect_dataset],
    merge_level="overlap"
)

Merge Strategies

The merge_level parameter controls how datasets with different columns are handled:

1. Soft Merge (Default)

Keeps all columns from all datasets, filling missing values with NaN:

# Soft merge - most permissive
combined_dataset = ConcatenatedDataset(
    [dataset1, dataset2],
    merge_level="soft"
)

2. Overlap Merge

Keeps only columns that exist in all datasets:

# Overlap merge - keeps common columns only
combined_dataset = ConcatenatedDataset(
    [dataset1, dataset2],
    merge_level="overlap"
)

3. Hard Merge

Requires all datasets to have identical columns:

# Hard merge - strictest option
combined_dataset = ConcatenatedDataset(
    [dataset1, dataset2],
    merge_level="hard"
)

Understanding the ConcatenatedDataset Class

The ConcatenatedDataset class is the result of dataset concatenation and provides several important features:

Key Properties

# Access dataset information
print(combined_dataset.info.name)  # Combined dataset name
print(combined_dataset.info.description)  # Merged description
print(combined_dataset.columns)  # Available columns (excludes internal tracking)
print(combined_dataset.available_splits)  # Always ["concatenated"]

# Sample rate handling
print(combined_dataset.sample_rate)  # Unified sample rate if compatible

Data Access

The concatenated dataset maintains full functionality of individual datasets:

# Standard dataset operations
for i, sample in enumerate(combined_dataset):
    if i >= 5:  # Just show first 5
        break
    print(f"Sample {i}: {sample.keys()}")

# Direct indexing
specific_sample = combined_dataset[42]

Source Dataset Tracking

Each sample maintains information about its original source:

# The internal tracking is handled automatically
# You get the properly loaded data from the original source dataset
sample = combined_dataset[0]
# This sample was loaded using the appropriate source dataset's __getitem__ method

Configuration and Metadata Merging

DatasetInfo Merging

When datasets are concatenated, their metadata is merged like so:

  • Names: Combined with "+" separator (e.g., "animalspeak+barkleycanyon")
  • Owners: Deduplicated and joined with ";" separator
  • Versions: Highest version is selected using semantic versioning
  • Descriptions: Numbered list of original descriptions
  • Sources: Deduplicated list of all sources
  • Licenses: Unique licenses joined with ";" separator

Sample Rate Validation

Sample rates must be compatible across datasets:

# This will work if both datasets have the same sample rate
combined_dataset = ConcatenatedDataset([dataset1, dataset2])

# This will raise MergeException if sample rates differ
try:
    incompatible_dataset = ConcatenatedDataset([audio_16k, audio_44k])
except MergeException as e:
    print(f"Sample rate mismatch: {e}")

Output Column Mapping

The output_take_and_give mappings are merged and validated:

from alp_data.datasets import AnimalSpeak

# Create datasets with compatible column mappings
# i.e., either completely different mappings valid to each dataset individually,
# or overlapping, but not conflicting mappings
dataset1 = AnimalSpeak(
    split="validation",
    output_take_and_give={"canonical_name": "species"}
)
dataset2 = AnimalSpeak(
    split="train",
    output_take_and_give={"local_path": "path"}
)

# These will be merged successfully
combined_dataset = ConcatenatedDataset([dataset1, dataset2])
# Access the merged output mappings
print(combined_dataset.output_take_and_give)
# Output: {'canonical_name': 'species', 'local_path': 'path'}

# Conflicting mappings will raise MergeException
dataset3 = AnimalSpeak(
    split="validation",
    output_take_and_give={"canonical_name": "different_name"}  # Conflict!
)

try:
    bad_combined = ConcatenatedDataset([dataset1, dataset3])
except MergeException as e:
    print(f"Mapping conflict: {e}")

Best Practices

1. Choose the Right Merge Strategy

  • Use soft merge when datasets have different but complementary columns
  • Use overlap merge when you only need common features across datasets
  • Use hard merge when datasets should have identical schemas

2. Validate Before Concatenation

It might make sense to a perform a sanity check that multiple datasets can be concatenated without issues. Incompatible datasets (for e.g. merge strategy "hard" but different columns, or different sample rates) will raise a MergeException when you try to concatenate them.

For example, this can be achieved using a check_compatibility function:

# Check dataset compatibility
def check_compatibility(datasets):
    sample_rates = [getattr(ds, 'sample_rate', None) for ds in datasets]
    if len(set(sr for sr in sample_rates if sr is not None)) > 1:
        print("Warning: Different sample rates detected")

    columns = [set(ds._data.columns) for ds in datasets]
    common_cols = set.intersection(*columns)
    print(f"Common columns: {len(common_cols)}")

check_compatibility([dataset1, dataset2])

Limitations and Considerations

Current Limitations

  1. Memory Usage: All source datasets remain in memory (streaming=True is not supported for concatenation; use ChainedDataset for streaming).
  2. Single Split: Concatenated datasets only support the "concatenated" split.
  3. Uniform Backend: All source datasets must use the same backend type (e.g., all polars or all pandas).

Performance Considerations

  • Concatenation creates a new DataFrame, which uses additional memory
  • Source dataset references are maintained, so original datasets aren't garbage collected
  • Index lookups require mapping back to source datasets

Function Reference

A dataset created by concatenating multiple datasets.

This dataset maintains references to the original datasets to enable proper audio loading and other dataset-specific functionality.

Parameters:

Name Type Description Default
datasets list[Dataset]

List of datasets to concatenate

required
merge_level (hard, overlap, soft)

Strategy for handling different columns - "hard": All columns must match exactly across all datasets - "overlap": Keep only common columns across all datasets - "soft": Keep all columns from all datasets (fill missing with NaN)

"hard"

Examples:

>>> from alp_data.datasets import InsectSet459, BirdSet
>>> from alp_data.concat import concatenate_datasets
>>> dataset1 = InsectSet459(split="validation")
>>> dataset2 = BirdSet(split="HSN-test")
>>> ds = ConcatenatedDataset([dataset1, dataset2], merge_level="soft")
>>> assert len(ds) > 0, "Concatenated dataset should not be empty"
>>> assert len(ds) == len(dataset1) + len(dataset2),         "Concatenated dataset length should match sum of source datasets lengths"
Source code in alp_data/concat.py
@register_dataset
class ConcatenatedDataset(Dataset):
    """A dataset created by concatenating multiple datasets.

    This dataset maintains references to the original datasets to enable
    proper audio loading and other dataset-specific functionality.

    Parameters
    ----------
    datasets : list[Dataset]
        List of datasets to concatenate
    merge_level : {"hard", "overlap", "soft"}, default="soft"
        Strategy for handling different columns
        - "hard": All columns must match exactly across all datasets
        - "overlap": Keep only common columns across all datasets
        - "soft": Keep all columns from all datasets (fill missing with NaN)

    Examples
    --------
    >>> from alp_data.datasets import InsectSet459, BirdSet
    >>> from alp_data.concat import concatenate_datasets
    >>> dataset1 = InsectSet459(split="validation")
    >>> dataset2 = BirdSet(split="HSN-test")
    >>> ds = ConcatenatedDataset([dataset1, dataset2], merge_level="soft")
    >>> assert len(ds) > 0, "Concatenated dataset should not be empty"
    >>> assert len(ds) == len(dataset1) + len(dataset2), \
        "Concatenated dataset length should match sum of source datasets lengths"
    """

    info = DatasetInfo(
        name="concatenated_dataset",
        owner="ESP Data Team",
        split_paths={"concatenated": "virtual://concatenated_dataset"},
        version="0.1.0",
        description="A dataset created by concatenating multiple datasets.",
        sources=["Multiple datasets"],
        license="CC0-1.0",
    )

    def __init__(
        self,
        datasets: list[Dataset],
        merge_level: Literal["hard", "overlap", "soft"] = "soft",
    ) -> None:
        # Validate inputs
        if not datasets:
            raise MergeException("At least one dataset must be provided")

        if not all(isinstance(ds, Dataset) for ds in datasets):
            raise MergeException("All objects must be Dataset instances")

        backend_type = getattr(datasets[0], "_backend_class", None) if datasets else None
        # Make sure all backend types are the same
        if not backend_type or not all(
            getattr(ds, "_backend_class", None) == backend_type for ds in datasets
        ):
            raise MergeException(
                "All datasets must have the same backend type "
                "to be concatenated into a ConcatenatedDataset."
            )

        # Check that streaming is False for ConcatenatedDataset
        if not all([not ds.streaming for ds in datasets]):
            raise MergeException(
                "Concatenation is only allowed with streaming=False",
                "because transforms need to be performed on the whole dataset",
            )

        super().__init__(
            backend=backend_type.__name__.replace("Backend", "").lower(), streaming=False
        )

        (
            self._data,
            self.info,
            self._source_datasets,
            self.sample_rate,
            output_take_and_give,
        ) = concatenate_datasets(datasets, merge_level=merge_level)

        self.output_take_and_give = output_take_and_give
        self.split = "concatenated"
        if len(self._data) == 0:
            raise ValueError("Concatenated dataset is empty. Check input datasets or merge level.")

    @property
    def columns(self) -> list[str]:
        # Filter out internal tracking columns
        all_columns = self._data.columns
        return [col for col in all_columns if not col.startswith("_source_")]

    @property
    def available_splits(self) -> list[str]:
        return ["concatenated"]

    def _load(self) -> None:
        pass  # Data is already loaded

    @classmethod
    def from_config(
        cls, concat_config: ConcatConfig
    ) -> tuple["ConcatenatedDataset", dict[str, Any]]:
        """Create a ConcatenatedDataset from a ConcatConfig object.

        Parameters
        ----------
        concat_config : ConcatConfig
            Configuration object specifying the datasets to concatenate
            and how to merge them.

        Returns
        -------
        tuple[ConcatenatedDataset, dict]
            A tuple containing the ConcatenatedDataset instance
            and metadata about transformations applied.
        """
        datasets = [dataset_from_config(cfg)[0] for cfg in concat_config.datasets]
        ds = cls(
            datasets,
            merge_level=concat_config.merge_level,
        )

        if concat_config.transformations:
            transform_metadata = ds.apply_transformations(concat_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        return len(self._data)

    def __getitem__(self, idx: int) -> dict[str, Any]:
        if idx >= len(self._data):
            raise IndexError(f"Index {idx} out of bounds for dataset of length {len(self._data)}")

        # Get row as dict from backend
        # This dict has transforms applied at concat level
        row = self._data[idx]

        # Determine which source dataset this row came from
        source_dataset_idx = int(row["_source_dataset"])
        source_row_idx = int(row["_source_index"])
        source_dataset = self._source_datasets[source_dataset_idx]

        # Get the original item from the source dataset
        try:
            # Temporarily restore the original output_take_and_give to get raw data
            original_otag = source_dataset.output_take_and_give
            source_dataset.output_take_and_give = None

            source_item = source_dataset[source_row_idx]

            # Restore the output_take_and_give
            source_dataset.output_take_and_give = original_otag

        except Exception as e:
            raise RuntimeError(
                f"Failed to load item {source_row_idx} "
                f"from source dataset {source_dataset_idx}: {e}"
            ) from e

        # Merge with row
        source_item.update(row)

        # Apply the concatenated dataset's output_take_and_give mapping
        if self.output_take_and_give:
            mapped_item = {}
            for key, value in self.output_take_and_give.items():
                if key in source_item:
                    mapped_item[value] = source_item[key]
            return mapped_item

        return source_item

    def __iter__(self) -> Iterator[dict[str, Any]]:
        for idx in range(len(self)):
            yield self[idx]

    def __str__(self) -> str:
        return (
            f"{self.info.name} (v{self.info.version})\n"
            f"Description: {self.info.description}\n"
            f"Length: {len(self)}\n"
            f"Columns: {', '.join(self.columns)}\n"
            f"Source datasets: {len(self._source_datasets)}"
        )

from_config(concat_config) classmethod

Create a ConcatenatedDataset from a ConcatConfig object.

Parameters:

Name Type Description Default
concat_config ConcatConfig

Configuration object specifying the datasets to concatenate and how to merge them.

required

Returns:

Type Description
tuple[ConcatenatedDataset, dict]

A tuple containing the ConcatenatedDataset instance and metadata about transformations applied.

Source code in alp_data/concat.py
@classmethod
def from_config(
    cls, concat_config: ConcatConfig
) -> tuple["ConcatenatedDataset", dict[str, Any]]:
    """Create a ConcatenatedDataset from a ConcatConfig object.

    Parameters
    ----------
    concat_config : ConcatConfig
        Configuration object specifying the datasets to concatenate
        and how to merge them.

    Returns
    -------
    tuple[ConcatenatedDataset, dict]
        A tuple containing the ConcatenatedDataset instance
        and metadata about transformations applied.
    """
    datasets = [dataset_from_config(cfg)[0] for cfg in concat_config.datasets]
    ds = cls(
        datasets,
        merge_level=concat_config.merge_level,
    )

    if concat_config.transformations:
        transform_metadata = ds.apply_transformations(concat_config.transformations)
        return ds, transform_metadata

    return ds, {}

Exception raised when dataset concatenation fails.

Source code in alp_data/concat.py
class MergeException(Exception):
    """Exception raised when dataset concatenation fails."""

    pass