`alp_data.datasets` module

What are ESP Datasets?

The datasets module provides a collection of datasets validated by the engineering team for ESP projects. In short, this is the module to use to download, load and manipulate official ESP datasets. Each dataset is implemented as a class that inherits from the base Dataset class, providing a consistent interface for data loading and access.

More technically an ESP Dataset is defined as such: - Inherits from the base Dataset class - Has a defined DatasetInfo containing metadata - Provides methods for loading and accessing data and splits - Can be configured through a DatasetConfig

How to Load Datasets?

Datasets can be loaded following two different approaches:

Direct instantiation:

from alp_data.datasets import AnimalSpeak

# Create a dataset instance
dataset = AnimalSpeak(
    split="validation"
)

# Access data
sample = dataset[0]  # Get first sample
print(len(dataset))  # Get dataset size

Using configuration:

from alp_data import DatasetConfig
from alp_data.datasets import AnimalSpeak

# Create a configuration
config = DatasetConfig(
    dataset_name="animalspeak",
    split="validation",
)

# Create dataset from config
# This returns a tuple, the dataset and a dictionary of metadata
# The metadata is generated by any transforms in the config which
# are applied to the dataset
dataset, _ = AnimalSpeak.from_config(config)

From a config yaml file:

Your yaml config file should look like this for a single dataset (see Concatenate for multiple datasets): Note the dataset key at the top level is required.

dataset:
  dataset_name: AnimalSpeak
  split: validation
  output_take_and_give:
    labels: label
  data_root: null
  transformations:
    - type: deduplicate
      subset: null

    - type: label_from_feature
      feature: species_common
      output_feature: label
      override: true

from alp_data import dataset_from_config

ds, transform_metadata = dataset_from_config("path/to/config.yaml")

print(len(ds))

Dataset Configuration

Deeper levels of configurations can be achieved by using specific parameters which are either common to all datasets or sometimes specific. Common arguments are:

split: The data split to use (e.g., "train", "validation")
output_take_and_give: Column picker and name mappings. This is used to:
- Pick the columns you want in the output dictionary returned when __getitem__ is called via x = sample[0].
- Rename the columns in the output dictionary. For example, if you want to rename the "audio" column to "raw_wav", you can specify {"audio": "raw_wav"}.
sample_rate: Target audio sample rate (for audio datasets, it will resample to this rate).
data_root: Custom root directory for data files. If not specified, the data_root is set as the parent directory of the path to the split. The idea here is that the data maybe copied from its original location (usually a bucket) to a local disk or a folder on the shared nfs.

Using Transforms with Datasets

Datasets can be combined with Transforms to modify or enhance the data during loading. Transforms are modifying the data inplace, so the returned dataset will be effectively a different version of the original data.

Basic Usage with Transforms

Transforms can be used in a sequential way, as in first get the original dataset, then apply a transform:

Remark

The order of the transforms is important. If you have multiple transforms, they will be applied in the order they are defined in the configuration. So, for e.g., if you change the name of a column with LabelFromFeatureTransform, it will effect the Filter Transform

from alp_data.datasets import AnimalSpeak
from alp_data.transforms import FilterConfig, LabelFromFeatureConfig

# Create a dataset
aspeak_output_map = {
    "audio": "raw_wav"  # maps  the "audio" column to "raw_wav" in output
}
dataset = AnimalSpeak(split="validation", output_take_and_give=aspeak_output_map)

# Create transform configurations
filter_config = FilterConfig(
    type="filter",
    property="source",
    values=["xeno-canto", "iNaturalist"],
    mode="include"
)

label_from_feature_config = LabelFromFeatureConfig(
    type="label_from_feature",
    feature="canonical_name",
    output_feature="label"
)

dataset.apply_transformations([filter_config, label_from_feature_config])

Using Transforms in Dataset Configuration

Transforms can also be specified in the dataset configuration to be automatically applied when the dataset is instantiated.

from alp_data import DatasetConfig
from alp_data.transforms import FilterConfig, LabelFromFeatureConfig

# Create transform configurations
filter_config = FilterConfig(
    type="filter",
    property="source",
    values=["xeno-canto", "iNaturalist"],
    mode="include"
)
label_config = LabelFromFeatureConfig(
    type="label_from_feature",
    feature="canonical_name",
    output_feature="label"
)

# Create dataset configuration with transforms
config = DatasetConfig(
    dataset_name="animalspeak",
    split="validation",
    transformations=[filter_config, label_config]
)

# Create dataset with transforms
dataset, metadata = AnimalSpeak.from_config(config)

print(metadata.keys())
# dict_keys(['filter', 'label_from_feature'])
print(metadata["label_from_feature"].keys())
# dict_keys(['label_feature', 'label_map', 'num_classes'])

Available Datasets

The list of available dataset will grow over time. Please refer to the next section if you wish to use your own Dataset or add a new one to the list of officially supported ones.

`AnimalSoundArchive`

📊 Dataset Information

Name	`animal-sound-archive`
Version	`0.1.0`
Owner	david
License	mostly CC-BY-NC-SA (unversioned)
Sources	Tierstimmenarchiv (Museum für Naturkunde Berlin)
Available Splits	`train`, `validation`, `all`, `train_excl_beanszero`, `validation_excl_beanszero`, `all_excl_beanszero`

Description:

Animal Sound Archive (Tierstimmenarchiv) audio dataset with taxonomic metadata. ~46k recordings of birds, mammals, insects, amphibians and other taxa from Museum für Naturkunde Berlin. Available at original (variable) sample rates, 16kHz, and 32kHz (pre-resampled). Pre-resampled audio uses librosa's kaiser_best resampling method. Train/val split: val_size=3000, random seed 42.

Animal Sound Archive (Tierstimmenarchiv) audio dataset.

Description

The Tierstimmenarchiv (Animal Sound Archive) at the Museum für Naturkunde Berlin hosts ~46k downloadable recordings covering birds, mammals, insects, amphibians, and other taxa. Recordings are linked to GBIF backbone taxonomy.

Audio is available as original MP3 files (variable sample rate) and pre-resampled WAV at 16kHz and 32kHz using librosa's kaiser_best method.

Available Metadata Fields

Taxonomic Information: - canonical_name: Canonical species name from GBIF (primary identifier) - species_scientific: Scientific species name - species_common: Common/vernacular species name (enriched via GBIF, ~99%) - genus, family, order, class, phylum, kingdom: Taxonomic hierarchy - gbifID: GBIF (Global Biodiversity Information Facility) identifier

Audio File Paths: - originals_path: Path to original MP3 audio relative to data root - 32khz_path: Path to pre-resampled 32kHz WAV audio relative to data root - 16khz_path: Path to pre-resampled 16kHz WAV audio relative to data root

Recording Metadata: - tsa_id: Tierstimmenarchiv unique identifier - eventDate: When the recording was made (~83%) - eventTime: Time of recording (~45%) - soundType: Type of sound, e.g. "song", "call" (~88%) - soundQuality: Recording quality assessment (~45%) - duration_seconds: Recording duration in seconds (~94%) - sex: Sex of the recorded animal(s) (~39%) - lifeStage: Life stage, e.g. "adult" (~40%) - backgroundSpecies: Species audible in the background (~22%)

Location: - latitudeDecimal, longitudeDecimal: GPS coordinates (~56%) - locality: Geographic location (~89%) - country: Country (~93%) - habitat: Habitat description (~23%)

Rights & Attribution: - recordist: Person who made the recording (~100%) - license, media_license: License information (mostly CC-BY-NC-SA, some CC BY-NC-SA) - url: Original archive download URL

Additional Fields: - occurrenceRemarks: Description of the recording in German (~95%) - occurrenceRemarks_en: Description in English (~81%) - fieldNotes: Observer's field notes (~70%) - weather: Weather conditions during recording (~35%) - recordingEquipment: Equipment used (~81%)

Available Splits

train: Training set (all minus 3000 held-out samples, random split)
validation: Validation set (3000 samples, random split)
all: Complete dataset (train + validation)
train_excl_beanszero: Training set excluding taxa evaluated in BEANS-Zero benchmark
validation_excl_beanszero: Validation set excluding taxa evaluated in BEANS-Zero benchmark
all_excl_beanszero: Complete dataset excluding BEANS-Zero taxa

References

Tierstimmenarchiv: https://www.tierstimmenarchiv.de/

Examples:

>>> from alp_data.datasets import AnimalSoundArchive
>>> dataset = AnimalSoundArchive(
...     split="train",
...     output_take_and_give={"canonical_name": "species"},
...     streaming=True
... )
>>> print(dataset.info.name)
animal-sound-archive
>>> print(dataset.available_sample_rates)
[32000, 16000]

>>> dataset_32k = AnimalSoundArchive(split="train", sample_rate=32000, streaming=True)

Source code in alp_data/datasets/animal_sound_archive.py

@register_dataset
class AnimalSoundArchive(Dataset):
    """Animal Sound Archive (Tierstimmenarchiv) audio dataset.

    Description
    -----------
    The Tierstimmenarchiv (Animal Sound Archive) at the Museum für Naturkunde
    Berlin hosts ~46k downloadable recordings covering birds, mammals, insects,
    amphibians, and other taxa. Recordings are linked to GBIF backbone taxonomy.

    Audio is available as original MP3 files (variable sample rate) and
    pre-resampled WAV at 16kHz and 32kHz using librosa's kaiser_best method.

    Available Metadata Fields
    -------------------------
    **Taxonomic Information:**
        - ``canonical_name``: Canonical species name from GBIF (primary identifier)
        - ``species_scientific``: Scientific species name
        - ``species_common``: Common/vernacular species name (enriched via GBIF, ~99%)
        - ``genus``, ``family``, ``order``, ``class``, ``phylum``, ``kingdom``: Taxonomic hierarchy
        - ``gbifID``: GBIF (Global Biodiversity Information Facility) identifier

    **Audio File Paths:**
        - ``originals_path``: Path to original MP3 audio relative to data root
        - ``32khz_path``: Path to pre-resampled 32kHz WAV audio relative to data root
        - ``16khz_path``: Path to pre-resampled 16kHz WAV audio relative to data root

    **Recording Metadata:**
        - ``tsa_id``: Tierstimmenarchiv unique identifier
        - ``eventDate``: When the recording was made (~83%)
        - ``eventTime``: Time of recording (~45%)
        - ``soundType``: Type of sound, e.g. "song", "call" (~88%)
        - ``soundQuality``: Recording quality assessment (~45%)
        - ``duration_seconds``: Recording duration in seconds (~94%)
        - ``sex``: Sex of the recorded animal(s) (~39%)
        - ``lifeStage``: Life stage, e.g. "adult" (~40%)
        - ``backgroundSpecies``: Species audible in the background (~22%)

    **Location:**
        - ``latitudeDecimal``, ``longitudeDecimal``: GPS coordinates (~56%)
        - ``locality``: Geographic location (~89%)
        - ``country``: Country (~93%)
        - ``habitat``: Habitat description (~23%)

    **Rights & Attribution:**
        - ``recordist``: Person who made the recording (~100%)
        - ``license``, ``media_license``: License information (mostly CC-BY-NC-SA, some CC BY-NC-SA)
        - ``url``: Original archive download URL

    **Additional Fields:**
        - ``occurrenceRemarks``: Description of the recording in German (~95%)
        - ``occurrenceRemarks_en``: Description in English (~81%)
        - ``fieldNotes``: Observer's field notes (~70%)
        - ``weather``: Weather conditions during recording (~35%)
        - ``recordingEquipment``: Equipment used (~81%)

    Available Splits
    ----------------
    - ``train``: Training set (all minus 3000 held-out samples, random split)
    - ``validation``: Validation set (3000 samples, random split)
    - ``all``: Complete dataset (train + validation)
    - ``train_excl_beanszero``: Training set excluding taxa evaluated in BEANS-Zero benchmark
    - ``validation_excl_beanszero``: Validation set excluding taxa evaluated in BEANS-Zero benchmark
    - ``all_excl_beanszero``: Complete dataset excluding BEANS-Zero taxa

    References
    ----------
    Tierstimmenarchiv: https://www.tierstimmenarchiv.de/

    Examples
    --------
    >>> from alp_data.datasets import AnimalSoundArchive
    >>> dataset = AnimalSoundArchive(
    ...     split="train",
    ...     output_take_and_give={"canonical_name": "species"},
    ...     streaming=True
    ... )
    >>> print(dataset.info.name)
    animal-sound-archive
    >>> print(dataset.available_sample_rates)
    [32000, 16000]

    >>> dataset_32k = AnimalSoundArchive(split="train", sample_rate=32000, streaming=True)
    """

    info = DatasetInfo(
        name="animal-sound-archive",
        owner="david",
        split_paths={
            "train": f"{_RAW_ROOT}/train_v2.csv",
            "validation": f"{_RAW_ROOT}/val_v2.csv",
            "all": f"{_RAW_ROOT}/all_v2.csv",
            "train_excl_beanszero": f"{_RAW_ROOT}/train_unseen_v2.csv",
            "validation_excl_beanszero": f"{_RAW_ROOT}/val_unseen_v2.csv",
            "all_excl_beanszero": f"{_RAW_ROOT}/all_unseen_v2.csv",
        },
        version="0.1.0",
        description="Animal Sound Archive (Tierstimmenarchiv) audio dataset with "
        "taxonomic metadata. ~46k recordings of birds, mammals, insects, amphibians "
        "and other taxa from Museum für Naturkunde Berlin. "
        "Available at original (variable) sample rates, 16kHz, and 32kHz (pre-resampled). "
        "Pre-resampled audio uses librosa's kaiser_best resampling method. "
        "Train/val split: val_size=3000, random seed 42.",
        sources=["Tierstimmenarchiv (Museum für Naturkunde Berlin)"],
        license="mostly CC-BY-NC-SA (unversioned)",
    )

    _sample_rate_paths = {
        32000: "32khz_path",
        16000: "16khz_path",
    }

    _originals_path_column = "originals_path"

    def __init__(
        self,
        split: str = "train",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the Animal Sound Archive dataset.

        Parameters
        ----------
        split : str, default="train"
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str], optional
            A dictionary mapping the original column names to the new column names.
        sample_rate : int, optional
            The sample rate to which audio files should be resampled. If the requested
            sample rate is available as pre-resampled audio (see ``available_sample_rates``),
            the pre-resampled version will be loaded directly. Otherwise, audio will be
            resampled on-the-fly from the original files (at variable sample rates) using
            librosa's kaiser_best method. If None, audio is returned at its original
            (variable) sample rate.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. All path columns in the CSV are
            relative to this root. If None, defaults to the GCS bucket path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self._load()
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(f"{_RAW_ROOT}/")
        else:
            self.data_root = anypath(data_root)

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Return the available pre-resampled sample rates.

        Returns
        -------
        list[int]
            List of sample rates (in Hz) for which pre-resampled audio is available.
            Audio at these sample rates can be loaded directly without on-the-fly resampling.
            This checks which path columns actually exist in the loaded data.
        """
        available = []
        for sr, path_column in self._sample_rate_paths.items():
            if path_column in self._data.columns:
                available.append(sr)
        return available

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(location, streaming=self._streaming)

    @classmethod
    def from_config(
        cls, dataset_config: DatasetConfig
    ) -> tuple["AnimalSoundArchive", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters.

        Returns
        -------
        tuple[AnimalSoundArchive, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise an empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode. Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Parameters
        ----------
        row : dict[str, Any]
            A dictionary representing a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            The processed row.
        """
        use_presampled = False
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            path_column = self._sample_rate_paths[self.sample_rate]
            if path_column in row and row[path_column] is not None and row[path_column] != "":
                audio_path = self.data_root / row[path_column]
                use_presampled = True

        if use_presampled:
            audio, sample_rate = read_audio(audio_path)
            audio = audio.astype(np.float32)
            audio = audio_stereo_to_mono(audio, mono_method="average")
        else:
            audio_path = self.data_root / row[self._originals_path_column]
            audio, sample_rate = read_audio(audio_path)
            audio = audio.astype(np.float32)
            audio = audio_stereo_to_mono(audio, mono_method="average")

            if self.sample_rate is not None and sample_rate != self.sample_rate:
                audio = librosa.resample(
                    y=audio,
                    orig_sr=sample_rate,
                    target_sr=self.sample_rate,
                    scale=True,
                    res_type="kaiser_best",
                )
                sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the processed data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version})"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`AnimalSpeak`

📊 Dataset Information

Name	`animalspeak`
Version	`0.1.0`
Owner	david; marius; masato
License	CC BY
Sources	Xeno-canto, iNaturalist, Watkins
Available Splits	`train`, `validation`

Description:

AnimalSpeak dataset

AnimalSpeak dataset.

Description

A part of NatureLM training and BioLingual, AnimalSpeak, as over a million audio-caption pairs holding information on species, vocalization context, and animal behavior.

References

TRANSFERABLE MODELS FOR BIOACOUSTICS WITH HUMAN LANGUAGE SUPERVISION Robinson et al 2023 https://arxiv.org/pdf/2308.04978

Examples:

>>> from alp_data.datasets import AnimalSpeak
>>> dataset = AnimalSpeak(
...     split="validation",
...     output_take_and_give={"species_common": "comm"}
... )
>>> print(dataset.info.name)
animalspeak

Source code in alp_data/datasets/animalspeak.py

@register_dataset
class AnimalSpeak(Dataset):
    """AnimalSpeak dataset.

    Description
    -----------
    A part of NatureLM training and BioLingual, AnimalSpeak,
    as over a million audio-caption pairs holding information on
    species, vocalization context, and animal behavior.

    References
    ----------
    TRANSFERABLE MODELS FOR BIOACOUSTICS WITH HUMAN LANGUAGE SUPERVISION
    Robinson et al 2023
    https://arxiv.org/pdf/2308.04978

    Examples
    --------
    >>> from alp_data.datasets import AnimalSpeak
    >>> dataset = AnimalSpeak(
    ...     split="validation",
    ...     output_take_and_give={"species_common": "comm"}
    ... )
    >>> print(dataset.info.name)
    animalspeak
    """

    info = DatasetInfo(
        name="animalspeak",
        owner="david; marius; masato",
        split_paths={
            "train": f"{DATA_HOME}/animalspeak/v0.1.0/raw/16KHz/train_v2.csv",
            "validation": f"{DATA_HOME}/animalspeak/v0.1.0/raw/16KHz/validation_v2.csv",
        },
        version="0.1.0",
        description="AnimalSpeak dataset",
        sources=["Xeno-canto", "iNaturalist", "Watkins"],
        license="CC BY",
    )

    def __init__(
        self,
        split: str = "train",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the AnimalSpeak dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self._load()  # Load the dataset (fills self._data)
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = f"{DATA_HOME}/animalspeak/v0.1.0/raw/16KHz/"
        else:
            self.data_root = data_root

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return self._data.columns

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}."
                "Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]

        # TODO: Polars needs a lot of rows to figure out types correctly
        # which is why we set infer_schema_length here to 10,000
        self._data = self._backend_class.from_csv(
            location,
            streaming=self._streaming,
            infer_schema_length=10_000,
            keep_default_na=False,
            na_values=[""],
        )

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["AnimalSpeak", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters.

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise an empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode.Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Parameters
        ----------
        row : dict[str, Any]
            A dictionary representing a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            The processed row.
        """
        relative_path = row["audio_path"]

        audio_path = anypath(self.data_root) / relative_path

        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)
        audio = audio_stereo_to_mono(audio, mono_method="average")

        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        # AnimalSpeak likes to call this 'raw_wav'
        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the audio data, text label, label, and path.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split: {self.split}"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`AnuraSetStrong`

📊 Dataset Information

Name	`anuraset_strong`
Version	`0.1.0`
Owner	benjamin
License	CC BY 1.0
Sources	Zenodo
Available Splits	`all`

Description:

AnuraSet: A dataset for benchmarking Neotropical anurancalls identification in passive acoustic monitoring by Canas et al. (2023): We introduce a large-scale multi-species dataset of anuran amphibianscalls recorded by PAM, that comprises 27 hours of expert annotationsfor 42 different species from two Brazilian biomes.

AnuraSetStrong Dataset

Description

This is the strongly labeled portion of AnuraSet, i.e. the portion with start- and stop-times annotated.

Description from "AnuraSet: A dataset for benchmarking Neotropical anuran calls identification in passive acoustic monitoring" by Canas et al. (2023)

"We introduce a large-scale multi-species dataset of anuran amphibians calls recorded by PAM, that comprises 27 hours of expert annotations for 42 different species from two Brazilian biomes.

To provide precise annotations, we identified bouts of advertisement calls within each audio file and generated strong labels for them (step 1). Using Audacity 3.2 software, we conducted a detailed visual and aural inspection of the spectrogram to identify temporal limits (beginning and end) of audio segments containing species-specific calls with an inter-call interval of less than 1 second. These annotations ensured fine-scale specificity (Figure 3). For longer intervals, we split the calls into different time boxes and labeled them independently. Detailed labels assigned to time boxes were composed of (i) the species ID, tagged with a unique 6-letter code built from the scientific name of each identified species (Table 2), and (ii) the perceived quality of the recorded signal, included as a single letter indicating a Low ('L'), Medium ('M'), or High ('H') quality (Figure 4). To ensure consistency among the perceptual quality labels, we set up the following criteria: A high-quality call has a high signal-to-noise ratio, no overlap with other sounds, has a well-identifiable structure on the spectrogram, and can be easily visualized on the oscillogram. A medium-quality call can be visually identified on the spectrogram but may overlap with other sounds that can be difficult to identify in the oscillogram. A low-quality call shows a low signal-to-noise ratio, is partially masked by other sounds, appears with low intensity on the spectrogram, and cannot be easily identified on the oscillogram. This information was used to increase the usability of the data and improve the error analysis of the learning model."

Note that we omitted the quality assessments.

Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels

Pre-resampled Audio

Pre-resampled audio is available at 16 kHz and 32 kHz. When sample_rate matches one of these rates, the pre-resampled files are loaded directly (no on-the-fly resampling). For any other target rate, audio is resampled on-the-fly using librosa's kaiser_best method.

References

https://arxiv.org/pdf/2307.06860

Source code in alp_data/datasets/anuraset.py

@register_dataset
class AnuraSetStrong(Dataset):
    """AnuraSetStrong Dataset

    Description
    -----------
    This is the strongly labeled portion of AnuraSet, i.e. the portion with
    start- and stop-times annotated.

    Description from "AnuraSet: A dataset for benchmarking Neotropical anuran
    calls identification in passive acoustic monitoring" by Canas et al. (2023)

    "We introduce a large-scale multi-species dataset of anuran amphibians
    calls recorded by PAM, that comprises 27 hours of expert annotations
    for 42 different species from two Brazilian biomes.

    To provide precise annotations, we identified bouts of advertisement
    calls within each audio file and generated strong labels for them (step 1).
    Using Audacity 3.2 software, we conducted a detailed visual and aural
    inspection of the spectrogram to identify temporal limits (beginning and end)
    of audio segments containing species-specific calls with an inter-call interval
    of less than 1 second. These annotations ensured fine-scale specificity (Figure 3).
    For longer intervals, we split the calls into different time boxes and labeled
    them independently. Detailed labels assigned to time boxes were composed of (i)
    the species ID, tagged with a unique 6-letter code built from the scientific
    name of each identified species (Table 2), and (ii) the perceived quality of the
    recorded signal, included as a single letter indicating a Low ('L'), Medium ('M'),
    or High ('H') quality (Figure 4). To ensure consistency among the perceptual quality
    labels, we set up the following criteria: A high-quality call has a high signal-to-noise
    ratio, no overlap with other sounds, has a well-identifiable structure on the spectrogram,
    and can be easily visualized on the oscillogram. A medium-quality call can be
    visually identified on the spectrogram but may overlap with other sounds that can be
    difficult to identify in the oscillogram. A low-quality call shows a low signal-to-noise
    ratio, is partially masked by other sounds, appears with low intensity on the spectrogram,
    and cannot be easily identified on the oscillogram. This information was used to increase
    the usability of the data and improve the error analysis of the learning model."

    Note that we omitted the quality assessments.

    Each entry consists of:
    - an audio recording
    - a selection table (Raven format), with Species labels

    Pre-resampled Audio
    -------------------
    Pre-resampled audio is available at 16 kHz and 32 kHz. When
    ``sample_rate`` matches one of these rates, the pre-resampled files are
    loaded directly (no on-the-fly resampling). For any other target rate,
    audio is resampled on-the-fly using librosa's ``kaiser_best`` method.

    References
    ----------
    https://arxiv.org/pdf/2307.06860

    """

    info = DatasetInfo(
        name="anuraset_strong",
        owner="benjamin",
        split_paths={
            "all": f"{DATA_HOME}/anuraset/anuraset_all_gbif_v3.csv",
        },
        version="0.1.0",
        description="AnuraSet: A dataset for benchmarking Neotropical anuran"
        "calls identification in passive acoustic monitoring by Canas et al. (2023): "
        "We introduce a large-scale multi-species dataset of anuran amphibians"
        "calls recorded by PAM, that comprises 27 hours of expert annotations"
        "for 42 different species from two Brazilian biomes.",
        sources="Zenodo",
        license="CC BY 1.0",
    )

    _sample_rate_paths: dict[int, str] = {16000: "16khz_path", 32000: "32khz_path"}
    _originals_path_column = "audio_path"

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """
        Parameters
        ----------
        split : str
            Split to load (key in info.split_paths).
        output_take_and_give : dict[str, str] | None
            Optional mapping of original → new output keys (filters columns as well).
        sample_rate : int | None
            If set, audio is resampled to this rate.
        data_root : str | AnyPathT | None
            Optional root directory to prepend to each row['audio_path'].
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self.annotation_columns = ["Species"]

        self.sample_rate = sample_rate
        self._data = None

        self._load()

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = anypath(data_root)

    @property
    def columns(self) -> list[str]:
        return self._data.columns

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Return pre-resampled sample rates whose path columns exist in the data."""
        return [sr for sr, col in self._sample_rate_paths.items() if col in self._data.columns]

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location, streaming=self._streaming, keep_default_na=False, na_values=[""]
        )

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode. Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Parameters
        ----------
        row : dict[str, Any]
            A dictionary representing a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            The processed row.
        """
        use_presampled = False
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            path_column = self._sample_rate_paths[self.sample_rate]
            if path_column in row and row[path_column] is not None and row[path_column] != "":
                audio_path = anypath(self.data_root) / row[path_column]
                use_presampled = True

        if not use_presampled:
            audio_path = anypath(self.data_root) / row[self._originals_path_column]

        audio, sr = read_audio(audio_path)
        audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)

        if not use_presampled and self.sample_rate is not None and sr != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sr,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sr = self.sample_rate

        st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
        audio_dur = len(audio) / float(sr)
        st = st[st["Begin Time (s)"] < audio_dur].copy()

        row["audio"] = audio
        row["selection_table"] = st

        if self.output_take_and_give:
            item = {}
            for old_key, new_key in self.output_take_and_give.items():
                item[new_key] = row[old_key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the processed data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["AnuraSetStrong", dict[str, Any]]:
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )
        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta
        return ds, {}

    def get_available_labels(self, anno_column: str = "Species") -> List[str]:
        """
        Return all possible labels for a given annotation column

        Returns
        ---------
        A list of all the available labels for anno_column
        """
        available_labels = set()
        for row in self._data:
            st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
            available_labels.update(st[anno_column].astype(str).tolist())

        return sorted(available_labels)

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version})"
        return (
            f"{base}\n"
            f"Sources: {self.info.sources}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`ArcticBirdSounds`

📊 Dataset Information

Name	`arctic_bird_sounds`
Version	`0.1.0`
Owner	benjamin
License	CC-BY-4.0
Sources	OSF
Available Splits	`all`

ArcticBirdSounds Dataset

Description

Recordings of birds in the arctic. Bird vocalizations are boxed (start and stop, high and low freq) and labeled with species. Description from the original publication:

"Tracking biodiversity shifts is central to understanding past, present, and future global changes. Recent advances in bioacoustics and the low cost of high-quality automatic recorders are revolutionizing studies in biogeography and community and behavioral ecology with a robust assessment of phenology, species occurrence, and individual activity. This large volume of acoustic recordings has recently generated a plethora of datasets that can now be handled automatically, mostly via big data methods such as deep learning. These approaches need high-quality annotations to classify and detect recorded sounds efficiently. However, very few strongly annotated datasets—that is, with detailed information on start and end time of each vocalization—are openly accessible to the public. Moreover, these datasets mostly cover temperate species and are usually limited to a single year of recordings. Here, we present ArcticBirdSounds, the first open- access, multisite, and multiyear strongly annotated dataset of arctic bird vocalizations. ArcticBirdSounds offers 20 h of annotated recordings over 2 years (2018, 2019), taken from 15 distinct plots within six locations across the Arctic, from Alaska to Greenland. Recordings cover the arctic vertebrates' breeding period and are evenly spaced during the day; they capture most species breeding there with 12,933 temporal annotations in 49 classes of sounds. While these data can be used for many pressing ecological questions, it is also a unique resource for methodological development to help meet the challenges of fast ecosystem transformations such as those happening in the Arctic. All data, including audio files, annotation files, and companion spreadsheets, are available in an Open Science Framework repository published under a CC BY 4.0 License."

Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels

Note that some species labels are unknown, and labeled as "Unknown"

References

https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecy.4047 https://osf.io/b9trx/overview

Source code in alp_data/datasets/arctic_bird_sounds.py

@register_dataset
class ArcticBirdSounds(Dataset):
    """ArcticBirdSounds Dataset

    Description
    -----------

    Recordings of birds in the arctic. Bird vocalizations are boxed (start and
    stop, high and low freq) and labeled with species. Description from the
    original publication:

    "Tracking biodiversity shifts is central to understanding past, present,
    and future global changes. Recent advances in bioacoustics and the low cost
    of high-quality automatic recorders are revolutionizing studies in
    biogeography and community and behavioral ecology with a robust assessment
    of phenology, species occurrence, and individual activity. This large
    volume of acoustic recordings has recently generated a plethora of datasets
    that can now be handled automatically, mostly via big data methods such as
    deep learning. These approaches need high-quality annotations to classify
    and detect recorded sounds efficiently. However, very few strongly
    annotated datasets—that is, with detailed information on start and end time
    of each vocalization—are openly accessible to the public. Moreover, these
    datasets mostly cover temperate species and are usually limited to a single
    year of recordings. Here, we present ArcticBirdSounds, the first open-
    access, multisite, and multiyear strongly annotated dataset of arctic bird
    vocalizations. ArcticBirdSounds offers 20 h of annotated recordings over 2
    years (2018, 2019), taken from 15 distinct plots within six locations
    across the Arctic, from Alaska to Greenland. Recordings cover the arctic
    vertebrates' breeding period and are evenly spaced during the day; they
    capture most species breeding there with 12,933 temporal annotations in
    49 classes of sounds. While these data can be used for many pressing
    ecological questions, it is also a unique resource for methodological
    development to help meet the challenges of fast ecosystem transformations
    such as those happening in the Arctic. All data, including audio files,
    annotation files, and companion spreadsheets, are available in an Open
    Science Framework repository published under a CC BY 4.0 License."

    Each entry consists of:
    - an audio recording
    - a selection table (Raven format), with Species labels

    Note that some species labels are unknown, and labeled as "Unknown"

    References
    ----------
    https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecy.4047
    https://osf.io/b9trx/overview

    """

    info = DatasetInfo(
        name="arctic_bird_sounds",
        owner="benjamin",
        split_paths={
            "all": f"{DATA_HOME}/arctic_bird_sounds/all.csv",
        },
        version="0.1.0",
        description="[MISSING]",
        sources="OSF",
        license="CC-BY-4.0",
    )

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = 16000,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "pandas",
        streaming: bool = False,
    ) -> None:
        """
        Parameters
        ----------
        split : str
            Split to load (key in info.split_paths).
        output_take_and_give : dict[str, str] | None
            Optional mapping of original → new output keys (filters columns as well).
        sample_rate : int | None
            If set, audio is resampled to this rate.
        data_root : str | AnyPathT | None
            Optional root directory to prepend to each row['audio_path'].
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data: pd.DataFrame | None = None
        self.annotation_columns = ["Species"]
        self.unknown_label = "Unknown"
        self.sample_rate = sample_rate

        # Load split CSV
        self._load()

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns) if self._data is not None else []

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location,
            streaming=self._streaming,
            keep_default_na=False,
            na_values=[""],
        )

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode.Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        # Resolve audio path
        audio_path = self.data_root / row["audio_path"]

        # Read audio
        audio, sample_rate = read_audio(audio_path)
        audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)

        # Resample if necessary
        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        # Selection table
        st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")

        # Clip events outside audio (keep only events that begin before audio end)
        audio_dur = len(audio) / float(sample_rate)
        st = st[st["Begin Time (s)"] < audio_dur].copy()

        # Build output
        row["audio"] = audio
        row["sample_rate"] = sample_rate
        row["selection_table"] = st

        if self.output_take_and_give:
            item = {}
            for old_key, new_key in self.output_take_and_give.items():
                item[new_key] = row[old_key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the audio data, text label, label, and path.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    @classmethod
    def from_config(
        cls, dataset_config: DatasetConfig
    ) -> tuple["ArcticBirdSounds", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters.

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise an empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta

        return ds, {}

    def get_available_labels(self, anno_column: str = "Species") -> List[str]:
        """
        Return all possible labels for a given annotation column

        Returns
        ---------
        A list of all the available labels for anno_column
        """
        available_labels = set()
        for row in self._data:
            st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
            available_labels.update(st[anno_column].astype(str).tolist())
        if self.unknown_label in available_labels:
            available_labels.remove(self.unknown_label)

        warnings.warn(
            f"Events with unknown label={self.unknown_label} exist in dataset"
            f"but {self.unknown_label} suppressed from get_available_labels output",
            stacklevel=2,
        )

        return sorted(available_labels)

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version})"
        return (
            f"{base}\n"
            f"Sources: {self.info.sources}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`AudioSet`

📊 Dataset Information

Name	`audioset`
Version	`0.1.0`
Owner	david; marius; masato
License	CC BY 4.0
Sources	YouTube
Available Splits

Description:

AudioSet dataset

AudioSet dataset.

Description

AudioSet is largescale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes in 10 second segments of YouTube videos.

References

AUDIO SET: AN ONTOLOGY AND HUMAN-LABELED DATASET FOR AUDIO EVENTS Gemmeke et al 2017 https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45857.pdf

The train and validation splits (balanced and unbalanced) correspond to the official ones in the paper (https://research.google.com/audioset/download.html). The train-animal, train-noise, validation-animal, and validation-noise splits are created for animal and non-animal (noise) classes in the ontology.

The "caption" column contains the caption from AudioSetCaps when available. AudioSetCaps Paper: https://arxiv.org/abs/2411.18953 AudioSetCaps Dataset: https://huggingface.co/datasets/baijs/AudioSetCaps Note these are empty with the exception of the unbalanced_train split of the V1 dataset.

Note that AudioSet contains different files depending on YouTube video availability at time of download. Version 0.1.0 contains a dump of AudioSet pulled in 2021 and resampled to 16khz. Version 0.2.0 contains a larger set of audios pulled from this HuggingFace release https://huggingface.co/datasets/agkphysics/AudioSet and maintaining the sample rates of the original files.

Pre-resampled Audio

Version 0.2.0 includes pre-resampled 32kHz audio that can be loaded directly without on-the-fly resampling for faster data loading:

Load with pre-resampled 32kHz audio (v0.2.0, no resampling needed)

dataset_32k = AudioSet(split="validation", version="0.2.0", sample_rate=32000, ... streaming=True) print(dataset_32k.available_sample_rates) [32000]

Load with on-the-fly resampling to 16kHz

dataset_16k = AudioSet(split="validation", version="0.2.0", sample_rate=16000, ... streaming=True)

Examples:

>>> from alp_data.datasets import AudioSet
>>> dataset = AudioSet(
...     split="train",
...     output_take_and_give={"label": "audio_label"},
...     version="0.1.0",
...     streaming=True
... )
>>> print(dataset.info.name)
audioset

Source code in alp_data/datasets/audioset.py

@register_dataset
class AudioSet(Dataset):
    """AudioSet dataset.

    Description
    -----------
    AudioSet is largescale dataset of manually-annotated audio events that endeavors
    to bridge the gap in data availability between image and audio research.
    Using a carefully structured hierarchical ontology of 632 audio classes
    in 10 second segments of YouTube videos.

    References
    ----------
    AUDIO SET: AN ONTOLOGY AND HUMAN-LABELED DATASET FOR AUDIO EVENTS
    Gemmeke et al 2017
    https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45857.pdf

    The train and validation splits (balanced and unbalanced)
    correspond to the official ones in the paper (https://research.google.com/audioset/download.html).
    The train-animal, train-noise, validation-animal, and validation-noise splits
    are created for animal and non-animal (noise) classes in the ontology.

    The "caption" column contains the caption from AudioSetCaps when available.
    AudioSetCaps Paper: https://arxiv.org/abs/2411.18953
    AudioSetCaps Dataset: https://huggingface.co/datasets/baijs/AudioSetCaps
    Note these are empty with the exception of the unbalanced_train split of the V1 dataset.

    Note that AudioSet contains different files depending on YouTube video availability at
    time of download. Version 0.1.0 contains a dump of AudioSet pulled in 2021 and resampled
    to 16khz. Version 0.2.0 contains a larger set of audios pulled from this HuggingFace
    release https://huggingface.co/datasets/agkphysics/AudioSet and maintaining the sample
    rates of the original files.

    Pre-resampled Audio
    -------------------
    Version 0.2.0 includes pre-resampled 32kHz audio that can be loaded directly
    without on-the-fly resampling for faster data loading:

    Load with pre-resampled 32kHz audio (v0.2.0, no resampling needed)
    >>> dataset_32k = AudioSet(split="validation", version="0.2.0", sample_rate=32000,
    ... streaming=True)
    >>> print(dataset_32k.available_sample_rates)
    [32000]

    Load with on-the-fly resampling to 16kHz
    >>> dataset_16k = AudioSet(split="validation", version="0.2.0", sample_rate=16000,
    ... streaming=True)

    Examples
    --------
    >>> from alp_data.datasets import AudioSet
    >>> dataset = AudioSet(
    ...     split="train",
    ...     output_take_and_give={"label": "audio_label"},
    ...     version="0.1.0",
    ...     streaming=True
    ... )
    >>> print(dataset.info.name)
    audioset
    """

    # Version registry with version-specific configurations
    VERSIONS = {
        "0.1.0": {
            "split_paths": {
                "train": f"{_V010_ROOT}/csv-data/unbalanced_train_segments_processed.csv",
                "train-balanced": f"{_V010_ROOT}/csv-data/balanced_train_segments_processed.csv",
                "validation": f"{_V010_ROOT}/csv-data/eval_segments_processed.csv",
            },
            "data_root": f"{_V010_ROOT}/",
        },
        "0.2.0": {
            "split_paths": {
                "train": f"{_V020_ROOT}/csv-data/unbalanced_train_segments_processed.csv",
                "validation": f"{_V020_ROOT}/csv-data/eval_segments_processed.csv",
                "train-environmental": f"{_V020_ROOT}/csv-data/unbalanced_train_environmental_sounds.csv",  # noqa: E501
            },
            "data_root": f"{_V020_ROOT}/",
        },
    }

    # Default version (keep as 0.1.0 if we want backward compatibility)
    DEFAULT_VERSION = "0.1.0"

    info = DatasetInfo(
        name="audioset",
        owner="david; marius; masato",
        split_paths={},  # Will be populated based on version
        version="0.1.0",  # Default version
        description="AudioSet dataset",
        sources=["YouTube"],
        license="CC BY 4.0",
    )

    # Mapping of sample rates to their corresponding path columns
    # Pre-resampled audio is available for v0.2.0 only
    _sample_rate_paths = {
        32000: "32khz_path",  # Pre-resampled to 32kHz (v0.2.0 only)
    }

    # Column name for original variable-rate audio files
    _originals_path_column = "local_path"

    def __init__(
        self,
        split: str = "train",
        version: str | None = None,
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the AudioSet dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        version : str, optional
            The version of the dataset to use. If None, uses DEFAULT_VERSION.
            Available versions: "0.1.0", "0.2.0"
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int, optional
            The sample rate to which audio files should be resampled. For v0.2.0, if
            sample_rate=32000, pre-resampled audio will be loaded directly (faster).
            Otherwise, audio will be resampled on-the-fly from the original files using
            librosa's kaiser_best method. If None, audio is returned at its original
            sample rate.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, uses the default data_root for the specified version.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False

        Raises
        ------
        ValueError
            If the specified version is not available.
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        # Copy class-level DatasetInfo to avoid cross-instance mutation (versions/splits)
        self.info = self.info.model_copy(deep=True)

        # Handle version selection
        if version is None:
            version = self.DEFAULT_VERSION

        if version not in self.VERSIONS:
            raise ValueError(
                f"Version '{version}' is not available. "
                f"Available versions: {list(self.VERSIONS.keys())}"
            )

        self.version = version
        self.version_config = self.VERSIONS[version]

        # Update info with version-specific split paths
        self.info.split_paths = self.version_config["split_paths"]
        self.info.version = version

        self.split = split
        self._data = None
        self._load()  # Load the dataset (fills self._data)
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(self.version_config["data_root"])
        else:
            self.data_root = anypath(data_root)

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Return the available pre-resampled sample rates.

        Returns
        -------
        list[int]
            List of sample rates (in Hz) for which pre-resampled audio is available.
            Audio at these sample rates can be loaded directly without on-the-fly resampling.
            This checks which path columns actually exist in the loaded data.
            Note: Pre-resampled audio is only available for v0.2.0.
        """
        available = []
        for sr, path_column in self._sample_rate_paths.items():
            # Check if the path column exists in the loaded data
            if path_column in self._data.columns:
                available.append(sr)
        return available

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}."
                "Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]

        self._data = self._backend_class.from_csv(location, streaming=self._streaming)

    @classmethod
    def from_config(cls, dataset_config: AudioSetConfig) -> tuple["AudioSet", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : AudioSetConfig
            Configuration dictionary containing dataset parameters.

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise an empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            version=cfg.get("version"),
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.
        """
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            Processed row with audio loaded and labels parsed.
        """
        # Parse JSON-encoded labels if present
        if "labels" in row:
            v = row["labels"]
            if v is None or v == "" or (isinstance(v, float) and np.isnan(v)):
                row["labels"] = []
            elif isinstance(v, str):
                # Labels are stored as JSON arrays of strings
                row["labels"] = json.loads(v)

        # Determine which path column to use based on requested sample rate
        # If a pre-resampled version is available, use it; otherwise resample on-the-fly
        use_presampled = False
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            path_column = self._sample_rate_paths[self.sample_rate]
            if (
                path_column in row
                and row[path_column] not in (None, "")
                and not (isinstance(row[path_column], float) and np.isnan(row[path_column]))
            ):
                audio_path = anypath(self.data_root) / str(row[path_column])
                use_presampled = True

        if use_presampled:
            audio, sample_rate = read_audio(audio_path)
            audio = audio.astype(np.float32)
            audio = audio_stereo_to_mono(audio, mono_method="average")
        else:
            # Resample on-the-fly from original variable-rate audio
            audio_path = anypath(self.data_root) / str(row[self._originals_path_column])
            audio, sample_rate = read_audio(audio_path)
            audio = audio.astype(np.float32)
            audio = audio_stereo_to_mono(audio, mono_method="average")

            if self.sample_rate is not None and sample_rate != self.sample_rate:
                audio = librosa.resample(
                    y=audio,
                    orig_sr=sample_rate,
                    target_sr=self.sample_rate,
                    scale=True,
                    res_type="kaiser_best",
                )
                sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item: dict[str, Any] = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the audio data, text label, label, and path.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.version})"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available versions: {', '.join(self.VERSIONS.keys())}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`AudioSetStrong`

📊 Dataset Information

Name	`audioset_strong`
Version	`0.1.0`
Owner	david; marius; masato
License	CC BY 4.0
Sources	YouTube
Available Splits	`train`, `train-environmental`

Description:

AudioSet Strong: Strongly-labeled subset with temporal annotations

AudioSet Strong Dataset

Description

AudioSet Strong is a strongly-labeled subset of AudioSet with temporal annotations (start and end times) for sound events. This dataset provides precise timing information for when each sound event occurs within the 10-second audio clips.

AudioSet is a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research, using a carefully structured hierarchical ontology of 632 audio classes in 10-second segments of YouTube videos.

This class makes the AudioSet Strong subset available in the alp-data strongly-labeled format, where each entry consists of: - An audio recording (10 seconds, pre-resampled to 32kHz) - A selection table with temporal annotations (begin time, end time, label)

The strong labels provide temporal boundaries for sound events, making this dataset suitable for sound event detection and temporal localization tasks.

AudioSet recordings include those available in this huggingface dataset: https://huggingface.co/datasets/agkphysics/AudioSet

Available Splits

train: AudioSet Strong training set with 32kHz pre-resampled audio (8115 rows).
train-environmental: Filtered to rows where ALL labels are environmental sounds (from AudioSet's environmental subset). 1109 rows.

References

AUDIO SET: AN ONTOLOGY AND HUMAN-LABELED DATASET FOR AUDIO EVENTS Gemmeke et al. 2017 https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45857.pdf

AudioSet Homepage: https://research.google.com/audioset/

Examples:

>>> from alp_data.datasets import AudioSetStrong
>>> dataset = AudioSetStrong(split="train", sample_rate=32000)
>>> print(len(dataset))
8115
>>> item = dataset[0]
>>> keys = sorted([k for k in item.keys() if k != '32khz_path'])
>>> len(keys)
7
>>> 'sample_rate' in keys and 'audio' in keys
True
>>> print(list(item['selection_table'].columns))
['Selection', 'Begin Time (s)', 'End Time (s)', 'Label']

>>> env_dataset = AudioSetStrong(split="train-environmental", sample_rate=32000)
>>> print(len(env_dataset))
1109

Source code in alp_data/datasets/audioset_strong.py

@register_dataset
class AudioSetStrong(Dataset):
    """AudioSet Strong Dataset

    Description
    -----------
    AudioSet Strong is a strongly-labeled subset of AudioSet with temporal annotations
    (start and end times) for sound events. This dataset provides precise timing
    information for when each sound event occurs within the 10-second audio clips.

    AudioSet is a large-scale dataset of manually-annotated audio events that endeavors
    to bridge the gap in data availability between image and audio research, using a
    carefully structured hierarchical ontology of 632 audio classes in 10-second
    segments of YouTube videos.

    This class makes the AudioSet Strong subset available in the alp-data strongly-labeled
    format, where each entry consists of:
    - An audio recording (10 seconds, pre-resampled to 32kHz)
    - A selection table with temporal annotations (begin time, end time, label)

    The strong labels provide temporal boundaries for sound events, making this dataset
    suitable for sound event detection and temporal localization tasks.

    AudioSet recordings include those available in this huggingface dataset:
    https://huggingface.co/datasets/agkphysics/AudioSet

    Available Splits
    ----------------
    - ``train``: AudioSet Strong training set with 32kHz pre-resampled audio (8115 rows).
    - ``train-environmental``: Filtered to rows where ALL labels are environmental sounds
      (from AudioSet's environmental subset). 1109 rows.

    References
    ----------
    AUDIO SET: AN ONTOLOGY AND HUMAN-LABELED DATASET FOR AUDIO EVENTS
    Gemmeke et al. 2017
    https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45857.pdf

    AudioSet Homepage:
    https://research.google.com/audioset/

    Examples
    --------
    >>> from alp_data.datasets import AudioSetStrong
    >>> dataset = AudioSetStrong(split="train", sample_rate=32000)
    >>> print(len(dataset))
    8115
    >>> item = dataset[0]
    >>> keys = sorted([k for k in item.keys() if k != '32khz_path'])
    >>> len(keys)
    7
    >>> 'sample_rate' in keys and 'audio' in keys
    True
    >>> print(list(item['selection_table'].columns))
    ['Selection', 'Begin Time (s)', 'End Time (s)', 'Label']

    >>> env_dataset = AudioSetStrong(split="train-environmental", sample_rate=32000)
    >>> print(len(env_dataset))
    1109
    """

    info = DatasetInfo(
        name="audioset_strong",
        owner="david; marius; masato",
        split_paths={
            "train": f"{_CSV_ROOT}/audioset_train_strong_32khz_only.csv",
            "train-environmental": f"{_CSV_ROOT}/audioset_train_strong_32khz_environmental.csv",
        },
        version="0.1.0",
        description="AudioSet Strong: Strongly-labeled subset with temporal annotations",
        sources=["YouTube"],
        license="CC BY 4.0",
    )

    _sample_rate_paths = {
        32000: "32khz_path",
    }
    _originals_path_column = "audio_path"

    def __init__(
        self,
        split: str = "train",
        output_take_and_give: Dict[str, str] | None = None,
        sample_rate: int | None = 16000,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """
        Parameters
        ----------
        split : str
            Split to load. Available splits:
            - "train": Full set with 32kHz pre-resampled audio (8115 rows)
            - "train-environmental": Environmental sounds only (1109 rows)
        output_take_and_give : dict[str, str] | None
            Optional mapping of original → new output keys (filters columns as well).
        sample_rate : int | None
            Target sample rate for audio. If sample_rate=32000, pre-resampled audio
            is loaded directly. Other sample rates resample on-the-fly.
        data_root : str | AnyPathT | None
            Optional root directory to prepend to each row['audio_path'].
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self.annotation_columns = ["Label"]

        self.sample_rate = sample_rate
        self.data_root = anypath(data_root) if data_root is not None else None

        # Load split CSV
        self._load()

        # If no explicit data_root, set to the raw directory (go up two levels from csv file)
        if self.data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent.parent

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns) if self._data is not None else []

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Return the available pre-resampled sample rates.

        Returns
        -------
        list[int]
            List of sample rates (in Hz) for which pre-resampled audio is available.
            Audio at these sample rates can be loaded directly without on-the-fly
            resampling. This checks which path columns actually exist in the loaded data.
        """
        available = []
        for sr, path_column in self._sample_rate_paths.items():
            if path_column in self._data.columns:
                available.append(sr)
        return available

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location, streaming=self._streaming, keep_default_na=False, na_values=[""]
        )

    def __len__(self) -> int:
        return len(self._data)

    @staticmethod
    def _empty_selection_table() -> pl.DataFrame:
        # Default Raven-style selection table columns we expect for strong labels.
        return pl.DataFrame(
            schema={
                "Selection": pl.Int64,
                "Begin Time (s)": pl.Float64,
                "End Time (s)": pl.Float64,
                "Label": pl.Utf8,
            }
        )

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        audio = None
        sr = None

        # Use pre-resampled audio if available for the requested sample rate
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            path_column = self._sample_rate_paths[self.sample_rate]
            if (
                path_column in row
                and row[path_column] not in (None, "")
                and not (isinstance(row[path_column], float) and np.isnan(row[path_column]))
            ):
                presampled_path = self.data_root / str(row[path_column])
                try:
                    audio, sr = read_audio(presampled_path)
                    sample_rate = sr
                    audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)
                    # Validate audio length (corrupt files may be very short)
                    if len(audio) < self.sample_rate:
                        audio = None
                except Exception:
                    audio = None

        # Fall back to original audio with on-the-fly resampling if needed
        if audio is None:
            audio_path = (
                (self.data_root / row[self._originals_path_column])
                if self.data_root
                else anypath(row[self._originals_path_column])
            )
            audio, sample_rate = read_audio(audio_path)
            audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)

            # Resample if necessary
            if self.sample_rate is not None and sample_rate != self.sample_rate:
                audio = librosa.resample(
                    y=audio,
                    orig_sr=sample_rate,
                    target_sr=self.sample_rate,
                    scale=True,
                    res_type="kaiser_best",
                )
                sample_rate = self.sample_rate

        # Selection table (using polars for ~5x faster parsing)
        selection_table_blob = row.get("selection_table", "")
        if selection_table_blob is None or selection_table_blob == "":
            st = self._empty_selection_table()
        else:
            st = pl.read_csv(StringIO(selection_table_blob), separator="\t")

        # Clip events outside audio (keep only events that begin before audio end)
        audio_dur = len(audio) / float(sample_rate)
        if "Begin Time (s)" in st.columns:
            st = st.filter(pl.col("Begin Time (s)") < audio_dur)

        # Build output
        row["audio"] = audio
        row["sample_rate"] = sample_rate
        row["selection_table"] = (
            st.to_pandas()
        )  # to adhere to the rest of the selection_table datasets

        if self.output_take_and_give:
            item: dict[str, Any] = {}
            for old_key, new_key in self.output_take_and_give.items():
                item[new_key] = row[old_key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        if idx < 0 or idx >= len(self._data):
            raise IndexError(f"Index {idx} out of bounds for dataset length {len(self._data)}")

        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        for row in self._data:
            yield self._process(row)

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["AudioSetStrong", dict[str, Any]]:
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )
        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta
        return ds, {}

    def get_available_labels(self) -> List[str]:
        """
        Return all possible labels found in the dataset.

        Returns
        -------
        List[str]
            A sorted list of all unique labels in the dataset.
        """
        labels: set[str] = set()
        for row in self._data:
            selection_table_blob = row.get("selection_table", "")
            if selection_table_blob is None or selection_table_blob == "":
                continue
            st = pl.read_csv(StringIO(selection_table_blob), separator="\t")
            if "Label" in st.columns:
                labels.update(st["Label"].cast(pl.Utf8).to_list())

        return sorted(labels)

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version})"
        return (
            f"{base}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`Beans`

📊 Dataset Information

Name	`beans`
Version	`0.1.0`
Owner	gagan
License	CC-BY-4.0, CC0
Sources	cbi, watkins, dogs, egyptian_fruit_bats, hiceas, dcase, enabirds, esc50, speech_commands, humbugdb, rfcx, hainan_gibbons
Available Splits	`train`, `validation`, `test`, `cbi_test`, `cbi_validation`, `cbi_train`, `watkins_test`, `watkins_validation`, `watkins_train`, `dogs_test`, ... (39 total)

Description:

BEANS benchmark dataset

BEANS dataset

Description

BEANS (the BEnchmark of ANimal Sounds) is a collection of bioacoustics tasks and public datasets, specifically designed to measure the performance of machine learning algorithms in the field of bioacoustics. The benchmark proposed here consists of two common tasks in bioacoustics: classification and detection. It includes 12 datasets covering various species, including birds, land and marine mammals, anurans, and insects.

References

BEANS: The Benchmark of Animal Sounds Masato Hagiwara et al 2022 https://arxiv.org/abs/2210.12300 https://github.com/earthspecies/beans

Examples:

>>> from alp_data.datasets import Beans
>>> dataset = Beans(
...     split="validation",
...     output_take_and_give={"species_scientific": "species"},
...     sample_rate=16000,
...     streaming=True,
... )

Source code in alp_data/datasets/beans.py

@register_dataset
class Beans(Dataset):
    """BEANS dataset

    Description
    -----------
    BEANS (the BEnchmark of ANimal Sounds) is a collection of bioacoustics tasks
    and public datasets, specifically designed to measure the performance of machine
    learning algorithms in the field of bioacoustics. The benchmark proposed here
    consists of two common tasks in bioacoustics: classification and detection.
    It includes 12 datasets covering various species, including birds, land and
    marine mammals, anurans, and insects.

    References
    ----------
    BEANS: The Benchmark of Animal Sounds
    Masato Hagiwara et al 2022
    https://arxiv.org/abs/2210.12300
    https://github.com/earthspecies/beans

    Examples
    --------
    >>> from alp_data.datasets import Beans
    >>> dataset = Beans(
    ...     split="validation",
    ...     output_take_and_give={"species_scientific": "species"},
    ...     sample_rate=16000,
    ...     streaming=True,
    ... )
    """

    info = DatasetInfo(
        name="beans",
        owner="gagan",
        split_paths={
            "train": f"{_RAW_ROOT}/beans_train_v3.csv",
            "validation": f"{_RAW_ROOT}/beans_val_v3.csv",
            "test": f"{_RAW_ROOT}/beans_test_v3.csv",
            "cbi_test": f"{_RAW_ROOT}/cbi_test.jsonl",
            "cbi_validation": f"{_RAW_ROOT}/cbi_val.jsonl",
            "cbi_train": f"{_RAW_ROOT}/cbi_train.jsonl",
            "watkins_test": f"{_RAW_ROOT}/watkins_test.jsonl",
            "watkins_validation": f"{_RAW_ROOT}/watkins_val.jsonl",
            "watkins_train": f"{_RAW_ROOT}/watkins_train.jsonl",
            "dogs_test": f"{_RAW_ROOT}/dogs_test.jsonl",
            "dogs_validation": f"{_RAW_ROOT}/dogs_val.jsonl",
            "dogs_train": f"{_RAW_ROOT}/dogs_train.jsonl",
            "egyptian_fruit_bats_test": f"{_RAW_ROOT}/egyptian_fruit_bats_test.jsonl",
            "egyptian_fruit_bats_validation": f"{_RAW_ROOT}/egyptian_fruit_bats_val.jsonl",
            "egyptian_fruit_bats_train": f"{_RAW_ROOT}/egyptian_fruit_bats_train.jsonl",
            "hiceas_test": f"{_RAW_ROOT}/hiceas_test.jsonl",
            "hiceas_validation": f"{_RAW_ROOT}/hiceas_val.jsonl",
            "hiceas_train": f"{_RAW_ROOT}/hiceas_train.jsonl",
            "dcase_test": f"{_RAW_ROOT}/dcase_test.jsonl",
            "dcase_validation": f"{_RAW_ROOT}/dcase_val.jsonl",
            "dcase_train": f"{_RAW_ROOT}/dcase_train.jsonl",
            "enabirds_test": f"{_RAW_ROOT}/enabirds_test.jsonl",
            "enabirds_validation": f"{_RAW_ROOT}/enabirds_val.jsonl",
            "enabirds_train": f"{_RAW_ROOT}/enabirds_train.jsonl",
            "esc50_test": f"{_RAW_ROOT}/esc50_test.jsonl",
            "esc50_validation": f"{_RAW_ROOT}/esc50_val.jsonl",
            "esc50_train": f"{_RAW_ROOT}/esc50_train.jsonl",
            "speech_commands_test": f"{_RAW_ROOT}/speech_commands_test_v2.jsonl",
            "speech_commands_validation": f"{_RAW_ROOT}/speech_commands_val_v2.jsonl",
            "speech_commands_train": f"{_RAW_ROOT}/speech_commands_train_v2.jsonl",
            "humbugdb_test": f"{_RAW_ROOT}/humbugdb_test.jsonl",
            "humbugdb_validation": f"{_RAW_ROOT}/humbugdb_val.jsonl",
            "humbugdb_train": f"{_RAW_ROOT}/humbugdb_train.jsonl",
            "rfcx_test": f"{_RAW_ROOT}/rfcx_test.jsonl",
            "rfcx_validation": f"{_RAW_ROOT}/rfcx_val.jsonl",
            "rfcx_train": f"{_RAW_ROOT}/rfcx_train.jsonl",
            "hainan_gibbons_test": f"{_RAW_ROOT}/hainan_gibbons_test.jsonl",
            "hainan_gibbons_validation": f"{_RAW_ROOT}/hainan_gibbons_val.jsonl",
            "hainan_gibbons_train": f"{_RAW_ROOT}/hainan_gibbons_train.jsonl",
        },
        version="0.1.0",
        description="BEANS benchmark dataset",
        sources=[
            "cbi",
            "watkins",
            "dogs",
            "egyptian_fruit_bats",
            "hiceas",
            "dcase",
            "enabirds",
            "esc50",
            "speech_commands",
            "humbugdb",
            "rfcx",
            "hainan_gibbons",
        ],
        license="CC-BY-4.0, CC0",
    )

    def __init__(
        self,
        split: str = "train",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the BEANS dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data: pd.DataFrame = None
        self._load()  # Load the dataset (fills self._data)
        self.sample_rate = sample_rate
        self.data_root = data_root

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}.Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        if anypath(location).suffix == ".jsonl":
            # For JSONL files, read them directly into a DataFrame
            self._data = self._backend_class.from_json(location, lines=True, orient="records")
        else:
            # Read CSV content
            self._data = self._backend_class.from_csv(
                location, keep_default_na=False, na_values=[""], null_values=[]
            )  # This setting avoids setting 'None' to a pd.NA type

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["Beans", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parametesf

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise an empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        audio_path = anypath(self.data_root) / row["local_path"]

        # Read the audio clip
        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)
        # Stereo to mono if necessary.
        audio = audio_stereo_to_mono(audio, mono_method="average")

        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call load() first.")
        if self._streaming:
            raise NotImplementedError("Length is not available in streaming mode.")
        return len(self._data)

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split: {self.split}"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`BeansZero`

📊 Dataset Information

Name	`beans_zero`
Version	`0.1.0`
Owner	gagan, masato, david, marius
License	CC-BY-4.0, CC0
Sources	Xeno-canto, iNaturalist, Animal Sound Archive, Elie and Theunissen 2016, Beans, esc50, rfcx, CBI, HumBugDB, Enabirds, HICEAS, Watkins, Gibbons, DCASE-2021-Task-5
Available Splits	`test`, `cbi`, `watkins`, `hiceas`, `dcase`, `enabirds`, `esc50`, `humbugdb`, `rfcx`, `gibbons`, ... (23 total)

Description:

BEANS-Zero benchmark dataset

BEANS-Zero dataset

Description

BEANS-Zero is a bioacoustics benchmark designed to evaluate multimodal audio-language models in zero-shot settings. Introduced in the paper NatureLM-audio paper (Robinson et al., 2025), it brings together tasks from both existing datasets and newly curated resources. The benchmark focuses on models that take a bioacoustic audio input (e.g., bird or mammal vocalizations) and a text instruction (e.g., "What species is in this audio?"), and return a textual output (e.g., "Taeniopygia guttata"). As a zero-shot benchmark, BEANS-Zero contains only a test split—no training or in-context examples are provided.

References

NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics David Robinson, Marius Miron, Masato Hagiwara, Olivier Pietquin https://openreview.net/forum?id=hJVdwBpWjt

Huggingface Dataset: https://huggingface.co/datasets/EarthSpeciesProject/BEANS-Zero

Examples:

>>> from alp_data.datasets import Beans
>>> dataset = BeansZero(
...     split="test",
...     output_take_and_give={"output": "species"},
...     sample_rate=16000,
...     streaming=True,
... )
>>> sample = next(iter(dataset))
>>> print(sample["species"])
None

Source code in alp_data/datasets/beans_zero.py

@register_dataset
class BeansZero(Dataset):
    """BEANS-Zero dataset

    Description
    -----------
    BEANS-Zero is a bioacoustics benchmark designed to evaluate multimodal
    audio-language models in zero-shot settings. Introduced in the paper
    NatureLM-audio paper (Robinson et al., 2025), it brings together tasks
    from both existing datasets and newly curated resources.
    The benchmark focuses on models that take a bioacoustic audio input
    (e.g., bird or mammal vocalizations) and a text instruction
    (e.g., "What species is in this audio?"),
    and return a textual output (e.g., "Taeniopygia guttata").
    As a zero-shot benchmark, BEANS-Zero contains only a test
    split—no training or in-context examples are provided.

    References
    ----------
    NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics
    David Robinson, Marius Miron, Masato Hagiwara, Olivier Pietquin
    https://openreview.net/forum?id=hJVdwBpWjt

    Huggingface Dataset:
    https://huggingface.co/datasets/EarthSpeciesProject/BEANS-Zero


    Examples
    --------
    >>> from alp_data.datasets import Beans
    >>> dataset = BeansZero(
    ...     split="test",
    ...     output_take_and_give={"output": "species"},
    ...     sample_rate=16000,
    ...     streaming=True,
    ... )
    >>> sample = next(iter(dataset))
    >>> print(sample["species"])
    None
    """

    info = DatasetInfo(
        name="beans_zero",
        owner="gagan, masato, david, marius",
        split_paths={
            # 'test' is the full test set combining all tasks
            "test": f"{_RAW_ROOT}/test.jsonl",
            "cbi": f"{_RAW_ROOT}/cbi_test.jsonl",
            "watkins": f"{_RAW_ROOT}/watkins_test.jsonl",
            "hiceas": f"{_RAW_ROOT}/hiceas_test.jsonl",
            "dcase": f"{_RAW_ROOT}/dcase_test.jsonl",
            "enabirds": f"{_RAW_ROOT}/enabirds_test.jsonl",
            "esc50": f"{_RAW_ROOT}/esc50_test.jsonl",
            "humbugdb": f"{_RAW_ROOT}/humbugdb_test.jsonl",
            "rfcx": f"{_RAW_ROOT}/rfcx_test.jsonl",
            "gibbons": f"{_RAW_ROOT}/gibbons_test.jsonl",
            "lifestage": f"{_RAW_ROOT}/lifestage_test.jsonl",
            "call-type": f"{_RAW_ROOT}/call-type_test.jsonl",
            "captioning": f"{_RAW_ROOT}/captioning_test.jsonl",
            "zf-indiv": f"{_RAW_ROOT}/zf-indiv_test.jsonl",
            "unseen-family-cmn": f"{_RAW_ROOT}/unseen-family-cmn_test.jsonl",
            "unseen-family-sci": f"{_RAW_ROOT}/unseen-family-sci_test.jsonl",
            "unseen-family-tax": f"{_RAW_ROOT}/unseen-family-tax_test.jsonl",
            "unseen-genus-cmn": f"{_RAW_ROOT}/unseen-genus-cmn_test.jsonl",
            "unseen-genus-sci": f"{_RAW_ROOT}/unseen-genus-sci_test.jsonl",
            "unseen-genus-tax": f"{_RAW_ROOT}/unseen-genus-tax_test.jsonl",
            "unseen-species-cmn": f"{_RAW_ROOT}/unseen-species-cmn_test.jsonl",
            "unseen-species-sci": f"{_RAW_ROOT}/unseen-species-sci_test.jsonl",
            "unseen-species-tax": f"{_RAW_ROOT}/unseen-species-tax_test.jsonl",
        },
        version="0.1.0",
        description="BEANS-Zero benchmark dataset",
        sources=[
            "Xeno-canto",
            "iNaturalist",
            "Animal Sound Archive",
            "Elie and Theunissen 2016",
            "Beans",
            "esc50",
            "rfcx",
            "CBI",
            "HumBugDB",
            "Enabirds",
            "HICEAS",
            "Watkins",
            "Gibbons",
            "DCASE-2021-Task-5",
        ],
        license="CC-BY-4.0, CC0",
    )

    # Mapping of sample rates to their corresponding path columns
    _sample_rate_paths = {
        32000: "audio_path_32KHz",  # Pre-resampled to 32kHz with librosa.resample
        16000: "audio_path_16KHz",  # Pre-resampled to 16kHz with librosa.resample
    }

    # Column name for original variable-rate audio files
    _originals_path_column = "audio_path_original_sample_rate"

    def __init__(
        self,
        split: str = "test",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the BEANS dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data: pd.DataFrame = None
        self._load()  # Load the dataset (fills self._data)
        self.sample_rate = sample_rate
        self.data_root = data_root

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Return the available pre-resampled sample rates.

        Returns
        -------
        list[int]
            List of sample rates (in Hz) for which pre-resampled audio is available.
            Audio at these sample rates can be loaded directly without on-the-fly resampling.
            This checks which path columns actually exist in the loaded data.
        """
        available = []
        for sr, path_column in self._sample_rate_paths.items():
            # Check if the path column exists in the loaded data
            if path_column in self._data.columns:
                available.append(sr)
        return available

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}.Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        if anypath(location).suffix == ".jsonl":
            # For JSONL files, read them directly into a DataFrame
            self._data = self._backend_class.from_json(location, lines=True, orient="records")
        else:
            # Read CSV content
            self._data = self._backend_class.from_csv(
                location, keep_default_na=False, na_values=[""], null_values=[]
            )  # This setting avoids setting 'None' to a pd.NA type

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["BeansZero", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parametesf

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise an empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        # Determine which path column to use based on requested sample rate
        # If a pre-resampled version is available, use it; otherwise resample on-the-fly
        use_presampled = False
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            path_column = self._sample_rate_paths[self.sample_rate]
            # Check if the pre-resampled path column exists in the data
            if path_column in row and row[path_column] is not None and row[path_column] != "":
                # Use pre-resampled audio
                audio_path = anypath(self.data_root) / row[path_column]
                use_presampled = True

        if use_presampled:
            audio, sample_rate = read_audio(audio_path)
            audio = audio.astype(np.float32)
            audio = audio_stereo_to_mono(audio, mono_method="average")
            # Audio is already at the correct sample rate, no resampling needed
        else:
            # Use original variable-rate files and resample on-the-fly if needed
            audio_path = anypath(self.data_root) / row[self._originals_path_column]
            audio, sample_rate = read_audio(audio_path)
            audio = audio.astype(np.float32)
            audio = audio_stereo_to_mono(audio, mono_method="average")

            if self.sample_rate is not None and sample_rate != self.sample_rate:
                audio = librosa.resample(
                    y=audio,
                    orig_sr=sample_rate,
                    target_sr=self.sample_rate,
                    scale=True,
                    res_type="kaiser_best",
                )
                sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call load() first.")
        if self._streaming:
            raise NotImplementedError("Length is not available in streaming mode.")
        return len(self._data)

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split: {self.split}"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`BengaleseFinchCalls`

📊 Dataset Information

Name	`Bengalese Finch Calls`
Version	`0.1.0`
Owner	david
License	CC-BY-4.0, CC0
Sources	BanglaJp_whistle
Available Splits	`Bird0`, `Bird1`, `Bird2`, `Bird3`, `Bird4`, `Bird5`, `Bird6`, `Bird7`, `Bird8`, `Bird9`, ... (55 total)

Description:

Bengalese Finch calls annotated with call-type and individual IDs, organized by individual birds.

Bengalese Finch call-type dataset with individual bird splits.

Source code in alp_data/datasets/bengalese_finch_calls.py

@register_dataset
class BengaleseFinchCalls(Dataset):
    """Bengalese Finch call-type dataset with individual bird splits."""

    info = DatasetInfo(
        name="Bengalese Finch Calls",
        owner="david",
        split_paths={
            # Original bird datasets (complete individual repertoires)
            "Bird0": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird0.csv",
            "Bird1": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird1.csv",
            "Bird2": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird2.csv",
            "Bird3": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird3.csv",
            "Bird4": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird4.csv",
            "Bird5": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird5.csv",
            "Bird6": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird6.csv",
            "Bird7": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird7.csv",
            "Bird8": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird8.csv",
            "Bird9": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird9.csv",
            "Bird10": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird10.csv",
            # Bird0 splits (9 call types, 7,652 samples)
            "Bird0_train": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird0_train.csv",
            "Bird0_train_small": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird0_train_small.csv",
            "Bird0_valid": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird0_valid.csv",
            "Bird0_test": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird0_test.csv",
            # Bird1 splits (12 call types, 35,728 samples)
            "Bird1_train": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird1_train.csv",
            "Bird1_train_small": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird1_train_small.csv",
            "Bird1_valid": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird1_valid.csv",
            "Bird1_test": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird1_test.csv",
            # Bird2 splits (17 call types, 26,127 samples) - highest diversity
            "Bird2_train": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird2_train.csv",
            "Bird2_train_small": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird2_train_small.csv",
            "Bird2_valid": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird2_valid.csv",
            "Bird2_test": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird2_test.csv",
            # Bird3 splits (9 call types, 29,470 samples)
            "Bird3_train": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird3_train.csv",
            "Bird3_train_small": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird3_train_small.csv",
            "Bird3_valid": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird3_valid.csv",
            "Bird3_test": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird3_test.csv",
            # Bird4 splits (5 call types, 26,891 samples)
            "Bird4_train": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird4_train.csv",
            "Bird4_train_small": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird4_train_small.csv",
            "Bird4_valid": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird4_valid.csv",
            "Bird4_test": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird4_test.csv",
            # Bird5 splits (7 call types, 20,525 samples)
            "Bird5_train": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird5_train.csv",
            "Bird5_train_small": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird5_train_small.csv",
            "Bird5_valid": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird5_valid.csv",
            "Bird5_test": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird5_test.csv",
            # Bird6 splits (5 call types, 17,653 samples)
            "Bird6_train": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird6_train.csv",
            "Bird6_train_small": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird6_train_small.csv",
            "Bird6_valid": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird6_valid.csv",
            "Bird6_test": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird6_test.csv",
            # Bird7 splits (7 call types, 20,722 samples)
            "Bird7_train": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird7_train.csv",
            "Bird7_train_small": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird7_train_small.csv",
            "Bird7_valid": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird7_valid.csv",
            "Bird7_test": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird7_test.csv",
            # Bird8 splits (4 call types, 4,985 samples)
            "Bird8_train": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird8_train.csv",
            "Bird8_train_small": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird8_train_small.csv",
            "Bird8_valid": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird8_valid.csv",
            "Bird8_test": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird8_test.csv",
            # Bird9 splits (6 call types, 19,541 samples)
            "Bird9_train": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird9_train.csv",
            "Bird9_train_small": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird9_train_small.csv",
            "Bird9_valid": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird9_valid.csv",
            "Bird9_test": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird9_test.csv",
            # Bird10 splits (12 call types, 5,743 samples)
            "Bird10_train": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird10_train.csv",
            "Bird10_train_small": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird10_train_small.csv",
            "Bird10_valid": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird10_valid.csv",
            "Bird10_test": f"{DATA_HOME}/bengalese_finch/v0.1.0/raw/Bird10_test.csv",
        },
        version="0.1.0",
        description=(
            "Bengalese Finch calls annotated with call-type and individual IDs, "
            "organized by individual birds."
        ),
        sources=["BanglaJp_whistle"],
        license="CC-BY-4.0, CC0",
    )

    def __init__(
        self,
        split: str = "Bird2_train",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Create a :class:`BengaleseFinchCalls` instance.

        Parameters
        ----------
        split: str
            Which bird/split to load. Options include:
            - Individual birds: "Bird0", "Bird1", ..., "Bird10" (complete repertoires)
            - ML splits: "{BirdX}_train", "{BirdX}_train_small", "{BirdX}_valid", "{BirdX}_test"
        output_take_and_give: dict[str, str], optional
            Mapping from original column names to desired output names.  When
            provided, the dataset __getitem__ will return only the mapped
            columns and use the *values* of this dict as keys.
        sample_rate: int, optional
            Target sample-rate.  If provided and differs from the original, the
            audio is resampled with ``librosa.resample``.
        data_root: str | AnyPathT, optional
            Custom root directory for audio files.  When *None* (default), we
            automatically use the parent directory of the metadata CSV.
        backend: BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming: bool, optional
            Whether to use streaming mode, by default False

        Raises
        ------
        LookupError
            If the specified split is not available in the dataset.
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self.sample_rate = sample_rate

        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split '{self.split}'. Available: {list(self.info.split_paths)}"
            )

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

        self._data = None
        self._load()

    @property
    def columns(self) -> list[str]:
        """Return the DataFrame column names."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the names of available splits."""
        return list(self.info.split_paths)

    def _load(self) -> None:
        """Load the CSV for the chosen split into :pyattr:`_data`."""
        csv_path = self.info.split_paths[self.split]

        # Read as DataFrame (avoid NA coercion so that strings stay strings)
        self._data = self._backend_class.from_csv(
            csv_path,
            streaming=self._streaming,  # keep_default_na=False, na_values=[""]
        )

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("Dataset not loaded - call _load() first.")
        if self._streaming:
            raise NotImplementedError("Length not available in streaming mode.")
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        # Construct full audio path ("local_path" is relative)
        audio_path = anypath(self.data_root) / row["local_path"]

        # Load the audio file
        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)
        audio = audio_stereo_to_mono(audio, mono_method="average")

        # Resample if the user requested a specific sample-rate
        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        # Apply output mapping if requested
        if self.output_take_and_give:
            mapped: dict[str, Any] = {}
            for src, dst in self.output_take_and_give.items():
                mapped[dst] = row[src]

            # Always include audio unless explicitly mapped
            if "audio" not in self.output_take_and_give:
                mapped["audio"] = row["audio"]
            return mapped

        return row

    def __getitem__(self, idx: int) -> Dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.
        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    @classmethod
    def from_config(
        cls, dataset_config: DatasetConfig
    ) -> tuple["BengaleseFinchCalls", dict[str, Any]]:
        """Instantiate from a :class:`DatasetConfig`.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters.

        Returns
        -------
        tuple[BengaleseFinchCalls, dict[str, Any]]
            A tuple containing the dataset instance and metadata from transformations.
            If no transformations are applied, metadata will be an empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata
        return ds, {}

    def __str__(self) -> str:  # noqa: D401 – keep style consistent
        base = f"{self.info.name} (v{self.info.version}), split='{self.split}'"
        return (
            f"{base}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths)}"
        )

`BirdSet`

📊 Dataset Information

Name	`birdset`
Version	`0.1.0`
Owner	marius; gagan; david
License	CC-BY-4.0, CC0
Sources	HSN, NBP, NES, PER, POW, SSW, SNE, UHH
Available Splits	`HSN-test`, `HSN-test_5s`, `NBP-test`, `NBP-test_5s`, `NES-test`, `NES-test_5s`, `PER-test`, `PER-test_5s`, `POW-test`, `POW-test_5s`, ... (17 total)

Description:

BirdSet avian bioacoustics benchmark with GBIF-linked taxonomy. Pre-resampled audio available at 16 kHz and 32 kHz (WAV). Original audio is 32 kHz OGG from the BirdSet HuggingFace repository.

BirdSet avian bioacoustics benchmark dataset.

Description

BirdSet is a large-scale benchmark dataset for audio classification focusing on avian bioacoustics. It includes over 6,800 recording hours from nearly 10,000 species for training and more than 400 hours across eight strongly labeled evaluation datasets. This version (v0.1.0) contains the eight evaluation subsets with test and test_5s splits, GBIF-linked taxonomy, and pre-resampled 16 kHz / 32 kHz WAV audio. The training data is not included in this dataset, but is a subset of the Xeno-canto dataset.

Available Metadata Fields

Taxonomic Information: - species: Scientific species name (resolved from eBird code) - species_common: Common English name - ebird_code: eBird species code (primary label) - ebird_code_multilabel: JSON list of all species codes in recording - species_multispecies: JSON list of scientific names for all species - canonical_name_multispecies: JSON list of canonical names for all species - gbifID_multispecies: JSON list of GBIF backbone IDs for all species - genus, order: Taxonomic hierarchy - gbifID: GBIF backbone identifier

Audio File Paths: - audio_path: Relative path to original 32 kHz OGG audio - 16khz_path: Relative path to pre-resampled 16 kHz WAV - 32khz_path: Relative path to pre-resampled 32 kHz WAV

Recording Metadata: - duration: Recording length in seconds - lat, long: GPS coordinates - source: Data provenance (e.g. xeno-canto recording ID) - microphone: Recording equipment - license: License string

Annotation Boundaries (test splits only): - start_time, end_time: Temporal boundaries (seconds) - low_freq, high_freq: Frequency range (Hz); present for test soundscape splits, empty for test_5s clips

Available Splits

Each of the eight evaluation subsets has two splits:

{SUBSET}-test: Full-length soundscape recordings (variable duration)
{SUBSET}-test_5s: 5-second clips extracted from test recordings

Subsets: HSN, NBP, NES, PER, POW, SSW, SNE, UHH.

all: Combined dataset across all subsets and splits.

References

Rauch, Lukas, et al. "BirdSet: A multi-task benchmark for classification in avian bioacoustics." https://arxiv.org/abs/2403.10380

https://github.com/DBD-research-group/BirdSet

Examples:

>>> from alp_data.datasets import BirdSet
>>> dataset = BirdSet(split="HSN-test_5s", sample_rate=16000)
>>> print(dataset.available_sample_rates)
[16000, 32000]

Load with pre-resampled 16 kHz audio (no on-the-fly resampling):

>>> dataset_16k = BirdSet(split="POW-test_5s", sample_rate=16000)

Load original 32 kHz OGG (returned at native sample rate):

>>> dataset_raw = BirdSet(split="POW-test_5s")

Source code in alp_data/datasets/birdset.py

@register_dataset
class BirdSet(Dataset):
    """BirdSet avian bioacoustics benchmark dataset.

    Description
    -----------
    BirdSet is a large-scale benchmark dataset for audio classification focusing
    on avian bioacoustics.  It includes over 6,800 recording hours from nearly
    10,000 species for training and more than 400 hours across eight strongly
    labeled evaluation datasets.  This version (v0.1.0) contains the eight
    evaluation subsets with test and test_5s splits, GBIF-linked taxonomy, and
    pre-resampled 16 kHz / 32 kHz WAV audio. The training data is not included in this dataset,
    but is a subset of the Xeno-canto dataset.

    Available Metadata Fields
    -------------------------
    **Taxonomic Information:**
        - ``species``: Scientific species name (resolved from eBird code)
        - ``species_common``: Common English name
        - ``ebird_code``: eBird species code (primary label)
        - ``ebird_code_multilabel``: JSON list of all species codes in recording
        - ``species_multispecies``: JSON list of scientific names for all species
        - ``canonical_name_multispecies``: JSON list of canonical names for all species
        - ``gbifID_multispecies``: JSON list of GBIF backbone IDs for all species
        - ``genus``, ``order``: Taxonomic hierarchy
        - ``gbifID``: GBIF backbone identifier

    **Audio File Paths:**
        - ``audio_path``: Relative path to original 32 kHz OGG audio
        - ``16khz_path``: Relative path to pre-resampled 16 kHz WAV
        - ``32khz_path``: Relative path to pre-resampled 32 kHz WAV

    **Recording Metadata:**
        - ``duration``: Recording length in seconds
        - ``lat``, ``long``: GPS coordinates
        - ``source``: Data provenance (e.g. xeno-canto recording ID)
        - ``microphone``: Recording equipment
        - ``license``: License string

    **Annotation Boundaries (test splits only):**
        - ``start_time``, ``end_time``: Temporal boundaries (seconds)
        - ``low_freq``, ``high_freq``: Frequency range (Hz); present for
          ``test`` soundscape splits, empty for ``test_5s`` clips

    Available Splits
    ----------------
    Each of the eight evaluation subsets has two splits:

    - ``{SUBSET}-test``: Full-length soundscape recordings (variable duration)
    - ``{SUBSET}-test_5s``: 5-second clips extracted from test recordings

    Subsets: HSN, NBP, NES, PER, POW, SSW, SNE, UHH.

    - ``all``: Combined dataset across all subsets and splits.

    References
    ----------
    Rauch, Lukas, et al. "BirdSet: A multi-task benchmark for classification
    in avian bioacoustics." https://arxiv.org/abs/2403.10380

    https://github.com/DBD-research-group/BirdSet

    Examples
    --------
    >>> from alp_data.datasets import BirdSet
    >>> dataset = BirdSet(split="HSN-test_5s", sample_rate=16000)
    >>> print(dataset.available_sample_rates)
    [16000, 32000]

    Load with pre-resampled 16 kHz audio (no on-the-fly resampling):

    >>> dataset_16k = BirdSet(split="POW-test_5s", sample_rate=16000)

    Load original 32 kHz OGG (returned at native sample rate):

    >>> dataset_raw = BirdSet(split="POW-test_5s")
    """

    info = DatasetInfo(
        name="birdset",
        owner="marius; gagan; david",
        split_paths={
            "HSN-test": f"{_GCS_ROOT}/HSN_test_v2.csv",
            "HSN-test_5s": f"{_GCS_ROOT}/HSN_test_5s_v2.csv",
            "NBP-test": f"{_GCS_ROOT}/NBP_test_v2.csv",
            "NBP-test_5s": f"{_GCS_ROOT}/NBP_test_5s_v2.csv",
            "NES-test": f"{_GCS_ROOT}/NES_test_v2.csv",
            "NES-test_5s": f"{_GCS_ROOT}/NES_test_5s_v2.csv",
            "PER-test": f"{_GCS_ROOT}/PER_test_v2.csv",
            "PER-test_5s": f"{_GCS_ROOT}/PER_test_5s_v2.csv",
            "POW-test": f"{_GCS_ROOT}/POW_test_v2.csv",
            "POW-test_5s": f"{_GCS_ROOT}/POW_test_5s_v2.csv",
            "SSW-test": f"{_GCS_ROOT}/SSW_test_v2.csv",
            "SSW-test_5s": f"{_GCS_ROOT}/SSW_test_5s_v2.csv",
            "SNE-test": f"{_GCS_ROOT}/SNE_test_v2.csv",
            "SNE-test_5s": f"{_GCS_ROOT}/SNE_test_5s_v2.csv",
            "UHH-test": f"{_GCS_ROOT}/UHH_test_v2.csv",
            "UHH-test_5s": f"{_GCS_ROOT}/UHH_test_5s_v2.csv",
            "all": f"{_GCS_ROOT}/birdset_all_v2.csv",
        },
        version="0.1.0",
        description=(
            "BirdSet avian bioacoustics benchmark with GBIF-linked taxonomy. "
            "Pre-resampled audio available at 16 kHz and 32 kHz (WAV). "
            "Original audio is 32 kHz OGG from the BirdSet HuggingFace repository."
        ),
        sources=["HSN", "NBP", "NES", "PER", "POW", "SSW", "SNE", "UHH"],
        license="CC-BY-4.0, CC0",
    )

    _sample_rate_paths: dict[int, str] = {
        16000: "16khz_path",
        32000: "32khz_path",
    }

    _originals_path_column = "audio_path"

    def __init__(
        self,
        split: str = "HSN-test_5s",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the BirdSet dataset.

        Parameters
        ----------
        split : str, default="HSN-test_5s"
            The split to load.  One of ``info.split_paths`` keys, e.g.
            ``"HSN-test"``, ``"SSW-test_5s"``, or ``"all"``.
        output_take_and_give : dict[str, str], optional
            Column rename / filter mapping.
        sample_rate : int, optional
            Target sample rate.  If a pre-resampled version exists (16 kHz or
            32 kHz), the corresponding WAV is loaded directly.  Otherwise the
            original 32 kHz OGG is loaded and resampled on-the-fly.
        data_root : str | AnyPathT, optional
            Root directory prepended to relative audio paths.  Defaults to the
            GCS path for this dataset version.
        backend : BackendType, optional
            Backend engine ("pandas" or "polars"), by default "polars".
        streaming : bool, optional
            Whether to use streaming mode, by default False.
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self._load()
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(f"{_GCS_ROOT}/")
        else:
            self.data_root = anypath(data_root)

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Return sample rates supported by this dataset.

        Pre-resampled audio is loaded directly when a matching column exists
        in the data; otherwise the original audio is resampled on-the-fly.

        Returns
        -------
        list[int]
            Sorted sample rates (Hz) declared in ``_sample_rate_paths``.
        """
        return sorted(self._sample_rate_paths.keys())

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(location, streaming=self._streaming)

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["BirdSet", dict[str, Any]]:
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )
        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta
        return ds, {}

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("No split has been loaded yet.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode. Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        use_presampled = False
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            col = self._sample_rate_paths[self.sample_rate]
            if col in row and row[col] is not None and str(row[col]).strip():
                audio_path = self.data_root / row[col]
                use_presampled = True

        if use_presampled:
            audio, sr = read_audio(audio_path)
            audio = audio.astype(np.float32)
            audio = audio_stereo_to_mono(audio, mono_method="average")
        else:
            audio_path = self.data_root / row[self._originals_path_column]
            audio, sr = read_audio(audio_path)
            audio = audio.astype(np.float32)
            audio = audio_stereo_to_mono(audio, mono_method="average")

            if self.sample_rate is not None and sr != self.sample_rate:
                audio = librosa.resample(
                    y=audio,
                    orig_sr=sr,
                    target_sr=self.sample_rate,
                    scale=True,
                    res_type="kaiser_best",
                )
                sr = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sr

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version}), split={self.split}"
        return (
            f"{base}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`Birdeep`

📊 Dataset Information

Name	`birdeep`
Version	`0.1.0`
Owner	benjamin
License	MIT
Sources	HuggingFace
Available Splits	`train`, `val`, `test`, `all`

Description:

Dataset of bird vocalizations with bounding boxes, originally released in: A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana by Alba Márquez-Rodríguez et al. (2025)

Birdeep Dataset

Description

Dataset of bird vocalizations with bounding boxes, originally released in: "A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana" by Alba Márquez-Rodríguez et al. (2025)

Description from the github:

"Data was collected using automatic audio recording devices (AudioMoths) in three different habitats in Doñana National Park. Approximately 500 minutes of audio data were recorded. There are 9 recorders in 3 different habitats (marshland, scrubland, and ecotone), which are constantly running, recording 1 minute and leaving 9 minutes between recordings. That is, 1 minute is recorded for every 10 minutes, with a sampling rate of 32 kHz. The recordings were made prioritising those times when the birds are most active in order to try to have as many audio recordings of songs as possible, specifically a few hours before dawn until midday.

Expert annotators labeled 461 minutes of audio data, identifying bird vocalizations and other relevant sounds. Annotations are provided in a standard format with start time, end time, and frequency range for each bird vocalization."

Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels

Note that some birds were not identifiable to species, and are annotated as "Unknown".

The dataset splits are the same as in the original publication.

Pre-resampled Audio

Pre-resampled audio is available at 16 kHz. When sample_rate=16000 is passed, the pre-resampled files are loaded directly (no on-the-fly resampling). For any other target rate, audio is resampled on-the-fly from the native 32 kHz files using librosa's kaiser_best method.

References

https://huggingface.co/datasets/GrunCrow/BIRDeep_AudioAnnotations https://www.sciencedirect.com/science/article/pii/S1574954125002638?via%3Dihub

Source code in alp_data/datasets/birdeep.py

@register_dataset
class Birdeep(Dataset):
    """Birdeep Dataset

    Description
    -----------
    Dataset of bird vocalizations with bounding boxes, originally released in:
    "A Bird Song Detector for improving bird identification through Deep Learning:
    a case study from Doñana" by Alba Márquez-Rodríguez et al. (2025)

    Description from the github:

    "Data was collected using automatic audio recording devices (AudioMoths) in
    three different habitats in Doñana National Park. Approximately 500 minutes
    of audio data were recorded. There are 9 recorders in 3 different habitats
    (marshland, scrubland, and ecotone), which are constantly running, recording
    1 minute and leaving 9 minutes between recordings. That is, 1 minute is
    recorded for every 10 minutes, with a sampling rate of 32 kHz. The
    recordings were made prioritising those times when the birds are most active
    in order to try to have as many audio recordings of songs as possible,
    specifically a few hours before dawn until midday.

    Expert annotators labeled 461 minutes of audio data, identifying bird
    vocalizations and other relevant sounds. Annotations are provided in a
    standard format with start time, end time, and frequency range for each
    bird vocalization."

    Each entry consists of:
    - an audio recording
    - a selection table (Raven format), with Species labels

    Note that some birds were not identifiable to species, and are annotated as "Unknown".

    The dataset splits are the same as in the original publication.

    Pre-resampled Audio
    -------------------
    Pre-resampled audio is available at 16 kHz. When ``sample_rate=16000`` is
    passed, the pre-resampled files are loaded directly (no on-the-fly
    resampling). For any other target rate, audio is resampled on-the-fly from
    the native 32 kHz files using librosa's ``kaiser_best`` method.

    References
    ----------
    https://huggingface.co/datasets/GrunCrow/BIRDeep_AudioAnnotations
    https://www.sciencedirect.com/science/article/pii/S1574954125002638?via%3Dihub

    """

    info = DatasetInfo(
        name="birdeep",
        owner="benjamin",
        split_paths={
            "train": f"{DATA_HOME}/birdeep/train_formatted_v3.csv",
            "val": f"{DATA_HOME}/birdeep/val_formatted_v3.csv",
            "test": f"{DATA_HOME}/birdeep/test_formatted_v3.csv",
            "all": f"{DATA_HOME}/birdeep/all_formatted_v3.csv",
        },
        version="0.1.0",
        description="Dataset of bird vocalizations with bounding boxes, originally released in: "
        "A Bird Song Detector for improving bird identification "
        "through Deep Learning: a case study from Doñana by Alba Márquez-Rodríguez et al. (2025)",
        sources="HuggingFace",
        license="MIT",
    )

    _sample_rate_paths: dict[int, str] = {16000: "16khz_path"}
    _originals_path_column = "audio_path"

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: Dict[str, str] | None = None,
        sample_rate: int | None = 16000,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """
        Parameters
        ----------
        split : str
            Split to load (key in info.split_paths).
        output_take_and_give : dict[str, str] | None
            Optional mapping of original → new output keys (filters columns as well).
        sample_rate : int | None
            If set, audio is resampled to this rate.
        data_root : str | AnyPathT | None
            Optional root directory to prepend to each row['audio_path'].
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self.annotation_columns = ["Species"]
        self.unknown_label = "Unknown"

        self.sample_rate = sample_rate

        self._load()

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = anypath(data_root)

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns) if self._data is not None else []

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Return pre-resampled sample rates whose path columns exist in the data."""
        return [sr for sr, col in self._sample_rate_paths.items() if col in self._data.columns]

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location, streaming=self._streaming, keep_default_na=False, na_values=[""]
        )

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode. Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Parameters
        ----------
        row : dict[str, Any]
            A dictionary representing a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            The processed row.
        """
        use_presampled = False
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            path_column = self._sample_rate_paths[self.sample_rate]
            if path_column in row and row[path_column] is not None and row[path_column] != "":
                audio_path = anypath(self.data_root) / row[path_column]
                use_presampled = True

        if not use_presampled:
            audio_path = anypath(self.data_root) / row[self._originals_path_column]

        audio, sr = read_audio(audio_path)
        audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)

        if not use_presampled and self.sample_rate is not None and sr != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sr,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sr = self.sample_rate

        st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
        audio_dur = len(audio) / float(sr)
        st = st[st["Begin Time (s)"] < audio_dur].copy()

        row["audio"] = audio
        row["sample_rate"] = sr
        row["selection_table"] = st

        if self.output_take_and_give:
            item = {}
            for old_key, new_key in self.output_take_and_give.items():
                item[new_key] = row[old_key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the processed data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["Birdeep", dict[str, Any]]:
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )
        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta
        return ds, {}

    def get_available_labels(self, anno_column: str = "Species") -> List[str]:
        """
        Return all possible labels for a given annotation column
        anno_column is included as an optional argument for consistency
        with other detection datasets.

        Returns
        ---------
        A list of all the available labels for anno_column
        """
        available_labels = set()
        for row in self._data:
            st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
            available_labels.update(st[anno_column].astype(str).tolist())
        if self.unknown_label in available_labels:
            available_labels.remove(self.unknown_label)

        warnings.warn(
            f"Events with unknown label={self.unknown_label} exist in dataset"
            f"but {self.unknown_label} suppressed from get_available_labels output",
            stacklevel=2,
        )

        return sorted(available_labels)

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version})"
        return (
            f"{base}\n"
            f"Sources: {self.info.sources}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`ChiffchaffId`

📊 Dataset Information

Name	`chiffchaff_id`
Version	`0.1.0`
Owner	david
License	CC-BY-4.0
Sources	https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940
Available Splits	`train_within_year`, `test_within_year`, `train_across_year`, `test_across_year`

Description:

Individual identify of common chiffchaffs

Chiffchaff ID dataset

Description

Vocalisations released by Stowell et al. for individual Chiffchaff males (Phylloscopus collybita). Provides both within-year and across-year evaluation schemes. https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940

This dataset includes train and test splits within year (train_within_year, test_within_year) and across year (train_across_year, test_across_year). Test within year tests on recordings from the same year as the training data, though different days, while test across year tests on recordings from different years, giving harder test conditions, with potential differences in acoustic environment or vocalisation characteristics.

References

https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940 Zenodo: https://zenodo.org/records/1413495

Examples:

>>> from alp_data.datasets import ChiffchaffId
>>> dataset = ChiffchaffId(
...     split="test_within_year",
...     sample_rate=16000,
... )

Source code in alp_data/datasets/chiffchaff_id.py

@register_dataset
class ChiffchaffId(Dataset):
    """Chiffchaff ID dataset

    Description
    -----------
    Vocalisations released by Stowell et al. for individual Chiffchaff males
    (Phylloscopus collybita). Provides both *within-year* and *across-year* evaluation schemes.
    https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940

    This dataset includes train and test splits within year (train_within_year, test_within_year)
    and across year (train_across_year, test_across_year).
    Test within year tests on recordings from the same year as the training data,
    though different days, while test across year tests on recordings from different years,
    giving harder test conditions,
    with potential differences in acoustic environment or vocalisation characteristics.

    References
    ----------
    https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940
    Zenodo: https://zenodo.org/records/1413495

    Examples
    --------
    >>> from alp_data.datasets import ChiffchaffId
    >>> dataset = ChiffchaffId(
    ...     split="test_within_year",
    ...     sample_rate=16000,
    ... )
    """

    info = DatasetInfo(
        name="chiffchaff_id",
        owner="david",
        split_paths={
            "train_within_year": f"{DATA_HOME}/chiffchaff_id/v0.1.0/raw/withinyear_fg_train.csv",
            "test_within_year": f"{DATA_HOME}/chiffchaff_id/v0.1.0/raw/withinyear_fg_test.csv",
            "train_across_year": f"{DATA_HOME}/chiffchaff_id/v0.1.0/raw/acrossyear_fg_train.csv",
            "test_across_year": f"{DATA_HOME}/chiffchaff_id/v0.1.0/raw/acrossyear_fg_test.csv",
        },
        version="0.1.0",
        description="Individual identify of common chiffchaffs",
        sources=[
            "https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940",
        ],
        license="CC-BY-4.0",
    )

    def __init__(
        self,
        split: str = "train_within_year",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the ChiffchaffId dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data: pd.DataFrame = None
        self._load()  # Load the dataset (fills self._data)
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}.Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        if anypath(location).suffix == ".jsonl":
            # For JSONL files, read them directly into a DataFrame
            self._data = self._backend_class.from_json(
                location, lines=True, streaming=self._streaming
            )
        else:
            # Read CSV content
            self._data = self._backend_class.from_csv(
                location,
                streaming=self._streaming,
            )

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["ChiffchaffId", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parametesf

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise an empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call load() first.")
        if self._streaming:
            raise NotImplementedError("Length is not available in streaming mode.")
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        # Ensure audio path is valid
        audio_path = anypath(self.data_root) / row["local_path"]

        # Read the audio clip
        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)
        # Stereo to mono if necessary.
        audio = audio_stereo_to_mono(audio, mono_method="average")

        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split='{self.split}'"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`CorvidWascher`

📊 Dataset Information

Name	`corvid_wascher`
Version	`0.1.0`
Owner	benjamin
License	private
Sources	XenoCanto, Claudia Wascher
Available Splits	`all`

Corvid Dataset from Clausia Wascher

Description

This dataset consists of recordings of corvids, taken from Xeno-canto. Claudia Wascher provided annotations of vocalization boundaries. Annotations should not be considered exhaustive within a file, i.e. there may exist non- boxed vocalizations.

This data was originally provided, with an MOU, for work on comparison between vocal repertoires of different corvid species.

Each entry consists of: - an audio recording - a selection table with start- and stop-times of vocalizations - file-level metadata columns

References

https://www.biorxiv.org/content/biorxiv/early/2024/07/17/2024.07.15.603339.full.pdf

Source code in alp_data/datasets/corvid_wascher.py

@register_dataset
class CorvidWascher(Dataset):
    """Corvid Dataset from Clausia Wascher

    Description
    -----------
    This dataset consists of recordings of corvids, taken from Xeno-canto.
    Claudia Wascher provided annotations of vocalization boundaries. Annotations
    should not be considered exhaustive within a file, i.e. there may exist non-
    boxed vocalizations.

    This data was originally provided, with an MOU, for work on comparison
    between vocal repertoires of different corvid species.

    Each entry consists of:
    - an audio recording
    - a selection table with start- and stop-times of vocalizations
    - file-level metadata columns

    References
    ----------
    https://www.biorxiv.org/content/biorxiv/early/2024/07/17/2024.07.15.603339.full.pdf


    """

    info = DatasetInfo(
        name="corvid_wascher",
        owner="benjamin",
        split_paths={
            "all": "gs://esp-ml-datasets/wascher_corvid_comparison/all.csv",
        },
        version="0.1.0",
        description="[MISSING]",
        sources="XenoCanto, Claudia Wascher",
        license="private",
    )

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: Dict[str, str] | None = None,
        sample_rate: int | None = 16000,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """
        Parameters
        ----------
        split : str
            Split to load (key in info.split_paths).
        output_take_and_give : dict[str, str] | None
            Optional mapping of original → new output keys (filters columns as well).
        sample_rate : int | None
            If set, audio is resampled to this rate.
        data_root : str | AnyPathT | None
            Optional root directory to prepend to each row['audio_path'].
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self.annotation_columns = ["Species"]

        self.sample_rate = sample_rate
        self.data_root = anypath(data_root) if data_root is not None else None

        # Load split CSV
        self._load()

        # If no explicit data_root, assume parent dir of the split path
        if self.data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location, streaming=self._streaming, keep_default_na=False, na_values=[""]
        )

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode. Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Parameters
        ----------
        row : dict[str, Any]
            A dictionary representing a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            The processed row.
        """

        # Resolve audio path
        audio_path = (
            (self.data_root / row["audio_path"]) if self.data_root else anypath(row["audio_path"])
        )

        # Read audio
        audio, sample_rate = read_audio(audio_path)
        audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)

        # Resample if necessary
        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        # Selection table
        st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")

        # Clip events outside audio (keep only events that begin before audio end)
        audio_dur = len(audio) / float(sample_rate)
        st = st[st["Begin Time (s)"] < audio_dur].copy()

        # Build output
        row["audio"] = audio
        row["sample_rate"] = sample_rate
        row["selection_table"] = st

        if self.output_take_and_give:
            item = {}
            for old_key, new_key in self.output_take_and_give.items():
                item[new_key] = row[old_key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the processed data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["CorvidWascher", dict[str, Any]]:
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )
        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta
        return ds, {}

    def get_available_labels(self, anno_column: str = "Species") -> List[str]:
        """
        Return all possible labels for a given annotation column

        Returns
        ---------
        A list of all the available labels for anno_column
        """
        available_labels = set()
        for row in self._data:
            st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
            available_labels.update(st[anno_column].astype(str).tolist())
        return sorted(available_labels)

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version})"
        return (
            f"{base}\n"
            f"Sources: {self.info.sources}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`DCLDE2026`

📊 Dataset Information

Name	`dclde2026`
Version	`0.1.0`
Owner	david
License	CC-BY-4.0
Sources	Palmer et al. (2025) doi:10.1038/s41597-025-05281-5
Available Splits	`all`

Description:

DCLDE 2026 killer whale dataset with species, ecotype, call type, pod, clan, and acoustic behavior annotations across 9 providers

DCLDE 2026 Killer Whale Dataset.

Description

Multi-provider annotated acoustic recordings of killer whales, humpback whales, and bowhead whales from Alaska, British Columbia, and Washington (2011–2024). Each entry is an audio file plus an enriched selection table containing detection/call-level annotations with species, ecotype, call type, pod, clan, and acoustic behavior labels — all human-annotated.

Columns

audio_path : str Relative path to source audio. selection_table : str TSV-serialised selection table with columns: Begin Time (s), End Time (s), Low Freq (Hz), High Freq (Hz), species, canonical_name, sound_detail, ecotype, call_type, acoustic_behavior, pod, clan, annotation_level, confidence, coarse_call_type. provider : str Data provider name (see :data:PROVIDERS). 16khz_path, 32khz_path : str | None Paths to pre-resampled audio (when available).

Splits

"all": All data (default)

Available tasks

Species classification: Killer whale / Humpback whale / Bowhead whale / Unknown biological
KW detection (binary): presence / absence of killer whale
Ecotype classification: SRKW / TKW / NRKW / SAR / OKW
Call type classification (fine-grained): S04, N24ii, T01, whistle, BP, EL, etc.
Call type classification (coarse): call / whistle / click / burst_pulse (see :data:COARSE_CALL_TYPE_LABELS below; aligned with Watkins taxonomy)
Pod identification: J / K / L pods (Southern Resident)
Clan identification: A / G clans (Northern Resident)

Provider Notes

Data providers: DFO_CRP, JASCO_VFPA, DFO_WDLP, SIMRES, SIO, ONC, OrcaSound, JASCO_VFPA_ONC, SMRUConsulting.

UAF_NGOS (University of Alaska Fairbanks / North Gulf Oceanic Society) is excluded from the dataset.

Providers differ in annotation precision and coverage. Combining data from multiple providers should be done carefully — consider filtering by provider using Transforms (e.g. filter_isin on the provider column) when training or evaluating.

Per-provider observations (detection focus):

SMRU: Not very temporally precise; selections sometimes cover large segments around the call. Multiple faint calls may be grouped.
SIMRES: Consistent annotations.
VFPA: Generally good; a few missed calls and slightly less consistency with faint calls.

Coarse call types (aligned with Watkins taxonomy):

Mapping rules: call — Discrete pulsed calls (S-series, N-series, T-series, OFF-series, NS) and variable vocalizations (tone, moan, upsweep, chirp, groan, knock, shriek, whup, creak, grunt, scream, rasp, growl). whistle — Whistle-labeled signals (whistle, whistle/tone, W). click — Echolocation clicks (EL) and rapid click trains (buzz, BZ). burst_pulse — Burst-pulse signals (BP).

Unknown / ambiguous labels (Unk, Multiple overlapping, etc.) map to empty string and are dropped when drop_empty_windows=True in windowing.

Examples:

>>> from alp_data.datasets import DCLDE2026
>>> dataset = DCLDE2026(split="all")
>>> print(dataset.info.name)
dclde2026

References

Palmer et al. (2025) doi:10.1038/s41597-025-05281-5 License: CC-BY-4.0

Source code in alp_data/datasets/dclde2026.py

@register_dataset
class DCLDE2026(Dataset):
    """DCLDE 2026 Killer Whale Dataset.

    Description
    -----------
    Multi-provider annotated acoustic recordings of killer whales, humpback
    whales, and bowhead whales from Alaska, British Columbia, and Washington
    (2011–2024). Each entry is an audio file plus an enriched selection table
    containing detection/call-level annotations with species, ecotype, call
    type, pod, clan, and acoustic behavior labels — all human-annotated.

    Columns
    -------
    audio_path : str
        Relative path to source audio.
    selection_table : str
        TSV-serialised selection table with columns:
        ``Begin Time (s)``, ``End Time (s)``, ``Low Freq (Hz)``,
        ``High Freq (Hz)``, ``species``, ``canonical_name``,
        ``sound_detail``, ``ecotype``, ``call_type``,
        ``acoustic_behavior``, ``pod``, ``clan``,
        ``annotation_level``, ``confidence``, ``coarse_call_type``.
    provider : str
        Data provider name (see :data:`PROVIDERS`).
    16khz_path, 32khz_path : str | None
        Paths to pre-resampled audio (when available).

    Splits
    ---------
    - "all": All data (default)

    Available tasks
    ---------------
    - Species classification: Killer whale / Humpback whale / Bowhead whale / Unknown biological
    - KW detection (binary): presence / absence of killer whale
    - Ecotype classification: SRKW / TKW / NRKW / SAR / OKW
    - Call type classification (fine-grained): S04, N24ii, T01, whistle, BP, EL, etc.
    - Call type classification (coarse): call / whistle / click / burst_pulse
      (see :data:`COARSE_CALL_TYPE_LABELS` below; aligned with Watkins taxonomy)
    - Pod identification: J / K / L pods (Southern Resident)
    - Clan identification: A / G clans (Northern Resident)

    Provider Notes
    --------------
    Data providers: DFO_CRP, JASCO_VFPA, DFO_WDLP, SIMRES, SIO,
    ONC, OrcaSound, JASCO_VFPA_ONC, SMRUConsulting.

    **UAF_NGOS** (University of Alaska Fairbanks / North Gulf Oceanic Society)
    is excluded from the dataset.

    Providers differ in annotation precision and coverage. Combining data
    from multiple providers should be done carefully — consider filtering
    by provider using Transforms (e.g. ``filter_isin`` on the ``provider``
    column) when training or evaluating.

    Per-provider observations (detection focus):

    - SMRU: Not very temporally precise; selections sometimes cover large
      segments around the call. Multiple faint calls may be grouped.
    - SIMRES: Consistent annotations.
    - VFPA: Generally good; a few missed calls and slightly less
      consistency with faint calls.

    Coarse call types (aligned with Watkins taxonomy):

    Mapping rules:
    call         — Discrete pulsed calls (S-series, N-series, T-series,
                     OFF-series, NS) and variable vocalizations (tone, moan,
                     upsweep, chirp, groan, knock, shriek, whup, creak,
                     grunt, scream, rasp, growl).
    whistle      — Whistle-labeled signals (whistle, whistle/tone, W).
    click        — Echolocation clicks (EL) and rapid click trains (buzz, BZ).
    burst_pulse  — Burst-pulse signals (BP).

    Unknown / ambiguous labels (Unk, Multiple overlapping, etc.) map to empty
    string and are dropped when ``drop_empty_windows=True`` in windowing.


    Examples
    --------
    >>> from alp_data.datasets import DCLDE2026
    >>> dataset = DCLDE2026(split="all")
    >>> print(dataset.info.name)
    dclde2026

    References
    ----------
    Palmer et al. (2025) doi:10.1038/s41597-025-05281-5
    License: CC-BY-4.0
    """

    info = DatasetInfo(
        name="dclde2026",
        owner="david",
        split_paths={
            "all": f"{DATA_HOME}/dclde2026/v0.1.0/raw/2026/dclde_2026_killer_whales/processed_enriched_v2.csv",  # noqa: E501
        },
        version="0.1.0",
        description="DCLDE 2026 killer whale dataset with species, ecotype, call type, "
        "pod, clan, and acoustic behavior annotations across 9 providers",
        sources="Palmer et al. (2025) doi:10.1038/s41597-025-05281-5",
        license="CC-BY-4.0",
    )

    _sample_rate_paths: dict[int, str] = {
        16000: "16khz_path",
        32000: "32khz_path",
    }

    # Subdirectories under data_root where pre-resampled audio lives.
    _sample_rate_subdirs: dict[int, str] = {
        16000: "audio_16k",
        32000: "audio_32k",
    }

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = 16000,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """
        Parameters
        ----------
        split : str
            Split to load (key in info.split_paths).
        output_take_and_give : dict[str, str] | None
            Optional mapping of original → new output keys (filters columns as well).
        sample_rate : int | None
            If set, audio is resampled to this rate.  Pre-resampled paths
            (``16khz_path``, ``32khz_path``) are preferred when present in the
            CSV; otherwise audio is resampled on-the-fly.
        data_root : str | AnyPathT | None
            Root directory containing provider audio subdirectories.
            If None, defaults to the parent directory of the split CSV path.
        backend : BackendType
            The backend to use ("pandas" or "polars"), by default "polars".
        streaming : bool
            Whether to use streaming mode, by default False.
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self.sample_rate = sample_rate
        self.annotation_columns = [
            "species",
            "ecotype",
            "call_type",
            "acoustic_behavior",
            "pod",
            "clan",
        ]
        self.data_root = anypath(data_root) if data_root is not None else None

        self._load()

        if self.data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns) if self._data is not None else []

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the split CSV into the configured backend.

        Raises
        ------
        LookupError
            If the requested split is not in ``info.split_paths``.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location,
            streaming=self._streaming,
            keep_default_na=False,
            na_values=[""],
        )

    def __len__(self) -> int:
        """Return the number of audio files in the dataset.

        Returns
        -------
        int
            Number of audio files in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        NotImplementedError
            If the dataset is in streaming mode.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode. Iterate over the dataset instead."
            )
        return len(self._data)

    # ------------------------------------------------------------------
    # Audio path resolution
    # ------------------------------------------------------------------

    def _resolve_audio_path(self, row: dict[str, Any]) -> tuple[AnyPathT, bool]:
        """Return ``(full_audio_path, is_presampled)``.

        If the CSV contains a pre-resampled path column for the requested
        sample rate (e.g. ``16khz_path``) and the value is non-empty, that
        path is used and ``is_presampled=True``.  The resampled file is
        located under ``data_root / <sr_subdir> / <16khz_path>``, where
        ``<sr_subdir>`` comes from :attr:`_sample_rate_subdirs` (e.g.
        ``audio_16k``).  Otherwise falls back to the original ``audio_path``.

        Returns
        -------
        tuple[AnyPathT, bool]
            ``(full_audio_path, is_presampled)`` — the resolved path and
            whether it points to a pre-resampled file.
        """
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            col = self._sample_rate_paths[self.sample_rate]
            if col in row and row[col] is not None and str(row[col]).strip():
                subdir = self._sample_rate_subdirs.get(self.sample_rate, "")
                if subdir:
                    return self.data_root / subdir / row[col], True
                return self.data_root / row[col], True
        return self.data_root / row["audio_path"], False

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        audio_fp, is_presampled = self._resolve_audio_path(row)

        window_start = row.get("window_start_sec")
        window_end = row.get("window_end_sec")

        if window_start is not None and window_end is not None:
            audio, sr = read_audio(
                audio_fp, start_time=float(window_start), end_time=float(window_end)
            )
        else:
            audio, sr = read_audio(audio_fp)
        audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)

        # Resample on-the-fly only when no pre-resampled file was used
        if not is_presampled and self.sample_rate is not None and sr != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sr,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sr = self.sample_rate

        # Parse selection table from serialized TSV
        st = pd.read_csv(StringIO(row["selection_table"]), sep="\t", keep_default_na=False)

        # Clip events outside audio (keep only events that begin before audio end)
        audio_dur = len(audio) / float(sr)
        st = st[st["Begin Time (s)"] < audio_dur].copy()

        # Build output
        row["audio"] = audio
        row["sample_rate"] = sr
        row["selection_table"] = st

        if self.output_take_and_give:
            item = {}
            for old_key, new_key in self.output_take_and_give.items():
                item[new_key] = row[old_key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the processed data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        ------
        dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["DCLDE2026", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters.

        Returns
        -------
        tuple[DCLDE2026, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            sample_rate=cfg["sample_rate"],
            data_root=cfg["data_root"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta

        return ds, {}

    def get_available_labels(self, annotation_column: str = "species") -> list[str]:
        """Return all possible labels for a given annotation column.

        Parameters
        ----------
        annotation_column : str
            Which annotation column to get labels for.
            Predefined label sets exist for: ``species``, ``ecotype``.

        Returns
        -------
        list[str]
            All possible label values for the given column.

        Raises
        ------
        ValueError
            If ``annotation_column`` does not have a predefined label set.
        """
        if annotation_column == "species":
            return SPECIES_LABELS
        elif annotation_column == "ecotype":
            return ECOTYPE_LABELS
        else:
            raise ValueError(
                f"No predefined label set for '{annotation_column}'. "
                f"Columns with predefined labels: species, ecotype"
            )

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version})"
        n = len(self) if self._data is not None and not self._streaming else "?"
        return (
            f"{base}\n"
            f"Audio files: {n}\n"
            f"Sources: {self.info.sources}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`DinardoDolphinWhistles`

📊 Dataset Information

Name	`dinardo_dolphin_whistles`
Version	`0.1.0`
Owner	gagan
License	CC-BY-4.0
Sources	Nature Scientific Data
Available Splits	`all`

Description:

Dolphin whistles dataset, Di Nardo et al 2023

Description

Authors: Francesco Di Nardo, Rocco De Marco, Alessandro Lucchetti & DavidScaradozzi Globally, interactions between fishing activities and dolphins are cause for concern due to their negative effects on both mammals and fishermen. The recording of acoustic emissions could aid in detecting the presence of dolphins in close proximity to fishing gear, elucidating their behavior, and guiding potential management measures designed to limit this harmful phenomenon. This data descriptor presents a dataset of acoustic recordings (WAV files) collected during interactions between common bottlenose dolphins (Tursiops truncatus) and fishing activities in the Adriatic Sea. This dataset is distinguished by the high complexity of its repertoire, which includes various different typologies of dolphin emission. Specifically, a group of free-ranging dolphins was found to emit frequency-modulated whistles, echolocation clicks, and burst pulse signals, including feeding buzzes. An analysis of signal quality based on the signal-to-noise ratio was conducted to validate the dataset. The signal digital files and corresponding features make this dataset suitable for studying dolphin behavior in order to gain a deeper understanding of their communication and interaction with fishing gear (trawl).

References

A WAV file dataset of bottlenose dolphin whistles, clicks, and pulse sounds during trawling interactions https://doi.org/10.1038/s41597-023-02547-8

Examples:

>>> from alp_data.datasets import DinardoDolphinWhistles
>>> dataset = DinardoDolphinWhistles(
...     split="all",
...     sample_rate=16000,
...     streaming=True)

Source code in alp_data/datasets/dinardo_dolphin_whistles.py

@register_dataset
class DinardoDolphinWhistles(Dataset):
    """Dolphin whistles dataset, Di Nardo et al 2023

    Description
    -----------
    Authors: Francesco Di Nardo,  Rocco De Marco, Alessandro Lucchetti & DavidScaradozzi
    Globally, interactions between fishing activities and dolphins are cause for concern
    due to their negative effects on both mammals and fishermen.
    The recording of acoustic emissions could aid in detecting the
    presence of dolphins in close proximity to fishing gear,
    elucidating their behavior, and guiding potential
    management measures designed to limit this harmful phenomenon.
    This data descriptor presents a dataset of acoustic recordings (WAV files) collected
    during interactions between common bottlenose dolphins (Tursiops truncatus) and
    fishing activities in the Adriatic Sea. This dataset is distinguished by the high
    complexity of its repertoire, which includes various different typologies of dolphin emission.
    Specifically, a group of free-ranging dolphins was found to emit frequency-modulated whistles,
    echolocation clicks, and burst pulse signals, including feeding buzzes.
    An analysis of signal quality based on the signal-to-noise ratio was
    conducted to validate the dataset. The signal digital files and corresponding features
    make this dataset suitable for studying
    dolphin behavior in order to gain a deeper understanding of their communication and
    interaction with fishing gear (trawl).

    References
    ----------
    A WAV file dataset of bottlenose dolphin whistles, clicks, and pulse
    sounds during trawling interactions
    https://doi.org/10.1038/s41597-023-02547-8


    Examples
    --------
    >>> from alp_data.datasets import DinardoDolphinWhistles
    >>> dataset = DinardoDolphinWhistles(
    ...     split="all",
    ...     sample_rate=16000,
    ...     streaming=True)
    """

    info = DatasetInfo(
        name="dinardo_dolphin_whistles",
        owner="gagan",
        split_paths={
            "all": f"{DATA_HOME}/dinardo2023_dolphin_whistles/v0.1.0/raw/dinardo2023_annotations.csv",  # noqa: E501
        },
        version="0.1.0",
        description="Dolphin whistles dataset, Di Nardo et al 2023",
        sources=["Nature Scientific Data"],
        license="CC-BY-4.0",
    )

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the DinardoDolphinWhistles dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

        self._data: pd.DataFrame = None
        self._load()  # Load the dataset (fills self._data)

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}.Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location,
            streaming=self._streaming,
            keep_default_na=False,
            na_values=[""],
        )

    @classmethod
    def from_config(
        cls, dataset_config: DatasetConfig
    ) -> tuple["DinardoDolphinWhistles", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode.Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        # Ensure audio path is valid
        audio_path = anypath(self.data_root) / row["local_path"]

        # Read the audio clip
        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)
        # Stereo to mono if necessary.
        audio = audio_stereo_to_mono(audio, mono_method="average")

        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split='{self.split}'"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`ESPRaincoast`

📊 Dataset Information

Name	`esp_raincoast`
Version	`0.1.0`
Owner	emmanuel; gagan; dylansmyth; maddie
License	private
Sources	esp-raincoast
Available Splits	`full`

Description:

Orca vocal repertoire dataset

ESP Raincoast.org dataset Recorded by Dylan Smyth, Valeria Vergara lab.

Source code in alp_data/datasets/esp_raincoast.py

@register_dataset
class ESPRaincoast(Dataset):
    """ESP Raincoast.org dataset
    Recorded by Dylan Smyth, Valeria Vergara lab.
    """

    info = DatasetInfo(
        name="esp_raincoast",
        owner="emmanuel; gagan; dylansmyth; maddie",
        split_paths={
            "full": "gs://esp-raincoast/2023-2024/full_selection_table.csv",
        },
        version="0.1.0",
        description="Orca vocal repertoire dataset",
        sources=["esp-raincoast"],
        license="private",
    )

    def __init__(
        self,
        split: str = "full",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        load_audio_segments: bool = True,
        mono_method: Literal["keep_first", "average"] | None = None,
        data_root: str | AnyPathT | None = None,
        backend: str = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the ESPRaincoast dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        load_audio_segments : bool
            If True, the audio files will be spliced between the 'Begin time(s)'
            and 'End time (s)' columns in the dataset.
            If False, the entire audio file will be loaded.
        mono_method : str | None
            Method to convert stereo audio to mono. If None, no conversion is done.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : str, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend, streaming)
        self.split = split
        self.sample_rate = sample_rate
        self.data_root = data_root
        self.load_audio_segments = load_audio_segments
        self.mono_method = mono_method

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

        self._data: pd.DataFrame = None
        self._load()

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}.Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        if anypath(location).suffix == ".jsonl":
            # For JSONL files, read them directly into a DataFrame
            # self._data = pd.read_json(location, lines=True, orient="records")
            self._data = self._backend_class.from_json(
                location, lines=True, streaming=self._streaming
            )
        else:
            # TODO: Polars picked up some inconsistencies in the data!
            # Column "Call Quality" has a mix of f64 and string types
            self._data = self._backend_class.from_csv(
                location, streaming=self._streaming, infer_schema_length=10000
            )

    @classmethod
    def from_config(
        cls, dataset_config: ESPRaincoastConfig
    ) -> tuple["ESPRaincoast", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            load_audio_segments=cfg["load_audio_segments"],
            mono_method=cfg["mono_method"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode.Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        # Ensure audio path is valid
        audio_path = anypath(self.data_root) / row["local_path"]

        # Read the audio clip
        if self.load_audio_segments:
            start_time = row.get("Begin Time (s)", 0.0)
            end_time = row.get("End Time (s)", None)
            audio, sample_rate = read_audio(audio_path, start_time=start_time, end_time=end_time)
        else:
            audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)

        # Stereo to mono if necessary.
        if self.mono_method is not None:
            audio = audio_stereo_to_mono(audio, mono_method="average")

        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split='{self.split}'"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`Geladas`

📊 Dataset Information

Name	`geladas`
Version	`0.1.0`
Owner	gagan
License	CC-BY-4.0
Sources	PNAS
Available Splits	`all`

Description:

Gelada vocal sequences dataset, Gustison et al 2016

Gelada vocal sequences follow Menzerath's linguistic law, Gustison et al 2016

Description

Identifying universal principles underpinning diverse natural systems is a key goal of the life sciences. A powerful approach in addressing this goal has been to test whether patterns consistent with linguistic laws are found in nonhuman animals. Menzerath's law is a linguistic law that states that, the larger the construct, the smaller the size of its constituents. Here, to our knowledge, we present the first evidence that Menzerath's law holds in the vocal communication of a nonhuman species. We show that, in vocal sequences of wild male geladas (Theropithecus gelada), construct size (sequence size in number of calls) is negatively correlated with constituent size (duration of calls). Call duration does not vary significantly with position in the sequence, but call sequence composition does change with sequence size and most call types are abbreviated in larger sequences. We also find that intercall intervals follow the same relationship with sequence size as do calls. Finally, we provide formal mathematical support for the idea that Menzerath's law reflects compression—the principle of minimizing the expected length of a code. Our findings suggest that a common principle underpins human and gelada vocal communication, highlighting the value of exploring the applicability of linguistic laws in vocal systems outside the realm of language.

References

Gelada vocal sequences follow Menzerath's linguistic law (PNAS, 2016) https://doi.org/10.1073/pnas.1522072113 Also: Morgan L. Gustison, Thore J. Bergman, Divergent acoustic properties of gelada and baboon vocalizations and their implications for the evolution of human speech. Journal of Language Evolution. https://doi.org/10.1093/jole/lzx015

Examples:

>>> from alp_data.datasets import Geladas
>>> dataset = Geladas(
...     split="all",
...     sample_rate=16000,
...     streaming=True)
>>> print(dataset.info.name)
geladas

Source code in alp_data/datasets/geladas.py

@register_dataset
class Geladas(Dataset):
    """Gelada vocal sequences follow Menzerath's linguistic law,
    Gustison et al 2016

    Description
    -----------
    Identifying universal principles underpinning diverse natural systems is a
    key goal of the life sciences. A powerful approach in
    addressing this goal has been to test whether patterns consistent
    with linguistic laws are found in nonhuman animals. Menzerath's
    law is a linguistic law that states that, the larger the construct, the
    smaller the size of its constituents. Here, to our knowledge, we
    present the first evidence that Menzerath's law holds in the vocal
    communication of a nonhuman species. We show that, in vocal
    sequences of wild male geladas (Theropithecus gelada), construct
    size (sequence size in number of calls) is negatively correlated with
    constituent size (duration of calls). Call duration does not vary
    significantly with position in the sequence, but call sequence composition does
    change with sequence size and most call types are
    abbreviated in larger sequences. We also find that intercall intervals follow the
    same relationship with sequence size as do calls.
    Finally, we provide formal mathematical support for the idea that
    Menzerath's law reflects compression—the principle of minimizing
    the expected length of a code. Our findings suggest that a common principle
    underpins human and gelada vocal communication,
    highlighting the value of exploring the applicability of linguistic
    laws in vocal systems outside the realm of language.

    References
    ----------
    Gelada vocal sequences follow Menzerath's linguistic law (PNAS, 2016)
    https://doi.org/10.1073/pnas.1522072113
    Also:
    Morgan L. Gustison, Thore J. Bergman, Divergent acoustic properties of gelada
    and baboon vocalizations and their implications for the evolution of human speech.
    Journal of Language Evolution.
    https://doi.org/10.1093/jole/lzx015

    Examples
    --------
    >>> from alp_data.datasets import Geladas
    >>> dataset = Geladas(
    ...     split="all",
    ...     sample_rate=16000,
    ...     streaming=True)
    >>> print(dataset.info.name)
    geladas
    """

    info = DatasetInfo(
        name="geladas",
        owner="gagan",
        split_paths={
            "all": f"{DATA_HOME}/geladas/v0.1.0/raw/geladas_annotations.csv",
        },
        version="0.1.0",
        description="Gelada vocal sequences dataset, Gustison et al 2016",
        sources=["PNAS"],
        license="CC-BY-4.0",
    )

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the Geladas dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

        self._data: pd.DataFrame = None
        self._load()  # Load the dataset (fills self._data)

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}.Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location,
            streaming=self._streaming,
            keep_default_na=False,
            na_values=[""],
        )

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["Geladas", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode.Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        # Ensure audio path is valid
        audio_path = anypath(self.data_root) / row["local_path"]

        # Read the audio clip
        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)
        # Stereo to mono if necessary.
        audio = audio_stereo_to_mono(audio, mono_method="average")

        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split='{self.split}'"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`GiantOtters`

📊 Dataset Information

Name	`giant_otters`
Version	`0.1.0`
Owner	david
License	CC-BY-4.0, CC0
Sources	PLOS ONE
Available Splits	`test`

Description:

Giant Otters vocal repertoire dataset

Giant Otters dataset

Description

Vocal repertoire of giant otters. 22 vocalization types from adults, 17 from neonates, annotated based on behavioral function and sound.

References

The Vocal Repertoire of Adult and Neonate Giant Otters (Pteronura brasiliensis) https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112562#s5

Examples:

>>> from alp_data.datasets import GiantOtters
>>> dataset = GiantOtters(
...     split="test",
...     output_take_and_give={"label": "label"},
...     sample_rate=16000,
...     streaming=True
... )

Source code in alp_data/datasets/giant_otters.py

@register_dataset
class GiantOtters(Dataset):
    """Giant Otters dataset

    Description
    -----------
    Vocal repertoire of giant otters.
    22 vocalization types from adults, 17 from neonates,
    annotated based on behavioral function and sound.

    References
    ----------
    The Vocal Repertoire of Adult and Neonate Giant Otters (Pteronura brasiliensis)
    https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112562#s5

    Examples
    --------
    >>> from alp_data.datasets import GiantOtters
    >>> dataset = GiantOtters(
    ...     split="test",
    ...     output_take_and_give={"label": "label"},
    ...     sample_rate=16000,
    ...     streaming=True
    ... )
    """

    info = DatasetInfo(
        name="giant_otters",
        owner="david",
        split_paths={
            "test": f"{DATA_HOME}/giant_otters/v0.1.0/raw/giant_otters_annotations_test.csv",
        },
        version="0.1.0",
        description="Giant Otters vocal repertoire dataset",
        sources=["PLOS ONE"],
        license="CC-BY-4.0, CC0",
    )

    def __init__(
        self,
        split: str = "test",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the GiantOtters dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

        self._data: pd.DataFrame = None
        self._load()  # Load the dataset (fills self._data)

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}.Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        if anypath(location).suffix == ".jsonl":
            # For JSONL files, read them directly into a DataFrame
            self._data = self._backend_class.from_json(
                location, lines=True, streaming=self._streaming, orient="records"
            )
        else:
            self._data = self._backend_class.from_csv(
                location,
                streaming=self._streaming,
                keep_default_na=False,
                na_values=[""],
            )

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["GiantOtters", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode.Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        # Ensure audio path is valid
        if self.data_root:
            audio_path = anypath(self.data_root) / row["path"]
        else:
            audio_path = anypath(row["path"])

        # Read the audio clip
        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)
        # Stereo to mono if necessary.
        audio = audio_stereo_to_mono(audio, mono_method="average")

        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split='{self.split}'"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`GibbonSolos`

📊 Dataset Information

Name	`gibbon_solos`
Version	`0.1.0`
Owner	gagan
License	CC0
Sources	Royalsocietypublishing.org, Dryad
Available Splits	`all`

Description:

Gibbon solos Clink 2020

Description

Title: Brevity is not a universal in animal communication: evidence for compression depends on the unit of analysis in small ape vocalizations

Evidence for compression, or minimization of code length, has been found across biological systems from genomes to human language and music. Two linguistic laws—Menzerath's Law (which states that longer sequences consist of shorter constituents) and Zipf's Law of abbreviation (a negative relationship between signal length and frequency of use)—are predictions of compression. It has been proposed that compression is a universal in animal communication, but there have been mixed results, particularly in reference to Zipf's Law of abbreviation. Like songbirds, male gibbons (Hylobates muelleri) engage in long solo bouts with unique combinations of notes which combine into phrases. We found strong support for Menzerath's Law as the longer a phrase, the shorter the notes. To identify phrase types, we used state-of-the-art affinity propagation clustering, and were able to predict phrase types using support vector machines with a mean accuracy of 74%. Based on unsupervised phrase type classification, we did not find support for Zipf's Law of abbreviation. Our results indicate that adherence to linguistic laws in male gibbon solos depends on the unit of analysis. We conclude that principles of compression are applicable outside of human language, but may act differently across levels of organization in biological systems.

References

Gelada vocal sequences follow Menzerath's linguistic law (PNAS, 2016) https://doi.org/10.1098/rsos.200151 (paper) https://doi.org/10.5061/dryad.wstqjq2h8 (dataset)

Examples:

>>> from alp_data.datasets import GibbonSolos
>>> dataset = GibbonSolos(
...     split="all",
...     sample_rate=16000,
...     streaming=True)
>>> print(dataset.info.name)
gibbon_solos

Source code in alp_data/datasets/gibbon_solos.py

@register_dataset
class GibbonSolos(Dataset):
    """Gibbon solos Clink 2020

    Description
    -----------
    Title: Brevity is not a universal in animal communication: evidence for
    compression depends on the unit of analysis in small ape vocalizations

    Evidence for compression, or minimization of code length, has been found
    across biological systems from genomes to human language and music.
    Two linguistic laws—Menzerath's Law (which states that longer sequences
    consist of shorter constituents) and Zipf's Law of abbreviation (a negative
    relationship between signal length and frequency of use)—are predictions of compression.
    It has been proposed that compression is a universal in animal communication,
    but there have been mixed results, particularly in reference to Zipf's Law of
    abbreviation. Like songbirds, male gibbons (Hylobates muelleri) engage in long
    solo bouts with unique combinations of notes which combine into phrases.
    We found strong support for Menzerath's Law as the longer a phrase,
    the shorter the notes. To identify phrase types, we used state-of-the-art
    affinity propagation clustering, and were able to predict phrase types using
    support vector machines with a mean accuracy of 74%.
    Based on unsupervised phrase type classification, we did not find support
    for Zipf's Law of abbreviation. Our results indicate that adherence to linguistic
    laws in male gibbon solos depends on the unit of analysis. We conclude that principles
    of compression are applicable outside of human language, but may act differently across
    levels of organization in biological systems.

    References
    ----------
    Gelada vocal sequences follow Menzerath's linguistic law (PNAS, 2016)
    https://doi.org/10.1098/rsos.200151 (paper)
    https://doi.org/10.5061/dryad.wstqjq2h8 (dataset)

    Examples
    --------
    >>> from alp_data.datasets import GibbonSolos
    >>> dataset = GibbonSolos(
    ...     split="all",
    ...     sample_rate=16000,
    ...     streaming=True)
    >>> print(dataset.info.name)
    gibbon_solos
    """

    info = DatasetInfo(
        name="gibbon_solos",
        owner="gagan",
        split_paths={
            "all": f"{DATA_HOME}/clink2020_gibbon_solos/v0.1.0/raw/clink2020_gibbons_annotations.csv",  # noqa: E501
        },
        version="0.1.0",
        description="Gibbon solos Clink 2020",
        sources=["Royalsocietypublishing.org", "Dryad"],
        license="CC0",
    )

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the GibbonSolos dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

        self._data: pd.DataFrame = None
        self._load()  # Load the dataset (fills self._data)

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}.Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location,
            streaming=self._streaming,
            keep_default_na=False,
            na_values=[""],
        )

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["GibbonSolos", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode.Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        # Ensure audio path is valid
        audio_path = anypath(self.data_root) / row["local_path"]

        # Read the audio clip
        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)
        # Stereo to mono if necessary.
        audio = audio_stereo_to_mono(audio, mono_method="average")

        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        # Selection table
        st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")

        # Clip events outside audio (keep only events that begin before audio end)
        audio_dur = len(audio) / float(sample_rate)
        st = st[st["Begin Time (s)"] < audio_dur].copy()

        # Build output
        row["audio"] = audio
        row["sample_rate"] = sample_rate
        row["selection_table"] = st

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split='{self.split}'"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`HawaiianBirds`

📊 Dataset Information

Name	`hawaiian_birds`
Version	`0.1.0`
Owner	benjamin
License	CC-BY-4.0
Sources	Zenodo
Available Splits	`all`

HawaiianBirds Dataset

Description

Annotated soundscapes from Hawaii, provided by Cornell Lab of Ornithology

Description from the Zenodo:

"This collection contains 635 soundscape recordings with a total duration of almost 51 hours, which have been annotated by expert ornithologists who provided 59,583 bounding box labels for 27 different bird species from the Hawaiian Islands, including 6 threatened or endangered native birds. The data were recorded between 2016 and 2022 at four sites across Hawaii Island. This collection has partially been featured as test data in the 2022 BirdCLEF competition and can primarily be used for training and evaluation of machine learning algorithms.

Data collection

Soundscapes for this collection were recorded for various research projects by the Listening Observatory for Hawaiian Ecosystems (LOHE) at the University of Hawaii at Hilo. The recordings were collected using Wildlife Acoustics Inc. Song Meters (models 2, 4, or Mini), as 16-bit wav files at a sampling rate of 44.1 kHz, using the default gain settings of each model. Further specifics for each recording, such as recording location and habitat type, can be found in the metadata provided. Soundscapes in this collection vary in length, ranging from just under a minute to 9 minutes in duration. All audio was unified, converted to FLAC, and resampled to 32 kHz for this collection. Parts of this dataset have previously been used in the 2022 BirdCLEF competition.

Sampling and annotation protocol

This collection is a subset of the files recorded over the course of the LOHE lab’s respective studies. The data were subsampled for annotation by aurally scanning the recordings and visually scanning spectrograms generated using Raven Pro software for target species of interest to the individual research project for which each recording was collected. Recordings that did not contain vocalizations of the species of interest were excluded from full annotation and thus this collection.

Using Raven Pro, annotators were asked to create a selection box around every bird call they could recognize, ignoring those that were too faint or unidentifiable at a spectrogram window size of 700 points. Provided labels contain full bird calls that are boxed in time and frequency. Annotators were allowed to combine multiple consecutive calls of the same species into one bounding box label if pauses between calls were shorter than 0.5 seconds. We converted labels to eBird species codes, following the 2021 eBird taxonomy (Clements list)."

Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels

References

https://zenodo.org/records/7078499

Source code in alp_data/datasets/hawaiian_birds.py

@register_dataset
class HawaiianBirds(Dataset):
    """HawaiianBirds Dataset

    Description
    -----------
    Annotated soundscapes from Hawaii, provided by Cornell Lab of Ornithology

    Description from the Zenodo:

    "This collection contains 635 soundscape recordings with a total duration
    of almost 51 hours, which have been annotated by expert ornithologists
    who provided 59,583 bounding box labels for 27 different bird species
    from the Hawaiian Islands, including 6 threatened or endangered native
    birds. The data were recorded between 2016 and 2022 at four sites across
    Hawaii Island. This collection has partially been featured as test data
    in the 2022 BirdCLEF competition and can primarily be used for training
    and evaluation of machine learning algorithms.

    Data collection

    Soundscapes for this collection were recorded for various research projects
    by the Listening Observatory for Hawaiian Ecosystems (LOHE) at the
    University of Hawaii at Hilo. The recordings were collected using Wildlife
    Acoustics Inc. Song Meters (models 2, 4, or Mini), as 16-bit wav files at a
    sampling rate of 44.1 kHz, using the default gain settings of each model.
    Further specifics for each recording, such as recording location and habitat
    type, can be found in the metadata provided. Soundscapes in this collection
    vary in length, ranging from just under a minute to 9 minutes in duration.
    All audio was unified, converted to FLAC, and resampled to 32 kHz for this
    collection. Parts of this dataset have previously been used in the 2022
    BirdCLEF competition.

    Sampling and annotation protocol

    This collection is a subset of the files recorded over the course of the LOHE
    lab’s respective studies. The data were subsampled for annotation by aurally
    scanning the recordings and visually scanning spectrograms generated using
    Raven Pro software for target species of interest to the individual research
    project for which each recording was collected. Recordings that did not
    contain vocalizations of the species of interest were excluded from full
    annotation and thus this collection.

    Using Raven Pro, annotators were asked to create a selection box around every
    bird call they could recognize, ignoring those that were too faint or
    unidentifiable at a spectrogram window size of 700 points. Provided labels
    contain full bird calls that are boxed in time and frequency. Annotators were
    allowed to combine multiple consecutive calls of the same species into one
    bounding box label if pauses between calls were shorter than 0.5 seconds. We
    converted labels to eBird species codes, following the 2021 eBird taxonomy
    (Clements list)."


    Each entry consists of:
    - an audio recording
    - a selection table (Raven format), with Species labels

    References
    ----------
    https://zenodo.org/records/7078499

    """

    info = DatasetInfo(
        name="hawaiian_birds",
        owner="benjamin",
        split_paths={
            "all": f"{DATA_HOME}/hawaiian_birds/all.csv",
        },
        version="0.1.0",
        description="[MISSING]",
        sources="Zenodo",
        license="CC-BY-4.0",
    )

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: Dict[str, str] | None = None,
        sample_rate: int | None = 16000,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """
        Parameters
        ----------
        split : str
            Split to load (key in info.split_paths).
        output_take_and_give : dict[str, str] | None
            Optional mapping of original → new output keys (filters columns as well).
        sample_rate : int | None
            If set, audio is resampled to this rate.
        data_root : str | AnyPathT | None
            Optional root directory to prepend to each row['audio_path'].
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self.annotation_columns = ["Species"]

        self.sample_rate = sample_rate
        self.data_root = anypath(data_root) if data_root is not None else None

        # Load split CSV
        self._load()

        # If no explicit data_root, assume parent dir of the split path
        if self.data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location, streaming=self._streaming, keep_default_na=False, na_values=[""]
        )

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode. Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Parameters
        ----------
        row : dict[str, Any]
            A dictionary representing a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            The processed row.
        """

        # Resolve audio path
        audio_path = (
            (self.data_root / row["audio_path"]) if self.data_root else anypath(row["audio_path"])
        )

        # Read audio
        audio, sample_rate = read_audio(audio_path)
        audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)

        # Resample if necessary
        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        # Selection table
        st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")

        # Clip events outside audio (keep only events that begin before audio end)
        audio_dur = len(audio) / float(sample_rate)
        st = st[st["Begin Time (s)"] < audio_dur].copy()

        # Build output
        row["audio"] = audio
        row["sample_rate"] = sample_rate
        row["selection_table"] = st

        if self.output_take_and_give:
            item = {}
            for old_key, new_key in self.output_take_and_give.items():
                item[new_key] = row[old_key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the processed data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["HawaiianBirds", dict[str, Any]]:
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )
        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta
        return ds, {}

    def get_available_labels(self, anno_column: str = "Species") -> List[str]:
        """
        Return all possible labels for a given annotation column

        Returns
        ---------
        A list of all the available labels for anno_column
        """
        available_labels = set()
        for row in self._data:
            st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
            available_labels.update(st[anno_column].astype(str).tolist())
        return sorted(available_labels)

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version})"
        return (
            f"{base}\n"
            f"Sources: {self.info.sources}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`INaturalist`

📊 Dataset Information

Name	`inaturalist`
Version	`0.1.0`
Owner	gagan; david
License	CC BY-NC 4.0, CC BY 4.0, CC0 1.0
Sources	iNaturalist
Available Splits	`train`, `train_unseen`, `val`, `val_unseen`, `all`, `all_unseen`

Description:

iNaturalist audio dataset with taxonomic metadata. Available at original (variable) sample rates and 32kHz (pre-resampled). Pre-resampled audio uses librosa's kaiser_best resampling method.

iNaturalist audio dataset.

Description

iNaturalist is a citizen science platform and biodiversity database containing observations of organisms. This dataset includes audio recordings from iNaturalist with associated metadata about species, locations, and other observation details. Recordings are linked to taxonomic information following ESP's taxonomy app (GBIF backbone), including species scientific and common names, family, genus, order. There is additional metadata including location, date, and recordist information. The current version 0.1.0 includes iNaturalist data up to July 2025.

Available Metadata Fields

Taxonomic Information: - canonical_name: Canonical species name (primary identifier) - species_scientific: Scientific species name - species_common: Common name for the species - genus, family, order, class, phylum: Taxonomic hierarchy - gbifID: GBIF (Global Biodiversity Information Facility) identifier

Audio File Paths: - originals_path: Path to original audio (variable sample rate) - 32khz_path: Path to pre-resampled 32kHz audio - 16khz_path: Path to pre-resampled 16kHz audio

Recording Metadata: - eventDate, eventTime: When the recording was made - lifeStage, sex, behavior: Biological context

Location: - latitudeDecimal, longitudeDecimal: GPS coordinates - country, locality: Geographic location names - verbatimElevation: Elevation information

Rights & Attribution: - recordist: Person who made the recording - rightsHolder: Copyright holder - license, license_url: Observation license (CC BY-NC 4.0, CC BY 4.0, or CC0 1.0) - media_license, media_license_url: Media-specific license (CC BY-NC 4.0, CC BY 4.0, or CC0 1.0) - url: Original iNaturalist sound URL

Captions (from AnimalSpeak): - caption, caption2, caption3: Descriptive text captions for the audio: only for the subset drawn from AnimalSpeak.

Additional Fields: - fieldNotes: Observer's notes about the recording - source, data_source: Origin of the data - identifier: iNaturalist observation identifier

Available Splits

train: Training set (random split)
val: Validation set (random split)
all: Complete dataset (train + val)
train_unseen: Training set excluding unseen taxa evaluated in BEANS-Zero benchmark
val_unseen: Validation set excluding unseen taxa evaluated in BEANS-Zero benchmark
all_unseen: Complete dataset excluding BEANS-Zero unseen taxa

The _unseen splits are designed for training models that will be evaluated on BEANS-Zero's unseen taxa benchmark, ensuring no test taxa leak into the training data.

Note that all splits exclude examples overlapping with the following benchmark datasets: - BEANS-Zero captioning test set (See the beans_zero dataset)

Remarks

⚠️ Some original audio files in m4a format were converted to WAV. This does not resolve the issues with m4a as a bioacoustic recording format, and the conversion to WAV via soundfile.write (see scripts/data_preprocessing_scripts/inat_m4a_to_wav.py) may introduce decoder specific metadata. ⚠️ MP3 audio files that were unreadable by soundfile were also converted to WAV using librosa and ffmpeg. This may introduce decoder specific metadata and potential quality issues. (see scripts/data_preprocessing_scripts/inat_mp3_to_wav.py)

References

iNaturalist: https://www.inaturalist.org/

Examples:

>>> from alp_data.datasets import INaturalist
>>> dataset = INaturalist(
...     split="train",
...     output_take_and_give={"canonical_name": "species"}
... )
>>> print(dataset.info.name)
inaturalist
>>> print(dataset.available_sample_rates)
[32000, 16000]

Load with pre-resampled 32kHz audio (no on-the-fly resampling needed)

>>> dataset_32k = INaturalist(split="train", sample_rate=32000, streaming=True)

Load with pre-resampled 16kHz audio (no on-the-fly resampling needed)

>>> dataset_16k = INaturalist(split="train", sample_rate=16000, streaming=True)

Source code in alp_data/datasets/inaturalist.py

@register_dataset
class INaturalist(Dataset):
    """iNaturalist audio dataset.

    Description
    -----------
    iNaturalist is a citizen science platform and biodiversity database
    containing observations of organisms. This dataset includes audio
    recordings from iNaturalist with associated metadata about species,
    locations, and other observation details. Recordings are linked to taxonomic information
    following ESP's taxonomy app (GBIF backbone),
    including species scientific and common names, family, genus, order.
    There is additional metadata including location, date, and recordist information.
    The current version 0.1.0 includes iNaturalist data up to July 2025.

    Available Metadata Fields
    -------------------------
    **Taxonomic Information:**
        - ``canonical_name``: Canonical species name (primary identifier)
        - ``species_scientific``: Scientific species name
        - ``species_common``: Common name for the species
        - ``genus``, ``family``, ``order``, ``class``, ``phylum``: Taxonomic hierarchy
        - ``gbifID``: GBIF (Global Biodiversity Information Facility) identifier

    **Audio File Paths:**
        - ``originals_path``: Path to original audio (variable sample rate)
        - ``32khz_path``: Path to pre-resampled 32kHz audio
        - ``16khz_path``: Path to pre-resampled 16kHz audio

    **Recording Metadata:**
        - ``eventDate``, ``eventTime``: When the recording was made
        - ``lifeStage``, ``sex``, ``behavior``: Biological context

    **Location:**
        - ``latitudeDecimal``, ``longitudeDecimal``: GPS coordinates
        - ``country``, ``locality``: Geographic location names
        - ``verbatimElevation``: Elevation information

    **Rights & Attribution:**
        - ``recordist``: Person who made the recording
        - ``rightsHolder``: Copyright holder
        - ``license``, ``license_url``: Observation license (CC BY-NC 4.0, CC BY 4.0, or CC0 1.0)
        - ``media_license``, ``media_license_url``: Media-specific license
            (CC BY-NC 4.0, CC BY 4.0, or CC0 1.0)
        - ``url``: Original iNaturalist sound URL

    **Captions (from AnimalSpeak):**
        - ``caption``, ``caption2``, ``caption3``: Descriptive text captions for the audio:
            only for the subset drawn from AnimalSpeak.

    **Additional Fields:**
        - ``fieldNotes``: Observer's notes about the recording
        - ``source``, ``data_source``: Origin of the data
        - ``identifier``: iNaturalist observation identifier

    Available Splits
    ----------------
    - ``train``: Training set (random split)
    - ``val``: Validation set (random split)
    - ``all``: Complete dataset (train + val)
    - ``train_unseen``: Training set excluding unseen taxa evaluated in BEANS-Zero benchmark
    - ``val_unseen``: Validation set excluding unseen taxa evaluated in BEANS-Zero benchmark
    - ``all_unseen``: Complete dataset excluding BEANS-Zero unseen taxa

    The ``_unseen`` splits are designed for training models that will be evaluated
    on BEANS-Zero's unseen taxa benchmark, ensuring no test taxa leak into the training data.

    Note that all splits exclude examples overlapping with the following benchmark datasets:
    - BEANS-Zero captioning test set (See the beans_zero dataset)

    Remarks
    -------
    ⚠️ Some original audio files in m4a format were converted to WAV. This does not
    resolve the issues with m4a as a bioacoustic recording format,
    and the conversion to WAV via soundfile.write
    (see scripts/data_preprocessing_scripts/inat_m4a_to_wav.py) may introduce decoder specific
    metadata.
    ⚠️ MP3 audio files that were unreadable by soundfile were also converted to WAV using librosa
    and ffmpeg. This may introduce decoder specific metadata and potential quality issues.
    (see scripts/data_preprocessing_scripts/inat_mp3_to_wav.py)

    References
    ----------
    iNaturalist: https://www.inaturalist.org/

    Examples
    --------
    >>> from alp_data.datasets import INaturalist
    >>> dataset = INaturalist(
    ...     split="train",
    ...     output_take_and_give={"canonical_name": "species"}
    ... )
    >>> print(dataset.info.name)
    inaturalist
    >>> print(dataset.available_sample_rates)
    [32000, 16000]

    Load with pre-resampled 32kHz audio (no on-the-fly resampling needed)
    >>> dataset_32k = INaturalist(split="train", sample_rate=32000, streaming=True)

    Load with pre-resampled 16kHz audio (no on-the-fly resampling needed)
    >>> dataset_16k = INaturalist(split="train", sample_rate=16000, streaming=True)
    """

    info = DatasetInfo(
        name="inaturalist",
        owner="gagan; david",
        split_paths={
            "train": f"{DATA_HOME}/inaturalist/v0.1.0/raw/train_20260201_v3.csv",
            "train_unseen": f"{DATA_HOME}/inaturalist/v0.1.0/raw/train_unseen_20260201_v3.csv",
            "val": f"{DATA_HOME}/inaturalist/v0.1.0/raw/val_20260201_v3.csv",
            "val_unseen": f"{DATA_HOME}/inaturalist/v0.1.0/raw/val_unseen_20260201_v3.csv",
            "all": f"{DATA_HOME}/inaturalist/v0.1.0/raw/all_20260201_v3.csv",
            "all_unseen": f"{DATA_HOME}/inaturalist/v0.1.0/raw/all_unseen_20260201_v3.csv",
        },
        version="0.1.0",
        description="iNaturalist audio dataset with taxonomic metadata. "
        "Available at original (variable) sample rates and 32kHz (pre-resampled). "
        "Pre-resampled audio uses librosa's kaiser_best resampling method.",
        sources=["iNaturalist"],
        license="CC BY-NC 4.0, CC BY 4.0, CC0 1.0",
    )

    # Mapping of sample rates to their corresponding path columns
    _sample_rate_paths = {
        32000: "32khz_path",  # Pre-resampled to 32kHz
        16000: "16khz_path",  # Pre-resampled to 16kHz
    }

    # Column name for original variable-rate audio files
    _originals_path_column = "originals_path"

    def __init__(
        self,
        split: str = "train",
        output_take_and_give: dict[str, str] = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the iNaturalist dataset.

        Parameters
        ----------
        split : str, default="train"
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str], optional
            A dictionary mapping the original column names to the new column names.
        sample_rate : int, optional
            The sample rate to which audio files should be resampled. If the requested
            sample rate is available as pre-resampled audio (see `available_sample_rates`),
            the pre-resampled version will be loaded directly. Otherwise, audio will be
            resampled on-the-fly from the original files (at variable sample rates) using
            librosa's kaiser_best method. If None, audio is returned at its original
            (variable) sample rate.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is prepended to the local_path
            column value to construct the full path to audio files. If None, defaults
            to the GCS bucket path for this dataset.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self._load()
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(f"{DATA_HOME}/inaturalist/v0.1.0/raw/")
        else:
            self.data_root = data_root

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Return the available pre-resampled sample rates.

        Returns
        -------
        list[int]
            List of sample rates (in Hz) for which pre-resampled audio is available.
            Audio at these sample rates can be loaded directly without on-the-fly resampling.
            This checks which path columns actually exist in the loaded data.
        """
        available = []
        for sr, path_column in self._sample_rate_paths.items():
            # Check if the path column exists in the loaded data
            if path_column in self._data.columns:
                available.append(sr)
        return available

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}."
                "Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        # Read CSV directly from GCS path to avoid memory issues
        self._data = self._backend_class.from_csv(location, streaming=self._streaming)

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["INaturalist", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters.

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise an empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode. Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Parameters
        ----------
        row : dict[str, Any]
            A dictionary representing a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            The processed row.
        """

        # Determine which path column to use based on requested sample rate
        # If a pre-resampled version is available, use it; otherwise resample on-the-fly
        use_presampled = False
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            path_column = self._sample_rate_paths[self.sample_rate]
            # Check if the pre-resampled path column exists in the data
            if path_column in row and row[path_column] is not None and row[path_column] != "":
                # Use pre-resampled audio
                audio_path = anypath(self.data_root) / row[path_column]
                use_presampled = True

        if use_presampled:
            audio, sample_rate = read_audio(audio_path)
            audio = audio.astype(np.float32)
            audio = audio_stereo_to_mono(audio, mono_method="average")
            # Audio is already at the correct sample rate, no resampling needed
        else:
            # Use original variable-rate files and resample on-the-fly if needed
            audio_path = anypath(self.data_root) / row[self._originals_path_column]
            audio, sample_rate = read_audio(audio_path)
            audio = audio.astype(np.float32)
            audio = audio_stereo_to_mono(audio, mono_method="average")

            if self.sample_rate is not None and sample_rate != self.sample_rate:
                audio = librosa.resample(
                    y=audio,
                    orig_sr=sample_rate,
                    target_sr=self.sample_rate,
                    scale=True,
                    res_type="kaiser_best",
                )
                sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the processed data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version})"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`InfantMarmosetsVox`

📊 Dataset Information

Name	`InfantMarmosetsVox`
Version	`0.1.0`
Owner	eklavya
License	CC-BY-4.0
Sources	Zenodo
Available Splits	`all`

Description:

InfantMarmosetsVox is a dataset for multi-class call-type and caller identification. Available at original 44.1 kHz and pre-resampled 16kHz. Pre-resampled audio uses librosa's kaiser_best resampling method. Contains approx. 73k vocalization segments of infant marmoset vocalizations. Each vocalization segment sample has a calltypeID and callerID label. There are 11 calltype classes (0-10) and 10 caller identity classes (0-9). A calltypeID index can be associated with its calltype through `CALLTYPE_NAMES`. An unfiltered version (labels_raw.csv) with silence/noise is also available.

InfantMarmosetsVox dataset

Description

InfantMarmosetsVox is a dataset for multi-class call-type and caller identification. It contains audio recordings of different individual marmosets and their call-types. The dataset contains a total of 350 files of precisely labelled 10-minute audio recordings across all caller classes. The audio was recorded from five pairs of infant marmoset twins, each recorded individually in two separate sound-proofed recording rooms at a sampling rate of 44.1 kHz. The start and end time, call-type, and marmoset identity of each vocalization are provided, labeled by an experienced researcher.

Each entry in the dataset corresponds to a single vocalization segment, with the audio loaded from the corresponding time range in the source file.

Labels

calltypeID: Call type (0-10, 11 classes)
callerID: Caller identity (0-9, 10 individuals from 5 twin pairs)

Additional Fields

audio: Vocalization segment waveform (numpy array).
path: Relative path to audio file.
start: Start time in seconds.
end: End time in seconds.
duration: Vocalization segment duration in seconds.
twinID: Marmoset twin pair (1-5).
vocID: Unique vocalization ID (row index).

References

Sarkar, E., Magimai.-Doss, M. (2023) "Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?" Proc. Interspeech 2023, 1189-1193. doi: 10.21437/Interspeech.2023-1968 https://www.isca-speech.org/archive/interspeech_2023/sarkar23_interspeech.html

Examples:

>>> from alp_data.datasets import InfantMarmosetsVox
>>> dataset = InfantMarmosetsVox(
...     split="all",
...     output_take_and_give={"calltypeID": "label", "audio": "audio"},
...     sample_rate=16000,
... )
>>> sample = dataset[0]
>>> print(dataset.info.name)
InfantMarmosetsVox
>>> print(dataset.available_sample_rates)
[44100, 16000]

Source code in alp_data/datasets/infant_marmosets_vox.py

@register_dataset
class InfantMarmosetsVox(Dataset):
    """InfantMarmosetsVox dataset

    Description
    -----------
    InfantMarmosetsVox is a dataset for multi-class call-type and caller
    identification. It contains audio recordings of different individual
    marmosets and their call-types. The dataset contains a total of 350 files
    of precisely labelled 10-minute audio recordings across all caller classes.
    The audio was recorded from five pairs of infant marmoset twins, each
    recorded individually in two separate sound-proofed recording rooms at a
    sampling rate of 44.1 kHz. The start and end time, call-type, and marmoset
    identity of each vocalization are provided, labeled by an experienced
    researcher.

    Each entry in the dataset corresponds to a single vocalization segment,
    with the audio loaded from the corresponding time range in the source file.

    Labels
    ------
    - ``calltypeID``: Call type (0-10, 11 classes)
    - ``callerID``: Caller identity (0-9, 10 individuals from 5 twin pairs)

    Additional Fields
    -----------------
    - ``audio``: Vocalization segment waveform (numpy array).
    - ``path``: Relative path to audio file.
    - ``start``: Start time in seconds.
    - ``end``: End time in seconds.
    - ``duration``: Vocalization segment duration in seconds.
    - ``twinID``: Marmoset twin pair (1-5).
    - ``vocID``: Unique vocalization ID (row index).

    References
    ----------
    Sarkar, E., Magimai.-Doss, M. (2023)
    "Can Self-Supervised Neural Representations Pre-Trained on Human Speech
    distinguish Animal Callers?" Proc. Interspeech 2023, 1189-1193.
    doi: 10.21437/Interspeech.2023-1968
    https://www.isca-speech.org/archive/interspeech_2023/sarkar23_interspeech.html

    Examples
    --------
    >>> from alp_data.datasets import InfantMarmosetsVox
    >>> dataset = InfantMarmosetsVox(
    ...     split="all",
    ...     output_take_and_give={"calltypeID": "label", "audio": "audio"},
    ...     sample_rate=16000,
    ... )
    >>> sample = dataset[0]
    >>> print(dataset.info.name)
    InfantMarmosetsVox
    >>> print(dataset.available_sample_rates)
    [44100, 16000]
    """

    info = DatasetInfo(
        name="InfantMarmosetsVox",
        owner="eklavya",
        split_paths={
            "all": f"{DATA_HOME}/infant_marmosets_vox/labels.csv",
        },
        version="0.1.0",
        description=(
            "InfantMarmosetsVox is a dataset for multi-class call-type and caller identification. "
            "Available at original 44.1 kHz and pre-resampled 16kHz. "
            "Pre-resampled audio uses librosa's kaiser_best resampling method. "
            "Contains approx. 73k vocalization segments of infant marmoset vocalizations. "
            "Each vocalization segment sample has a calltypeID and callerID label. "
            "There are 11 calltype classes (0-10) and 10 caller identity classes (0-9). "
            "A calltypeID index can be associated with its calltype through `CALLTYPE_NAMES`. "
            "An unfiltered version (labels_raw.csv) with silence/noise is also available."
        ),
        sources=["Zenodo"],
        license="CC-BY-4.0",
    )

    # Mapping of sample rates to audio subdirectory names
    _sample_rate_paths = {
        44100: "audio_44k",  # Original 44.1kHz
        16000: "audio_16k",  # Pre-resampled to 16kHz
    }

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the InfantMarmosetsVox dataset.

        Parameters
        ----------
        split : str
            The split to load. Currently only "all" is available.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset audio files.
            If None, defaults to parent directory of the split CSV.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

        self._data = None
        self._load()

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Return the available pre-resampled sample rates.

        Returns
        -------
        list[int]
            List of sample rates (in Hz) for which pre-resampled audio is available.
            Audio at these sample rates can be loaded directly without on-the-fly
            resampling. Original audio is at 44100 Hz.
        """
        return list(self._sample_rate_paths.keys())

    def _load(self) -> None:
        """Load the dataset from preprocessed CSV.

        Raises
        ------
        LookupError
            If the requested split is not available.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location,
            streaming=self._streaming,
            keep_default_na=False,
            na_values=[""],
        )

    @classmethod
    def from_config(
        cls, dataset_config: DatasetConfig
    ) -> tuple["InfantMarmosetsVox", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )
        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata
        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            The number of samples.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        NotImplementedError
            If streaming mode is enabled.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet.")
        if self._streaming:
            raise NotImplementedError("Length is not available in streaming mode.")
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row to load audio segment.

        Parameters
        ----------
        row : dict
            A row from the dataset containing path, start, end times.

        Returns
        -------
        dict
            The processed sample with audio loaded.
        """
        # Determine which audio directory to use based on requested sample rate
        audio_rel_path = row["path"]

        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            # Use pre-resampled audio - replace "audio_44k" with appropriate subdirectory
            audio_subdir = self._sample_rate_paths[self.sample_rate]
            audio_rel_path = audio_rel_path.replace("audio_44k", audio_subdir, 1)

        audio_path = anypath(self.data_root) / audio_rel_path
        start = float(row["start"])
        end = float(row["end"])

        # Read the audio clip
        audio, sample_rate = read_audio(audio_path, start_time=start, end_time=end)
        audio = audio.astype(np.float32)
        # Stereo to mono if necessary.
        audio = audio_stereo_to_mono(audio, mono_method="average")

        # Resample on-the-fly if requested sample rate doesn't have pre-resampled version
        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate
        row["path"] = audio_rel_path  # Update path to reflect actual file loaded

        if self.output_take_and_give:
            return {value: row[key] for key, value in self.output_take_and_give.items()}
        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing audio and metadata.
        """
        return self._process(self._data[idx])

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        ------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A formatted string with dataset information.
        """
        return (
            f"{self.info.name} (v{self.info.version}), split='{self.split}'\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

    @property
    def calltype_names(self) -> dict[int, str]:
        """Return mapping from call type ID to name."""
        return CALLTYPE_NAMES

    @property
    def num_callers(self) -> int:
        """Return number of unique callers."""
        return len(self._data.get_unique("callerID"))

`InsectSet459`

📊 Dataset Information

Name	`insectset_459`
Version	`0.1.0`
Owner	gagan
License	CC-BY-4.0, CC0
Sources	Xeno-canto, iNaturalist, Bioacoustica
Available Splits	`train`, `validation`

Description:

InsectSet459 dataset

InsectSet459 dataset.

Description

Excerpt from the original publication Abstract: "...Automatic recognition of insect sound could help us understand changing biodiversity trends around the world—but insect sounds are challenging to recognize even for deep learning. We present a new dataset comprised of 26399 audio files, from 459 species of Orthoptera and Cicadidae..."

References

Faiss, Ghani, Stowell 2025. https://arxiv.org/abs/2503.15074 Dataset DOI: https://zenodo.org/records/8252141

Examples:

>>> from alp_data.datasets import InsectSet459
>>> dataset = InsectSet459(
...     split="validation",
...     output_take_and_give={"species_scientific": "species"},
...     sample_rate=16000,
... )

Source code in alp_data/datasets/insectset_459.py

@register_dataset
class InsectSet459(Dataset):
    """InsectSet459 dataset.

    Description
    -----------
    Excerpt from the original publication Abstract:
    "...Automatic recognition of insect sound could help us understand
    changing biodiversity trends around the world—but insect sounds
    are challenging to recognize even for deep learning.
    We present a new dataset comprised of 26399 audio files,
    from 459 species of Orthoptera and Cicadidae..."

    References
    ----------
    Faiss, Ghani, Stowell 2025.
    https://arxiv.org/abs/2503.15074
    Dataset DOI:
    https://zenodo.org/records/8252141

    Examples
    --------
    >>> from alp_data.datasets import InsectSet459
    >>> dataset = InsectSet459(
    ...     split="validation",
    ...     output_take_and_give={"species_scientific": "species"},
    ...     sample_rate=16000,
    ... )
    """

    info = DatasetInfo(
        name="insectset_459",
        owner="gagan",
        split_paths={
            "train": f"{DATA_HOME}/insectset_459/v0.1.0/raw/insectset459_annotations_train.csv",
            "validation": f"{DATA_HOME}/insectset_459/v0.1.0/raw/insectset459_annotations_val.csv",
        },
        version="0.1.0",
        description="InsectSet459 dataset",
        sources=["Xeno-canto", "iNaturalist", "Bioacoustica"],
        license="CC-BY-4.0, CC0",
    )

    def __init__(
        self,
        split: str = "train",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the InsectSet459 dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int | None
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data: pd.DataFrame = None
        self._load()  # Load the dataset (fills self._data)
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}.Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location, streaming=self._streaming, keep_default_na=False, na_values=[""]
        )

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["InsectSet459", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parametesf

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise an empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call load() first.")
        if self._streaming:
            raise NotImplementedError("Length is not available in streaming mode for InsectSet459.")
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        audio_path = anypath(self.data_root) / row["local_path"]

        # Read the audio clip
        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)
        # Stereo to mono if necessary.
        audio = audio_stereo_to_mono(audio, mono_method="average")

        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split='{self.split}'"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`LittleOwlId`

📊 Dataset Information

Name	`littleowl_id`
Version	`0.1.0`
Owner	david
License	CC-BY-4.0
Sources	https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940
Available Splits	`train_across_year`, `test_across_year`

Description:

Individual identification of little owls (Athene noctua)

Little Owl individual ID dataset.

Description

Vocalisations released by Stowell et al. for individual Little Owls (Athene noctua). Provides both within-year and across-year evaluation schemes. https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940

For this dataset, the train and test splits (train_across_year, test_across_year) are drawn from different years, giving harder test conditions, with potential differences in acoustic environment or vocalisation characteristics.

References

https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940 Zenodo: https://zenodo.org/records/1413495

Examples:

>>> from alp_data.datasets import LittleOwlId
>>> dataset = LittleOwlId(
...     split="test_across_year",
...     sample_rate=16000,
... )

Source code in alp_data/datasets/littleowl_id.py

@register_dataset
class LittleOwlId(Dataset):
    """Little Owl individual ID dataset.

    Description
    -----------
    Vocalisations released by Stowell et al. for individual Little Owls
    (Athene noctua). Provides both *within-year* and *across-year* evaluation schemes.
    https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940

    For this dataset, the train and test splits (train_across_year, test_across_year)
    are drawn from different years, giving harder test conditions,
    with potential differences in acoustic environment
    or vocalisation characteristics.

    References
    ----------
    https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940
    Zenodo: https://zenodo.org/records/1413495

    Examples
    --------
    >>> from alp_data.datasets import LittleOwlId
    >>> dataset = LittleOwlId(
    ...     split="test_across_year",
    ...     sample_rate=16000,
    ... )
    """

    info = DatasetInfo(
        name="littleowl_id",
        owner="david",
        split_paths={
            "train_across_year": f"{DATA_HOME}/littleowl_id/v0.1.0/raw/acrossyear_fg_train.csv",
            "test_across_year": f"{DATA_HOME}/littleowl_id/v0.1.0/raw/acrossyear_fg_test.csv",
        },
        version="0.1.0",
        description="Individual identification of little owls (Athene noctua)",
        sources=["https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940"],
        license="CC-BY-4.0",
    )

    def __init__(
        self,
        split: str = "train_across_year",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the LittleOwlId dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data: pd.DataFrame = None
        self._load()
        self.sample_rate = sample_rate
        self.data_root = data_root or anypath(self.info.split_paths[self.split]).parent

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]

        self._data = self._backend_class.from_csv(
            location,
            streaming=self._streaming,
            keep_default_na=False,
            na_values=[""],
            infer_schema_length=10000,
            columns=["local_path", "individual_id"],  # for polars
            usecols=["local_path", "individual_id"],  # for pandas
        )

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["LittleOwlId", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise empty dict.
        """

        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        # Do not include split in kwargs if not defined and let __init__ use the default
        kwargs = {
            "output_take_and_give": cfg["output_take_and_give"],
            "data_root": cfg["data_root"],
            "sample_rate": cfg["sample_rate"],
            "backend": cfg["backend"],
            "streaming": cfg["streaming"],
            "split": cfg["split"],
        }

        ds = cls(**kwargs)

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("Dataset not loaded.")
        if self._streaming:
            raise NotImplementedError("Length is not available in streaming mode for this dataset.")
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        audio_path = anypath(self.data_root) / row["local_path"]

        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)
        audio = audio_stereo_to_mono(audio, mono_method="average")
        if self.sample_rate and sample_rate != self.sample_rate:
            audio = librosa.resample(
                audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate
        row["audio"] = audio
        row["sample_rate"] = sample_rate
        if self.output_take_and_give:
            return {new: row[old] for old, new in self.output_take_and_give.items()}
        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split='{self.split}'"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`MacaquesCooCalls`

📊 Dataset Information

Name	`macaques_coo_calls`
Version	`0.1.0`
Owner	marius
License	CC0 1.0 Universal
Sources	archive.org
Available Splits	`test`, `train`, `val`

Description:

Coo calls from male and female macaques (Macaca mulatta) including id, sex, weight_kg

Macaques Coo Calls dataset

Description

Coo calls from male and female macaques (Macaca mulatta) including macaque id,sex, weight.

References

https://archive.org/details/macaque_coo_calls

Examples:

>>> from alp_data.datasets import MacaquesCooCalls
>>> dataset = MacaquesCooCalls(
...     split="test",
...     output_take_and_give={"id": "label"},
...     sample_rate=16000,
... )

Source code in alp_data/datasets/macaques_coo_calls.py

@register_dataset
class MacaquesCooCalls(Dataset):
    """Macaques Coo Calls dataset

    Description
    -----------
    Coo calls from male and female macaques (Macaca mulatta)
    including macaque id,sex, weight.

    References
    ----------

    https://archive.org/details/macaque_coo_calls

    Examples
    --------
    >>> from alp_data.datasets import MacaquesCooCalls
    >>> dataset = MacaquesCooCalls(
    ...     split="test",
    ...     output_take_and_give={"id": "label"},
    ...     sample_rate=16000,
    ... )
    """

    info = DatasetInfo(
        name="macaques_coo_calls",
        owner="marius",
        split_paths={
            "test": f"{DATA_HOME}/macaques_coo_calls/v0.1.0/raw/test.csv",
            "train": f"{DATA_HOME}/macaques_coo_calls/v0.1.0/raw/train.csv",
            "val": f"{DATA_HOME}/macaques_coo_calls/v0.1.0/raw/validation.csv",
        },
        version="0.1.0",
        description=(
            "Coo calls from male and female macaques (Macaca mulatta) including id, sex, weight_kg"
        ),
        sources=["archive.org"],
        license="CC0 1.0 Universal",
    )

    def __init__(
        self,
        split: str = "train",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the Macaques Coo Calls dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

        self._data = None
        self._load()  # Load the dataset (fills self._data)

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}.Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(location, streaming=self._streaming)

    @classmethod
    def from_config(
        cls, dataset_config: DatasetConfig
    ) -> tuple["MacaquesCooCalls", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call load() first.")
        if self._streaming:
            raise RuntimeError("Length is not available in streaming mode for this dataset.")
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        # Ensure audio path is valid
        audio_path = anypath(self.data_root) / anypath(row["local_path"])

        # Read the audio clip
        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)
        # Stereo to mono if necessary.
        audio = audio_stereo_to_mono(audio, mono_method="average")

        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split: {self.split}"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`NocturnalBirdMigration`

📊 Dataset Information

Name	`nocturnal_bird_migration`
Version	`0.1.0`
Owner	benjamin
License	CC-BY-NC-SA 4.0
Sources	Zenodo, xeno-canto
Available Splits	`train`, `train_nonxc`, `train_xc`, `test`

Description:

Dataset of nocturnal vocalizations from migratory birds in Europe. Vocalizations are annotated with start- and end- times, as well as high- andlow-frequencies.

NocturnalBirdMigration Dataset

Description

Dataset of nocturnal vocalizations from migratory birds in Europe. Vocalizations are annotated with start- and end- times, as well as high- and low-frequencies.

The dataset consists of a train split and a test split. The test split consists entirely of xeno-canto recordings which the dataset authors annotated. The train split consists of recordings submitted by French citizen-scientists, as well as xeno-canto recordings annotated by the dataset authors.

Note that the license is mixed (due to origins on xeno-canto); most restrictive is CC-BY-NC-SA 4.0. See zenodo page for full license details.

Description from the paper:

The persisting threats on migratory bird populations highlight the urgent need for effective monitoring tech- niques that could assist in their conservation. Among these, passive acoustic monitoring is an essential tool, particularly for nocturnal migratory species that are difficult to track otherwise. This work presents the Noc- turnal Bird Migration (NBM) dataset, a collection of 13,359 annotated vocalizations from 117 species of the Western Palearctic. The dataset includes precise time and frequency annotations, gathered by dozens of bird enthusiasts across France, enabling novel downstream acoustic analysis. In particular, we prove the utility of this database by training an original two-stage deep ob- ject detection model tailored for the processing of audio data. While allowing the precise localization of bird calls in spectrograms, this model shows competitive accuracy on the 45 main species of the dataset with state-of-the- art systems trained on much larger audio collections. These results highlight the interest of fostering similar open-science initiatives to acquire costly but valuable fine-grained annotations of audio files. All data and code are made openly available.

Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels - xeno-canto id, if applicable (else, empty string)

Pre-resampled Audio

Pre-resampled audio is available at 16 kHz and 32 kHz. When sample_rate matches one of these rates, the pre-resampled files are loaded directly (no on-the-fly resampling). For any other target rate, audio is resampled on-the-fly using librosa's kaiser_best method.

References

https://zenodo.org/records/17573913 https://arxiv.org/pdf/2412.03633

Source code in alp_data/datasets/nocturnal_bird_migration.py

@register_dataset
class NocturnalBirdMigration(Dataset):
    """NocturnalBirdMigration Dataset

    Description
    -----------
    Dataset of nocturnal vocalizations from migratory birds in Europe.
    Vocalizations are annotated with start- and end- times, as well as high- and
    low-frequencies.

    The dataset consists of a train split and a test split. The test split consists
    entirely of xeno-canto recordings which the dataset authors annotated. The train
    split consists of recordings submitted by French citizen-scientists, as well as
    xeno-canto recordings annotated by the dataset authors.

    Note that the license is mixed (due to origins on xeno-canto); most restrictive
    is CC-BY-NC-SA 4.0. See zenodo page for full license details.

    Description from the paper:

    The persisting threats on migratory bird populations
    highlight the urgent need for effective monitoring tech-
    niques that could assist in their conservation. Among
    these, passive acoustic monitoring is an essential tool,
    particularly for nocturnal migratory species that are
    difficult to track otherwise. This work presents the Noc-
    turnal Bird Migration (NBM) dataset, a collection of
    13,359 annotated vocalizations from 117 species of the
    Western Palearctic. The dataset includes precise time
    and frequency annotations, gathered by dozens of bird
    enthusiasts across France, enabling novel downstream
    acoustic analysis. In particular, we prove the utility of
    this database by training an original two-stage deep ob-
    ject detection model tailored for the processing of audio
    data. While allowing the precise localization of bird calls
    in spectrograms, this model shows competitive accuracy
    on the 45 main species of the dataset with state-of-the-
    art systems trained on much larger audio collections.
    These results highlight the interest of fostering similar
    open-science initiatives to acquire costly but valuable
    fine-grained annotations of audio files. All data and
    code are made openly available.

    Each entry consists of:
    - an audio recording
    - a selection table (Raven format), with Species labels
    - xeno-canto id, if applicable (else, empty string)

    Pre-resampled Audio
    -------------------
    Pre-resampled audio is available at 16 kHz and 32 kHz. When
    ``sample_rate`` matches one of these rates, the pre-resampled files are
    loaded directly (no on-the-fly resampling). For any other target rate,
    audio is resampled on-the-fly using librosa's ``kaiser_best`` method.

    References
    ----------
    https://zenodo.org/records/17573913
    https://arxiv.org/pdf/2412.03633

    """

    info = DatasetInfo(
        name="nocturnal_bird_migration",
        owner="benjamin",
        split_paths={
            # Full training set
            "train": f"{DATA_HOME}/nocturnal_bird_migration/train_v2_1.csv",
            # Training subset: no xeno-canto
            "train_nonxc": f"{DATA_HOME}/nocturnal_bird_migration/train_nonxc_v2_1.csv",
            # Training subset: xeno-canto recordings only
            "train_xc": f"{DATA_HOME}/nocturnal_bird_migration/train_xc_v2_1.csv",
            # Held-out test set
            "test": f"{DATA_HOME}/nocturnal_bird_migration/test_v2_1.csv",
        },
        version="0.1.0",
        description="Dataset of nocturnal vocalizations from migratory birds in Europe. "
        "Vocalizations are annotated with start- and end- times, as well as high- and"
        "low-frequencies.",
        sources="Zenodo, xeno-canto",
        license="CC-BY-NC-SA 4.0",
    )

    _sample_rate_paths: dict[int, str] = {16000: "16khz_path", 32000: "32khz_path"}
    _originals_path_column = "audio_path"

    def __init__(
        self,
        split: str = "train",
        output_take_and_give: Dict[str, str] | None = None,
        sample_rate: int | None = 16000,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """
        Parameters
        ----------
        split : str
            Split to load (key in info.split_paths).
        output_take_and_give : dict[str, str] | None
            Optional mapping of original → new output keys (filters columns as well).
        sample_rate : int | None
            If set, audio is resampled to this rate.
        data_root : str | AnyPathT | None
            Optional root directory to prepend to each row['audio_path'].
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self.annotation_columns = ["Species"]
        self.unknown_label = "Unknown"

        self.sample_rate = sample_rate

        self._load()

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = anypath(data_root)

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns) if self._data is not None else []

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Return pre-resampled sample rates whose path columns exist in the data."""
        return [sr for sr, col in self._sample_rate_paths.items() if col in self._data.columns]

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location, streaming=self._streaming, keep_default_na=False, na_values=[""]
        )

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode. Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Parameters
        ----------
        row : dict[str, Any]
            A dictionary representing a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            The processed row.
        """
        use_presampled = False
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            path_column = self._sample_rate_paths[self.sample_rate]
            if path_column in row and row[path_column] is not None and row[path_column] != "":
                audio_path = anypath(self.data_root) / row[path_column]
                use_presampled = True

        if not use_presampled:
            audio_path = anypath(self.data_root) / row[self._originals_path_column]

        audio, sr = read_audio(audio_path)
        audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)

        if not use_presampled and self.sample_rate is not None and sr != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sr,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sr = self.sample_rate

        st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
        audio_dur = len(audio) / float(sr)
        st = st[st["Begin Time (s)"] < audio_dur].copy()

        row["audio"] = audio
        row["sample_rate"] = sr
        row["selection_table"] = st

        if self.output_take_and_give:
            item = {}
            for old_key, new_key in self.output_take_and_give.items():
                item[new_key] = row[old_key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the processed data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    @classmethod
    def from_config(
        cls, dataset_config: DatasetConfig
    ) -> tuple["NocturnalBirdMigration", dict[str, Any]]:
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )
        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta
        return ds, {}

    def get_available_labels(self, anno_column: str = "Species") -> List[str]:
        """
        Return all possible labels for a given annotation column
        anno_column is included as an optional argument for consistency
        with other detection datasets.

        Returns
        ---------
        A list of all the available labels for anno_column
        """
        available_labels = set()
        for row in self._data:
            st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
            available_labels.update(st[anno_column].astype(str).tolist())
        if self.unknown_label in available_labels:
            available_labels.remove(self.unknown_label)

        warnings.warn(
            f"Events with unknown label={self.unknown_label} exist in dataset"
            f"but {self.unknown_label} suppressed from get_available_labels output",
            stacklevel=2,
        )
        return sorted(available_labels)

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version})"
        return (
            f"{base}\n"
            f"Sources: {self.info.sources}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`PipitId`

📊 Dataset Information

Name	`pipit_id`
Version	`0.1.0`
Owner	david
License	CC-BY-4.0
Sources	https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940
Available Splits	`train_within_year`, `test_within_year`, `train_across_year`, `test_across_year`

Description:

Individual identification of tree pipits (Anthus trivialis)

Tree Pipit individual ID dataset.

Description

Vocalisations released by Stowell et al. for individual Tree Pipits males (Anthus trivialis). Provides both within-year and across-year evaluation schemes.

This dataset includes train and test splits within year (train_within_year, test_within_year) and across year (train_across_year, test_across_year). Test within year tests on recordings from the same year as the training data, though different days, while test across year tests on recordings from different years, giving harder test conditions, with potential differences in acoustic environment or vocalisation characteristics.

References

https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940 Zenodo: https://zenodo.org/records/1413495

Examples:

>>> from alp_data.datasets import PipitId
>>> dataset = PipitId(
...     split="test_within_year",
...     sample_rate=16000,
...     streaming=True,
... )

Source code in alp_data/datasets/pipit_id.py

@register_dataset
class PipitId(Dataset):
    """Tree Pipit individual ID dataset.

    Description
    -----------
    Vocalisations released by Stowell et al. for individual Tree Pipits males
    (Anthus trivialis). Provides both *within-year* and *across-year*
    evaluation schemes.

    This dataset includes train and test splits within year
    (train_within_year, test_within_year)
    and across year (train_across_year, test_across_year). Test within year tests
    on recordings from the same year as the training data, though different days,
    while test across year tests on recordings from different years, giving harder
    test conditions, with potential differences in acoustic environment or
    vocalisation characteristics.

    References
    ----------
    https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940
    Zenodo: https://zenodo.org/records/1413495

    Examples
    --------
    >>> from alp_data.datasets import PipitId
    >>> dataset = PipitId(
    ...     split="test_within_year",
    ...     sample_rate=16000,
    ...     streaming=True,
    ... )
    """

    info = DatasetInfo(
        name="pipit_id",
        owner="david",
        split_paths={
            "train_within_year": f"{DATA_HOME}/pipit_id/v0.1.0/raw/withinyear_fg_train.csv",
            "test_within_year": f"{DATA_HOME}/pipit_id/v0.1.0/raw/withinyear_fg_test.csv",
            "train_across_year": f"{DATA_HOME}/pipit_id/v0.1.0/raw/acrossyear_fg_train.csv",
            "test_across_year": f"{DATA_HOME}/pipit_id/v0.1.0/raw/acrossyear_fg_test.csv",
        },
        version="0.1.0",
        description="Individual identification of tree pipits (Anthus trivialis)",
        sources=["https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940"],
        license="CC-BY-4.0",
    )

    def __init__(
        self,
        split: str = "train_within_year",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the PipitId dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data: pd.DataFrame = None
        self._load()
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location,
            streaming=self._streaming,
            keep_default_na=False,
            na_values=[""],
            infer_schema_length=10000,
            columns=["local_path", "individual_id"],  # for polars
            usecols=["local_path", "individual_id"],  # for pandas
        )

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["PipitId", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise empty dict.
        """

        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        # Do not include split in kwargs if not defined and let __init__ use the default
        kwargs = {
            "output_take_and_give": cfg["output_take_and_give"],
            "data_root": cfg["data_root"],
            "sample_rate": cfg["sample_rate"],
            "backend": cfg["backend"],
            "streaming": cfg["streaming"],
            "split": cfg["split"],
        }

        ds = cls(**kwargs)

        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta
        return ds, {}

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("Dataset not loaded.")
        if self._streaming:
            raise NotImplementedError("Length is not available in streaming mode for this dataset.")
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        audio_path = anypath(self.data_root) / row["local_path"]

        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)
        audio = audio_stereo_to_mono(audio, mono_method="average")

        if self.sample_rate and sample_rate != self.sample_rate:
            audio = librosa.resample(
                audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate
        if self.output_take_and_give:
            return {new: row[old] for old, new in self.output_take_and_give.items()}
        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split='{self.split}'"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`Powdermill`

📊 Dataset Information

Name	`powdermill`
Version	`0.1.0`
Owner	benjamin
License	Public Domain
Sources	Dryad
Available Splits	`all`

Powdermill Dataset

Description

Dataset of bird vocalizations with bounding boxes, originally released in: "An annotated set of audio recordings of Eastern North American birds containing frequency, time, and species information" by Lauren Chronister et al. (2021).

Description from the original:

"Acoustic recordings of soundscapes are an important category of audio data that can be useful for answering a variety of questions, and an entire discipline within ecology, dubbed “soundscape ecology,” has risen to study them. Bird sound is often the focus of studies of soundscapes due to the ubiquitousness of birds in most terrestrial environments and their high vocal activity. Autonomous acoustic recorders have increased the quantity and availability of recordings of natural soundscapes while mitigating the impact of human observers on community behavior. However, such recordings are of little use without analysis of the sounds they contain. Manual analysis currently stands as the best means of processing this form of data for use in certain applications within soundscape ecology, but it is a laborious task, sometimes requiring many hours of human review to process comparatively few hours of recording. For this reason, few annotated data sets of soundscape recordings are publicly available. Further still, there are no publicly available strongly labeled soundscape recordings of bird sounds that contain information on timing, frequency, and species. Therefore, we present the first data set of strongly labeled bird sound soundscape recordings under free use license. These data were collected in the Northeastern United States at Powdermill Nature Reserve, Rector, Pennsylvania, USA. Recordings encompass 385 minutes of dawn chorus recordings collected by autonomous acoustic recorders between the months of April through July 2018. Recordings were collected in continuous bouts on four days during the study period and contain 48 species and 16,052 annotations. Applications of this data set may be numerous and include the training, validation, and testing of certain advanced machine-learning models that detect or classify bird sounds. There are no copyright or propriety restrictions; please cite this paper when using materials within."

Note that this data was included in the BEANS "detection", i.e. multi-label classification, benchmark, under the name ENABirds.

Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels

References

https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecy.3329

Source code in alp_data/datasets/powdermill.py

@register_dataset
class Powdermill(Dataset):
    """Powdermill Dataset

    Description
    -----------
    Dataset of bird vocalizations with bounding boxes, originally released in:
    "An annotated set of audio recordings of Eastern North American birds containing
    frequency, time, and species information" by Lauren Chronister et al. (2021).

    Description from the original:

    "Acoustic recordings of soundscapes are an important category of audio data that
    can be useful for answering a variety of questions, and an entire discipline
    within ecology, dubbed “soundscape ecology,” has risen to study them. Bird sound
    is often the focus of studies of soundscapes due to the ubiquitousness of birds
    in most terrestrial environments and their high vocal activity. Autonomous
    acoustic recorders have increased the quantity and availability of recordings
    of natural soundscapes while mitigating the impact of human observers on
    community behavior. However, such recordings are of little use without analysis
    of the sounds they contain. Manual analysis currently stands as the best means
    of processing this form of data for use in certain applications within
    soundscape ecology, but it is a laborious task, sometimes requiring many hours
    of human review to process comparatively few hours of recording. For this reason,
    few annotated data sets of soundscape recordings are publicly available. Further
    still, there are no publicly available strongly labeled soundscape recordings of
    bird sounds that contain information on timing, frequency, and species. Therefore,
    we present the first data set of strongly labeled bird sound soundscape recordings
    under free use license. These data were collected in the Northeastern United States
    at Powdermill Nature Reserve, Rector, Pennsylvania, USA. Recordings encompass 385
    minutes of dawn chorus recordings collected by autonomous acoustic recorders between
    the months of April through July 2018. Recordings were collected in continuous bouts
    on four days during the study period and contain 48 species and 16,052 annotations.
    Applications of this data set may be numerous and include the training, validation,
    and testing of certain advanced machine-learning models that detect or classify bird
    sounds. There are no copyright or propriety restrictions; please cite this paper when
    using materials within."

    Note that this data was included in the BEANS "detection", i.e. multi-label
    classification, benchmark, under the name ENABirds.

    Each entry consists of:
    - an audio recording
    - a selection table (Raven format), with Species labels

    References
    ----------
    https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecy.3329

    """

    info = DatasetInfo(
        name="powdermill",
        owner="benjamin",
        split_paths={
            "all": f"{DATA_HOME}/powdermill/all_gbif.csv",
        },
        version="0.1.0",
        description="[MISSING]",
        sources="Dryad",
        license="Public Domain",
    )

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: Dict[str, str] | None = None,
        sample_rate: int | None = 16000,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """
        Parameters
        ----------
        split : str
            Split to load (key in info.split_paths).
        output_take_and_give : dict[str, str] | None
            Optional mapping of original → new output keys (filters columns as well).
        sample_rate : int | None
            If set, audio is resampled to this rate.
        data_root : str | AnyPathT | None
            Optional root directory to prepend to each row['audio_path'].
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self.annotation_columns = ["Species"]

        self.sample_rate = sample_rate
        self.data_root = anypath(data_root) if data_root is not None else None

        # Load split CSV
        self._load()

        # If no explicit data_root, assume parent dir of the split path
        if self.data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns) if self._data is not None else []

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location, streaming=self._streaming, keep_default_na=False, na_values=[""]
        )

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode. Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Parameters
        ----------
        row : dict[str, Any]
            A dictionary representing a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            The processed row.
        """

        # Resolve audio path
        audio_path = (
            (self.data_root / row["audio_path"]) if self.data_root else anypath(row["audio_path"])
        )

        # Read audio
        audio, sample_rate = read_audio(audio_path)
        audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)

        # Resample if necessary
        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        # Selection table
        st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")

        # Clip events outside audio (keep only events that begin before audio end)
        audio_dur = len(audio) / float(sample_rate)
        st = st[st["Begin Time (s)"] < audio_dur].copy()

        # Build output
        row["audio"] = audio
        row["sample_rate"] = sample_rate
        row["selection_table"] = st

        if self.output_take_and_give:
            item = {}
            for old_key, new_key in self.output_take_and_give.items():
                item[new_key] = row[old_key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the processed data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["Powdermill", dict[str, Any]]:
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )
        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta
        return ds, {}

    def get_available_labels(self, anno_column: str = "Species") -> List[str]:
        """
        Return all possible labels for a given annotation column
        anno_column is included as an optional argument for consistency
        with other detection datasets.

        Returns
        ---------
        A list of all the available labels for anno_column
        """
        available_labels = set()
        for row in self._data:
            st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
            available_labels.update(st[anno_column].astype(str).tolist())
        return sorted(available_labels)

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version})"
        return (
            f"{base}\n"
            f"Sources: {self.info.sources}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`Subsegmentation`

📊 Dataset Information

Name	`subsegmentation`
Version	`0.1.0`
Owner	benjamin
License	private
Sources	Logan James
Available Splits	`all`, `train`, `val`, `test`, `single_song_all`, `single_song_train`, `single_song_val`, `single_song_test`

Bird Song Subsegmentation Dataset

Description

Bird Song subsegmentation dataset from Logan James' paper "Pervasive patterns in the songs of passerine birds resemble human music universals and are linked with production and cognitive mechanisms"

Currently, this dataset is for internal use but we hope to release it publicly. The recordings come from xeno-canto and the annotations come from the paper.

Each entry consists of: - an audio recording - a selection table with start- and stop-times of song syllables - a boolean indicating if it passed quality control (i.e. if it was sub-segmentable) - annotations of Species, Genus, Order, and Family.

Splits

Original (multi-song recordings): "all", "train", "val", "test" Single-song (one song per item, times re-zeroed): "single_song_all", "single_song_train", "single_song_val", "single_song_test"

Each selection table has, for each syllable, column for the Species, Genus, Order, and Family, as well as an Annotation:

'a' indicates a syllable that is the beginning of a song (we define as at least 500 ms silence before) 'z' indicates a syllable that is the end of a song (we define as at least 500 ms silence after) 's' indicates all other syllables

References

https://www.biorxiv.org/content/biorxiv/early/2024/07/17/2024.07.15.603339.full.pdf

Source code in alp_data/datasets/subsegmentation.py

@register_dataset
class Subsegmentation(Dataset):
    """Bird Song Subsegmentation Dataset

    Description
    -----------
    Bird Song subsegmentation dataset from Logan James' paper "Pervasive patterns in
    the songs of passerine birds resemble human music universals and are linked with
    production and cognitive mechanisms"

    Currently, this dataset is for internal use but we hope to release it publicly.
    The recordings come from xeno-canto and the annotations come from the paper.

    Each entry consists of:
    - an audio recording
    - a selection table with start- and stop-times of song syllables
    - a boolean indicating if it passed quality control (i.e. if it was sub-segmentable)
    - annotations of Species, Genus, Order, and Family.

    Splits
    ------
    Original (multi-song recordings): "all", "train", "val", "test"
    Single-song (one song per item, times re-zeroed): "single_song_all",
        "single_song_train", "single_song_val", "single_song_test"

    Each selection table has, for each syllable, column for the Species, Genus, Order,
    and Family, as well as an Annotation:

    'a' indicates a syllable that is the beginning of a song (we define as at least 500 ms
        silence before)
    'z' indicates a syllable that is the end of a song (we define as at least 500 ms silence
        after)
    's' indicates all other syllables


    References
    ----------
    https://www.biorxiv.org/content/biorxiv/early/2024/07/17/2024.07.15.603339.full.pdf


    """

    info = DatasetInfo(
        name="subsegmentation",
        owner="benjamin",
        split_paths={
            "all": "gs://subsegmentation/xeno_canto_annotations/all.csv",
            "train": "gs://subsegmentation/xeno_canto_annotations/train.csv",
            "val": "gs://subsegmentation/xeno_canto_annotations/val.csv",
            "test": "gs://subsegmentation/xeno_canto_annotations/test.csv",
            "single_song_all": "gs://subsegmentation/single_song/all.csv",
            "single_song_train": "gs://subsegmentation/single_song/train.csv",
            "single_song_val": "gs://subsegmentation/single_song/val.csv",
            "single_song_test": "gs://subsegmentation/single_song/test.csv",
        },
        version="0.1.0",
        description="[MISSING]",
        sources="Logan James",
        license="private",
    )

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: Dict[str, str] | None = None,
        sample_rate: int | None = 16000,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """
        Parameters
        ----------
        split : str
            Split to load (key in info.split_paths).
        output_take_and_give : dict[str, str] | None
            Optional mapping of original → new output keys (filters columns as well).
        sample_rate : int | None
            If set, audio is resampled to this rate.
        data_root : str | AnyPathT | None
            Optional root directory to prepend to each row['audio_path'].
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self.annotation_columns = ["Species", "Genus", "Family", "Order", "Annotation"]

        self.sample_rate = sample_rate
        self.data_root = anypath(data_root) if data_root is not None else None

        # Load split CSV
        self._load()

        # If no explicit data_root, assume parent dir of the split path
        if self.data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location, streaming=self._streaming, keep_default_na=False, na_values=[""]
        )

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode. Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Parameters
        ----------
        row : dict[str, Any]
            A dictionary representing a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            The processed row.
        """

        # Resolve audio path
        audio_path = (
            (self.data_root / row["audio_path"]) if self.data_root else anypath(row["audio_path"])
        )

        # Read audio
        audio, sample_rate = read_audio(audio_path)
        audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)

        # Resample if necessary
        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        # Selection table
        st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")

        # Clip events outside audio (keep only events that begin before audio end)
        audio_dur = len(audio) / float(sample_rate)
        st = st[st["Begin Time (s)"] < audio_dur].copy()

        # Build output
        row["audio"] = audio
        row["sample_rate"] = sample_rate
        row["selection_table"] = st

        if self.output_take_and_give:
            item = {}
            for old_key, new_key in self.output_take_and_give.items():
                item[new_key] = row[old_key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the processed data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["Subsegmentation", dict[str, Any]]:
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )
        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta
        return ds, {}

    def get_available_labels(self, anno_column: str = "Species") -> List[str]:
        """
        Return all possible labels for a given annotation column

        Returns
        ---------
        A list of all the available labels for anno_column
        """
        available_labels = set()
        for row in self._data:
            st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
            available_labels.update(st[anno_column].astype(str).tolist())
        return sorted(available_labels)

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version})"
        return (
            f"{base}\n"
            f"Sources: {self.info.sources}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`SuperbStarling`

📊 Dataset Information

Name	`superb_starling`
Version	`0.1.0`
Owner	Sara
License	CC0 1.0
Sources	Kenya field recordings
Available Splits	`all`

Description:

superb starling flight calls with individual ID and group ID annotations

Superb Starling Dataset

Description

Dataset of superb starling (Lamprotornis superbus) flight calls with precise time bounds, individual ID, and social group ID

Each entry includes: - An audio clip containing one flight call - Annotations for exact start/stop of the call in audio clip - Metadata (bird ID, group, sex, ring, timestamp)

The metadata file is a tab-separated text file that is formatted as a Raven selection table. This lets you open all sound files in Raven and see the annotations aligned for every selection, which each correspond to a single flight call

References

Keen, S. C., Meliza, C. D., & Rubenstein, D. R. (2013). Flight calls signal group and individual identity but not kinship in a cooperatively breeding bird. Behavioral Ecology, 24(6), 1279-1285. https://doi.org/10.5061/dryad.p1n88

Source code in alp_data/datasets/superb_starling.py

@register_dataset
class SuperbStarling(Dataset):
    """Superb Starling Dataset

    Description
    -----------
    Dataset of superb starling (Lamprotornis superbus) flight calls with precise time bounds,
    individual ID, and social group ID

    Each entry includes:
    - An audio clip containing one flight call
    - Annotations for exact start/stop of the call in audio clip
    - Metadata (bird ID, group, sex, ring, timestamp)

    The metadata file is a tab-separated text file that is formatted as a Raven selection table.
    This lets you open all sound files in Raven and see the annotations aligned for every
    selection, which each correspond to a single flight call

    References
    ----------
    Keen, S. C., Meliza, C. D., & Rubenstein, D. R. (2013). Flight calls signal group and
    individual identity but not kinship in a cooperatively breeding bird.
    Behavioral Ecology, 24(6), 1279-1285.
    https://doi.org/10.5061/dryad.p1n88

    """

    info = DatasetInfo(
        name="superb_starling",
        owner="Sara",
        split_paths={
            "all": f"{DATA_HOME}/superb-starlings-keen/v0.1.0/organized_data/superb_starlings_flightcalls.txt",  # noqa: E501
        },
        version="0.1.0",
        description="superb starling flight calls with individual ID and group ID annotations",
        sources="Kenya field recordings",
        license="CC0 1.0",  # https://datadryad.org/dataset/doi:10.5061/dryad.p1n88
    )

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = 16000,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """
        Parameters
        ----------
        split : str
            Split to load (key in info.split_paths).
        output_take_and_give : dict[str, str] | None
            Optional mapping of original → new output keys (filters columns as well).
        sample_rate : int | None
            If set, audio is resampled to this rate.
        data_root : str | AnyPathT | None
            Optional root directory to prepend to each row's audio path.
            If None, will use the parent directory of the split file.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "pandas"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data: pd.DataFrame | None = None
        self.annotation_columns = ["Species", "bird", "group", "sex", "ring"]

        self.sample_rate = sample_rate
        self.data_root = anypath(data_root) if data_root is not None else None

        # Load split file
        self._load()

        # If no explicit data_root, assume parent dir of the split path
        if self.data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns) if self._data is not None else []

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        # Read tab-separated file
        self._data = self._backend_class.from_csv(
            location,
            streaming=self._streaming,
            sep="\t",
            separator="\t",
            keep_default_na=False,
            na_values=[""],
        )

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("No split loaded.")
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        # Get full audio path
        audio_filename = row["Begin Path"]
        audio_path = self.data_root / audio_filename

        # Read audio
        audio, sample_rate = read_audio(audio_path)
        audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)

        # Resample if necessary
        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        # Add audio and sample rate to output
        row["audio"] = audio
        row["sample_rate"] = sample_rate

        # Calculate duration from audio
        row["duration_secs"] = len(audio) / float(sample_rate)

        if self.output_take_and_give:
            item = {}
            for old_key, new_key in self.output_take_and_give.items():
                item[new_key] = row[old_key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the processed data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        for i in range(len(self)):
            yield self[i]

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["SuperbStarling", dict[str, Any]]:
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )
        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta
        return ds, {}

    def get_available_labels(self, anno_column: str = "Species") -> List[str]:
        """
        Return all possible labels for a given annotation column.

        Parameters
        ----------
        anno_column : str
            The annotation column to get labels from. Options include:
            'Species', 'bird', 'group', 'sex', 'ring'

        Returns
        -------
        list[str]
            A sorted list of all unique values in the specified column.

        Raises
        ------
        ValueError
            If the specified column is not found in the dataset.
        """
        if self._data is None:
            return []
        if anno_column not in self._data.columns:
            raise ValueError(f"Column '{anno_column}' not found. Available columns: {self.columns}")
        return np.array(self._data.get_unique(anno_column)).astype(str).tolist()

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version})"
        return (
            f"{base}\n"
            f"Sources: {self.info.sources}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}\n"
            f"Total vocalizations: {len(self) if self._data is not None else 'N/A'}"
        )

`Voxaboxen`

📊 Dataset Information

Name	`voxaboxen`
Version	`0.1.0`
Owner	benjamin; gagan
License	CC BY
Sources	Anuraset, BV, MT, OZF, Hawaii, Humpback, Katydids, Powdermill
Available Splits	`Anuraset_train`, `Anuraset_val`, `Anuraset_test`, `BV_train`, `BV_val`, `BV_test`, `MT_train`, `MT_val`, `MT_test`, `OZF_train`, ... (39 total)

Description:

Voxaboxen dataset for acoustic sound event detection

Voxaboxen dataset.

Description

Voxaboxen is the dataset used in the Voxaboxen project. It consists of several datasets with annotated call start and end times via selection tables. Excerpt from paper: "...a method for accurately detecting bioacoustic sound events that is robust to overlapping events... We also release a new dataset designed to measure performance on detecting overlapping vocalizations. This consists of recordings of zebra finches annotated with temporally-strong labels and showing frequent overlaps. We test Voxaboxen on seven existing data sets and on our new data set."

References

Robust detection of overlapping bioacoustic sound events Louis Mahon, Benjamin Hoffman, Logan S James, Maddie Cusimano, Masato Hagiwara, Sarah C Woolley, Olivier Pietquin https://arxiv.org/abs/2503.02389

Examples:

>>> from alp_data.datasets import Voxaboxen
>>> dataset = Voxaboxen(
...     split="BV_val",
...     output_take_and_give={"selection_table": "st"}
... )
>>> print(dataset.info.name)
voxaboxen

Source code in alp_data/datasets/voxaboxen.py

@register_dataset
class Voxaboxen(Dataset):
    """Voxaboxen dataset.

    Description
    -----------
    Voxaboxen is the dataset used in the Voxaboxen project. It consists of
    several datasets with annotated call start and end times via selection tables.
    Excerpt from paper:
    "...a method for accurately detecting bioacoustic sound events that is robust
    to overlapping events... We also release a new dataset designed to measure
    performance on detecting overlapping vocalizations. This consists of recordings of
    zebra finches annotated with temporally-strong labels and showing frequent overlaps.
    We test Voxaboxen on seven existing data sets and on our new data set."

    References
    ----------
    Robust detection of overlapping bioacoustic sound events
    Louis Mahon, Benjamin Hoffman, Logan S James, Maddie Cusimano,
    Masato Hagiwara, Sarah C Woolley, Olivier Pietquin
    https://arxiv.org/abs/2503.02389

    Examples
    --------
    >>> from alp_data.datasets import Voxaboxen
    >>> dataset = Voxaboxen(
    ...     split="BV_val",
    ...     output_take_and_give={"selection_table": "st"}
    ... )
    >>> print(dataset.info.name)
    voxaboxen
    """

    info = DatasetInfo(
        name="voxaboxen",
        owner="benjamin; gagan",
        split_paths={
            "Anuraset_train": f"{_FILES_ROOT}/Anuraset/formatted/train_info.csv",
            "Anuraset_val": f"{_FILES_ROOT}/Anuraset/formatted/val_info.csv",
            "Anuraset_test": f"{_FILES_ROOT}/Anuraset/formatted/test_info.csv",
            "BV_train": f"{_FILES_ROOT}/BV/formatted/train_info.csv",
            "BV_val": f"{_FILES_ROOT}/BV/formatted/val_info.csv",
            "BV_test": f"{_FILES_ROOT}/BV/formatted/test_info.csv",
            "MT_train": f"{_FILES_ROOT}/MT/formatted/train_info.csv",
            "MT_val": f"{_FILES_ROOT}/MT/formatted/val_info.csv",
            "MT_test": f"{_FILES_ROOT}/MT/formatted/test_info.csv",
            "OZF_train": f"{_FILES_ROOT}/OZF/formatted/train_info.csv",
            "OZF_val": f"{_FILES_ROOT}/OZF/formatted/val_info.csv",
            "OZF_test": f"{_FILES_ROOT}/OZF/formatted/test_info.csv",
            "hawaii_train": f"{_FILES_ROOT}/hawaii/formatted/train_info.csv",
            "hawaii_val": f"{_FILES_ROOT}/hawaii/formatted/val_info.csv",
            "hawaii_test": f"{_FILES_ROOT}/hawaii/formatted/test_info.csv",
            "humpback_train": f"{_FILES_ROOT}/humpback/formatted/train_info.csv",
            "humpback_val": f"{_FILES_ROOT}/humpback/formatted/val_info.csv",
            "humpback_test": f"{_FILES_ROOT}/humpback/formatted/test_info.csv",
            "katydids_train": f"{_FILES_ROOT}/katydids/formatted/train_info.csv",
            "katydids_val": f"{_FILES_ROOT}/katydids/formatted/val_info.csv",
            "katydids_test": f"{_FILES_ROOT}/katydids/formatted/test_info.csv",
            "powdermill_train": f"{_FILES_ROOT}/powdermill/formatted/train_info.csv",
            "powdermill_val": f"{_FILES_ROOT}/powdermill/formatted/val_info.csv",
            "powdermill_test": f"{_FILES_ROOT}/powdermill/formatted/test_info.csv",
            "OZF_synthetic_overlap_0_train": f"{_OZF_SYN_ROOT}/overlap_0/train_info.csv",
            "OZF_synthetic_overlap_0_val": f"{_OZF_SYN_ROOT}/overlap_0/val_info.csv",
            "OZF_synthetic_overlap_0_test": f"{_OZF_SYN_ROOT}/overlap_0/test_info.csv",
            "OZF_synthetic_overlap_0.2_train": f"{_OZF_SYN_ROOT}/overlap_0.2/train_info.csv",
            "OZF_synthetic_overlap_0.2_val": f"{_OZF_SYN_ROOT}/overlap_0.2/val_info.csv",
            "OZF_synthetic_overlap_0.2_test": f"{_OZF_SYN_ROOT}/overlap_0.2/test_info.csv",
            "OZF_synthetic_overlap_0.4_train": f"{_OZF_SYN_ROOT}/overlap_0.4/train_info.csv",
            "OZF_synthetic_overlap_0.4_val": f"{_OZF_SYN_ROOT}/overlap_0.4/val_info.csv",
            "OZF_synthetic_overlap_0.4_test": f"{_OZF_SYN_ROOT}/overlap_0.4/test_info.csv",
            "OZF_synthetic_overlap_0.6_train": f"{_OZF_SYN_ROOT}/overlap_0.6/train_info.csv",
            "OZF_synthetic_overlap_0.6_val": f"{_OZF_SYN_ROOT}/overlap_0.6/val_info.csv",
            "OZF_synthetic_overlap_0.6_test": f"{_OZF_SYN_ROOT}/overlap_0.6/test_info.csv",
            "OZF_synthetic_overlap_1_train": f"{_OZF_SYN_ROOT}/overlap_1/train_info.csv",
            "OZF_synthetic_overlap_1_val": f"{_OZF_SYN_ROOT}/overlap_1/val_info.csv",
            "OZF_synthetic_overlap_1_test": f"{_OZF_SYN_ROOT}/overlap_1/test_info.csv",
        },
        version="0.1.0",
        description="Voxaboxen dataset for acoustic sound event detection",
        sources=[
            "Anuraset",
            "BV",
            "MT",
            "OZF",
            "Hawaii",
            "Humpback",
            "Katydids",
            "Powdermill",
        ],
        license="CC BY",
    )

    def __init__(
        self,
        split: str = "Anuraset_train",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        mono_method: str | None = "average",
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the Voxaboxen dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        mono_method : str, optional
            Method to convert stereo audio to mono. Defaults to "average".
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data: pd.DataFrame = None
        self._load()  # Load the dataset (fills self._data)
        self.sample_rate = sample_rate
        self.mono_method = mono_method

        if data_root is None:
            # we assume that parent dir of the split path is the data root
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}."
                "Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(location, streaming=self._streaming)

    @classmethod
    def from_config(cls, dataset_config: VoxaboxenConfig) -> tuple["Voxaboxen", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters.

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise an empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            mono_method=cfg["mono_method"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise RuntimeError("Length is not available in streaming mode.")
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        # Ensure audio path is valid
        audio_path = anypath(self.data_root) / row["audio_fp"]
        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)

        if self.mono_method:
            audio = audio_stereo_to_mono(audio, mono_method="average")

        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        # read selection table
        row["selection_table"] = pd.read_csv(StringIO(row["selection_table_str"]), sep="\t")

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the audio data, text label, label, and path.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split='{self.split}'"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`VoxaboxenEvents`

📊 Dataset Information

Name	`voxaboxen_events`
Version	`0.1.0`
Owner	benjamin; gagan
License	CC BY
Sources	Anuraset, BV, MT, OZF, Hawaii, Humpback, Katydids, Powdermill
Available Splits	`Anuraset_train`, `Anuraset_val`, `Anuraset_test`, `BV_train`, `BV_val`, `BV_test`, `MT_train`, `MT_val`, `MT_test`, `OZF_train`, ... (39 total)

Description:

Voxaboxen events dataset for acoustic sound event detection

Voxaboxen dataset as events

Description

Same as Voxaboxen, but the audio is split according to the information in the selection table.

References

Robust detection of overlapping bioacoustic sound events Louis Mahon, Benjamin Hoffman, Logan S James, Maddie Cusimano, Masato Hagiwara, Sarah C Woolley, Olivier Pietquin https://arxiv.org/abs/2503.02389

Examples:

>>> from alp_data.datasets import VoxaboxenEvents
>>> dataset = VoxaboxenEvents(
...     split="BV_val",
...     output_take_and_give={"selection_table": "st"}
... )
>>> print(dataset.info.name)
voxaboxen_events

Source code in alp_data/datasets/voxaboxen.py

@register_dataset
class VoxaboxenEvents(Dataset):
    """Voxaboxen dataset as events

    Description
    -----------
    Same as Voxaboxen, but the audio is split according to the information
    in the selection table.

    References
    ----------
    Robust detection of overlapping bioacoustic sound events
    Louis Mahon, Benjamin Hoffman, Logan S James, Maddie Cusimano,
    Masato Hagiwara, Sarah C Woolley, Olivier Pietquin
    https://arxiv.org/abs/2503.02389

    Examples
    --------
    >>> from alp_data.datasets import VoxaboxenEvents
    >>> dataset = VoxaboxenEvents(
    ...     split="BV_val",
    ...     output_take_and_give={"selection_table": "st"}
    ... )
    >>> print(dataset.info.name)
    voxaboxen_events
    """

    info = DatasetInfo(
        name="voxaboxen_events",
        owner="benjamin; gagan",
        split_paths={
            "Anuraset_train": f"{_FILES_ROOT}/Anuraset/formatted/train_info.csv",
            "Anuraset_val": f"{_FILES_ROOT}/Anuraset/formatted/val_info.csv",
            "Anuraset_test": f"{_FILES_ROOT}/Anuraset/formatted/test_info.csv",
            "BV_train": f"{_FILES_ROOT}/BV/formatted/train_info.csv",
            "BV_val": f"{_FILES_ROOT}/BV/formatted/val_info.csv",
            "BV_test": f"{_FILES_ROOT}/BV/formatted/test_info.csv",
            "MT_train": f"{_FILES_ROOT}/MT/formatted/train_info.csv",
            "MT_val": f"{_FILES_ROOT}/MT/formatted/val_info.csv",
            "MT_test": f"{_FILES_ROOT}/MT/formatted/test_info.csv",
            "OZF_train": f"{_FILES_ROOT}/OZF/formatted/train_info.csv",
            "OZF_val": f"{_FILES_ROOT}/OZF/formatted/val_info.csv",
            "OZF_test": f"{_FILES_ROOT}/OZF/formatted/test_info.csv",
            "hawaii_train": f"{_FILES_ROOT}/hawaii/formatted/train_info.csv",
            "hawaii_val": f"{_FILES_ROOT}/hawaii/formatted/val_info.csv",
            "hawaii_test": f"{_FILES_ROOT}/hawaii/formatted/test_info.csv",
            "humpback_train": f"{_FILES_ROOT}/humpback/formatted/train_info.csv",
            "humpback_val": f"{_FILES_ROOT}/humpback/formatted/val_info.csv",
            "humpback_test": f"{_FILES_ROOT}/humpback/formatted/test_info.csv",
            "katydids_train": f"{_FILES_ROOT}/katydids/formatted/train_info.csv",
            "katydids_val": f"{_FILES_ROOT}/katydids/formatted/val_info.csv",
            "katydids_test": f"{_FILES_ROOT}/katydids/formatted/test_info.csv",
            "powdermill_train": f"{_FILES_ROOT}/powdermill/formatted/train_info.csv",
            "powdermill_val": f"{_FILES_ROOT}/powdermill/formatted/val_info.csv",
            "powdermill_test": f"{_FILES_ROOT}/powdermill/formatted/test_info.csv",
            "OZF_synthetic_overlap_0_train": f"{_OZF_SYN_ROOT}/overlap_0/train_info.csv",
            "OZF_synthetic_overlap_0_val": f"{_OZF_SYN_ROOT}/overlap_0/val_info.csv",
            "OZF_synthetic_overlap_0_test": f"{_OZF_SYN_ROOT}/overlap_0/test_info.csv",
            "OZF_synthetic_overlap_0.2_train": f"{_OZF_SYN_ROOT}/overlap_0.2/train_info.csv",
            "OZF_synthetic_overlap_0.2_val": f"{_OZF_SYN_ROOT}/overlap_0.2/val_info.csv",
            "OZF_synthetic_overlap_0.2_test": f"{_OZF_SYN_ROOT}/overlap_0.2/test_info.csv",
            "OZF_synthetic_overlap_0.4_train": f"{_OZF_SYN_ROOT}/overlap_0.4/train_info.csv",
            "OZF_synthetic_overlap_0.4_val": f"{_OZF_SYN_ROOT}/overlap_0.4/val_info.csv",
            "OZF_synthetic_overlap_0.4_test": f"{_OZF_SYN_ROOT}/overlap_0.4/test_info.csv",
            "OZF_synthetic_overlap_0.6_train": f"{_OZF_SYN_ROOT}/overlap_0.6/train_info.csv",
            "OZF_synthetic_overlap_0.6_val": f"{_OZF_SYN_ROOT}/overlap_0.6/val_info.csv",
            "OZF_synthetic_overlap_0.6_test": f"{_OZF_SYN_ROOT}/overlap_0.6/test_info.csv",
            "OZF_synthetic_overlap_1_train": f"{_OZF_SYN_ROOT}/overlap_1/train_info.csv",
            "OZF_synthetic_overlap_1_val": f"{_OZF_SYN_ROOT}/overlap_1/val_info.csv",
            "OZF_synthetic_overlap_1_test": f"{_OZF_SYN_ROOT}/overlap_1/test_info.csv",
        },
        version="0.1.0",
        description="Voxaboxen events dataset for acoustic sound event detection",
        sources=[
            "Anuraset",
            "BV",
            "MT",
            "OZF",
            "Hawaii",
            "Humpback",
            "Katydids",
            "Powdermill",
        ],
        license="CC BY",
    )

    def __init__(
        self,
        split: str = "Anuraset_train",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int = 16000,
        data_root: str | AnyPathT | None = None,
        stereo_or_mono: Literal["stereo", "mono"] = "stereo",
        mono_method: Literal["average", "keep_first"] = "average",
        clip_duration: float = 10.0,
        clip_hop: float = 5.0,
        clip_start_offset: float = 0.0,
        omit_empty_clip_prob: float = 0.0,
        scale_factor: int = 1,
        segmentation_based: bool = True,
        unknown_label: str = "Unknown",
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the VoxaboxenEvents dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        mono_method : str, optional
            The method to convert stereo audio to mono. Defaults to "average".
            Other options are "average" and "keep_first"
        clip_duration : float, optional
            Duration of each audio clip in seconds.
        clip_hop : float, optional
            Hop size between consecutive audio clips in seconds. If None, the full
            audio is used without overlapping clips.
        clip_start_offset : float, optional
            Offset in seconds to start the first clip. Defaults to 0.0 seconds.
            This is useful for skipping a portion of the audio before the first clip.
        scale_factor : float, optional
            Scale factor for downsampling the audio. This is needed when representations
            have been downsampled by encoders like AVES.
        omit_empty_clip_prob : float, optional
            Probability of omitting empty clips (no annotations).
            Defaults to 0.0, meaning no empty clips are omitted.
        segmentation_based : bool, optional
            If True, the dataset is segmented based on the selection table.
            If False, the entire audio file is treated as a single segment.
            Defaults to True.
        unknown_label : str, optional
            The label used for unknown annotations. Defaults to "unknown".
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self._load()  # Load the dataset (fills self._data)
        self.sample_rate = sample_rate
        self.stereo_or_mono = stereo_or_mono
        self.mono_method = mono_method
        self.clip_duration = clip_duration
        self.clip_hop = clip_hop
        self.clip_start_offset = clip_start_offset
        self.scale_factor = scale_factor
        self.segmentation_based = segmentation_based
        self.unknown_label = unknown_label

        self.omit_empty_clip_prob = omit_empty_clip_prob
        self.rng = default_rng()

        if data_root is None:
            # we assume that parent dir of the split path is the data root
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = data_root

        self.label_mapping: dict = None
        if split in LABEL_SETS:
            self.label_set: list = LABEL_SETS[split]
        else:
            self.label_set = []

        self.n_classes = 0
        self._create_label_map()

        self._make_metadata()

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}."
                "Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(location, streaming=self._streaming)

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["VoxaboxenEvents", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters.

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise an empty dict.

        Raises
        -------
        LookupError
            If the specified split is not available in the dataset info.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        split = cfg.get("split", None)
        if not split or split not in cls.info.split_paths:
            raise LookupError(
                f"Invalid split '{split}'."
                f"Available splits: {', '.join(cls.info.split_paths.keys())}"
            )

        ds = cls(
            split=split,
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            mono_method=cfg["mono_method", "average"],
            clip_duration=cfg["clip_duration"],
            clip_hop=cfg["clip_hop"],
            clip_start_offset=cfg["clip_start_offset"],
            omit_empty_clip_prob=cfg["omit_empty_clip_prob"],
            scale_factor=cfg["scale_factor"],
            segmentation_based=cfg["segmentation_based"],
            unknown_label=cfg["unknown_label"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.
        """
        return len(self._metadata)

    def _create_label_map(self) -> None:
        """Create a mapping from label names to indices.

        Raises
        ------
        ValueError
            If no labels are found in the dataset.
        """
        # load all selection tables and create a label set
        if not self.label_set:
            self.label_set = set()
            for row in self._data.itertuples():
                selection_table = pd.read_csv(StringIO(row.selection_table_str), sep="\t")

                labels = selection_table["Annotation"].unique().tolist()
                if not hasattr(self, "label_set"):
                    self.label_set = set(labels)
                else:
                    self.label_set.update(labels)

            self.label_set.add(self.unknown_label)
            self.label_set = list(self.label_set)

        if not self.label_set:
            raise ValueError("No labels found in the dataset.")

        self.label_mapping = {label: label for label in self.label_set}
        self.n_classes = len(self.label_set)

    def _process_selection_table(self, selection_table_str: str) -> IntervalTree:
        """
        Process annotation file into interval tree format.

        Parameters
        ----------
        selection_table_str : str
            String representation of the selection table in TSV format.

        Returns
        -------
        IntervalTree
            Tree containing labeled time intervals
        """
        selection_table = pd.read_csv(StringIO(selection_table_str), sep="\t")
        tree = IntervalTree()

        for _, row in selection_table.iterrows():
            start = row["Begin Time (s)"]
            end = row["End Time (s)"]
            label = row["Annotation"]

            if end <= start:
                continue

            if label in self.label_mapping:
                label = self.label_mapping[label]
            else:
                continue

            if label == self.unknown_label:
                label_idx = -1
            else:
                label_idx = self.label_set.index(label)
            tree.addi(start, end, label_idx)

        return tree

    def _make_metadata(self) -> None:
        """Generate dataset metadata including clip boundaries."""

        selection_table_dict = dict()
        metadata = []

        for _ii, row in enumerate(self._data):
            fn = row["fn"]
            audio_fp = anypath(self.data_root) / row["audio_fp"]

            selection_table = self._process_selection_table(row["selection_table_str"])
            selection_table_dict[fn] = selection_table

            # Determine number of clips based on audio duration
            duration = row["audio_duration"]
            num_clips = max(
                0,
                int(
                    np.floor(
                        (duration - self.clip_duration - self.clip_start_offset) // self.clip_hop
                    )
                ),
            )

            for tt in range(num_clips):
                start = tt * self.clip_hop + self.clip_start_offset
                end = start + self.clip_duration

                ivs: IntervalTree = selection_table[start:end]
                # if no annotated intervals, skip with specified probability
                if not ivs:
                    if self.omit_empty_clip_prob > self.rng.uniform():
                        continue

                metadata.append([fn, str(audio_fp), start, end])

        self._selection_table_dict = selection_table_dict
        self._metadata = metadata

    def _get_pos_intervals(
        self, fn: str, start: float, end: float
    ) -> list[tuple[float, float, str]]:
        """
        Get annotated intervals within specified time range.

        Parameters
        ----------
        fn : str
            Filename identifier
        start : float
            Start time in seconds
        end : float
            End time in seconds

        Returns
        -------
        list[tuple[float, float, str]]
            List of tuples containing (start, end, label) for each interval
        """

        tree = self._selection_table_dict[fn]

        intervals = tree[start:end]
        intervals = [
            (max(iv.begin, start) - start, min(iv.end, end) - start, iv.data) for iv in intervals
        ]

        return intervals

    def _get_class_proportions(self) -> np.ndarray:
        """
        Calculate class distribution in dataset.

        Returns
        -------
        numpy.ndarray
            Array of class proportions
        """

        counts = np.zeros((self.n_classes,))

        for k in self.selection_table_dict:
            st = self.selection_table_dict[k]
            for interval in st:
                annot = interval.data
                if annot == -1:
                    continue
                else:
                    counts[annot] += 1

        total_count = np.sum(counts)
        proportions = counts / total_count

        return proportions

    def _get_annotation(
        self, pos_intervals: list[tuple[float, float, int]], audio: np.ndarray
    ) -> tuple[
        np.ndarray,  # anchor_annos
        np.ndarray,  # regression_annos
        np.ndarray,  # class_annos
        np.ndarray,  # rev_anchor_annos
        np.ndarray,  # rev_regression_annos
        np.ndarray,  # rev_class_annos
    ]:
        """
        Generate target annotations from positive intervals.

        Parameters
        ----------
        pos_intervals : list
            List of (start, end, label_idx) tuples
        audio : np.ndarray
            Input audio tensor

        Returns
        -------
        tuple
            Tuple containing:
            - anchor_annos: Anchor point annotations
            - regression_annos: Duration annotations
            - class_annos: Class probability annotations
            - rev_anchor_annos: Reverse anchor points
            - rev_regression_annos: Reverse duration
            - rev_class_annos: Reverse class probabilities
        """

        raw_seq_len = audio.shape[-1]
        seq_len = int(math.ceil(raw_seq_len / self.scale_factor))

        regression_annos = np.zeros((seq_len,))
        class_annos = np.zeros((seq_len, self.n_classes))
        anchor_annos = [
            np.zeros(
                seq_len,
            )
        ]
        rev_regression_annos = np.zeros((seq_len,))
        rev_class_annos = np.zeros((seq_len, self.n_classes))
        rev_anchor_annos = [
            np.zeros(
                seq_len,
            )
        ]

        for iv in pos_intervals:
            start, end, class_idx = iv
            dur = end - start
            dur_samples = np.ceil(dur * self.sample_rate)

            start_idx = int(math.floor(start * self.sample_rate))
            start_idx = max(min(start_idx, seq_len - 1), 0)

            end_idx = int(math.ceil(end * self.sample_rate))
            end_idx = max(min(end_idx, seq_len - 1), 0)
            dur_samples = int(np.ceil(dur * self.sample_rate))

            anchor_anno = _get_anchor_anno(start_idx, dur_samples, seq_len)
            anchor_annos.append(anchor_anno)
            regression_annos[start_idx] = dur

            rev_anchor_anno = _get_anchor_anno(end_idx, dur_samples, seq_len)
            rev_anchor_annos.append(rev_anchor_anno)
            rev_regression_annos[end_idx] = dur

            if self.segmentation_based:
                if class_idx == -1:
                    pass
                else:
                    class_annos[start_idx : start_idx + dur_samples, class_idx] = 1.0

            else:
                if class_idx != -1:
                    class_annos[start_idx, class_idx] = 1.0
                    rev_class_annos[end_idx, class_idx] = 1.0
                else:
                    class_annos[start_idx, :] = (
                        1.0 / self.n_classes
                    )  # if unknown, enforce uncertainty
                    rev_class_annos[end_idx, :] = (
                        1.0 / self.n_classes
                    )  # if unknown, enforce uncertainty

        anchor_annos = np.stack(anchor_annos)
        anchor_annos = np.amax(anchor_annos, axis=0)
        rev_anchor_annos = np.stack(rev_anchor_annos)
        rev_anchor_annos = np.amax(rev_anchor_annos, axis=0)
        # shapes [time_steps, ], [time_steps, ], [time_steps, n_classes] (times two)
        return (
            anchor_annos,
            regression_annos,
            class_annos,
            rev_anchor_annos,
            rev_regression_annos,
            rev_class_annos,
        )

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the audio data, text label, label, and path.

        Raises
        ------
        IndexError
            If the index is out of bounds.
        """
        if idx >= len(self._data):
            raise IndexError(f"Index {idx} out of bounds for dataset of length {len(self._data)}.")

        fn, audio_fp, start, end = self._metadata[idx]

        # Read audio clip
        audio, sr = read_audio(audio_fp, start_time=start, end_time=end)
        audio = audio.astype(np.float32)

        if self.stereo_or_mono == "mono":
            audio = audio_stereo_to_mono(audio, mono_method=self.mono_method)
        else:
            channel_dim = np.argmin(audio.shape)
            if channel_dim != 0:
                audio = audio.T

        if self.sample_rate is not None and sr != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sr,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )

        pos_intervals = self._get_pos_intervals(fn, start, end)
        (
            anchor_anno,
            regression_anno,
            class_anno,
            rev_anchor_anno,
            rev_regression_anno,
            rev_class_anno,
        ) = self._get_annotation(pos_intervals, audio)

        row = {
            "audio": audio,
            "fn": fn,
            "audio_fp": audio_fp,
            "start": start,
            "end": end,
            "anchor_anno": anchor_anno,
            "regression_anno": regression_anno,
            "class_anno": class_anno,
            "rev_regression_anno": rev_regression_anno,
            "rev_anchor_anno": rev_anchor_anno,
            "rev_class_anno": rev_class_anno,
        }

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for idx in range(len(self)):
            yield self[idx]

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split={self.split}"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`WABAD`

📊 Dataset Information

Name	`wabad`
Version	`0.1.0`
Owner	benjamin
License	CC-BY-4.0
Sources	zenodo.org
Available Splits	`all`, `CAT`, `POZO`, `BRE`, `EFFOR`, `MONTEB`, `CB`, `FEU`, `BIAL`, `SPMCO`, ... (73 total)

Description:

WABAD: This database includes 5,047 minutes of audio files annotated to species-level by local experts with the start and end time, and the upper and lower frequencies of each identified bird vocalisation in the recordings. The database has a wide taxonomic and spatial coverage, including information on 91,931 vocalisations from 1,192 bird species recorded at 72 recording sites in 29 recording locations

WABAD Dataset

Description

This class makes WABAD dataset available. Each entry is an audio recording, plus a selection table. Each row of the selection table has annotations at different taxonomic granularities (stored in annotation_columns attribute). Taxonomy has been coerced into GBIF.

This class was included in alp-data (initially) for use as a zero-shot detection evaluation dataset.

Description from publication: https://www.researchgate.net/publication/387711208_WABAD_A_World_Annotated_Bird_Acoustic_Dataset_for_Passive_Acoustic_Monitoring

Under the current global biodiversity crisis, there is a need for automated and non-invasive monitoring techniques that can gather large amounts of data cost-effectively at various ecological scales, from local to large spatial scales. This data can then be analyzed to inform stakeholders and decision makers. One such technique is passive acoustic monitoring, which is commonly coupled with automatic identification of animal species based on their sound. Automated sound analyses usually require the training of sound detection and identification algorithms. These algorithms are based on annotated acoustic datasets which mark the occurrence of sounds of species inside sound recordings. However, compiling large annotated acoustic datasets is time- consuming and requires experts, and therefore they normally cover reduced spatial, temporal and taxonomic scales. This data paper presents WABAD, the World Annotated Bird Acoustic Dataset for passive acoustic monitoring. WABAD is designed to provide the public, the research community, and conservation managers with a novel and globally representative annotated acoustic dataset. This database includes 5,047 minutes of audio files annotated to species-level by local experts with the start and end time, and the upper and lower frequencies of each identified bird vocalisation in the recordings. The database has a wide taxonomic and spatial coverage, including information on 91,931 vocalisations from 1,192 bird species recorded at 72 recording sites in 29 recording locations (mainly countries) and distributed across 13 biomes. WABAD can be used, for example, for developing and/or validating automatic species detection algorithms, answering ecological questions, such as assessing geographical variations on bird vocalisations, or comparing acoustic diversity indices with species-based diversity indices. The dataset is published under a Creative Commons Attribution Non Commercial 4.0 International copyright.

Pre-resampled Audio

Pre-resampled audio is available at 16 kHz and 32 kHz. When sample_rate matches one of these rates, the pre-resampled files are loaded directly (no on-the-fly resampling). For any other target rate, audio is resampled on-the-fly using librosa's kaiser_best method.

References

https://zenodo.org/records/15629388 https://www.researchgate.net/publication/387711208_WABAD_A_World_Annotated_Bird_Acoustic_Dataset_for_Passive_Acoustic_Monitoring

Source code in alp_data/datasets/wabad.py

@register_dataset
class WABAD(Dataset):
    """WABAD Dataset

    Description
    -----------
    This class makes WABAD dataset available. Each entry is an audio recording,
    plus a selection table. Each row of the selection table has annotations at
    different taxonomic granularities (stored in annotation_columns attribute).
    Taxonomy has been coerced into GBIF.

    This class was included in alp-data (initially) for use as a zero-shot
    detection evaluation dataset.

    Description from publication:
    https://www.researchgate.net/publication/387711208_WABAD_A_World_Annotated_Bird_Acoustic_Dataset_for_Passive_Acoustic_Monitoring

    Under the current global biodiversity crisis, there is a need for automated
    and non-invasive monitoring techniques that can gather large amounts of data
    cost-effectively at various ecological scales, from local to large spatial
    scales. This data can then be analyzed to inform stakeholders and decision
    makers. One such technique is passive acoustic monitoring, which is commonly
    coupled with automatic identification of animal species based on their sound.
    Automated sound analyses usually require the training of sound detection and
    identification algorithms. These algorithms are based on annotated acoustic
    datasets which mark the occurrence of sounds of species inside sound
    recordings. However, compiling large annotated acoustic datasets is time-
    consuming and requires experts, and therefore they normally cover reduced
    spatial, temporal and taxonomic scales. This data paper presents WABAD, the
    World Annotated Bird Acoustic Dataset for passive acoustic monitoring. WABAD
    is designed to provide the public, the research community, and conservation
    managers with a novel and globally representative annotated acoustic dataset.
    This database includes 5,047 minutes of audio files annotated to species-level
    by local experts with the start and end time, and the upper and lower
    frequencies of each identified bird vocalisation in the recordings. The
    database has a wide taxonomic and spatial coverage, including information on
    91,931 vocalisations from 1,192 bird species recorded at 72 recording sites in
    29 recording locations (mainly countries) and distributed across 13 biomes.
    WABAD can be used, for example, for developing and/or validating automatic
    species detection algorithms, answering ecological questions, such as assessing
    geographical variations on bird vocalisations, or comparing acoustic diversity
    indices with species-based diversity indices. The dataset is published under a
    Creative Commons Attribution Non Commercial 4.0 International copyright.

    Pre-resampled Audio
    -------------------
    Pre-resampled audio is available at 16 kHz and 32 kHz. When
    ``sample_rate`` matches one of these rates, the pre-resampled files are
    loaded directly (no on-the-fly resampling). For any other target rate,
    audio is resampled on-the-fly using librosa's ``kaiser_best`` method.

    References
    ----------
    https://zenodo.org/records/15629388
    https://www.researchgate.net/publication/387711208_WABAD_A_World_Annotated_Bird_Acoustic_Dataset_for_Passive_Acoustic_Monitoring

    """

    info = DatasetInfo(
        name="wabad",
        owner="benjamin",
        split_paths={
            "all": f"{_RAW_ROOT}/all_info_gbif_v3.csv",
            "CAT": f"{_RAW_ROOT}/CAT_info_gbif_v2.csv",
            "POZO": f"{_RAW_ROOT}/POZO_info_gbif_v2.csv",
            "BRE": f"{_RAW_ROOT}/BRE_info_gbif_v2.csv",
            "EFFOR": f"{_RAW_ROOT}/EFFOR_info_gbif_v2.csv",
            "MONTEB": f"{_RAW_ROOT}/MONTEB_info_gbif_v2.csv",
            "CB": f"{_RAW_ROOT}/CB_info_gbif_v2.csv",
            "FEU": f"{_RAW_ROOT}/FEU_info_gbif_v2.csv",
            "BIAL": f"{_RAW_ROOT}/BIAL_info_gbif_v2.csv",
            "SPMCO": f"{_RAW_ROOT}/SPMCO_info_gbif_v2.csv",
            "OIO": f"{_RAW_ROOT}/OIO_info_gbif_v2.csv",
            "OESF": f"{_RAW_ROOT}/OESF_info_gbif_v2.csv",
            "QR": f"{_RAW_ROOT}/QR_info_gbif_v2.csv",
            "HAG": f"{_RAW_ROOT}/HAG_info_gbif_v2.csv",
            "VIL": f"{_RAW_ROOT}/VIL_info_gbif_v2.csv",
            "RFP": f"{_RAW_ROOT}/RFP_info_gbif_v2.csv",
            "HAK": f"{_RAW_ROOT}/HAK_info_gbif_v2.csv",
            "SLOB": f"{_RAW_ROOT}/SLOB_info_gbif_v2.csv",
            "BERB": f"{_RAW_ROOT}/BERB_info_gbif_v2.csv",
            "COU": f"{_RAW_ROOT}/COU_info_gbif_v2.csv",
            "OLIV": f"{_RAW_ROOT}/OLIV_info_gbif_v2.csv",
            "EVROS": f"{_RAW_ROOT}/EVROS_info_gbif_v2.csv",
            "FNCA": f"{_RAW_ROOT}/FNCA_info_gbif_v2.csv",
            "RGU": f"{_RAW_ROOT}/RGU_info_gbif_v2.csv",
            "CRUZ": f"{_RAW_ROOT}/CRUZ_info_gbif_v2.csv",
            "JUNCA": f"{_RAW_ROOT}/JUNCA_info_gbif_v2.csv",
            "PINA": f"{_RAW_ROOT}/PINA_info_gbif_v2.csv",
            "GTLU": f"{_RAW_ROOT}/GTLU_info_gbif_v2.csv",
            "MAPIMI": f"{_RAW_ROOT}/MAPIMI_info_gbif_v2.csv",
            "SAL": f"{_RAW_ROOT}/SAL_info_gbif_v2.csv",
            "ARD": f"{_RAW_ROOT}/ARD_info_gbif_v2.csv",
            "MARTI": f"{_RAW_ROOT}/MARTI_info_gbif_v2.csv",
            "DYOM": f"{_RAW_ROOT}/DYOM_info_gbif_v2.csv",
            "VER": f"{_RAW_ROOT}/VER_info_gbif_v2.csv",
            "SCHG": f"{_RAW_ROOT}/SCHG_info_gbif_v2.csv",
            "GLEN": f"{_RAW_ROOT}/GLEN_info_gbif_v2.csv",
            "HONDO": f"{_RAW_ROOT}/HONDO_info_gbif_v2.csv",
            "NL": f"{_RAW_ROOT}/NL_info_gbif_v2.csv",
            "BRCAS": f"{_RAW_ROOT}/BRCAS_info_gbif_v2.csv",
            "NAV": f"{_RAW_ROOT}/NAV_info_gbif_v2.csv",
            "KAR": f"{_RAW_ROOT}/KAR_info_gbif_v2.csv",
            "BUR": f"{_RAW_ROOT}/BUR_info_gbif_v2.csv",
            "KIB": f"{_RAW_ROOT}/KIB_info_gbif_v2.csv",
            "SCHF": f"{_RAW_ROOT}/SCHF_info_gbif_v2.csv",
            "TAM": f"{_RAW_ROOT}/TAM_info_gbif_v2.csv",
            "HUAP": f"{_RAW_ROOT}/HUAP_info_gbif_v2.csv",
            "DONG": f"{_RAW_ROOT}/DONG_info_gbif_v2.csv",
            "CLH": f"{_RAW_ROOT}/CLH_info_gbif_v2.csv",
            "HAR": f"{_RAW_ROOT}/HAR_info_gbif_v2.csv",
            "BOLIN": f"{_RAW_ROOT}/BOLIN_info_gbif_v2.csv",
            "SITH": f"{_RAW_ROOT}/SITH_info_gbif_v2.csv",
            "RBA": f"{_RAW_ROOT}/RBA_info_gbif_v2.csv",
            "MOPU": f"{_RAW_ROOT}/MOPU_info_gbif_v2.csv",
            "CRAT": f"{_RAW_ROOT}/CRAT_info_gbif_v2.csv",
            "PGF": f"{_RAW_ROOT}/PGF_info_gbif_v2.csv",
            "PUUL": f"{_RAW_ROOT}/PUUL_info_gbif_v2.csv",
            "MILLAN": f"{_RAW_ROOT}/MILLAN_info_gbif_v2.csv",
            "BMT": f"{_RAW_ROOT}/BMT_info_gbif_v2.csv",
            "SD": f"{_RAW_ROOT}/SD_info_gbif_v2.csv",
            "UNI": f"{_RAW_ROOT}/UNI_info_gbif_v2.csv",
            "SBN": f"{_RAW_ROOT}/SBN_info_gbif_v2.csv",
            "DUNAS": f"{_RAW_ROOT}/DUNAS_info_gbif_v2.csv",
            "PETI": f"{_RAW_ROOT}/PETI_info_gbif_v2.csv",
            "LIM": f"{_RAW_ROOT}/LIM_info_gbif_v2.csv",
            "BAM": f"{_RAW_ROOT}/BAM_info_gbif_v2.csv",
            "DEVA": f"{_RAW_ROOT}/DEVA_info_gbif_v2.csv",
            "ROTOK": f"{_RAW_ROOT}/ROTOK_info_gbif_v2.csv",
            "CARI": f"{_RAW_ROOT}/CARI_info_gbif_v2.csv",
            "PITI": f"{_RAW_ROOT}/PITI_info_gbif_v2.csv",
            "RME": f"{_RAW_ROOT}/RME_info_gbif_v2.csv",
            "MABI": f"{_RAW_ROOT}/MABI_info_gbif_v2.csv",
            "EMP": f"{_RAW_ROOT}/EMP_info_gbif_v2.csv",
            "EFFOU": f"{_RAW_ROOT}/EFFOU_info_gbif_v2.csv",
        },
        version="0.1.0",
        description="WABAD: This database includes 5,047 minutes of audio files "
        "annotated to species-level by local experts with the start and end time, "
        "and the upper and lower frequencies of each identified bird vocalisation "
        "in the recordings. The database has a wide taxonomic and spatial coverage, "
        "including information on 91,931 vocalisations from 1,192 bird species "
        "recorded at 72 recording sites in 29 recording locations",
        sources="zenodo.org",
        license="CC-BY-4.0",
    )

    _sample_rate_paths: dict[int, str] = {16000: "16khz_path", 32000: "32khz_path"}
    _originals_path_column = "audio_fp"

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = 16000,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "pandas",
        streaming: bool = False,
    ) -> None:
        """
        Parameters
        ----------
        split : str
            Split to load (key in info.split_paths).
        output_take_and_give : dict[str, str] | None
            Optional mapping of original → new output keys (filters columns as well).
        sample_rate : int | None
            If set, audio is resampled to this rate.
        data_root : str | AnyPathT | None
            Optional root directory to prepend to each row['audio_fp'].
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self.annotation_columns = ["Species"]
        self.unknown_label = "Unknown"
        self.sample_rate = sample_rate

        self.full_dataset_available_labels = None  # placeholder for labels if split == all

        self._load()

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = anypath(data_root)

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns) if self._data is not None else []

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Return pre-resampled sample rates whose path columns exist in the data."""
        return [sr for sr, col in self._sample_rate_paths.items() if col in self._data.columns]

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location,
            streaming=self._streaming,
            keep_default_na=False,
            na_values=[""],
        )

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode.Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Parameters
        ----------
        row : dict[str, Any]
            A dictionary representing a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            The processed row.
        """
        use_presampled = False
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            path_column = self._sample_rate_paths[self.sample_rate]
            if path_column in row and row[path_column] is not None and row[path_column] != "":
                audio_path = anypath(self.data_root) / row[path_column]
                use_presampled = True

        if not use_presampled:
            audio_path = anypath(self.data_root) / row[self._originals_path_column]

        audio, sr = read_audio(audio_path)
        audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)

        if not use_presampled and self.sample_rate is not None and sr != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sr,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sr = self.sample_rate

        st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
        audio_dur = len(audio) / float(sr)
        st = st[st["Begin Time (s)"] < audio_dur].copy()

        row["audio"] = audio
        row["sample_rate"] = sr
        row["selection_table"] = st

        if self.output_take_and_give:
            item = {}
            for old_key, new_key in self.output_take_and_give.items():
                item[new_key] = row[old_key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the audio data, text label, label, and path.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["WABAD", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters.

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise an empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta

        return ds, {}

    def get_available_labels(self, anno_column: str | None = "Species") -> list[str]:
        """
        Return all possible species labels

        Returns
        ---------
        A list of all the available labels for anno_column
        """
        if self.split == "all":
            if self.full_dataset_available_labels is None:
                self.full_dataset_available_labels = pd.read_csv(SPECIES_INFO_PATH)[
                    anno_column
                ].to_list()
            return self.full_dataset_available_labels
        else:
            available_labels = set()
            for row in self._data:
                st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
                available_labels.update(st[anno_column].astype(str).tolist())
            return sorted(available_labels)

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version})"
        return (
            f"{base}\n"
            f"Sources: {self.info.sources}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`Watkins`

📊 Dataset Information

Name	`watkins`
Version	`0.1.0`
Owner	david
License	LicenseRef-WHOI-Public
Sources	https://cis.whoi.edu/science/B/whalesounds/index.cfm
Available Splits	`train`

Description:

Watkins Marine Mammal Sound Database — 2018 remastered release. ~13,700 audio clips spanning ~50 species of cetaceans and pinnipeds with GBIF-resolved taxonomy. Original audio at variable sample rates; pre-resampled 16kHz and 32kHz versions available.

Watkins Marine Mammal Sound Database (2018 remaster).

Description

The Watkins Marine Mammal Sound Database is the largest publicly available collection of marine mammal vocalisations, originally compiled by William A. Watkins at Woods Hole Oceanographic Institution. This dataset uses the 2018 remastered FLAC release and includes GBIF-resolved taxonomic metadata.

The dataset spans ~50 species across cetaceans (baleen whales, toothed whales, dolphins) and pinnipeds (seals, sea lions, walrus), with ~13,700 audio clips at variable original sample rates.

Available Metadata Fields

Taxonomic Information: - species: Scientific species name (as labelled in the original dataset) - canonical_name: GBIF-resolved canonical species name - species_common: Common/vernacular species name - genus, family, order, class, phylum, kingdom: Taxonomic hierarchy - gbifID: GBIF identifier

Vocalisation Labels: - call_type: Semicolon-separated fine-grained vocalisation types (e.g. "click;whistle", "moan", "pulsed_click;click;squeal"). Populated for ~84% of rows. - coarse_call_type: Semicolon-separated coarse categories (e.g. "click;whistle", "call", "burst_pulse;click").

Audio File Paths: - audio_path: Path to original FLAC audio relative to data_root (variable sample rate) - 16khz_path: Path to pre-resampled 16kHz WAV audio - 32khz_path: Path to pre-resampled 32kHz WAV audio

Audio Metadata: - sample_rate_hz: Original sample rate of the recording (Hz) - duration_s: Duration of the recording (seconds)

References

Watkins Marine Mammal Sound Database: https://cis.whoi.edu/science/B/whalesounds/index.cfm DOI: 10.1575/1912/7270

Examples:

>>> from alp_data.datasets import Watkins
>>> ds = Watkins(split="train")
>>> print(len(ds))
13693

>>> ds_16k = Watkins(split="train", sample_rate=16000)

Source code in alp_data/datasets/watkins.py

@register_dataset
class Watkins(Dataset):
    """Watkins Marine Mammal Sound Database (2018 remaster).

    Description
    -----------
    The Watkins Marine Mammal Sound Database is the largest publicly available
    collection of marine mammal vocalisations, originally compiled by William
    A. Watkins at Woods Hole Oceanographic Institution.  This dataset uses the
    2018 remastered FLAC release and includes GBIF-resolved taxonomic metadata.

    The dataset spans ~50 species across cetaceans (baleen whales, toothed
    whales, dolphins) and pinnipeds (seals, sea lions, walrus), with ~13,700
    audio clips at variable original sample rates.

    Available Metadata Fields
    -------------------------
    **Taxonomic Information:**
        - ``species``: Scientific species name (as labelled in the original dataset)
        - ``canonical_name``: GBIF-resolved canonical species name
        - ``species_common``: Common/vernacular species name
        - ``genus``, ``family``, ``order``, ``class``, ``phylum``, ``kingdom``:
          Taxonomic hierarchy
        - ``gbifID``: GBIF identifier

    **Vocalisation Labels:**
        - ``call_type``: Semicolon-separated fine-grained vocalisation types
          (e.g. ``"click;whistle"``, ``"moan"``, ``"pulsed_click;click;squeal"``).
          Populated for ~84% of rows.
        - ``coarse_call_type``: Semicolon-separated coarse categories
          (e.g. ``"click;whistle"``, ``"call"``, ``"burst_pulse;click"``).

    **Audio File Paths:**
        - ``audio_path``: Path to original FLAC audio relative to data_root
          (variable sample rate)
        - ``16khz_path``: Path to pre-resampled 16kHz WAV audio
        - ``32khz_path``: Path to pre-resampled 32kHz WAV audio

    **Audio Metadata:**
        - ``sample_rate_hz``: Original sample rate of the recording (Hz)
        - ``duration_s``: Duration of the recording (seconds)

    References
    ----------
    Watkins Marine Mammal Sound Database:
        https://cis.whoi.edu/science/B/whalesounds/index.cfm
    DOI: 10.1575/1912/7270

    Examples
    --------
    >>> from alp_data.datasets import Watkins
    >>> ds = Watkins(split="train")
    >>> print(len(ds))
    13693

    >>> ds_16k = Watkins(split="train", sample_rate=16000)
    """

    info = DatasetInfo(
        name="watkins",
        owner="david",
        split_paths={
            "train": f"{_RAW_ROOT}/watkins.csv",
        },
        version="0.1.0",
        description=(
            "Watkins Marine Mammal Sound Database — 2018 remastered release.  "
            "~13,700 audio clips spanning ~50 species of cetaceans and "
            "pinnipeds with GBIF-resolved taxonomy.  Original audio at "
            "variable sample rates; pre-resampled 16kHz and 32kHz versions "
            "available."
        ),
        sources=["https://cis.whoi.edu/science/B/whalesounds/index.cfm"],
        license="LicenseRef-WHOI-Public",
    )

    _sample_rate_paths = {
        16000: "16khz_path",
        32000: "32khz_path",
    }

    _originals_path_column = "audio_path"

    def __init__(
        self,
        split: str = "train",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = 16000,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialise the Watkins dataset.

        Parameters
        ----------
        split : str
            The split to load (default ``"train"``).
        output_take_and_give : dict[str, str] | None
            Column renaming / selection mapping.
        sample_rate : int | None
            Target sample rate.  If a pre-resampled version exists (16kHz or
            32kHz), it will be loaded directly; otherwise audio is resampled
            on-the-fly.  ``None`` returns original sample rate.
        data_root : str | AnyPathT | None
            Root directory prepended to audio paths.  Defaults to the GCS
            bucket holding the original FLAC files.
        backend : BackendType
            DataFrame backend (``"polars"`` or ``"pandas"``).
        streaming : bool
            Whether to use streaming mode.
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self.sample_rate = sample_rate
        self._data = None

        if data_root is None:
            self.data_root = anypath(_RAW_ROOT)
        else:
            self.data_root = anypath(data_root)

        self._load()

    # ── Loading ────────────────────────────────────────────────────────

    def _load(self) -> None:
        """Load the Watkins CSV from the configured split path.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(location, streaming=self._streaming)

    # ── Properties ─────────────────────────────────────────────────────

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns) if self._data is not None else []

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Pre-resampled sample rates available in the loaded data."""
        available = []
        if self._data is not None:
            for sr, col in self._sample_rate_paths.items():
                if col in self._data.columns:
                    available.append(sr)
        return available

    # ── Factory ────────────────────────────────────────────────────────

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["Watkins", dict[str, Any]]:
        """Create a Watkins instance from a config.

        Returns
        -------
        tuple[Watkins, dict[str, Any]]
            The dataset instance and transformation metadata (empty if none).
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            sample_rate=cfg["sample_rate"],
            data_root=cfg["data_root"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )
        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta
        return ds, {}

    # ── Iteration / indexing ───────────────────────────────────────────

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row: load audio and optionally resample.

        Returns
        -------
        dict[str, Any]
            The processed row with ``audio`` and ``sample_rate`` keys added.
        """
        use_presampled = False
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            col = self._sample_rate_paths[self.sample_rate]
            if col in row and row[col] is not None and str(row[col]).strip():
                audio_path = self.data_root / row[col]
                use_presampled = True

        if not use_presampled:
            audio_path = self.data_root / row[self._originals_path_column]

        audio, sr = read_audio(audio_path)
        audio = audio.astype(np.float32)
        audio = audio_stereo_to_mono(audio, mono_method="average")

        if not use_presampled and self.sample_rate is not None and sr != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sr,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sr = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sr

        if self.output_take_and_give:
            return {new: row[old] for old, new in self.output_take_and_give.items()}
        return row

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("No data loaded.")
        if self._streaming:
            raise NotImplementedError("Length unavailable in streaming mode.")
        return len(self._data)

    def __getitem__(self, idx: int) -> dict[str, Any]:
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        n = len(self) if self._data is not None and not self._streaming else "?"
        return (
            f"{self.info.name} (v{self.info.version}), split='{self.split}'\n"
            f"  Rows: {n}\n"
            f"  Description: {self.info.description}\n"
            f"  License: {self.info.license}\n"
            f"  Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`XenoCanto`

📊 Dataset Information

Name	`xeno-canto`
Version	`0.1.0`
Owner	david; gagan
License	CC BY-NC-SA 4.0, CC BY-NC 4.0, CC BY-SA, CC0
Sources	Xeno-canto
Available Splits	`train`, `validation`, `all`, `train_unseen`, `validation_unseen`, `all_unseen`

Description:

Xeno-canto audio dataset with taxonomic metadata. Available at original (variable) sample rates and 32kHz (pre-resampled). Pre-resampled audio uses librosa's kaiser_best resampling method. Xeno-canto dump as of Oct 2025. Train/val split is 90%/10% with random seed 42.

Xeno-canto audio dataset.

Description

Xeno-canto is a website dedicated to sharing wildlife sounds from around the world. This dataset includes audio recordings from Xeno-canto with associated metadata about species, locations, and other observation details.

The dataset contains audio recordings with rich taxonomic information, including species scientific and common names, family, genus, order, and other metadata such as location, date, and recordist information.

Available Metadata Fields

Taxonomic Information: - canonical_name: Canonical species name (primary identifier). Linked to the GBIF backbone taxonomy. - species_common: Common/vernacular species name - species: Species scientific name - scientificName: Scientific species name with author (legacy field) - scientific_name_unified: Scientific name (before GBIF standardization) - genus, family, order, class, phylum, kingdom: Taxonomic hierarchy - gbifID, taxonKey, speciesKey: GBIF identifiers - xc_clade: Xeno-canto clade classification (e.g., "aves") - Associated Taxa: Background species in the recording

Audio File Paths: - relative_path: Path to original audio relative to data_root (variable sample rate) - gcs_path: Full GCS path to original audio - 32khz_path: Path to pre-resampled 32kHz audio (if available) - 16khz_path: Path to pre-resampled 16kHz audio (if available)

Recording Metadata: - eventDate, eventTime: When the recording was made - year, month, day: Date components - behavior: Behavior being recorded (e.g., "calling song") - sex: Sex of the recorded animal(s) - lifeStage: Life stage (e.g., "adult") - recordedBy: Name of the recordist

Location: - latitudeDecimal, longitudeDecimal: GPS coordinates (also decimalLatitude, decimalLongitude) - coordinateUncertaintyInMeters: Coordinate precision - locality, location: Geographic location information - country_code, countryCode: ISO country code - continent: Continent name - verbatimElevation: Elevation as recorded

Rights & Attribution: - rightsHolder: Copyright holder - recordedBy: Name of the recordist - license, license_text, license_url: License information (e.g., CC BY-SA 4.0) - media_license, media_license_url: Media-specific license - media_url: Direct audio file URL

Additional Fields: - fieldNotes, description: Observer's notes about the recording - caption, caption2: Recording captions - xc_id: Xeno-canto recording ID - dataset, source_version, source_dataset: Data source information - occurrenceID: GBIF occurrence identifier

Available Splits

train: Training set (90% of data, random split)
validation: Validation set (10% of data, random split)
all: Complete dataset (train + validation)
train_unseen: Training set excluding unseen taxa evaluated in BEANS-Zero benchmark
validation_unseen: Validation set excluding unseen taxa evaluated in BEANS-Zero benchmark
all_unseen: Complete dataset excluding BEANS-Zero unseen taxa

The _unseen splits are designed for training models that will be evaluated on BEANS-Zero's unseen taxa benchmark, ensuring no test taxa leak into the training data.

Note that all splits exclude examples overlapping with the following benchmark datasets: - cbi (See the beans dataset) - BEANS-Zero call-type, lifestage, and captioning test sets (See the beans_zero dataset) - xeno-canto Jeantet et al. 2023 dataset (See the XenoCantoAnnotatedJeantet23 dataset)

References

Xeno-canto: https://www.xeno-canto.org/

Examples:

>>> from alp_data.datasets import XenoCanto
>>> dataset = XenoCanto(
...     split="train",
...     output_take_and_give={"canonical_name": "species"}
... )
>>> print(dataset.info.name)
xeno-canto
>>> print(dataset.available_sample_rates)
[32000, 16000]

Load with pre-resampled 32kHz audio (when available)

>>> dataset_32k = XenoCanto(split="train", sample_rate=32000, streaming=True)

Load with pre-resampled 16kHz audio (when available)

>>> dataset_16k = XenoCanto(split="train", sample_rate=16000, streaming=True)

Source code in alp_data/datasets/xeno_canto.py

@register_dataset
class XenoCanto(Dataset):
    """Xeno-canto audio dataset.

    Description
    -----------
    Xeno-canto is a website dedicated to sharing wildlife sounds from around
    the world. This dataset includes audio recordings from Xeno-canto with
    associated metadata about species, locations, and other observation details.

    The dataset contains audio recordings with rich taxonomic information,
    including species scientific and common names, family, genus, order,
    and other metadata such as location, date, and recordist information.

    Available Metadata Fields
    -------------------------
    **Taxonomic Information:**
        - ``canonical_name``: Canonical species name (primary identifier).
            Linked to the GBIF backbone taxonomy.
        - ``species_common``: Common/vernacular species name
        - ``species``: Species scientific name
        - ``scientificName``: Scientific species name with author (legacy field)
        - ``scientific_name_unified``: Scientific name (before GBIF standardization)
        - ``genus``, ``family``, ``order``, ``class``, ``phylum``, ``kingdom``: Taxonomic hierarchy
        - ``gbifID``, ``taxonKey``, ``speciesKey``: GBIF identifiers
        - ``xc_clade``: Xeno-canto clade classification (e.g., "aves")
        - ``Associated Taxa``: Background species in the recording

    **Audio File Paths:**
        - ``relative_path``: Path to original audio relative to data_root (variable sample rate)
        - ``gcs_path``: Full GCS path to original audio
        - ``32khz_path``: Path to pre-resampled 32kHz audio (if available)
        - ``16khz_path``: Path to pre-resampled 16kHz audio (if available)

    **Recording Metadata:**
        - ``eventDate``, ``eventTime``: When the recording was made
        - ``year``, ``month``, ``day``: Date components
        - ``behavior``: Behavior being recorded (e.g., "calling song")
        - ``sex``: Sex of the recorded animal(s)
        - ``lifeStage``: Life stage (e.g., "adult")
        - ``recordedBy``: Name of the recordist

    **Location:**
        - ``latitudeDecimal``, ``longitudeDecimal``: GPS coordinates
            (also ``decimalLatitude``, ``decimalLongitude``)
        - ``coordinateUncertaintyInMeters``: Coordinate precision
        - ``locality``, ``location``: Geographic location information
        - ``country_code``, ``countryCode``: ISO country code
        - ``continent``: Continent name
        - ``verbatimElevation``: Elevation as recorded

    **Rights & Attribution:**
        - ``rightsHolder``: Copyright holder
        - ``recordedBy``: Name of the recordist
        - ``license``, ``license_text``, ``license_url``: License information (e.g., CC BY-SA 4.0)
        - ``media_license``, ``media_license_url``: Media-specific license
        - ``media_url``: Direct audio file URL

    **Additional Fields:**
        - ``fieldNotes``, ``description``: Observer's notes about the recording
        - ``caption``, ``caption2``: Recording captions
        - ``xc_id``: Xeno-canto recording ID
        - ``dataset``, ``source_version``, ``source_dataset``: Data source information
        - ``occurrenceID``: GBIF occurrence identifier

    Available Splits
    ----------------
    - ``train``: Training set (90% of data, random split)
    - ``validation``: Validation set (10% of data, random split)
    - ``all``: Complete dataset (train + validation)
    - ``train_unseen``: Training set excluding unseen taxa evaluated in BEANS-Zero benchmark
    - ``validation_unseen``: Validation set excluding unseen taxa evaluated in BEANS-Zero benchmark
    - ``all_unseen``: Complete dataset excluding BEANS-Zero unseen taxa

    The ``_unseen`` splits are designed for training models that will be evaluated
    on BEANS-Zero's unseen taxa benchmark, ensuring no test taxa leak into the training data.

    Note that all splits exclude examples overlapping with the following benchmark datasets:
    - cbi (See the beans dataset)
    - BEANS-Zero call-type, lifestage, and captioning test sets (See the beans_zero dataset)
    - xeno-canto Jeantet et al. 2023 dataset (See the XenoCantoAnnotatedJeantet23 dataset)

    References
    ----------
    Xeno-canto: https://www.xeno-canto.org/

    Examples
    --------
    >>> from alp_data.datasets import XenoCanto
    >>> dataset = XenoCanto(
    ...     split="train",
    ...     output_take_and_give={"canonical_name": "species"}
    ... )
    >>> print(dataset.info.name)
    xeno-canto
    >>> print(dataset.available_sample_rates)
    [32000, 16000]

    Load with pre-resampled 32kHz audio (when available)
    >>> dataset_32k = XenoCanto(split="train", sample_rate=32000, streaming=True)

    Load with pre-resampled 16kHz audio (when available)
    >>> dataset_16k = XenoCanto(split="train", sample_rate=16000, streaming=True)
    """

    info = DatasetInfo(
        name="xeno-canto",
        owner="david; gagan",
        split_paths={
            "train": f"{DATA_HOME}/xeno-canto/v0.1.0/raw/train_20260203_v2.csv",
            "validation": f"{DATA_HOME}/xeno-canto/v0.1.0/raw/val_20260203_v2.csv",
            "all": f"{DATA_HOME}/xeno-canto/v0.1.0/raw/all_20260203_v2.csv",
            "train_unseen": f"{DATA_HOME}/xeno-canto/v0.1.0/raw/train_unseen_20260203_v2.csv",
            "validation_unseen": f"{DATA_HOME}/xeno-canto/v0.1.0/raw/val_unseen_20260203_v2.csv",
            "all_unseen": f"{DATA_HOME}/xeno-canto/v0.1.0/raw/all_unseen_20260203_v2.csv",
        },
        version="0.1.0",
        description="Xeno-canto audio dataset with taxonomic metadata. "
        "Available at original (variable) sample rates and 32kHz (pre-resampled). "
        "Pre-resampled audio uses librosa's kaiser_best resampling method. "
        "Xeno-canto dump as of Oct 2025. "
        "Train/val split is 90%/10% with random seed 42.",
        sources=["Xeno-canto"],
        license="CC BY-NC-SA 4.0, CC BY-NC 4.0, CC BY-SA, CC0",
    )

    # Mapping of sample rates to their corresponding path columns
    _sample_rate_paths = {
        32000: "32khz_path",  # Pre-resampled to 32kHz
        16000: "16khz_path",  # Pre-resampled to 16kHz
    }

    # Column name for original variable-rate audio files
    _originals_path_column = "relative_path"

    def __init__(
        self,
        split: str = "train",
        output_take_and_give: dict[str, str] = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the Xeno-canto dataset.

        Parameters
        ----------
        split : str, default="train"
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str], optional
            A dictionary mapping the original column names to the new column names.
        sample_rate : int, optional
            The sample rate to which audio files should be resampled. If the requested
            sample rate is available as pre-resampled audio (see `available_sample_rates`),
            the pre-resampled version will be loaded directly. Otherwise, audio will be
            resampled on-the-fly from the original files (at variable sample rates) using
            librosa's kaiser_best method. If None, audio is returned at its original
            (variable) sample rate.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is prepended to the path
            column value to construct the full path to audio files. If None, defaults
            to the GCS bucket path for this dataset.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self._load()
        self.sample_rate = sample_rate

        if data_root is None:
            self.data_root = anypath(f"{DATA_HOME}/xeno-canto/v0.1.0/raw/")
            self._data_root_32k = anypath(f"{DATA_HOME}/xeno-canto/v0.1.0/raw/audio_32k/")
            self._data_root_16k = anypath(f"{DATA_HOME}/xeno-canto/v0.1.0/raw/audio_16k/")
        else:
            self.data_root = anypath(data_root)
            self._data_root_32k = anypath(data_root)
            self._data_root_16k = anypath(data_root)

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Return the available pre-resampled sample rates.

        Returns
        -------
        list[int]
            List of sample rates (in Hz) for which pre-resampled audio is available.
            Audio at these sample rates can be loaded directly without on-the-fly resampling.
            This checks which path columns actually exist in the loaded data.
        """
        available = []
        for sr, path_column in self._sample_rate_paths.items():
            # Check if the path column exists in the loaded data
            if path_column in self._data.columns:
                available.append(sr)
        return available

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}."
                "Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        # Read CSV directly from GCS path to avoid memory issues
        self._data = self._backend_class.from_csv(location, streaming=self._streaming)

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> tuple["XenoCanto", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters.

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise an empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode. Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Parameters
        ----------
        row : dict[str, Any]
            A dictionary representing a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            The processed row.
        """
        # Determine which path column to use based on requested sample rate
        # If a pre-resampled version is available, use it; otherwise resample on-the-fly
        # TODO (gagan): this logic is a bit convoluted - can we simplify it?
        use_presampled = False
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            path_column = self._sample_rate_paths[self.sample_rate]
            # Check if the pre-resampled path column exists in the data
            if path_column in row and row[path_column] is not None and row[path_column] != "":
                # Use pre-resampled audio with appropriate data root
                if self.sample_rate == 16000:
                    audio_path = self._data_root_16k / row[path_column]
                else:
                    audio_path = self._data_root_32k / row[path_column]
                use_presampled = True

        if use_presampled:
            audio, sample_rate = read_audio(audio_path)
            audio = audio.astype(np.float32)
            audio = audio_stereo_to_mono(audio, mono_method="average")
            # Audio is already at the correct sample rate, no resampling needed
        else:
            # Use original variable-rate files and resample on-the-fly if needed
            # For original files, relative_path needs audio/ prefix if not already present
            rel_path = row[self._originals_path_column]
            if not rel_path.startswith("audio/"):
                audio_path = anypath(self.data_root) / "audio" / rel_path
            else:
                audio_path = anypath(self.data_root) / rel_path
            audio, sample_rate = read_audio(audio_path)
            audio = audio.astype(np.float32)
            audio = audio_stereo_to_mono(audio, mono_method="average")

            if self.sample_rate is not None and sample_rate != self.sample_rate:
                audio = librosa.resample(
                    y=audio,
                    orig_sr=sample_rate,
                    target_sr=self.sample_rate,
                    scale=True,
                    res_type="kaiser_best",
                )
                sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the processed data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version})"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`XenoCantoAnnotatedJeantet23`

📊 Dataset Information

Name	`xeno_canto_annotated_jeantet_23`
Version	`0.1.0`
Owner	benjamin
License	CC-BY-4.0
Sources	XenoCanto
Available Splits	`all`

Description:

Bird song detection dataset consisting of xeno canto recordings annotatedwith start- and stop-times

XenoCantoAnnotatedJeantet23 Dataset

Description

Bird song detection dataset consisting of xeno canto recordings annotated with start- and stop-times. The species were chosen specifically to be those for which adding location information would improve performance.

From the article "Improving deep learning acoustic classifiers with contextual information for wildlife monitoring" by Jeantet and Dufourq (2023):

"Firstly, we selected the ten most recorded families in the Passeriformes order, the most represented order in the Xeno-canto database. From each of the ten families, we again sub-sampled the ten most recorded genera. For each genus, we observed the countries of the recordings and the number of available recordings per species and country. From the information gathered, and by visually analyzing the spectrograms, we conducted a self-selection process of genera that comprised species with similar songs recorded in different regions. Our aim was to ensure that there were sufficient recordings available for each species and country, allowing us to form a comprehensive dataset. In the end, 5 genera were selected containing 22 species (Table 1). Due to the significant variation in the number of available recordings across different species, we needed to determine a suitable allocation of segments for each species. To address this, we calculated the average number of records per species and per country. For species/country pairs with a higher number of recordings than this average, we set an upper limit on the number of assigned segments to this average value. The recordings were downloaded from the Xeno-canto database in.wav format and each recording was manually annotated by labelling the start and stop time for every vocalisation occurrence using Sonic Visualiser (Suppl. Fig. 1, Cannam et al. (2010)). In total, we obtained 6,537 occurrences of bird songs of various lengths from 967 file recordings (Table 1)."

Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels - the id of the xeno canto asset

Pre-resampled Audio

Pre-resampled audio is available at 16 kHz and 32 kHz. When sample_rate matches one of these rates, the pre-resampled files are loaded directly (no on-the-fly resampling). For any other target rate, audio is resampled on-the-fly using librosa's kaiser_best method.

References

https://www.sciencedirect.com/science/article/pii/S1574954123002856

Source code in alp_data/datasets/xeno_canto_annotated_jeantet_23.py

@register_dataset
class XenoCantoAnnotatedJeantet23(Dataset):
    """XenoCantoAnnotatedJeantet23 Dataset

    Description
    -----------
    Bird song detection dataset consisting of xeno canto recordings annotated
    with start- and stop-times. The species were chosen specifically to be
    those for which adding location information would improve performance.

    From the article "Improving deep learning acoustic classifiers with contextual
    information for wildlife monitoring" by Jeantet and Dufourq (2023):

    "Firstly, we selected the ten most recorded families in the Passeriformes order,
    the most represented order in the Xeno-canto database. From each of the ten
    families, we again sub-sampled the ten most recorded genera. For each genus, we
    observed the countries of the recordings and the number of available recordings
    per species and country. From the information gathered, and by visually
    analyzing the spectrograms, we conducted a self-selection process of genera that
    comprised species with similar songs recorded in different regions. Our aim was
    to ensure that there were sufficient recordings available for each species and
    country, allowing us to form a comprehensive dataset. In the end, 5 genera were
    selected containing 22 species (Table 1). Due to the significant variation in
    the number of available recordings across different species, we needed to
    determine a suitable allocation of segments for each species. To address this,
    we calculated the average number of records per species and per country. For
    species/country pairs with a higher number of recordings than this average, we
    set an upper limit on the number of assigned segments to this average value. The
    recordings were downloaded from the Xeno-canto database in.wav format and each
    recording was manually annotated by labelling the start and stop time for every
    vocalisation occurrence using Sonic Visualiser (Suppl. Fig. 1, Cannam et al.
    (2010)). In total, we obtained 6,537 occurrences of bird songs of various
    lengths from 967 file recordings (Table 1)."


    Each entry consists of:
    - an audio recording
    - a selection table (Raven format), with Species labels
    - the id of the xeno canto asset

    Pre-resampled Audio
    -------------------
    Pre-resampled audio is available at 16 kHz and 32 kHz. When
    ``sample_rate`` matches one of these rates, the pre-resampled files are
    loaded directly (no on-the-fly resampling). For any other target rate,
    audio is resampled on-the-fly using librosa's ``kaiser_best`` method.

    References
    ----------
    https://www.sciencedirect.com/science/article/pii/S1574954123002856

    """

    info = DatasetInfo(
        name="xeno_canto_annotated_jeantet_23",
        owner="benjamin",
        split_paths={
            "all": f"{DATA_HOME}/xeno_canto_annotated_jeantet_2023/all_gbif_v2_1.csv",
        },
        version="0.1.0",
        description="Bird song detection dataset consisting of xeno canto recordings annotated"
        "with start- and stop-times",
        sources="XenoCanto",
        license="CC-BY-4.0",
    )

    _sample_rate_paths: dict[int, str] = {16000: "16khz_path", 32000: "32khz_path"}
    _originals_path_column = "audio_path"

    def __init__(
        self,
        split: str = "all",
        output_take_and_give: Dict[str, str] | None = None,
        sample_rate: int | None = 16000,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """
        Parameters
        ----------
        split : str
            Split to load (key in info.split_paths).
        output_take_and_give : dict[str, str] | None
            Optional mapping of original → new output keys (filters columns as well).
        sample_rate : int | None
            If set, audio is resampled to this rate.
        data_root : str | AnyPathT | None
            Optional root directory to prepend to each row['audio_path'].
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self._data = None
        self.annotation_columns = ["Species"]

        self.sample_rate = sample_rate

        self._load()

        if data_root is None:
            self.data_root = anypath(self.info.split_paths[self.split]).parent
        else:
            self.data_root = anypath(data_root)

    @property
    def columns(self) -> list[str]:
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        return list(self.info.split_paths.keys())

    @property
    def available_sample_rates(self) -> list[int]:
        """Return pre-resampled sample rates whose path columns exist in the data."""
        return [sr for sr, col in self._sample_rate_paths.items() if col in self._data.columns]

    def _load(self) -> None:
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. Expected one of {list(self.info.split_paths.keys())}"
            )
        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(
            location, streaming=self._streaming, keep_default_na=False, na_values=[""]
        )

    def __len__(self) -> int:
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call _load() first.")
        if self._streaming:
            raise NotImplementedError(
                "Length is not available in streaming mode. Iterate over the dataset instead."
            )
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        """Process a single row of the dataset.

        Parameters
        ----------
        row : dict[str, Any]
            A dictionary representing a single row of the dataset.

        Returns
        -------
        dict[str, Any]
            The processed row.
        """
        use_presampled = False
        if self.sample_rate is not None and self.sample_rate in self._sample_rate_paths:
            path_column = self._sample_rate_paths[self.sample_rate]
            if path_column in row and row[path_column] is not None and row[path_column] != "":
                audio_path = anypath(self.data_root) / row[path_column]
                use_presampled = True

        if not use_presampled:
            audio_path = anypath(self.data_root) / row[self._originals_path_column]

        audio, sr = read_audio(audio_path)
        audio = audio_stereo_to_mono(audio, mono_method="average").astype(np.float32)

        if not use_presampled and self.sample_rate is not None and sr != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sr,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sr = self.sample_rate

        st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
        audio_dur = len(audio) / float(sr)
        st = st[st["Begin Time (s)"] < audio_dur].copy()

        row["audio"] = audio
        row["selection_table"] = st

        if self.output_take_and_give:
            item = {}
            for old_key, new_key in self.output_take_and_give.items():
                item[new_key] = row[old_key]
            return item

        return row

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.

        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the processed data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    @classmethod
    def from_config(
        cls, dataset_config: DatasetConfig
    ) -> tuple["XenoCantoAnnotatedJeantet23", dict[str, Any]]:
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )
        if dataset_config.transformations:
            meta = ds.apply_transformations(dataset_config.transformations)
            return ds, meta
        return ds, {}

    def get_available_labels(self, anno_column: str = "Species") -> List[str]:
        """
        Return all possible labels for a given annotation column

        Returns
        ---------
        A list of all the available labels for anno_column
        """
        available_labels = set()
        for row in self._data:
            st = pd.read_csv(StringIO(row["selection_table"]), sep="\t")
            available_labels.update(st[anno_column].astype(str).tolist())
        return sorted(available_labels)

    def __str__(self) -> str:
        base = f"{self.info.name} (v{self.info.version})"
        return (
            f"{base}\n"
            f"Sources: {self.info.sources}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

`ZebraFinchJulieElie`

📊 Dataset Information

Name	`zebra_finch_julie_elie`
Version	`0.1.0`
Owner	marius
License	CC-BY-4.0, CC0
Sources	Julie Elie
Available Splits	`test`, `train`, `val`, `full_dataset`

Description:

Vocal repertoires from adult and chick, male and female zebra finches (Taeniopygia guttata)

Zebra Finch Julie Elie dataset

Description

Vocal repertoires from adult and chick, male and female zebra finches (Taeniopygia guttata) including bird id, call type, age.

References

Elie JE and Theunissen FE. The vocal repertoire of the domesticated zebra finch: a data driven approach to decipher the information-bearing acoustic features of communication signals. Animal Cognition. 2016. 19(2) 285-315

DOI 10.1007/s10071-015-0933-6

https://figshare.com/articles/dataset/Vocal_repertoires_from_adult_and_chick_male_and_female_zebra_finches_Taeniopygia_guttata_/11905533/1

Examples:

>>> from alp_data.datasets import ZebraFinchJulieElie
>>> dataset = ZebraFinchJulieElie(
...     split="test",
...     output_take_and_give={"label": "label"},
...     sample_rate=16000,
... )

Source code in alp_data/datasets/zebra_finch_julie_elie.py

@register_dataset
class ZebraFinchJulieElie(Dataset):
    """Zebra Finch Julie Elie dataset

    Description
    -----------
    Vocal repertoires from adult and chick, male and female zebra finches
    (Taeniopygia guttata) including bird id, call type, age.

    References
    ----------
    Elie JE and Theunissen FE. The vocal repertoire of the domesticated zebra finch:
    a data driven approach to decipher the information-bearing acoustic features of
    communication signals. Animal Cognition. 2016. 19(2) 285-315

    DOI 10.1007/s10071-015-0933-6

    https://figshare.com/articles/dataset/Vocal_repertoires_from_adult_and_chick_male_and_female_zebra_finches_Taeniopygia_guttata_/11905533/1

    Examples
    --------
    >>> from alp_data.datasets import ZebraFinchJulieElie
    >>> dataset = ZebraFinchJulieElie(
    ...     split="test",
    ...     output_take_and_give={"label": "label"},
    ...     sample_rate=16000,
    ... )
    """

    info = DatasetInfo(
        name="zebra_finch_julie_elie",
        owner="marius",
        split_paths={
            "test": f"{_CSV_ROOT}/test.csv",
            "train": f"{_CSV_ROOT}/train.csv",
            "val": f"{_CSV_ROOT}/val.csv",
            "full_dataset": f"{_CSV_ROOT}/full_dataset.csv",
        },
        version="0.1.0",
        description=(
            "Vocal repertoires from adult and chick, male and female zebra finches "
            "(Taeniopygia guttata)"
        ),
        sources=["Julie Elie"],
        license="CC-BY-4.0, CC0",
    )

    def __init__(
        self,
        split: str = "train",
        output_take_and_give: dict[str, str] | None = None,
        sample_rate: int | None = None,
        data_root: str | AnyPathT | None = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        """Initialize the Zebra Finch Julie Elie dataset.

        Parameters
        ----------
        split : str
            The split to load. One of info.split_paths keys.
        output_take_and_give : dict[str, str]
            A dictionary mapping the original column names to the new column names.
            It acts as a filter as well.
        sample_rate : int
            The sample rate to which audio files should be resampled.
        data_root : str | AnyPathT, optional
            The root directory for the dataset. This is optionally appended to the
            path item of a sample in the dataset.
            If None, the default is the parent directory of the split path.
        backend : BackendType, optional
            The backend to use ("pandas" or "polars"), by default "polars"
        streaming : bool, optional
            Whether to use streaming mode, by default False
        """
        super().__init__(output_take_and_give, backend=backend, streaming=streaming)
        self.split = split
        self.sample_rate = sample_rate
        self.data_root = data_root
        if self.data_root is None:
            # we assume that parent dir of the csv_data directory is the data root
            # The split path is: .../raw/csv_data/test.csv
            # We want the data root to be: .../raw/
            split_path = anypath(self.info.split_paths[self.split])
            self.data_root = split_path.parent.parent

        self._data: pd.DataFrame = None
        self._load()  # Load the dataset (fills self._data)

    @property
    def columns(self) -> list[str]:
        """Return the columns of the dataset."""
        return list(self._data.columns)

    @property
    def available_splits(self) -> list[str]:
        """Return the available splits of the dataset."""
        return list(self.info.split_paths.keys())

    def _load(self) -> None:
        """Load the dataset.

        Raises
        ------
        LookupError
            If the split is not valid.
        """
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}.Expected one of {list(self.info.split_paths.keys())}"
            )

        location = self.info.split_paths[self.split]
        self._data = self._backend_class.from_csv(location, streaming=self._streaming)

    @classmethod
    def from_config(
        cls, dataset_config: DatasetConfig
    ) -> tuple["ZebraFinchJulieElie", dict[str, Any]]:
        """Create a Dataset instance from a configuration dictionary.

        Parameters
        ----------
        dataset_config : DatasetConfig
            Configuration dictionary containing dataset parameters

        Returns
        -------
        tuple[Dataset, dict[str, Any]]
            A tuple containing the dataset instance and metadata.
            If the dataset_config contains transformations, they will be applied
            and the metadata will be returned as dict, otherwise empty dict.
        """
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        ds = cls(
            split=cfg["split"],
            output_take_and_give=cfg["output_take_and_give"],
            data_root=cfg["data_root"],
            sample_rate=cfg["sample_rate"],
            backend=cfg["backend"],
            streaming=cfg["streaming"],
        )

        if dataset_config.transformations:
            transform_metadata = ds.apply_transformations(dataset_config.transformations)
            return ds, transform_metadata

        return ds, {}

    def __len__(self) -> int:
        """Return the number of samples in the dataset.

        Returns
        -------
        int
            Number of samples in the current split.

        Raises
        ------
        RuntimeError
            If no split has been loaded yet.
        """
        if self._data is None:
            raise RuntimeError("No split has been loaded yet. Call load() first.")
        if self._streaming:
            raise NotImplementedError("Length is not available in streaming mode.")
        return len(self._data)

    def _process(self, row: dict[str, Any]) -> dict[str, Any]:
        # Ensure audio path is valid
        audio_path = anypath(self.data_root) / row["local_path"]

        # Read the audio clip
        audio, sample_rate = read_audio(audio_path)
        audio = audio.astype(np.float32)
        # Stereo to mono if necessary.
        audio = audio_stereo_to_mono(audio, mono_method="average")

        if self.sample_rate is not None and sample_rate != self.sample_rate:
            audio = librosa.resample(
                y=audio,
                orig_sr=sample_rate,
                target_sr=self.sample_rate,
                scale=True,
                res_type="kaiser_best",
            )
            sample_rate = self.sample_rate

        row["audio"] = audio
        row["sample_rate"] = sample_rate

        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    def __getitem__(self, idx: int) -> dict[str, Any]:
        """Get a specific sample from the dataset.
        Parameters
        ----------
        idx : int
            Index of the sample to get.

        Returns
        -------
        dict[str, Any]
            A dictionary containing the data.
        """
        row = self._data[idx]
        return self._process(row)

    def __iter__(self) -> Iterator[Dict[str, Any]]:
        """Iterate over samples in the dataset.

        Yields
        -------
        Dict[str, Any]
            Each sample in the dataset.
        """
        for row in self._data:
            yield self._process(row)

    def __str__(self) -> str:
        """Return a string representation of the dataset.

        Returns
        -------
        str
            A string representation of the dataset including its name, version,
            and basic statistics if data is loaded.
        """
        base_info = f"{self.info.name} (v{self.info.version}), split='{self.split}'"

        return (
            f"{base_info}\n"
            f"Description: {self.info.description}\n"
            f"Sources: {', '.join(self.info.sources)}\n"
            f"License: {self.info.license}\n"
            f"Available splits: {', '.join(self.info.split_paths.keys())}"
        )

Using your own dataset

First of all, you must answer an important question: is this new dataset relatively stable and time and potentially useful to others? If yes, then you should talk to the engineering team to add it as an official ESP Dataset. If not, you can just follow the next steps!

To create a new dataset, you need to subclass the base Dataset class and implement several key components. Here's a step-by-step guide:

1. Basic Structure

from alp_data import Dataset, DatasetInfo, register_dataset
from alp_data.io import anypath, AnyPathT
from typing import Any, Dict, Optional
import pandas as pd

@register_dataset
class MyCustomDataset(Dataset):
    """My custom dataset description.

    Parameters
    ----------
    split : str
        The split to load. One of info.split_paths keys.
    output_take_and_give : dict[str, str], optional
        A dictionary mapping the original column names to the new column names.
    data_root : str | AnyPathT, optional
        Custom data root directory.
    """

    # Define dataset metadata
    info = DatasetInfo(
        name="my_custom_dataset",
        owner="your_name",
        split_paths={
            "train": "path/to/train.csv",
            "validation": "path/to/validation.csv",
        },
        version="0.1.0",
        description="Description of your dataset",
        sources=["Source 1", "Source 2"],
        license="Your License",
    )

    def __init__(
        self,
        split: str = "train",
        output_take_and_give: Optional[dict[str, str]] = None,
        data_root: Optional[str | AnyPathT] = None,
    ) -> None:
        """Initialize the dataset."""
        super().__init__(output_take_and_give)
        self.split = split
        self._data = None
        self._load()
        self.data_root = data_root

    def _load(self) -> None:
        """Load the dataset data."""
        if self.split not in self.info.split_paths:
            raise LookupError(
                f"Invalid split: {self.split}. "
                f"Expected one of {list(self.info.split_paths.keys())}"
            )

        # Implement your data loading logic here
        location = self.info.split_paths[self.split]
        # Example: Load CSV data
        self._data = pd.read_csv(anypath(location))

    def __len__(self) -> int:
        """Return the number of samples in the dataset."""
        if self._data is None:
            raise RuntimeError("No split has been loaded yet.")
        return len(self._data)

    def __getitem__(self, idx: int) -> Dict[str, Any]:
        """Get a specific sample from the dataset."""
        if idx < 0 or idx >= len(self._data):
            raise IndexError(f"Index {idx} out of bounds.")

        # Implement your sample loading logic here
        row = self._data.iloc[idx].to_dict()

        # Example: Load and process data
        if self.data_root:
            data_path = anypath(self.data_root) / row["path"]
        else:
            data_path = anypath(row["path"])

        # Load your data (e.g., image, audio, text)
        data = # your code goes here

        # Apply output_take_and_give if specified
        if self.output_take_and_give:
            item = {}
            for key, value in self.output_take_and_give.items():
                item[value] = row[key]
        else:
            item = row

        return item

    @classmethod
    def from_config(cls, dataset_config: DatasetConfig) -> "MyCustomDataset":
        """Create a Dataset instance from a configuration."""
        cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})

        split = cfg.get("split", None)
        if not split or split not in cls.info.split_paths:
            raise LookupError(
                f"Invalid split '{split}'. "
                f"Available splits: {', '.join(cls.info.split_paths.keys())}"
            )

        return cls(
            split=split,
            output_take_and_give=cfg.get("output_take_and_give", None),
            data_root=cfg.get("data_root"),
        )

2. Key Components to Implement

DatasetInfo:
name: Unique identifier for your dataset
owner: Dataset maintainer
split_paths: Dictionary mapping split names to data paths
version: Dataset version
description: Brief description
sources: List of data sources
license: Dataset license
Required Methods:
__init__: Initialize the dataset with split and configuration
_load: Load the dataset data
__len__: Return dataset size
__getitem__: Get a specific sample
from_config: Create dataset from configuration
Optional Methods:
__iter__: Iterate over samples
__str__: String representation

3. Registration

Use the @register_dataset decorator to register your dataset:

@register_dataset
class MyCustomDataset(Dataset):
    # Your implementation
    pass

5. Example Usage

Now, here is an example on how to use your new dataset!

# Create dataset instance
dataset = MyCustomDataset(
    split="train",
    output_take_and_give={"original_col": "new_col"}
)

# Access data
sample = dataset[0]
print(len(dataset))

# Use with transforms
from alp_data.transforms import Filter
filter_transform = Filter(property="category", values=["A", "B"])
dataset.apply_transformations([filter_transform])

alp_data.datasets module

What are ESP Datasets?

How to Load Datasets?

Dataset Configuration

Using Transforms with Datasets

Basic Usage with Transforms

Using Transforms in Dataset Configuration

Available Datasets

AnimalSoundArchive

AnimalSpeak

AnuraSetStrong

ArcticBirdSounds

AudioSet

AudioSetStrong

Beans

BeansZero

BengaleseFinchCalls

BirdSet

Birdeep

ChiffchaffId

CorvidWascher

DCLDE2026

DinardoDolphinWhistles

ESPRaincoast

Geladas

GiantOtters

GibbonSolos

HawaiianBirds

INaturalist

InfantMarmosetsVox

InsectSet459

LittleOwlId

MacaquesCooCalls

NocturnalBirdMigration

PipitId

Powdermill

Subsegmentation

SuperbStarling

Voxaboxen

VoxaboxenEvents

WABAD

Watkins

XenoCanto

XenoCantoAnnotatedJeantet23

ZebraFinchJulieElie

Using your own dataset

1. Basic Structure

2. Key Components to Implement

3. Registration

5. Example Usage

`alp_data.datasets` module

`AnimalSoundArchive`

`AnimalSpeak`

`AnuraSetStrong`

`ArcticBirdSounds`

`AudioSet`

`AudioSetStrong`

`Beans`

`BeansZero`

`BengaleseFinchCalls`

`BirdSet`

`Birdeep`

`ChiffchaffId`

`CorvidWascher`

`DCLDE2026`

`DinardoDolphinWhistles`

`ESPRaincoast`

`Geladas`

`GiantOtters`

`GibbonSolos`

`HawaiianBirds`

`INaturalist`

`InfantMarmosetsVox`

`InsectSet459`

`LittleOwlId`

`MacaquesCooCalls`

`NocturnalBirdMigration`

`PipitId`

`Powdermill`

`Subsegmentation`

`SuperbStarling`

`Voxaboxen`

`VoxaboxenEvents`

`WABAD`

`Watkins`

`XenoCanto`

`XenoCantoAnnotatedJeantet23`

`ZebraFinchJulieElie`