alp_data.datasets module
What are ESP Datasets?
The datasets module provides a collection of datasets validated by the engineering team for ESP projects. In short, this is the module to use to download, load and manipulate official ESP datasets. Each dataset is implemented as a class that inherits from the base Dataset class, providing a consistent interface for data loading and access.
More technically an ESP Dataset is defined as such:
- Inherits from the base Dataset class
- Has a defined DatasetInfo containing metadata
- Provides methods for loading and accessing data and splits
- Can be configured through a DatasetConfig
How to Load Datasets?
Datasets can be loaded following two different approaches:
-
Direct instantiation:
-
Using configuration:
from alp_data import DatasetConfig from alp_data.datasets import AnimalSpeak # Create a configuration config = DatasetConfig( dataset_name="animalspeak", split="validation", ) # Create dataset from config # This returns a tuple, the dataset and a dictionary of metadata # The metadata is generated by any transforms in the config which # are applied to the dataset dataset, _ = AnimalSpeak.from_config(config) -
From a config yaml file:
Your yaml config file should look like this for a single dataset (see Concatenate for multiple datasets):
Note the dataset key at the top level is required.
dataset:
dataset_name: AnimalSpeak
split: validation
output_take_and_give:
labels: label
data_root: null
transformations:
- type: deduplicate
subset: null
- type: label_from_feature
feature: species_common
output_feature: label
override: true
from alp_data import dataset_from_config
ds, transform_metadata = dataset_from_config("path/to/config.yaml")
print(len(ds))
Dataset Configuration
Deeper levels of configurations can be achieved by using specific parameters which are either common to all datasets or sometimes specific. Common arguments are:
split: The data split to use (e.g., "train", "validation")-
output_take_and_give: Column picker and name mappings. This is used to:- Pick the columns you want in the output dictionary returned when
__getitem__is called viax = sample[0]. - Rename the columns in the output dictionary. For example, if you want to rename the "audio" column to "raw_wav", you can specify
{"audio": "raw_wav"}.
- Pick the columns you want in the output dictionary returned when
-
sample_rate: Target audio sample rate (for audio datasets, it will resample to this rate). data_root: Custom root directory for data files. If not specified, the data_root is set as the parent directory of the path to the split. The idea here is that the data maybe copied from its original location (usually a bucket) to a local disk or a folder on the shared nfs.
Using Transforms with Datasets
Datasets can be combined with Transforms to modify or enhance the data during loading. Transforms are modifying the data inplace, so the returned dataset will be effectively a different version of the original data.
Basic Usage with Transforms
Transforms can be used in a sequential way, as in first get the original dataset, then apply a transform:
Remark
The order of the transforms is important. If you have multiple transforms, they will be applied in the order they are defined in the configuration. So, for e.g., if you change the name of a column with LabelFromFeatureTransform, it will effect the Filter Transform
from alp_data.datasets import AnimalSpeak
from alp_data.transforms import FilterConfig, LabelFromFeatureConfig
# Create a dataset
aspeak_output_map = {
"audio": "raw_wav" # maps the "audio" column to "raw_wav" in output
}
dataset = AnimalSpeak(split="validation", output_take_and_give=aspeak_output_map)
# Create transform configurations
filter_config = FilterConfig(
type="filter",
property="source",
values=["xeno-canto", "iNaturalist"],
mode="include"
)
label_from_feature_config = LabelFromFeatureConfig(
type="label_from_feature",
feature="canonical_name",
output_feature="label"
)
dataset.apply_transformations([filter_config, label_from_feature_config])
Using Transforms in Dataset Configuration
Transforms can also be specified in the dataset configuration to be automatically applied when the dataset is instantiated.
from alp_data import DatasetConfig
from alp_data.transforms import FilterConfig, LabelFromFeatureConfig
# Create transform configurations
filter_config = FilterConfig(
type="filter",
property="source",
values=["xeno-canto", "iNaturalist"],
mode="include"
)
label_config = LabelFromFeatureConfig(
type="label_from_feature",
feature="canonical_name",
output_feature="label"
)
# Create dataset configuration with transforms
config = DatasetConfig(
dataset_name="animalspeak",
split="validation",
transformations=[filter_config, label_config]
)
# Create dataset with transforms
dataset, metadata = AnimalSpeak.from_config(config)
print(metadata.keys())
# dict_keys(['filter', 'label_from_feature'])
print(metadata["label_from_feature"].keys())
# dict_keys(['label_feature', 'label_map', 'num_classes'])
Available Datasets
The list of available dataset will grow over time. Please refer to the next section if you wish to use your own Dataset or add a new one to the list of officially supported ones.
AnimalSoundArchive
📊 Dataset Information
| Name | animal-sound-archive |
| Version | 0.1.0 |
| Owner | david |
| License | mostly CC-BY-NC-SA (unversioned) |
| Sources | Tierstimmenarchiv (Museum für Naturkunde Berlin) |
| Available Splits | train, validation, all, train_excl_beanszero, validation_excl_beanszero, all_excl_beanszero |
Animal Sound Archive (Tierstimmenarchiv) audio dataset with taxonomic metadata. ~46k recordings of birds, mammals, insects, amphibians and other taxa from Museum für Naturkunde Berlin. Available at original (variable) sample rates, 16kHz, and 32kHz (pre-resampled). Pre-resampled audio uses librosa's kaiser_best resampling method. Train/val split: val_size=3000, random seed 42.
Animal Sound Archive (Tierstimmenarchiv) audio dataset.
Description
The Tierstimmenarchiv (Animal Sound Archive) at the Museum für Naturkunde Berlin hosts ~46k downloadable recordings covering birds, mammals, insects, amphibians, and other taxa. Recordings are linked to GBIF backbone taxonomy.
Audio is available as original MP3 files (variable sample rate) and pre-resampled WAV at 16kHz and 32kHz using librosa's kaiser_best method.
Available Splits
train: Training set (all minus 3000 held-out samples, random split)validation: Validation set (3000 samples, random split)all: Complete dataset (train + validation)train_excl_beanszero: Training set excluding taxa evaluated in BEANS-Zero benchmarkvalidation_excl_beanszero: Validation set excluding taxa evaluated in BEANS-Zero benchmarkall_excl_beanszero: Complete dataset excluding BEANS-Zero taxa
References
Tierstimmenarchiv: https://www.tierstimmenarchiv.de/
Examples:
>>> from alp_data.datasets import AnimalSoundArchive
>>> dataset = AnimalSoundArchive(
... split="train",
... output_take_and_give={"canonical_name": "species"},
... streaming=True
... )
>>> print(dataset.info.name)
animal-sound-archive
>>> print(dataset.available_sample_rates)
[32000, 16000]
Source code in alp_data/datasets/animal_sound_archive.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 | |
AnimalSpeak
📊 Dataset Information
| Name | animalspeak |
| Version | 0.1.0 |
| Owner | david; marius; masato |
| License | CC BY |
| Sources | Xeno-canto, iNaturalist, Watkins |
| Available Splits | train, validation |
AnimalSpeak dataset
AnimalSpeak dataset.
Description
A part of NatureLM training and BioLingual, AnimalSpeak, as over a million audio-caption pairs holding information on species, vocalization context, and animal behavior.
References
TRANSFERABLE MODELS FOR BIOACOUSTICS WITH HUMAN LANGUAGE SUPERVISION Robinson et al 2023 https://arxiv.org/pdf/2308.04978
Examples:
>>> from alp_data.datasets import AnimalSpeak
>>> dataset = AnimalSpeak(
... split="validation",
... output_take_and_give={"species_common": "comm"}
... )
>>> print(dataset.info.name)
animalspeak
Source code in alp_data/datasets/animalspeak.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 | |
AnuraSetStrong
📊 Dataset Information
| Name | anuraset_strong |
| Version | 0.1.0 |
| Owner | benjamin |
| License | CC BY 1.0 |
| Sources | Zenodo |
| Available Splits | all |
AnuraSet: A dataset for benchmarking Neotropical anurancalls identification in passive acoustic monitoring by Canas et al. (2023): We introduce a large-scale multi-species dataset of anuran amphibianscalls recorded by PAM, that comprises 27 hours of expert annotationsfor 42 different species from two Brazilian biomes.
AnuraSetStrong Dataset
Description
This is the strongly labeled portion of AnuraSet, i.e. the portion with start- and stop-times annotated.
Description from "AnuraSet: A dataset for benchmarking Neotropical anuran calls identification in passive acoustic monitoring" by Canas et al. (2023)
"We introduce a large-scale multi-species dataset of anuran amphibians calls recorded by PAM, that comprises 27 hours of expert annotations for 42 different species from two Brazilian biomes.
To provide precise annotations, we identified bouts of advertisement calls within each audio file and generated strong labels for them (step 1). Using Audacity 3.2 software, we conducted a detailed visual and aural inspection of the spectrogram to identify temporal limits (beginning and end) of audio segments containing species-specific calls with an inter-call interval of less than 1 second. These annotations ensured fine-scale specificity (Figure 3). For longer intervals, we split the calls into different time boxes and labeled them independently. Detailed labels assigned to time boxes were composed of (i) the species ID, tagged with a unique 6-letter code built from the scientific name of each identified species (Table 2), and (ii) the perceived quality of the recorded signal, included as a single letter indicating a Low ('L'), Medium ('M'), or High ('H') quality (Figure 4). To ensure consistency among the perceptual quality labels, we set up the following criteria: A high-quality call has a high signal-to-noise ratio, no overlap with other sounds, has a well-identifiable structure on the spectrogram, and can be easily visualized on the oscillogram. A medium-quality call can be visually identified on the spectrogram but may overlap with other sounds that can be difficult to identify in the oscillogram. A low-quality call shows a low signal-to-noise ratio, is partially masked by other sounds, appears with low intensity on the spectrogram, and cannot be easily identified on the oscillogram. This information was used to increase the usability of the data and improve the error analysis of the learning model."
Note that we omitted the quality assessments.
Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels
Pre-resampled Audio
Pre-resampled audio is available at 16 kHz and 32 kHz. When
sample_rate matches one of these rates, the pre-resampled files are
loaded directly (no on-the-fly resampling). For any other target rate,
audio is resampled on-the-fly using librosa's kaiser_best method.
References
https://arxiv.org/pdf/2307.06860
Source code in alp_data/datasets/anuraset.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 | |
ArcticBirdSounds
📊 Dataset Information
| Name | arctic_bird_sounds |
| Version | 0.1.0 |
| Owner | benjamin |
| License | CC-BY-4.0 |
| Sources | OSF |
| Available Splits | all |
ArcticBirdSounds Dataset
Description
Recordings of birds in the arctic. Bird vocalizations are boxed (start and stop, high and low freq) and labeled with species. Description from the original publication:
"Tracking biodiversity shifts is central to understanding past, present, and future global changes. Recent advances in bioacoustics and the low cost of high-quality automatic recorders are revolutionizing studies in biogeography and community and behavioral ecology with a robust assessment of phenology, species occurrence, and individual activity. This large volume of acoustic recordings has recently generated a plethora of datasets that can now be handled automatically, mostly via big data methods such as deep learning. These approaches need high-quality annotations to classify and detect recorded sounds efficiently. However, very few strongly annotated datasets—that is, with detailed information on start and end time of each vocalization—are openly accessible to the public. Moreover, these datasets mostly cover temperate species and are usually limited to a single year of recordings. Here, we present ArcticBirdSounds, the first open- access, multisite, and multiyear strongly annotated dataset of arctic bird vocalizations. ArcticBirdSounds offers 20 h of annotated recordings over 2 years (2018, 2019), taken from 15 distinct plots within six locations across the Arctic, from Alaska to Greenland. Recordings cover the arctic vertebrates' breeding period and are evenly spaced during the day; they capture most species breeding there with 12,933 temporal annotations in 49 classes of sounds. While these data can be used for many pressing ecological questions, it is also a unique resource for methodological development to help meet the challenges of fast ecosystem transformations such as those happening in the Arctic. All data, including audio files, annotation files, and companion spreadsheets, are available in an Open Science Framework repository published under a CC BY 4.0 License."
Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels
Note that some species labels are unknown, and labeled as "Unknown"
References
https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecy.4047 https://osf.io/b9trx/overview
Source code in alp_data/datasets/arctic_bird_sounds.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 | |
AudioSet
📊 Dataset Information
| Name | audioset |
| Version | 0.1.0 |
| Owner | david; marius; masato |
| License | CC BY 4.0 |
| Sources | YouTube |
| Available Splits |
AudioSet dataset
AudioSet dataset.
Description
AudioSet is largescale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes in 10 second segments of YouTube videos.
References
AUDIO SET: AN ONTOLOGY AND HUMAN-LABELED DATASET FOR AUDIO EVENTS Gemmeke et al 2017 https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45857.pdf
The train and validation splits (balanced and unbalanced) correspond to the official ones in the paper (https://research.google.com/audioset/download.html). The train-animal, train-noise, validation-animal, and validation-noise splits are created for animal and non-animal (noise) classes in the ontology.
The "caption" column contains the caption from AudioSetCaps when available. AudioSetCaps Paper: https://arxiv.org/abs/2411.18953 AudioSetCaps Dataset: https://huggingface.co/datasets/baijs/AudioSetCaps Note these are empty with the exception of the unbalanced_train split of the V1 dataset.
Note that AudioSet contains different files depending on YouTube video availability at time of download. Version 0.1.0 contains a dump of AudioSet pulled in 2021 and resampled to 16khz. Version 0.2.0 contains a larger set of audios pulled from this HuggingFace release https://huggingface.co/datasets/agkphysics/AudioSet and maintaining the sample rates of the original files.
Pre-resampled Audio
Version 0.2.0 includes pre-resampled 32kHz audio that can be loaded directly without on-the-fly resampling for faster data loading:
Load with pre-resampled 32kHz audio (v0.2.0, no resampling needed)
dataset_32k = AudioSet(split="validation", version="0.2.0", sample_rate=32000, ... streaming=True) print(dataset_32k.available_sample_rates) [32000]
Load with on-the-fly resampling to 16kHz
dataset_16k = AudioSet(split="validation", version="0.2.0", sample_rate=16000, ... streaming=True)
Examples:
>>> from alp_data.datasets import AudioSet
>>> dataset = AudioSet(
... split="train",
... output_take_and_give={"label": "audio_label"},
... version="0.1.0",
... streaming=True
... )
>>> print(dataset.info.name)
audioset
Source code in alp_data/datasets/audioset.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 | |
AudioSetStrong
📊 Dataset Information
| Name | audioset_strong |
| Version | 0.1.0 |
| Owner | david; marius; masato |
| License | CC BY 4.0 |
| Sources | YouTube |
| Available Splits | train, train-environmental |
AudioSet Strong: Strongly-labeled subset with temporal annotations
AudioSet Strong Dataset
Description
AudioSet Strong is a strongly-labeled subset of AudioSet with temporal annotations (start and end times) for sound events. This dataset provides precise timing information for when each sound event occurs within the 10-second audio clips.
AudioSet is a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research, using a carefully structured hierarchical ontology of 632 audio classes in 10-second segments of YouTube videos.
This class makes the AudioSet Strong subset available in the alp-data strongly-labeled format, where each entry consists of: - An audio recording (10 seconds, pre-resampled to 32kHz) - A selection table with temporal annotations (begin time, end time, label)
The strong labels provide temporal boundaries for sound events, making this dataset suitable for sound event detection and temporal localization tasks.
AudioSet recordings include those available in this huggingface dataset: https://huggingface.co/datasets/agkphysics/AudioSet
Available Splits
train: AudioSet Strong training set with 32kHz pre-resampled audio (8115 rows).train-environmental: Filtered to rows where ALL labels are environmental sounds (from AudioSet's environmental subset). 1109 rows.
References
AUDIO SET: AN ONTOLOGY AND HUMAN-LABELED DATASET FOR AUDIO EVENTS Gemmeke et al. 2017 https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45857.pdf
AudioSet Homepage: https://research.google.com/audioset/
Examples:
>>> from alp_data.datasets import AudioSetStrong
>>> dataset = AudioSetStrong(split="train", sample_rate=32000)
>>> print(len(dataset))
8115
>>> item = dataset[0]
>>> keys = sorted([k for k in item.keys() if k != '32khz_path'])
>>> len(keys)
7
>>> 'sample_rate' in keys and 'audio' in keys
True
>>> print(list(item['selection_table'].columns))
['Selection', 'Begin Time (s)', 'End Time (s)', 'Label']
>>> env_dataset = AudioSetStrong(split="train-environmental", sample_rate=32000)
>>> print(len(env_dataset))
1109
Source code in alp_data/datasets/audioset_strong.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 | |
Beans
📊 Dataset Information
| Name | beans |
| Version | 0.1.0 |
| Owner | gagan |
| License | CC-BY-4.0, CC0 |
| Sources | cbi, watkins, dogs, egyptian_fruit_bats, hiceas, dcase, enabirds, esc50, speech_commands, humbugdb, rfcx, hainan_gibbons |
| Available Splits | train, validation, test, cbi_test, cbi_validation, cbi_train, watkins_test, watkins_validation, watkins_train, dogs_test, ... (39 total) |
BEANS benchmark dataset
BEANS dataset
Description
BEANS (the BEnchmark of ANimal Sounds) is a collection of bioacoustics tasks and public datasets, specifically designed to measure the performance of machine learning algorithms in the field of bioacoustics. The benchmark proposed here consists of two common tasks in bioacoustics: classification and detection. It includes 12 datasets covering various species, including birds, land and marine mammals, anurans, and insects.
References
BEANS: The Benchmark of Animal Sounds Masato Hagiwara et al 2022 https://arxiv.org/abs/2210.12300 https://github.com/earthspecies/beans
Examples:
>>> from alp_data.datasets import Beans
>>> dataset = Beans(
... split="validation",
... output_take_and_give={"species_scientific": "species"},
... sample_rate=16000,
... streaming=True,
... )
Source code in alp_data/datasets/beans.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | |
BeansZero
📊 Dataset Information
| Name | beans_zero |
| Version | 0.1.0 |
| Owner | gagan, masato, david, marius |
| License | CC-BY-4.0, CC0 |
| Sources | Xeno-canto, iNaturalist, Animal Sound Archive, Elie and Theunissen 2016, Beans, esc50, rfcx, CBI, HumBugDB, Enabirds, HICEAS, Watkins, Gibbons, DCASE-2021-Task-5 |
| Available Splits | test, cbi, watkins, hiceas, dcase, enabirds, esc50, humbugdb, rfcx, gibbons, ... (23 total) |
BEANS-Zero benchmark dataset
BEANS-Zero dataset
Description
BEANS-Zero is a bioacoustics benchmark designed to evaluate multimodal audio-language models in zero-shot settings. Introduced in the paper NatureLM-audio paper (Robinson et al., 2025), it brings together tasks from both existing datasets and newly curated resources. The benchmark focuses on models that take a bioacoustic audio input (e.g., bird or mammal vocalizations) and a text instruction (e.g., "What species is in this audio?"), and return a textual output (e.g., "Taeniopygia guttata"). As a zero-shot benchmark, BEANS-Zero contains only a test split—no training or in-context examples are provided.
References
NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics David Robinson, Marius Miron, Masato Hagiwara, Olivier Pietquin https://openreview.net/forum?id=hJVdwBpWjt
Huggingface Dataset: https://huggingface.co/datasets/EarthSpeciesProject/BEANS-Zero
Examples:
>>> from alp_data.datasets import Beans
>>> dataset = BeansZero(
... split="test",
... output_take_and_give={"output": "species"},
... sample_rate=16000,
... streaming=True,
... )
>>> sample = next(iter(dataset))
>>> print(sample["species"])
None
Source code in alp_data/datasets/beans_zero.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 | |
BengaleseFinchCalls
📊 Dataset Information
| Name | Bengalese Finch Calls |
| Version | 0.1.0 |
| Owner | david |
| License | CC-BY-4.0, CC0 |
| Sources | BanglaJp_whistle |
| Available Splits | Bird0, Bird1, Bird2, Bird3, Bird4, Bird5, Bird6, Bird7, Bird8, Bird9, ... (55 total) |
Bengalese Finch calls annotated with call-type and individual IDs, organized by individual birds.
Bengalese Finch call-type dataset with individual bird splits.
Source code in alp_data/datasets/bengalese_finch_calls.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 | |
BirdSet
📊 Dataset Information
| Name | birdset |
| Version | 0.1.0 |
| Owner | marius; gagan; david |
| License | CC-BY-4.0, CC0 |
| Sources | HSN, NBP, NES, PER, POW, SSW, SNE, UHH |
| Available Splits | HSN-test, HSN-test_5s, NBP-test, NBP-test_5s, NES-test, NES-test_5s, PER-test, PER-test_5s, POW-test, POW-test_5s, ... (17 total) |
BirdSet avian bioacoustics benchmark with GBIF-linked taxonomy. Pre-resampled audio available at 16 kHz and 32 kHz (WAV). Original audio is 32 kHz OGG from the BirdSet HuggingFace repository.
BirdSet avian bioacoustics benchmark dataset.
Description
BirdSet is a large-scale benchmark dataset for audio classification focusing on avian bioacoustics. It includes over 6,800 recording hours from nearly 10,000 species for training and more than 400 hours across eight strongly labeled evaluation datasets. This version (v0.1.0) contains the eight evaluation subsets with test and test_5s splits, GBIF-linked taxonomy, and pre-resampled 16 kHz / 32 kHz WAV audio. The training data is not included in this dataset, but is a subset of the Xeno-canto dataset.
Available Splits
Each of the eight evaluation subsets has two splits:
{SUBSET}-test: Full-length soundscape recordings (variable duration){SUBSET}-test_5s: 5-second clips extracted from test recordings
Subsets: HSN, NBP, NES, PER, POW, SSW, SNE, UHH.
all: Combined dataset across all subsets and splits.
References
Rauch, Lukas, et al. "BirdSet: A multi-task benchmark for classification in avian bioacoustics." https://arxiv.org/abs/2403.10380
https://github.com/DBD-research-group/BirdSet
Examples:
>>> from alp_data.datasets import BirdSet
>>> dataset = BirdSet(split="HSN-test_5s", sample_rate=16000)
>>> print(dataset.available_sample_rates)
[16000, 32000]
Load with pre-resampled 16 kHz audio (no on-the-fly resampling):
Load original 32 kHz OGG (returned at native sample rate):
Source code in alp_data/datasets/birdset.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 | |
Birdeep
📊 Dataset Information
| Name | birdeep |
| Version | 0.1.0 |
| Owner | benjamin |
| License | MIT |
| Sources | HuggingFace |
| Available Splits | train, val, test, all |
Dataset of bird vocalizations with bounding boxes, originally released in: A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana by Alba Márquez-Rodríguez et al. (2025)
Birdeep Dataset
Description
Dataset of bird vocalizations with bounding boxes, originally released in: "A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana" by Alba Márquez-Rodríguez et al. (2025)
Description from the github:
"Data was collected using automatic audio recording devices (AudioMoths) in three different habitats in Doñana National Park. Approximately 500 minutes of audio data were recorded. There are 9 recorders in 3 different habitats (marshland, scrubland, and ecotone), which are constantly running, recording 1 minute and leaving 9 minutes between recordings. That is, 1 minute is recorded for every 10 minutes, with a sampling rate of 32 kHz. The recordings were made prioritising those times when the birds are most active in order to try to have as many audio recordings of songs as possible, specifically a few hours before dawn until midday.
Expert annotators labeled 461 minutes of audio data, identifying bird vocalizations and other relevant sounds. Annotations are provided in a standard format with start time, end time, and frequency range for each bird vocalization."
Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels
Note that some birds were not identifiable to species, and are annotated as "Unknown".
The dataset splits are the same as in the original publication.
Pre-resampled Audio
Pre-resampled audio is available at 16 kHz. When sample_rate=16000 is
passed, the pre-resampled files are loaded directly (no on-the-fly
resampling). For any other target rate, audio is resampled on-the-fly from
the native 32 kHz files using librosa's kaiser_best method.
References
https://huggingface.co/datasets/GrunCrow/BIRDeep_AudioAnnotations https://www.sciencedirect.com/science/article/pii/S1574954125002638?via%3Dihub
Source code in alp_data/datasets/birdeep.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 | |
ChiffchaffId
📊 Dataset Information
| Name | chiffchaff_id |
| Version | 0.1.0 |
| Owner | david |
| License | CC-BY-4.0 |
| Sources | https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940 |
| Available Splits | train_within_year, test_within_year, train_across_year, test_across_year |
Individual identify of common chiffchaffs
Chiffchaff ID dataset
Description
Vocalisations released by Stowell et al. for individual Chiffchaff males (Phylloscopus collybita). Provides both within-year and across-year evaluation schemes. https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940
This dataset includes train and test splits within year (train_within_year, test_within_year) and across year (train_across_year, test_across_year). Test within year tests on recordings from the same year as the training data, though different days, while test across year tests on recordings from different years, giving harder test conditions, with potential differences in acoustic environment or vocalisation characteristics.
References
https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940 Zenodo: https://zenodo.org/records/1413495
Examples:
>>> from alp_data.datasets import ChiffchaffId
>>> dataset = ChiffchaffId(
... split="test_within_year",
... sample_rate=16000,
... )
Source code in alp_data/datasets/chiffchaff_id.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 | |
CorvidWascher
📊 Dataset Information
| Name | corvid_wascher |
| Version | 0.1.0 |
| Owner | benjamin |
| License | private |
| Sources | XenoCanto, Claudia Wascher |
| Available Splits | all |
Corvid Dataset from Clausia Wascher
Description
This dataset consists of recordings of corvids, taken from Xeno-canto. Claudia Wascher provided annotations of vocalization boundaries. Annotations should not be considered exhaustive within a file, i.e. there may exist non- boxed vocalizations.
This data was originally provided, with an MOU, for work on comparison between vocal repertoires of different corvid species.
Each entry consists of: - an audio recording - a selection table with start- and stop-times of vocalizations - file-level metadata columns
References
https://www.biorxiv.org/content/biorxiv/early/2024/07/17/2024.07.15.603339.full.pdf
Source code in alp_data/datasets/corvid_wascher.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 | |
DCLDE2026
📊 Dataset Information
| Name | dclde2026 |
| Version | 0.1.0 |
| Owner | david |
| License | CC-BY-4.0 |
| Sources | Palmer et al. (2025) doi:10.1038/s41597-025-05281-5 |
| Available Splits | all |
DCLDE 2026 killer whale dataset with species, ecotype, call type, pod, clan, and acoustic behavior annotations across 9 providers
DCLDE 2026 Killer Whale Dataset.
Description
Multi-provider annotated acoustic recordings of killer whales, humpback whales, and bowhead whales from Alaska, British Columbia, and Washington (2011–2024). Each entry is an audio file plus an enriched selection table containing detection/call-level annotations with species, ecotype, call type, pod, clan, and acoustic behavior labels — all human-annotated.
Columns
audio_path : str
Relative path to source audio.
selection_table : str
TSV-serialised selection table with columns:
Begin Time (s), End Time (s), Low Freq (Hz),
High Freq (Hz), species, canonical_name,
sound_detail, ecotype, call_type,
acoustic_behavior, pod, clan,
annotation_level, confidence, coarse_call_type.
provider : str
Data provider name (see :data:PROVIDERS).
16khz_path, 32khz_path : str | None
Paths to pre-resampled audio (when available).
Splits
- "all": All data (default)
Available tasks
- Species classification: Killer whale / Humpback whale / Bowhead whale / Unknown biological
- KW detection (binary): presence / absence of killer whale
- Ecotype classification: SRKW / TKW / NRKW / SAR / OKW
- Call type classification (fine-grained): S04, N24ii, T01, whistle, BP, EL, etc.
- Call type classification (coarse): call / whistle / click / burst_pulse
(see :data:
COARSE_CALL_TYPE_LABELSbelow; aligned with Watkins taxonomy) - Pod identification: J / K / L pods (Southern Resident)
- Clan identification: A / G clans (Northern Resident)
Provider Notes
Data providers: DFO_CRP, JASCO_VFPA, DFO_WDLP, SIMRES, SIO, ONC, OrcaSound, JASCO_VFPA_ONC, SMRUConsulting.
UAF_NGOS (University of Alaska Fairbanks / North Gulf Oceanic Society) is excluded from the dataset.
Providers differ in annotation precision and coverage. Combining data
from multiple providers should be done carefully — consider filtering
by provider using Transforms (e.g. filter_isin on the provider
column) when training or evaluating.
Per-provider observations (detection focus):
- SMRU: Not very temporally precise; selections sometimes cover large segments around the call. Multiple faint calls may be grouped.
- SIMRES: Consistent annotations.
- VFPA: Generally good; a few missed calls and slightly less consistency with faint calls.
Coarse call types (aligned with Watkins taxonomy):
Mapping rules: call — Discrete pulsed calls (S-series, N-series, T-series, OFF-series, NS) and variable vocalizations (tone, moan, upsweep, chirp, groan, knock, shriek, whup, creak, grunt, scream, rasp, growl). whistle — Whistle-labeled signals (whistle, whistle/tone, W). click — Echolocation clicks (EL) and rapid click trains (buzz, BZ). burst_pulse — Burst-pulse signals (BP).
Unknown / ambiguous labels (Unk, Multiple overlapping, etc.) map to empty
string and are dropped when drop_empty_windows=True in windowing.
Examples:
>>> from alp_data.datasets import DCLDE2026
>>> dataset = DCLDE2026(split="all")
>>> print(dataset.info.name)
dclde2026
References
Palmer et al. (2025) doi:10.1038/s41597-025-05281-5 License: CC-BY-4.0
Source code in alp_data/datasets/dclde2026.py
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 | |
DinardoDolphinWhistles
📊 Dataset Information
| Name | dinardo_dolphin_whistles |
| Version | 0.1.0 |
| Owner | gagan |
| License | CC-BY-4.0 |
| Sources | Nature Scientific Data |
| Available Splits | all |
Dolphin whistles dataset, Di Nardo et al 2023
Dolphin whistles dataset, Di Nardo et al 2023
Description
Authors: Francesco Di Nardo, Rocco De Marco, Alessandro Lucchetti & DavidScaradozzi Globally, interactions between fishing activities and dolphins are cause for concern due to their negative effects on both mammals and fishermen. The recording of acoustic emissions could aid in detecting the presence of dolphins in close proximity to fishing gear, elucidating their behavior, and guiding potential management measures designed to limit this harmful phenomenon. This data descriptor presents a dataset of acoustic recordings (WAV files) collected during interactions between common bottlenose dolphins (Tursiops truncatus) and fishing activities in the Adriatic Sea. This dataset is distinguished by the high complexity of its repertoire, which includes various different typologies of dolphin emission. Specifically, a group of free-ranging dolphins was found to emit frequency-modulated whistles, echolocation clicks, and burst pulse signals, including feeding buzzes. An analysis of signal quality based on the signal-to-noise ratio was conducted to validate the dataset. The signal digital files and corresponding features make this dataset suitable for studying dolphin behavior in order to gain a deeper understanding of their communication and interaction with fishing gear (trawl).
References
A WAV file dataset of bottlenose dolphin whistles, clicks, and pulse sounds during trawling interactions https://doi.org/10.1038/s41597-023-02547-8
Examples:
>>> from alp_data.datasets import DinardoDolphinWhistles
>>> dataset = DinardoDolphinWhistles(
... split="all",
... sample_rate=16000,
... streaming=True)
Source code in alp_data/datasets/dinardo_dolphin_whistles.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 | |
ESPRaincoast
📊 Dataset Information
| Name | esp_raincoast |
| Version | 0.1.0 |
| Owner | emmanuel; gagan; dylansmyth; maddie |
| License | private |
| Sources | esp-raincoast |
| Available Splits | full |
Orca vocal repertoire dataset
ESP Raincoast.org dataset Recorded by Dylan Smyth, Valeria Vergara lab.
Source code in alp_data/datasets/esp_raincoast.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 | |
Geladas
📊 Dataset Information
| Name | geladas |
| Version | 0.1.0 |
| Owner | gagan |
| License | CC-BY-4.0 |
| Sources | PNAS |
| Available Splits | all |
Gelada vocal sequences dataset, Gustison et al 2016
Gelada vocal sequences follow Menzerath's linguistic law, Gustison et al 2016
Description
Identifying universal principles underpinning diverse natural systems is a key goal of the life sciences. A powerful approach in addressing this goal has been to test whether patterns consistent with linguistic laws are found in nonhuman animals. Menzerath's law is a linguistic law that states that, the larger the construct, the smaller the size of its constituents. Here, to our knowledge, we present the first evidence that Menzerath's law holds in the vocal communication of a nonhuman species. We show that, in vocal sequences of wild male geladas (Theropithecus gelada), construct size (sequence size in number of calls) is negatively correlated with constituent size (duration of calls). Call duration does not vary significantly with position in the sequence, but call sequence composition does change with sequence size and most call types are abbreviated in larger sequences. We also find that intercall intervals follow the same relationship with sequence size as do calls. Finally, we provide formal mathematical support for the idea that Menzerath's law reflects compression—the principle of minimizing the expected length of a code. Our findings suggest that a common principle underpins human and gelada vocal communication, highlighting the value of exploring the applicability of linguistic laws in vocal systems outside the realm of language.
References
Gelada vocal sequences follow Menzerath's linguistic law (PNAS, 2016) https://doi.org/10.1073/pnas.1522072113 Also: Morgan L. Gustison, Thore J. Bergman, Divergent acoustic properties of gelada and baboon vocalizations and their implications for the evolution of human speech. Journal of Language Evolution. https://doi.org/10.1093/jole/lzx015
Examples:
>>> from alp_data.datasets import Geladas
>>> dataset = Geladas(
... split="all",
... sample_rate=16000,
... streaming=True)
>>> print(dataset.info.name)
geladas
Source code in alp_data/datasets/geladas.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 | |
GiantOtters
📊 Dataset Information
| Name | giant_otters |
| Version | 0.1.0 |
| Owner | david |
| License | CC-BY-4.0, CC0 |
| Sources | PLOS ONE |
| Available Splits | test |
Giant Otters vocal repertoire dataset
Giant Otters dataset
Description
Vocal repertoire of giant otters. 22 vocalization types from adults, 17 from neonates, annotated based on behavioral function and sound.
References
The Vocal Repertoire of Adult and Neonate Giant Otters (Pteronura brasiliensis) https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112562#s5
Examples:
>>> from alp_data.datasets import GiantOtters
>>> dataset = GiantOtters(
... split="test",
... output_take_and_give={"label": "label"},
... sample_rate=16000,
... streaming=True
... )
Source code in alp_data/datasets/giant_otters.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 | |
GibbonSolos
📊 Dataset Information
| Name | gibbon_solos |
| Version | 0.1.0 |
| Owner | gagan |
| License | CC0 |
| Sources | Royalsocietypublishing.org, Dryad |
| Available Splits | all |
Gibbon solos Clink 2020
Gibbon solos Clink 2020
Description
Title: Brevity is not a universal in animal communication: evidence for compression depends on the unit of analysis in small ape vocalizations
Evidence for compression, or minimization of code length, has been found across biological systems from genomes to human language and music. Two linguistic laws—Menzerath's Law (which states that longer sequences consist of shorter constituents) and Zipf's Law of abbreviation (a negative relationship between signal length and frequency of use)—are predictions of compression. It has been proposed that compression is a universal in animal communication, but there have been mixed results, particularly in reference to Zipf's Law of abbreviation. Like songbirds, male gibbons (Hylobates muelleri) engage in long solo bouts with unique combinations of notes which combine into phrases. We found strong support for Menzerath's Law as the longer a phrase, the shorter the notes. To identify phrase types, we used state-of-the-art affinity propagation clustering, and were able to predict phrase types using support vector machines with a mean accuracy of 74%. Based on unsupervised phrase type classification, we did not find support for Zipf's Law of abbreviation. Our results indicate that adherence to linguistic laws in male gibbon solos depends on the unit of analysis. We conclude that principles of compression are applicable outside of human language, but may act differently across levels of organization in biological systems.
References
Gelada vocal sequences follow Menzerath's linguistic law (PNAS, 2016) https://doi.org/10.1098/rsos.200151 (paper) https://doi.org/10.5061/dryad.wstqjq2h8 (dataset)
Examples:
>>> from alp_data.datasets import GibbonSolos
>>> dataset = GibbonSolos(
... split="all",
... sample_rate=16000,
... streaming=True)
>>> print(dataset.info.name)
gibbon_solos
Source code in alp_data/datasets/gibbon_solos.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 | |
HawaiianBirds
📊 Dataset Information
| Name | hawaiian_birds |
| Version | 0.1.0 |
| Owner | benjamin |
| License | CC-BY-4.0 |
| Sources | Zenodo |
| Available Splits | all |
HawaiianBirds Dataset
Description
Annotated soundscapes from Hawaii, provided by Cornell Lab of Ornithology
Description from the Zenodo:
"This collection contains 635 soundscape recordings with a total duration of almost 51 hours, which have been annotated by expert ornithologists who provided 59,583 bounding box labels for 27 different bird species from the Hawaiian Islands, including 6 threatened or endangered native birds. The data were recorded between 2016 and 2022 at four sites across Hawaii Island. This collection has partially been featured as test data in the 2022 BirdCLEF competition and can primarily be used for training and evaluation of machine learning algorithms.
Data collection
Soundscapes for this collection were recorded for various research projects by the Listening Observatory for Hawaiian Ecosystems (LOHE) at the University of Hawaii at Hilo. The recordings were collected using Wildlife Acoustics Inc. Song Meters (models 2, 4, or Mini), as 16-bit wav files at a sampling rate of 44.1 kHz, using the default gain settings of each model. Further specifics for each recording, such as recording location and habitat type, can be found in the metadata provided. Soundscapes in this collection vary in length, ranging from just under a minute to 9 minutes in duration. All audio was unified, converted to FLAC, and resampled to 32 kHz for this collection. Parts of this dataset have previously been used in the 2022 BirdCLEF competition.
Sampling and annotation protocol
This collection is a subset of the files recorded over the course of the LOHE lab’s respective studies. The data were subsampled for annotation by aurally scanning the recordings and visually scanning spectrograms generated using Raven Pro software for target species of interest to the individual research project for which each recording was collected. Recordings that did not contain vocalizations of the species of interest were excluded from full annotation and thus this collection.
Using Raven Pro, annotators were asked to create a selection box around every bird call they could recognize, ignoring those that were too faint or unidentifiable at a spectrogram window size of 700 points. Provided labels contain full bird calls that are boxed in time and frequency. Annotators were allowed to combine multiple consecutive calls of the same species into one bounding box label if pauses between calls were shorter than 0.5 seconds. We converted labels to eBird species codes, following the 2021 eBird taxonomy (Clements list)."
Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels
References
https://zenodo.org/records/7078499
Source code in alp_data/datasets/hawaiian_birds.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 | |
INaturalist
📊 Dataset Information
| Name | inaturalist |
| Version | 0.1.0 |
| Owner | gagan; david |
| License | CC BY-NC 4.0, CC BY 4.0, CC0 1.0 |
| Sources | iNaturalist |
| Available Splits | train, train_unseen, val, val_unseen, all, all_unseen |
iNaturalist audio dataset with taxonomic metadata. Available at original (variable) sample rates and 32kHz (pre-resampled). Pre-resampled audio uses librosa's kaiser_best resampling method.
iNaturalist audio dataset.
Description
iNaturalist is a citizen science platform and biodiversity database containing observations of organisms. This dataset includes audio recordings from iNaturalist with associated metadata about species, locations, and other observation details. Recordings are linked to taxonomic information following ESP's taxonomy app (GBIF backbone), including species scientific and common names, family, genus, order. There is additional metadata including location, date, and recordist information. The current version 0.1.0 includes iNaturalist data up to July 2025.
Available Splits
train: Training set (random split)val: Validation set (random split)all: Complete dataset (train + val)train_unseen: Training set excluding unseen taxa evaluated in BEANS-Zero benchmarkval_unseen: Validation set excluding unseen taxa evaluated in BEANS-Zero benchmarkall_unseen: Complete dataset excluding BEANS-Zero unseen taxa
The _unseen splits are designed for training models that will be evaluated
on BEANS-Zero's unseen taxa benchmark, ensuring no test taxa leak into the training data.
Note that all splits exclude examples overlapping with the following benchmark datasets: - BEANS-Zero captioning test set (See the beans_zero dataset)
Remarks
⚠️ Some original audio files in m4a format were converted to WAV. This does not resolve the issues with m4a as a bioacoustic recording format, and the conversion to WAV via soundfile.write (see scripts/data_preprocessing_scripts/inat_m4a_to_wav.py) may introduce decoder specific metadata. ⚠️ MP3 audio files that were unreadable by soundfile were also converted to WAV using librosa and ffmpeg. This may introduce decoder specific metadata and potential quality issues. (see scripts/data_preprocessing_scripts/inat_mp3_to_wav.py)
References
iNaturalist: https://www.inaturalist.org/
Examples:
>>> from alp_data.datasets import INaturalist
>>> dataset = INaturalist(
... split="train",
... output_take_and_give={"canonical_name": "species"}
... )
>>> print(dataset.info.name)
inaturalist
>>> print(dataset.available_sample_rates)
[32000, 16000]
Load with pre-resampled 32kHz audio (no on-the-fly resampling needed)
Load with pre-resampled 16kHz audio (no on-the-fly resampling needed)
Source code in alp_data/datasets/inaturalist.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 | |
InfantMarmosetsVox
📊 Dataset Information
| Name | InfantMarmosetsVox |
| Version | 0.1.0 |
| Owner | eklavya |
| License | CC-BY-4.0 |
| Sources | Zenodo |
| Available Splits | all |
InfantMarmosetsVox is a dataset for multi-class call-type and caller identification. Available at original 44.1 kHz and pre-resampled 16kHz. Pre-resampled audio uses librosa's kaiser_best resampling method. Contains approx. 73k vocalization segments of infant marmoset vocalizations. Each vocalization segment sample has a calltypeID and callerID label. There are 11 calltype classes (0-10) and 10 caller identity classes (0-9). A calltypeID index can be associated with its calltype through `CALLTYPE_NAMES`. An unfiltered version (labels_raw.csv) with silence/noise is also available.
InfantMarmosetsVox dataset
Description
InfantMarmosetsVox is a dataset for multi-class call-type and caller identification. It contains audio recordings of different individual marmosets and their call-types. The dataset contains a total of 350 files of precisely labelled 10-minute audio recordings across all caller classes. The audio was recorded from five pairs of infant marmoset twins, each recorded individually in two separate sound-proofed recording rooms at a sampling rate of 44.1 kHz. The start and end time, call-type, and marmoset identity of each vocalization are provided, labeled by an experienced researcher.
Each entry in the dataset corresponds to a single vocalization segment, with the audio loaded from the corresponding time range in the source file.
Labels
calltypeID: Call type (0-10, 11 classes)callerID: Caller identity (0-9, 10 individuals from 5 twin pairs)
Additional Fields
audio: Vocalization segment waveform (numpy array).path: Relative path to audio file.start: Start time in seconds.end: End time in seconds.duration: Vocalization segment duration in seconds.twinID: Marmoset twin pair (1-5).vocID: Unique vocalization ID (row index).
References
Sarkar, E., Magimai.-Doss, M. (2023) "Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?" Proc. Interspeech 2023, 1189-1193. doi: 10.21437/Interspeech.2023-1968 https://www.isca-speech.org/archive/interspeech_2023/sarkar23_interspeech.html
Examples:
>>> from alp_data.datasets import InfantMarmosetsVox
>>> dataset = InfantMarmosetsVox(
... split="all",
... output_take_and_give={"calltypeID": "label", "audio": "audio"},
... sample_rate=16000,
... )
>>> sample = dataset[0]
>>> print(dataset.info.name)
InfantMarmosetsVox
>>> print(dataset.available_sample_rates)
[44100, 16000]
Source code in alp_data/datasets/infant_marmosets_vox.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 | |
InsectSet459
📊 Dataset Information
| Name | insectset_459 |
| Version | 0.1.0 |
| Owner | gagan |
| License | CC-BY-4.0, CC0 |
| Sources | Xeno-canto, iNaturalist, Bioacoustica |
| Available Splits | train, validation |
InsectSet459 dataset
InsectSet459 dataset.
Description
Excerpt from the original publication Abstract: "...Automatic recognition of insect sound could help us understand changing biodiversity trends around the world—but insect sounds are challenging to recognize even for deep learning. We present a new dataset comprised of 26399 audio files, from 459 species of Orthoptera and Cicadidae..."
References
Faiss, Ghani, Stowell 2025. https://arxiv.org/abs/2503.15074 Dataset DOI: https://zenodo.org/records/8252141
Examples:
>>> from alp_data.datasets import InsectSet459
>>> dataset = InsectSet459(
... split="validation",
... output_take_and_give={"species_scientific": "species"},
... sample_rate=16000,
... )
Source code in alp_data/datasets/insectset_459.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 | |
LittleOwlId
📊 Dataset Information
| Name | littleowl_id |
| Version | 0.1.0 |
| Owner | david |
| License | CC-BY-4.0 |
| Sources | https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940 |
| Available Splits | train_across_year, test_across_year |
Individual identification of little owls (Athene noctua)
Little Owl individual ID dataset.
Description
Vocalisations released by Stowell et al. for individual Little Owls (Athene noctua). Provides both within-year and across-year evaluation schemes. https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940
For this dataset, the train and test splits (train_across_year, test_across_year) are drawn from different years, giving harder test conditions, with potential differences in acoustic environment or vocalisation characteristics.
References
https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940 Zenodo: https://zenodo.org/records/1413495
Examples:
>>> from alp_data.datasets import LittleOwlId
>>> dataset = LittleOwlId(
... split="test_across_year",
... sample_rate=16000,
... )
Source code in alp_data/datasets/littleowl_id.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | |
MacaquesCooCalls
📊 Dataset Information
| Name | macaques_coo_calls |
| Version | 0.1.0 |
| Owner | marius |
| License | CC0 1.0 Universal |
| Sources | archive.org |
| Available Splits | test, train, val |
Coo calls from male and female macaques (Macaca mulatta) including id, sex, weight_kg
Macaques Coo Calls dataset
Description
Coo calls from male and female macaques (Macaca mulatta) including macaque id,sex, weight.
References
https://archive.org/details/macaque_coo_calls
Examples:
>>> from alp_data.datasets import MacaquesCooCalls
>>> dataset = MacaquesCooCalls(
... split="test",
... output_take_and_give={"id": "label"},
... sample_rate=16000,
... )
Source code in alp_data/datasets/macaques_coo_calls.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | |
NocturnalBirdMigration
📊 Dataset Information
| Name | nocturnal_bird_migration |
| Version | 0.1.0 |
| Owner | benjamin |
| License | CC-BY-NC-SA 4.0 |
| Sources | Zenodo, xeno-canto |
| Available Splits | train, train_nonxc, train_xc, test |
Dataset of nocturnal vocalizations from migratory birds in Europe. Vocalizations are annotated with start- and end- times, as well as high- andlow-frequencies.
NocturnalBirdMigration Dataset
Description
Dataset of nocturnal vocalizations from migratory birds in Europe. Vocalizations are annotated with start- and end- times, as well as high- and low-frequencies.
The dataset consists of a train split and a test split. The test split consists entirely of xeno-canto recordings which the dataset authors annotated. The train split consists of recordings submitted by French citizen-scientists, as well as xeno-canto recordings annotated by the dataset authors.
Note that the license is mixed (due to origins on xeno-canto); most restrictive is CC-BY-NC-SA 4.0. See zenodo page for full license details.
Description from the paper:
The persisting threats on migratory bird populations highlight the urgent need for effective monitoring tech- niques that could assist in their conservation. Among these, passive acoustic monitoring is an essential tool, particularly for nocturnal migratory species that are difficult to track otherwise. This work presents the Noc- turnal Bird Migration (NBM) dataset, a collection of 13,359 annotated vocalizations from 117 species of the Western Palearctic. The dataset includes precise time and frequency annotations, gathered by dozens of bird enthusiasts across France, enabling novel downstream acoustic analysis. In particular, we prove the utility of this database by training an original two-stage deep ob- ject detection model tailored for the processing of audio data. While allowing the precise localization of bird calls in spectrograms, this model shows competitive accuracy on the 45 main species of the dataset with state-of-the- art systems trained on much larger audio collections. These results highlight the interest of fostering similar open-science initiatives to acquire costly but valuable fine-grained annotations of audio files. All data and code are made openly available.
Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels - xeno-canto id, if applicable (else, empty string)
Pre-resampled Audio
Pre-resampled audio is available at 16 kHz and 32 kHz. When
sample_rate matches one of these rates, the pre-resampled files are
loaded directly (no on-the-fly resampling). For any other target rate,
audio is resampled on-the-fly using librosa's kaiser_best method.
References
https://zenodo.org/records/17573913 https://arxiv.org/pdf/2412.03633
Source code in alp_data/datasets/nocturnal_bird_migration.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 | |
PipitId
📊 Dataset Information
| Name | pipit_id |
| Version | 0.1.0 |
| Owner | david |
| License | CC-BY-4.0 |
| Sources | https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940 |
| Available Splits | train_within_year, test_within_year, train_across_year, test_across_year |
Individual identification of tree pipits (Anthus trivialis)
Tree Pipit individual ID dataset.
Description
Vocalisations released by Stowell et al. for individual Tree Pipits males (Anthus trivialis). Provides both within-year and across-year evaluation schemes.
This dataset includes train and test splits within year (train_within_year, test_within_year) and across year (train_across_year, test_across_year). Test within year tests on recordings from the same year as the training data, though different days, while test across year tests on recordings from different years, giving harder test conditions, with potential differences in acoustic environment or vocalisation characteristics.
References
https://royalsocietypublishing.org/doi/10.1098/rsif.2018.0940 Zenodo: https://zenodo.org/records/1413495
Examples:
>>> from alp_data.datasets import PipitId
>>> dataset = PipitId(
... split="test_within_year",
... sample_rate=16000,
... streaming=True,
... )
Source code in alp_data/datasets/pipit_id.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | |
Powdermill
📊 Dataset Information
| Name | powdermill |
| Version | 0.1.0 |
| Owner | benjamin |
| License | Public Domain |
| Sources | Dryad |
| Available Splits | all |
Powdermill Dataset
Description
Dataset of bird vocalizations with bounding boxes, originally released in: "An annotated set of audio recordings of Eastern North American birds containing frequency, time, and species information" by Lauren Chronister et al. (2021).
Description from the original:
"Acoustic recordings of soundscapes are an important category of audio data that can be useful for answering a variety of questions, and an entire discipline within ecology, dubbed “soundscape ecology,” has risen to study them. Bird sound is often the focus of studies of soundscapes due to the ubiquitousness of birds in most terrestrial environments and their high vocal activity. Autonomous acoustic recorders have increased the quantity and availability of recordings of natural soundscapes while mitigating the impact of human observers on community behavior. However, such recordings are of little use without analysis of the sounds they contain. Manual analysis currently stands as the best means of processing this form of data for use in certain applications within soundscape ecology, but it is a laborious task, sometimes requiring many hours of human review to process comparatively few hours of recording. For this reason, few annotated data sets of soundscape recordings are publicly available. Further still, there are no publicly available strongly labeled soundscape recordings of bird sounds that contain information on timing, frequency, and species. Therefore, we present the first data set of strongly labeled bird sound soundscape recordings under free use license. These data were collected in the Northeastern United States at Powdermill Nature Reserve, Rector, Pennsylvania, USA. Recordings encompass 385 minutes of dawn chorus recordings collected by autonomous acoustic recorders between the months of April through July 2018. Recordings were collected in continuous bouts on four days during the study period and contain 48 species and 16,052 annotations. Applications of this data set may be numerous and include the training, validation, and testing of certain advanced machine-learning models that detect or classify bird sounds. There are no copyright or propriety restrictions; please cite this paper when using materials within."
Note that this data was included in the BEANS "detection", i.e. multi-label classification, benchmark, under the name ENABirds.
Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels
References
https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecy.3329
Source code in alp_data/datasets/powdermill.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 | |
Subsegmentation
📊 Dataset Information
| Name | subsegmentation |
| Version | 0.1.0 |
| Owner | benjamin |
| License | private |
| Sources | Logan James |
| Available Splits | all, train, val, test, single_song_all, single_song_train, single_song_val, single_song_test |
Bird Song Subsegmentation Dataset
Description
Bird Song subsegmentation dataset from Logan James' paper "Pervasive patterns in the songs of passerine birds resemble human music universals and are linked with production and cognitive mechanisms"
Currently, this dataset is for internal use but we hope to release it publicly. The recordings come from xeno-canto and the annotations come from the paper.
Each entry consists of: - an audio recording - a selection table with start- and stop-times of song syllables - a boolean indicating if it passed quality control (i.e. if it was sub-segmentable) - annotations of Species, Genus, Order, and Family.
Splits
Original (multi-song recordings): "all", "train", "val", "test" Single-song (one song per item, times re-zeroed): "single_song_all", "single_song_train", "single_song_val", "single_song_test"
Each selection table has, for each syllable, column for the Species, Genus, Order, and Family, as well as an Annotation:
'a' indicates a syllable that is the beginning of a song (we define as at least 500 ms silence before) 'z' indicates a syllable that is the end of a song (we define as at least 500 ms silence after) 's' indicates all other syllables
References
https://www.biorxiv.org/content/biorxiv/early/2024/07/17/2024.07.15.603339.full.pdf
Source code in alp_data/datasets/subsegmentation.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 | |
SuperbStarling
📊 Dataset Information
| Name | superb_starling |
| Version | 0.1.0 |
| Owner | Sara |
| License | CC0 1.0 |
| Sources | Kenya field recordings |
| Available Splits | all |
superb starling flight calls with individual ID and group ID annotations
Superb Starling Dataset
Description
Dataset of superb starling (Lamprotornis superbus) flight calls with precise time bounds, individual ID, and social group ID
Each entry includes: - An audio clip containing one flight call - Annotations for exact start/stop of the call in audio clip - Metadata (bird ID, group, sex, ring, timestamp)
The metadata file is a tab-separated text file that is formatted as a Raven selection table. This lets you open all sound files in Raven and see the annotations aligned for every selection, which each correspond to a single flight call
References
Keen, S. C., Meliza, C. D., & Rubenstein, D. R. (2013). Flight calls signal group and individual identity but not kinship in a cooperatively breeding bird. Behavioral Ecology, 24(6), 1279-1285. https://doi.org/10.5061/dryad.p1n88
Source code in alp_data/datasets/superb_starling.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | |
Voxaboxen
📊 Dataset Information
| Name | voxaboxen |
| Version | 0.1.0 |
| Owner | benjamin; gagan |
| License | CC BY |
| Sources | Anuraset, BV, MT, OZF, Hawaii, Humpback, Katydids, Powdermill |
| Available Splits | Anuraset_train, Anuraset_val, Anuraset_test, BV_train, BV_val, BV_test, MT_train, MT_val, MT_test, OZF_train, ... (39 total) |
Voxaboxen dataset for acoustic sound event detection
Voxaboxen dataset.
Description
Voxaboxen is the dataset used in the Voxaboxen project. It consists of several datasets with annotated call start and end times via selection tables. Excerpt from paper: "...a method for accurately detecting bioacoustic sound events that is robust to overlapping events... We also release a new dataset designed to measure performance on detecting overlapping vocalizations. This consists of recordings of zebra finches annotated with temporally-strong labels and showing frequent overlaps. We test Voxaboxen on seven existing data sets and on our new data set."
References
Robust detection of overlapping bioacoustic sound events Louis Mahon, Benjamin Hoffman, Logan S James, Maddie Cusimano, Masato Hagiwara, Sarah C Woolley, Olivier Pietquin https://arxiv.org/abs/2503.02389
Examples:
>>> from alp_data.datasets import Voxaboxen
>>> dataset = Voxaboxen(
... split="BV_val",
... output_take_and_give={"selection_table": "st"}
... )
>>> print(dataset.info.name)
voxaboxen
Source code in alp_data/datasets/voxaboxen.py
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 | |
VoxaboxenEvents
📊 Dataset Information
| Name | voxaboxen_events |
| Version | 0.1.0 |
| Owner | benjamin; gagan |
| License | CC BY |
| Sources | Anuraset, BV, MT, OZF, Hawaii, Humpback, Katydids, Powdermill |
| Available Splits | Anuraset_train, Anuraset_val, Anuraset_test, BV_train, BV_val, BV_test, MT_train, MT_val, MT_test, OZF_train, ... (39 total) |
Voxaboxen events dataset for acoustic sound event detection
Voxaboxen dataset as events
Description
Same as Voxaboxen, but the audio is split according to the information in the selection table.
References
Robust detection of overlapping bioacoustic sound events Louis Mahon, Benjamin Hoffman, Logan S James, Maddie Cusimano, Masato Hagiwara, Sarah C Woolley, Olivier Pietquin https://arxiv.org/abs/2503.02389
Examples:
>>> from alp_data.datasets import VoxaboxenEvents
>>> dataset = VoxaboxenEvents(
... split="BV_val",
... output_take_and_give={"selection_table": "st"}
... )
>>> print(dataset.info.name)
voxaboxen_events
Source code in alp_data/datasets/voxaboxen.py
554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 | |
WABAD
📊 Dataset Information
| Name | wabad |
| Version | 0.1.0 |
| Owner | benjamin |
| License | CC-BY-4.0 |
| Sources | zenodo.org |
| Available Splits | all, CAT, POZO, BRE, EFFOR, MONTEB, CB, FEU, BIAL, SPMCO, ... (73 total) |
WABAD: This database includes 5,047 minutes of audio files annotated to species-level by local experts with the start and end time, and the upper and lower frequencies of each identified bird vocalisation in the recordings. The database has a wide taxonomic and spatial coverage, including information on 91,931 vocalisations from 1,192 bird species recorded at 72 recording sites in 29 recording locations
WABAD Dataset
Description
This class makes WABAD dataset available. Each entry is an audio recording, plus a selection table. Each row of the selection table has annotations at different taxonomic granularities (stored in annotation_columns attribute). Taxonomy has been coerced into GBIF.
This class was included in alp-data (initially) for use as a zero-shot detection evaluation dataset.
Description from publication: https://www.researchgate.net/publication/387711208_WABAD_A_World_Annotated_Bird_Acoustic_Dataset_for_Passive_Acoustic_Monitoring
Under the current global biodiversity crisis, there is a need for automated and non-invasive monitoring techniques that can gather large amounts of data cost-effectively at various ecological scales, from local to large spatial scales. This data can then be analyzed to inform stakeholders and decision makers. One such technique is passive acoustic monitoring, which is commonly coupled with automatic identification of animal species based on their sound. Automated sound analyses usually require the training of sound detection and identification algorithms. These algorithms are based on annotated acoustic datasets which mark the occurrence of sounds of species inside sound recordings. However, compiling large annotated acoustic datasets is time- consuming and requires experts, and therefore they normally cover reduced spatial, temporal and taxonomic scales. This data paper presents WABAD, the World Annotated Bird Acoustic Dataset for passive acoustic monitoring. WABAD is designed to provide the public, the research community, and conservation managers with a novel and globally representative annotated acoustic dataset. This database includes 5,047 minutes of audio files annotated to species-level by local experts with the start and end time, and the upper and lower frequencies of each identified bird vocalisation in the recordings. The database has a wide taxonomic and spatial coverage, including information on 91,931 vocalisations from 1,192 bird species recorded at 72 recording sites in 29 recording locations (mainly countries) and distributed across 13 biomes. WABAD can be used, for example, for developing and/or validating automatic species detection algorithms, answering ecological questions, such as assessing geographical variations on bird vocalisations, or comparing acoustic diversity indices with species-based diversity indices. The dataset is published under a Creative Commons Attribution Non Commercial 4.0 International copyright.
Pre-resampled Audio
Pre-resampled audio is available at 16 kHz and 32 kHz. When
sample_rate matches one of these rates, the pre-resampled files are
loaded directly (no on-the-fly resampling). For any other target rate,
audio is resampled on-the-fly using librosa's kaiser_best method.
References
https://zenodo.org/records/15629388 https://www.researchgate.net/publication/387711208_WABAD_A_World_Annotated_Bird_Acoustic_Dataset_for_Passive_Acoustic_Monitoring
Source code in alp_data/datasets/wabad.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 | |
Watkins
📊 Dataset Information
| Name | watkins |
| Version | 0.1.0 |
| Owner | david |
| License | LicenseRef-WHOI-Public |
| Sources | https://cis.whoi.edu/science/B/whalesounds/index.cfm |
| Available Splits | train |
Watkins Marine Mammal Sound Database — 2018 remastered release. ~13,700 audio clips spanning ~50 species of cetaceans and pinnipeds with GBIF-resolved taxonomy. Original audio at variable sample rates; pre-resampled 16kHz and 32kHz versions available.
Watkins Marine Mammal Sound Database (2018 remaster).
Description
The Watkins Marine Mammal Sound Database is the largest publicly available collection of marine mammal vocalisations, originally compiled by William A. Watkins at Woods Hole Oceanographic Institution. This dataset uses the 2018 remastered FLAC release and includes GBIF-resolved taxonomic metadata.
The dataset spans ~50 species across cetaceans (baleen whales, toothed whales, dolphins) and pinnipeds (seals, sea lions, walrus), with ~13,700 audio clips at variable original sample rates.
References
Watkins Marine Mammal Sound Database: https://cis.whoi.edu/science/B/whalesounds/index.cfm DOI: 10.1575/1912/7270
Examples:
Source code in alp_data/datasets/watkins.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 | |
XenoCanto
📊 Dataset Information
| Name | xeno-canto |
| Version | 0.1.0 |
| Owner | david; gagan |
| License | CC BY-NC-SA 4.0, CC BY-NC 4.0, CC BY-SA, CC0 |
| Sources | Xeno-canto |
| Available Splits | train, validation, all, train_unseen, validation_unseen, all_unseen |
Xeno-canto audio dataset with taxonomic metadata. Available at original (variable) sample rates and 32kHz (pre-resampled). Pre-resampled audio uses librosa's kaiser_best resampling method. Xeno-canto dump as of Oct 2025. Train/val split is 90%/10% with random seed 42.
Xeno-canto audio dataset.
Description
Xeno-canto is a website dedicated to sharing wildlife sounds from around the world. This dataset includes audio recordings from Xeno-canto with associated metadata about species, locations, and other observation details.
The dataset contains audio recordings with rich taxonomic information, including species scientific and common names, family, genus, order, and other metadata such as location, date, and recordist information.
Available Splits
train: Training set (90% of data, random split)validation: Validation set (10% of data, random split)all: Complete dataset (train + validation)train_unseen: Training set excluding unseen taxa evaluated in BEANS-Zero benchmarkvalidation_unseen: Validation set excluding unseen taxa evaluated in BEANS-Zero benchmarkall_unseen: Complete dataset excluding BEANS-Zero unseen taxa
The _unseen splits are designed for training models that will be evaluated
on BEANS-Zero's unseen taxa benchmark, ensuring no test taxa leak into the training data.
Note that all splits exclude examples overlapping with the following benchmark datasets: - cbi (See the beans dataset) - BEANS-Zero call-type, lifestage, and captioning test sets (See the beans_zero dataset) - xeno-canto Jeantet et al. 2023 dataset (See the XenoCantoAnnotatedJeantet23 dataset)
References
Xeno-canto: https://www.xeno-canto.org/
Examples:
>>> from alp_data.datasets import XenoCanto
>>> dataset = XenoCanto(
... split="train",
... output_take_and_give={"canonical_name": "species"}
... )
>>> print(dataset.info.name)
xeno-canto
>>> print(dataset.available_sample_rates)
[32000, 16000]
Load with pre-resampled 32kHz audio (when available)
Load with pre-resampled 16kHz audio (when available)
Source code in alp_data/datasets/xeno_canto.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 | |
XenoCantoAnnotatedJeantet23
📊 Dataset Information
| Name | xeno_canto_annotated_jeantet_23 |
| Version | 0.1.0 |
| Owner | benjamin |
| License | CC-BY-4.0 |
| Sources | XenoCanto |
| Available Splits | all |
Bird song detection dataset consisting of xeno canto recordings annotatedwith start- and stop-times
XenoCantoAnnotatedJeantet23 Dataset
Description
Bird song detection dataset consisting of xeno canto recordings annotated with start- and stop-times. The species were chosen specifically to be those for which adding location information would improve performance.
From the article "Improving deep learning acoustic classifiers with contextual information for wildlife monitoring" by Jeantet and Dufourq (2023):
"Firstly, we selected the ten most recorded families in the Passeriformes order, the most represented order in the Xeno-canto database. From each of the ten families, we again sub-sampled the ten most recorded genera. For each genus, we observed the countries of the recordings and the number of available recordings per species and country. From the information gathered, and by visually analyzing the spectrograms, we conducted a self-selection process of genera that comprised species with similar songs recorded in different regions. Our aim was to ensure that there were sufficient recordings available for each species and country, allowing us to form a comprehensive dataset. In the end, 5 genera were selected containing 22 species (Table 1). Due to the significant variation in the number of available recordings across different species, we needed to determine a suitable allocation of segments for each species. To address this, we calculated the average number of records per species and per country. For species/country pairs with a higher number of recordings than this average, we set an upper limit on the number of assigned segments to this average value. The recordings were downloaded from the Xeno-canto database in.wav format and each recording was manually annotated by labelling the start and stop time for every vocalisation occurrence using Sonic Visualiser (Suppl. Fig. 1, Cannam et al. (2010)). In total, we obtained 6,537 occurrences of bird songs of various lengths from 967 file recordings (Table 1)."
Each entry consists of: - an audio recording - a selection table (Raven format), with Species labels - the id of the xeno canto asset
Pre-resampled Audio
Pre-resampled audio is available at 16 kHz and 32 kHz. When
sample_rate matches one of these rates, the pre-resampled files are
loaded directly (no on-the-fly resampling). For any other target rate,
audio is resampled on-the-fly using librosa's kaiser_best method.
References
https://www.sciencedirect.com/science/article/pii/S1574954123002856
Source code in alp_data/datasets/xeno_canto_annotated_jeantet_23.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 | |
ZebraFinchJulieElie
📊 Dataset Information
| Name | zebra_finch_julie_elie |
| Version | 0.1.0 |
| Owner | marius |
| License | CC-BY-4.0, CC0 |
| Sources | Julie Elie |
| Available Splits | test, train, val, full_dataset |
Vocal repertoires from adult and chick, male and female zebra finches (Taeniopygia guttata)
Zebra Finch Julie Elie dataset
Description
Vocal repertoires from adult and chick, male and female zebra finches (Taeniopygia guttata) including bird id, call type, age.
References
Elie JE and Theunissen FE. The vocal repertoire of the domesticated zebra finch: a data driven approach to decipher the information-bearing acoustic features of communication signals. Animal Cognition. 2016. 19(2) 285-315
DOI 10.1007/s10071-015-0933-6
https://figshare.com/articles/dataset/Vocal_repertoires_from_adult_and_chick_male_and_female_zebra_finches_Taeniopygia_guttata_/11905533/1
Examples:
>>> from alp_data.datasets import ZebraFinchJulieElie
>>> dataset = ZebraFinchJulieElie(
... split="test",
... output_take_and_give={"label": "label"},
... sample_rate=16000,
... )
Source code in alp_data/datasets/zebra_finch_julie_elie.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 | |
Using your own dataset
First of all, you must answer an important question: is this new dataset relatively stable and time and potentially useful to others? If yes, then you should talk to the engineering team to add it as an official ESP Dataset. If not, you can just follow the next steps!
To create a new dataset, you need to subclass the base Dataset class and implement several key components. Here's a step-by-step guide:
1. Basic Structure
from alp_data import Dataset, DatasetInfo, register_dataset
from alp_data.io import anypath, AnyPathT
from typing import Any, Dict, Optional
import pandas as pd
@register_dataset
class MyCustomDataset(Dataset):
"""My custom dataset description.
Parameters
----------
split : str
The split to load. One of info.split_paths keys.
output_take_and_give : dict[str, str], optional
A dictionary mapping the original column names to the new column names.
data_root : str | AnyPathT, optional
Custom data root directory.
"""
# Define dataset metadata
info = DatasetInfo(
name="my_custom_dataset",
owner="your_name",
split_paths={
"train": "path/to/train.csv",
"validation": "path/to/validation.csv",
},
version="0.1.0",
description="Description of your dataset",
sources=["Source 1", "Source 2"],
license="Your License",
)
def __init__(
self,
split: str = "train",
output_take_and_give: Optional[dict[str, str]] = None,
data_root: Optional[str | AnyPathT] = None,
) -> None:
"""Initialize the dataset."""
super().__init__(output_take_and_give)
self.split = split
self._data = None
self._load()
self.data_root = data_root
def _load(self) -> None:
"""Load the dataset data."""
if self.split not in self.info.split_paths:
raise LookupError(
f"Invalid split: {self.split}. "
f"Expected one of {list(self.info.split_paths.keys())}"
)
# Implement your data loading logic here
location = self.info.split_paths[self.split]
# Example: Load CSV data
self._data = pd.read_csv(anypath(location))
def __len__(self) -> int:
"""Return the number of samples in the dataset."""
if self._data is None:
raise RuntimeError("No split has been loaded yet.")
return len(self._data)
def __getitem__(self, idx: int) -> Dict[str, Any]:
"""Get a specific sample from the dataset."""
if idx < 0 or idx >= len(self._data):
raise IndexError(f"Index {idx} out of bounds.")
# Implement your sample loading logic here
row = self._data.iloc[idx].to_dict()
# Example: Load and process data
if self.data_root:
data_path = anypath(self.data_root) / row["path"]
else:
data_path = anypath(row["path"])
# Load your data (e.g., image, audio, text)
data = # your code goes here
# Apply output_take_and_give if specified
if self.output_take_and_give:
item = {}
for key, value in self.output_take_and_give.items():
item[value] = row[key]
else:
item = row
return item
@classmethod
def from_config(cls, dataset_config: DatasetConfig) -> "MyCustomDataset":
"""Create a Dataset instance from a configuration."""
cfg = dataset_config.model_dump(exclude={"dataset_name", "transformations"})
split = cfg.get("split", None)
if not split or split not in cls.info.split_paths:
raise LookupError(
f"Invalid split '{split}'. "
f"Available splits: {', '.join(cls.info.split_paths.keys())}"
)
return cls(
split=split,
output_take_and_give=cfg.get("output_take_and_give", None),
data_root=cfg.get("data_root"),
)
2. Key Components to Implement
- DatasetInfo:
name: Unique identifier for your datasetowner: Dataset maintainersplit_paths: Dictionary mapping split names to data pathsversion: Dataset versiondescription: Brief descriptionsources: List of data sources-
license: Dataset license -
Required Methods:
__init__: Initialize the dataset with split and configuration_load: Load the dataset data__len__: Return dataset size__getitem__: Get a specific sample-
from_config: Create dataset from configuration -
Optional Methods:
__iter__: Iterate over samples__str__: String representation
3. Registration
Use the @register_dataset decorator to register your dataset:
5. Example Usage
Now, here is an example on how to use your new dataset!
# Create dataset instance
dataset = MyCustomDataset(
split="train",
output_take_and_give={"original_col": "new_col"}
)
# Access data
sample = dataset[0]
print(len(dataset))
# Use with transforms
from alp_data.transforms import Filter
filter_transform = Filter(property="category", values=["A", "B"])
dataset.apply_transformations([filter_transform])