alp_data.transforms module
What are Transforms?
Transforms are operations that can be applied to an ESP dataset to modify, filter, or enhance the data in various ways. In short, Transforms are callable objects that take a pandas DataFrame as input and return a tuple containing:
- The transformed DataFrame
- A dictionary of metadata about the transformation. Can be an empty dictionary if no metadata is needed.
Each transform is defined by two main components:
- A configuration class (inheriting from
pydantic.BaseModel) - A transform class that implements the actual transformation logic
How to Use Transforms
Basic Usage
Transforms can be used in two ways:
-
Direct instantiation:
-
Using configuration:
from alp_data.transforms import FilterConfig, transform_from_config # Create a configuration config = FilterConfig( type="filter", property="category", values=["A", "B"], mode="include" ) # Assume a dataframe called 'data' is already defined transform = transform_from_config(config) transformed_data, metadata = transform(data)
Transform Configuration
Each transform has its own configuration class that defines its parameters. For example, the FilterConfig has:
- type: The type of transform ("filter")
- mode: Either "include" or "exclude"
- property: The property to filter on
- values: List of values to filter by
Creating Custom Transforms
To create a custom transform:
-
Create a configuration class:
-
Create the transform class:
class MyTransform: def __init__(self, **kwargs): # Initialize your transform pass @classmethod def from_config(cls, cfg: MyTransformConfig) -> "MyTransform": return cls(**cfg.model_dump(exclude=("type",))) def __call__(self, data: pd.DataFrame) -> tuple[pd.DataFrame, dict]: # Implement your transformation logic transformed_data = data # Your transformation here return transformed_data, {} -
Register your transform:
Available Transforms
The transforms system uses a registry pattern to manage available transforms. The registry ensures that each transform type is unique and properly configured before use. The module provides several built-in transforms to handle common data transformation tasks. Here's an overview of each transform and its functionality:
Filter Transform
The Filter transform allows you to selectively include or exclude rows from your dataset based on specific property values.
alp_data.transforms.Filter
Filter data based on property values.
This transform filters a DataFrame based on the values of a specified property. It can either include or exclude rows based on the specified values. The property is a column in the DataFrame, and the values are the values to filter by.
Works with any backend (pandas, polars) through the DataBackend protocol.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
property
|
str
|
The name of the property (column) to filter by. |
required |
values
|
list[str]
|
The values to include or exclude from the DataFrame. |
required |
mode
|
Literal['include', 'exclude']
|
The mode of filtering. If "include", only rows with the specified values in the property will be kept. If "exclude", rows with the specified values will be removed from the DataFrame. |
'include'
|
Examples:
>>> from alp_data.transforms import Filter
>>> from alp_data.backends import PandasBackend
>>> import pandas as pd
>>> filter_transform = Filter(property="species", values=["bee", "butterfly"],
... mode="include")
>>> df = pd.DataFrame({"species": ["bee", "ant", "butterfly", "spider"],
... "count": [10, 5, 8, 2]})
>>> backend = PandasBackend(df)
>>> filtered_backend, _ = filter_transform(backend)
Source code in alp_data/transforms/filter.py
__call__(backend)
Filter the data based on property values.
Args: backend: The backend wrapping the dataframe to filter
Returns: The filtered backend (same type as input) and empty metadata dict.
Source code in alp_data/transforms/filter.py
__init__(*, property, values, mode='include')
Initialize the filter.
LabelFromFeature Transform
The LabelFromFeature transform converts categorical features into numerical labels. Example use case: Converting a 'species' column with values like 'dog', 'cat', 'bird' into numerical labels 0, 1, 2.
alp_data.transforms.LabelFromFeature
Transform to create a label feature from an existing feature in a DataFrame.
This transform maps the values of a specified feature to integer labels.
Works with any backend (pandas, polars) through the DataBackend protocol.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature
|
str
|
The name of the feature in the DataFrame from which to create labels. |
required |
label_map
|
dict[Any, int] | None
|
A mapping of feature values to integer labels. If None, the labels will be created from the unique values in the feature. |
None
|
output_feature
|
str
|
The name of the new feature to store the labels. Defaults to "label". |
'label'
|
override
|
bool
|
If True, will override the output feature if it already exists in the DataFrame. If False, will raise an AssertionError if the output feature already exists. |
False
|
Examples:
>>> from alp_data.backends import PandasBackend
>>> import pandas as pd
>>> df = pd.DataFrame({"species": ["cat", "dog", "bird", "cat"]})
>>> backend = PandasBackend(df)
>>> transform = LabelFromFeature(feature="species", output_feature="label")
>>> transformed_backend, metadata = transform(backend)
Source code in alp_data/transforms/label_from_feature.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | |
__call__(backend)
Apply the transformation to the backend.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
backend
|
DataBackend
|
The backend wrapping the DataFrame to transform. |
required |
Returns:
| Type | Description |
|---|---|
tuple[DataBackend, dict]
|
A tuple containing the transformed backend and metadata about the labels. |
Raises:
| Type | Description |
|---|---|
AssertionError
|
If the output feature already exists and override is False. |
Source code in alp_data/transforms/label_from_feature.py
MultiLabelFromFeatures Transform
The MultiLabelFromFeatures transform extends the functionality of LabelFromFeature to handle multiple features simultaneously. Example use case: Creating labels from multiple categorical columns like 'species', 'breed', and 'color' in a single operation.
alp_data.transforms.MultiLabelFromFeatures
A transform that generates multi-label targets from one or more feature columns.
This class goes through one or more specified columns and generates a mapping of unique values to integer IDs. It then uses this mapping to generate a new column where each row contains a list of integer label IDs corresponding to the unique values found in the specified feature columns. It is useful for preparing data for multi-label classification tasks, where each sample may be associated with multiple labels.
Notes
If element values are themselves lists, the transform will explode them first before constructing the mapping dictionary and converting the values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features
|
list[str]
|
The names of the columns in the DataFrame to use as sources for the labels. Each column can contain a single value or a list of values per row. |
required |
label_map
|
dict[Any, int] | None
|
A mapping of unique values to integer IDs. If not provided, the transform will generate a mapping based on the unique values in the specified feature columns. |
None
|
output_feature
|
str
|
The name of the output column to store the generated label lists. |
"label"
|
override
|
bool
|
If False and the output_feature already exists in the dataset, an error is raised. If True, the output_feature will be overwritten. |
False
|
allow_missing_labels
|
bool
|
If True, rows with no labels will be included in the output. If False, rows with no labels will be dropped. |
True
|
Methods:
| Name | Description |
|---|---|
from_config |
Instantiates the transform from a configuration object. |
__call__ |
Applies the transform to the DataFrame, returning the modified DataFrame and metadata about the label mapping. |
Examples:
>>> import pandas as pd
>>> from alp_data.transforms import MultiLabelFromFeatures
>>> config = MultiLabelFromFeaturesConfig(
... type="labels_from_features",
... features=["tags", "categories"],
... label_map=None,
... output_feature="labels",
... override=False
... )
>>> df = pd.DataFrame({
... "tags": [["cat", "dog"], ["bird"], ["cat"]],
... "categories": [["mammal"], ["avian"], []]
... })
>>> from alp_data.backends import PandasBackend
>>> backend = PandasBackend(df)
>>> transform = MultiLabelFromFeatures.from_config(config)
>>> transformed_df, metadata = transform(backend)
>>> metadata["label_map"]
{'avian': 0, 'bird': 1, 'cat': 2, 'dog': 3, 'mammal': 4}
Source code in alp_data/transforms/multilabel_from_features.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | |
Subsample Transform
The Subsample transform reduces the size of your dataset by sampling a subset of the data. Example use case: Creating a 10% random sample of a large dataset for initial testing.
alp_data.transforms.Subsample
Subsample data based on property ratios.
This transform subsamples a DataFrame based on the specified ratios for each value of a given property. It allows for controlling the representation of different categories in the dataset by specifying how much of each category to keep. The property is a column in the DataFrame, and the ratios are specified as a dictionary where keys are property values and values are the ratios of samples to keep for each property value. The "other" category can be used to specify a ratio for all other values not explicitly listed in the ratios dictionary.
Works with any backend (pandas, polars) through the DataBackend protocol.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
property
|
str
|
The name of the property (column) to subsample by. |
required |
ratios
|
dict[str, float]
|
A dictionary where keys are the values of the property and values are the ratios of samples to keep for each value. The ratios should be in the range [0, 1]. If "other" is included as a key, it will subsample all other values not explicitly listed in the ratios dictionary. |
required |
Examples:
>>> from alp_data.transforms import Subsample, SubsampleConfig
>>> from alp_data.backends import PandasBackend
>>> import pandas as pd
>>> config = SubsampleConfig(
... type="subsample",
... property="species",
... ratios={
... "bee": 0.5,
... "butterfly": 0.3,
... "other": 0.1
... })
>>> subsample_transform = Subsample.from_config(config)
>>> df = pd.DataFrame({
... "species": ["bee", "bee", "butterfly", "ant", "butterfly", "spider"],
... "count": [10, 5, 8, 2, 3, 1]
... })
>>> backend = PandasBackend(df)
>>> subsampled_backend, _ = subsample_transform(backend)
Source code in alp_data/transforms/subsample.py
__call__(backend)
Apply the subsample transformation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
backend
|
DataBackend
|
The backend wrapping the dataframe to subsample |
required |
Returns:
| Type | Description |
|---|---|
tuple[DataBackend, dict]: A tuple containing:
|
The subsampled backend (same type as input). The metadata dictionary (empty placeholder for future use). |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the specified property is not found in the DataFrame columns. |
Source code in alp_data/transforms/subsample.py
BalancedSample Transform
The BalancedSample transform performs balanced sampling of the data, ensuring balanced representation across different categories.
alp_data.transforms.BalancedSample
Balance data by sampling to equalize category counts.
This transform balances a DataFrame based on a specified property to ensure that the resulting DataFrame has a balanced distribution of the specified property across the samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
property
|
str
|
The name of the property (column) to sample by. |
required |
strategy
|
Literal['min', 'max', 'median', 'mean', 'median_with_range']
|
The balancing strategy to use. Options are: - "min": Sample all categories to the minimum count (downsamples larger) - "max": Sample all categories to the maximum count (upsamples smaller) - "median": Sample all categories to the median count (default) - "mean": Sample all categories to the mean count - "median_with_range": Clamp each category to a range around the median |
'median'
|
range_fraction
|
float
|
Only used with "median_with_range" strategy. The fraction of the median to use as the range. E.g., 0.2 means targets are clamped to [median * 0.8, median * 1.2]. Defaults to 0.2. |
0.2
|
seed
|
int
|
Random seed for reproducibility. Defaults to 42. |
42
|
Examples:
>>> from alp_data.transforms import BalancedSample, BalancedSampleConfig
>>> from alp_data.backends import PandasBackend
>>> import pandas as pd
>>> config = BalancedSampleConfig(
... type="balanced_sample",
... property="species",
... strategy="median",
... seed=42
... )
>>> balanced_sample_transform = BalancedSample.from_config(config)
>>> df = pd.DataFrame({
... "species": ["bee", "bee", "butterfly", "ant", "butterfly", "spider"],
... "count": [10, 5, 8, 2, 3, 1]
... })
>>> backend = PandasBackend(df)
>>> sampled_backend, _ = balanced_sample_transform(backend)
Source code in alp_data/transforms/balanced_sample.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
__call__(backend)
Apply the balanced sample transformation.
This transform creates a balanced distribution by sampling categories based on the selected strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
backend
|
DataBackend
|
The backend wrapping the dataframe to sample. |
required |
Returns:
| Type | Description |
|---|---|
tuple[DataBackend, dict]: A tuple containing:
|
The sampled backend (same type as input). The metadata dictionary (empty placeholder for future use). |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the specified property is not found in the DataFrame columns. |
Source code in alp_data/transforms/balanced_sample.py
Deduplicate Transform
The Deduplicate transform removes duplicate rows from your dataset based on specified columns. Example use case: Ensuring that each entry in a dataset is unique based on a combination of 'species' and 'location'.
alp_data.transforms.Deduplicate
A transform to remove duplicate rows from a DataFrame.
This transform removes duplicate rows based on specified columns or all columns if none are specified. It can keep either the first or last occurrence of duplicates.
Works with any backend (pandas, polars) through the DataBackend protocol.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset
|
list[str] | None
|
List of column names to consider for deduplication. If empty, all columns are considered. |
None
|
keep_first
|
bool
|
If True, keeps the first occurrence of duplicates. If False, keeps the last occurrence. |
True
|
Examples:
>>> from alp_data.backends import PandasBackend
>>> import pandas as pd
>>> df = pd.DataFrame({
... "species": ["bee", "bee", "butterfly", "bee"],
... "count": [10, 10, 5, 10]
... })
>>> backend = PandasBackend(df)
>>> transform = Deduplicate(subset=["species"], keep_first=True)
>>> deduplicated_backend, _ = transform(backend)
Source code in alp_data/transforms/deduplicate.py
__call__(backend)
Remove duplicate rows from the backend.
Args: backend: The backend wrapping the dataframe to deduplicate
Returns: The deduplicated backend (same type as input) and empty metadata dict.
Source code in alp_data/transforms/deduplicate.py
SelectColumns Transform
The SelectColumns transform allows you to select a subset of columns from your dataset. Example use case: Keeping only the 'audio' and 'label' columns for a machine learning task.
alp_data.transforms.SelectColumns
Select a subset of columns from the dataset.
This transform keeps only the specified columns and drops all others.
Works with any backend (pandas, polars) through the DataBackend protocol.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns
|
list[str]
|
List of column names to keep. |
required |
Examples:
>>> from alp_data.transforms import SelectColumns, SelectColumnsConfig
>>> from alp_data.backends import PandasBackend
>>> import pandas as pd
>>> config = SelectColumnsConfig(
... type="select_columns",
... columns=["species", "audio"],
... )
>>> transform = SelectColumns.from_config(config)
>>> df = pd.DataFrame({
... "species": ["bee", "ant"],
... "audio": ["/a.wav", "/b.wav"],
... "extra": [1, 2],
... })
>>> backend = PandasBackend(df)
>>> result, _ = transform(backend)
Source code in alp_data/transforms/select_columns.py
__call__(backend)
Select the specified columns from the backend.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
backend
|
DataBackend
|
The backend wrapping the dataframe to transform. |
required |
Returns:
| Type | Description |
|---|---|
tuple[DataBackend, dict]
|
A tuple containing the transformed backend with only the selected columns and an empty metadata dictionary. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If any of the specified columns are not found in the backend. |
Source code in alp_data/transforms/select_columns.py
LongTailUpsample Transform
The LongTailUpsample transform performs upsampling of underrepresented classes in a long-tailed distribution. Example use case: Increasing the number of samples for rare species in a biodiversity dataset.
alp_data.transforms.LongTailUpsample
Upsample under-represented categories without excessive repetition.
Designed for long-tail distributions (e.g. bioacoustic species counts) where
a few categories dominate and many categories have very few examples. This
transform lifts the tail towards a sufficient_threshold while capping how
many times any single example can be repeated via max_repeats.
For each category with count c:
- If
c >= sufficient_threshold: the category is left untouched. - If
c < sufficient_threshold: the target becomesmin(sufficient_threshold, c * max_repeats).
This produces a gradual compression of the distribution: well-represented
categories keep all their data, moderately-represented categories are boosted
to the threshold, and very rare categories are boosted as much as possible
without repeating any single example more than max_repeats times.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
property
|
str
|
The name of the property (column) to balance on. |
required |
sufficient_threshold
|
int
|
Categories with at least this many examples are left as-is. Categories
below this count are upsampled towards it, subject to |
required |
max_repeats
|
int
|
Maximum number of times any individual example may appear in the output. Prevents over-fitting on very rare categories. |
required |
seed
|
int
|
Random seed for reproducibility. Defaults to 42. |
42
|
Examples:
>>> from alp_data.transforms import LongTailUpsample, LongTailUpsampleConfig
>>> from alp_data.backends import PandasBackend
>>> import pandas as pd
>>> config = LongTailUpsampleConfig(
... type="long_tail_upsample",
... property="species",
... sufficient_threshold=6,
... max_repeats=3,
... seed=42,
... )
>>> transform = LongTailUpsample.from_config(config)
>>> df = pd.DataFrame({
... "species": (
... ["common"] * 10
... + ["moderate"] * 4
... + ["rare"] * 1
... ),
... })
>>> backend = PandasBackend(df)
>>> result, _ = transform(backend)
>>> # common (10): untouched — already >= 6
>>> # moderate (4): upsampled to min(6, 4*3)=6
>>> # rare (1): upsampled to min(6, 1*3)=3
Source code in alp_data/transforms/long_tail_upsample.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | |
__call__(backend)
Apply the long-tail upsample transformation.
Categories below sufficient_threshold are upsampled towards it,
bounded by max_repeats. Categories at or above the threshold are
left unchanged.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
backend
|
DataBackend
|
The backend wrapping the dataframe to transform. |
required |
Returns:
| Type | Description |
|---|---|
tuple[DataBackend, dict]
|
A tuple containing the transformed backend (same type as input)
and a metadata dictionary with keys |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the specified property is not found in the DataFrame columns. |
Source code in alp_data/transforms/long_tail_upsample.py
AddTaxonomy Transform
The AddTaxonomy transform adds precomputed GBIF taxonomic information to your dataset based on existing features. Example use case: Adding family and order information to a dataset with a 'species' column using GBIF taxonomy.
alp_data.discover.AddTaxonomy
Transform that adds resolved GBIF taxonomy info to each row.
Uses GBIFConverter to resolve scientific names in a specified column to their accepted species-level taxonomic records. New columns are added for each taxonomy rank: 'kingdom', 'phylum', 'class', 'order', 'family', 'genus'. An extra column 'taxonomic_name' is also added, which concatenates the higher ranks with the canonical name e.g. "Animalia Chordata Aves Passeriformes Corvidae Corvus corax".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature
|
str
|
Column name containing scientific names to look up. |
'scientific_name'
|
gbif_precomputed_taxonomy_path
|
str | AnyPathT
|
Path to GBIF taxonomy json file, preprocessed via scripts/cache_gbif_taxonomy_conversion.py |
DEFAULT_PRECOMPUTED_LOCATION
|
add_taxonomic_name
|
bool
|
Whether to add a 'taxonomic_name' column with the full taxonomic name. |
False
|
Source code in alp_data/discover/gbif_taxonomy.py
438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 | |
__call__(backend)
Apply the transform to add taxonomy columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
backend
|
DataBackend
|
The backend wrapping the DataFrame to transform. |
required |
Returns:
| Type | Description |
|---|---|
tuple[DataBackend, dict]
|
A tuple containing the transformed backend with taxonomy columns added, and metadata about the resolution (success/failure counts). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the specified feature column is not found in the DataFrame. |