Skip to content

alp_data.backends Module

What are data backends?

A DataBackend is a Python Protocol that defines an interface for any library that is used to read data and perform common operations on data. It's an abstraction that allows alp-data to support multiple data libraries without being tightly coupled to any specific one.

An example of such a library is pandas and so the corresponding backend class is alp_data.backends.PandasBackend. All the Dataset and Transform classes in alp-data up to version 1.3.0 used pandas to load the underlying annotation csv / jsonl files. This required implementing functions like pd.read_csv and pandas based dataframe manipulation directly within class methods. We want to reduce this dependence on pandas and allow ourselves the freedom to use other libraries like polars, duckdb, pyarrow, webdataset etc. to load and manipulate data because each library has its own strengths and weaknesses for different ML use-cases.

Available Backends

Currently, alp-data provides two backend implementations:

Backend Class Description
pandas PandasBackend Uses pandas DataFrames for data operations. Supports streaming via chunked reading.
polars PolarsBackend Uses polars DataFrames/LazyFrames. Supports streaming via LazyFrame for memory-efficient processing.

How to Use Backends

Specifying a Backend in Dataset Configuration

When loading a dataset, you can specify which backend to use via the backend parameter:

from alp_data.datasets import BirdSet

# Use polars backend (default)
dataset = BirdSet(split="HSN-train", backend="polars")

# Use pandas backend
dataset = BirdSet(split="HSN-train", backend="pandas")

Or via YAML configuration:

dataset:
  dataset_name: birdset
  split: HSN-train
  backend: polars  # or "pandas"
  streaming: false

Direct Backend Usage

You can also use backends directly for standalone data operations:

from alp_data.backends import PandasBackend, PolarsBackend

# Load data with PandasBackend
backend = PandasBackend.from_csv("path/to/data.csv")
print(len(backend))  # Number of rows
print(backend.columns)  # Column names

# Load data with PolarsBackend
backend = PolarsBackend.from_parquet("path/to/data.parquet")
row = backend[0]  # Get first row as dict

Streaming Mode

Both backends support streaming mode for memory-efficient processing of large datasets:

from alp_data.datasets import BirdSet

# Enable streaming mode
dataset = BirdSet(split="HSN-train", backend="polars", streaming=True)

# Iterate over rows without loading entire dataset into memory
for sample in dataset:
    process(sample)

Streaming Limitations

In streaming mode, __getitem__ indexing is disabled. Use iteration instead. Additionally, len() is not available until the stream is consumed.

Accessing the Underlying Data Object

If you need to perform library-specific operations, use the unwrap property to access the underlying data object:

from alp_data.backends import PandasBackend

backend = PandasBackend.from_csv("data.csv")
df = backend.unwrap  # Returns pd.DataFrame

For more details, see Accessing the Underlying Data.

Pandas vs Polars: When to Use Which

Use Case Recommended Backend Reason
Small to medium datasets Either Both perform well
Large datasets (> 1GB) polars Better memory efficiency and performance
Streaming/lazy evaluation polars Native LazyFrame support
Compatibility with existing pandas code pandas Direct access to pd.DataFrame via unwrap
Parquet file streaming polars Pandas doesn't support streaming parquet

Accessing the Underlying Data

If you need to perform operations not covered by the backend interface, use the unwrap property:

from alp_data.backends import PandasBackend, PolarsBackend

# Pandas backend
pandas_backend = PandasBackend.from_csv("data.csv")
df = pandas_backend.unwrap  # Returns pd.DataFrame
# Now use pandas-specific operations
df.describe()

# Polars backend
polars_backend = PolarsBackend.from_csv("data.csv")
df = polars_backend.unwrap  # Returns pl.DataFrame or pl.LazyFrame
# Now use polars-specific operations
df.select(pl.col("species").value_counts())

Tip

When using unwrap, be aware that you lose the backend abstraction. Operations on the unwrapped object won't automatically work with other backends.

The DataBackend Protocol

The DataBackend protocol defines a common interface that all backend implementations must follow. This enables alp-data to work uniformly with different data libraries.

Core Interface

The protocol defines these key operations:

Data Loading (Class Methods)

Method Description
from_csv(path, streaming=False) Load data from a CSV file
from_json(path, lines=False, streaming=False) Load data from a JSON file (supports JSON lines format)
from_parquet(path, streaming=False) Load data from a Parquet file

Data Access

Method Description
__getitem__(key) Get row(s) by index (int returns dict, list/slice returns new backend)
__len__() Get number of rows
__iter__() Iterate over rows as dictionaries
columns Property returning list of column names
column_exists(column) Check if a column exists
unwrap Property returning the underlying data object (e.g., pd.DataFrame, pl.DataFrame)

Data Manipulation

Method Description
filter_isin(column, values, negate=False) Filter rows by column values
drop_duplicates(subset=None, keep="first") Remove duplicate rows
dropna(subset=None) Remove rows with missing values
get_unique(column) Get sorted unique values from a column
map_column(column, mapping, output_column) Create new column by mapping values
rename_columns(mapping) Rename columns
add_column(column, values) Add a new column
select_columns(columns) Select subset of columns
concat(backends, ignore_index=True) Concatenate multiple backends vertically

Sampling

Method Description
sample_rows(n, seed=42, replace=False) Randomly sample n rows
subsample_by_column(column, ratios, seed=42) Subsample by column values with specified ratios

Advanced Operations

Method Description
copy() Create a copy of the backend
apply_fn(fn, fn_kwargs, apply_kwargs) Apply a custom function to the data
multilabel_from_features(input_features, output_feature, ...) Create multilabel column from multiple features

Backend Integration in alp-data

Integration with Datasets

All dataset classes use backends internally to manage their data. The backend is selected at instantiation time:

# alp_data/dataset.py
class Dataset(ABC):
    def __init__(
        self,
        output_take_and_give: dict[str, str] = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        self._backend_class = get_backend(backend)

Datasets then use the backend class to load data:

# Example from BirdSet._load()
self._data = self._backend_class.from_json(
    location, lines=True, streaming=self._streaming
)

Integration with Transforms

Transforms operate directly on backend instances rather than raw DataFrames. This makes transforms backend-agnostic:

from alp_data.transforms import Filter
from alp_data.backends import PandasBackend

# Create backend
backend = PandasBackend.from_csv("data.csv")

# Apply transform - works with any backend
filter_transform = Filter(property="species", values=["cat", "dog"], mode="include")
filtered_backend, metadata = filter_transform(backend)

Transforms use the backend's methods rather than library-specific operations:

# alp_data/transforms/filter.py
class Filter:
    def __call__(self, backend: DataBackend) -> tuple[DataBackend, dict]:
        # Uses backend.filter_isin() instead of pandas-specific code
        negate = self.mode == "exclude"
        filtered_backend = backend.filter_isin(self.property, self.values, negate=negate)
        return filtered_backend, {}

Integration with Dataset Concatenation

The ConcatenatedDataset class uses backend operations to merge multiple datasets:

from alp_data.datasets import InsectSet459, BirdSet
from alp_data.concat import ConcatenatedDataset

dataset1 = InsectSet459(split="validation", backend="polars")
dataset2 = BirdSet(split="HSN-test", backend="polars")

# All datasets must use the same backend type
concat_ds = ConcatenatedDataset([dataset1, dataset2], merge_level="soft")

The concatenation uses the backend's concat class method internally.

API Reference

alp_data.backends.protocol.DataBackend

Protocol defining the interface all data backends must implement.

This protocol uses the Adapter pattern where the backend wraps a data and provides a unified interface. The wrapped data is stored as an instance attribute, making the API more Pythonic and cleaner.

is_streaming property

Check if backend is in streaming mode.

Returns:

Type Description
bool

True if in streaming mode, False otherwise

columns property

Get the list of column names.

Returns:

Type Description
list[str]

List of column names

unwrap property

Get the underlying data object.

This is useful when you need to access backend-specific functionality or pass the data to functions that expect the native type.

Returns:

Type Description
Any

The underlying data (e.g., pd.DataFrame, pl.DataFrame)

from_csv(path, *, streaming=False, **kwargs) classmethod

Read a CSV file and return a wrapped data backend.

Parameters:

Name Type Description Default
path str

Path to the CSV file (supports local and cloud paths via cloudpathlib)

required
streaming bool

If True, use streaming mode (lazy evaluation). In streaming mode, getitem is disabled and data is processed via iteration. By default False.

False
**kwargs Any

Additional backend-specific arguments

{}

Returns:

Type Description
DataBackend

Backend instance wrapping the loaded data

from_json(path, *, lines=False, streaming=False, **kwargs) classmethod

Read a JSON file and return a wrapped data backend.

Parameters:

Name Type Description Default
path str

Path to the JSON file

required
lines bool

If True, read file as JSON lines (one JSON object per line), by default False

False
streaming bool

If True, use streaming mode (lazy evaluation), by default False

False
**kwargs Any

Additional backend-specific arguments

{}

Returns:

Type Description
DataBackend

Backend instance wrapping the loaded data

from_parquet(path, *, streaming=False, **kwargs) classmethod

Read a Parquet file and return a wrapped data backend.

Parameters:

Name Type Description Default
path str

Path to the Parquet file

required
streaming bool

If True, use streaming mode (lazy evaluation), by default False

False
**kwargs Any

Additional backend-specific arguments

{}

Returns:

Type Description
DataBackend

Backend instance wrapping the loaded data object

from_path(path, *, streaming=False, **kwargs) classmethod

Load a tabular file, dispatching on extension.

Parameters:

Name Type Description Default
path str

Path to a .parquet, .csv, .json, .jsonl, or .ndjson file.

required
streaming bool

Whether to use streaming mode, by default False.

False
**kwargs Any

Additional backend-specific arguments.

{}

Returns:

Type Description
DataBackend

Backend instance wrapping the loaded data.

Raises:

Type Description
ValueError

If the file extension is not supported.

__init__(df, *, streaming=False)

Wrap an existing data object.

Parameters:

Name Type Description Default
df Any

The data to wrap (e.g., pd.DataFrame, pl.DataFrame).

required
streaming bool

If True, use streaming mode where getitem is disabled and iteration processes data in chunks, by default False

False

__getitem__(key)

__getitem__(key: int) -> dict[str, Any]
__getitem__(key: list[int]) -> 'DataBackend'
__getitem__(key: slice) -> 'DataBackend'

Get row(s) from the dataset using Pythonic indexing.

Parameters:

Name Type Description Default
key int | list[int] | slice
  • int: Get single row as dict
  • list[int]: Get multiple rows as new backend
  • slice: Get row range as new backend
required

Returns:

Type Description
dict[str, Any] | DataBackend
  • dict if key is int (single row)
  • DataBackend if key is list or slice (multiple rows)

Raises:

Type Description
IndexError

If index is out of bounds

TypeError

If key type is not supported

RuntimeError

If backend is in streaming mode (use iteration instead)

Note

In streaming mode, getitem is disabled. Use iteration instead: for row in backend: process(row)

__len__()

Get the number of rows in the dataset.

Returns:

Type Description
int

Number of rows

__iter__()

Iterate over rows as dictionaries.

Yields:

Type Description
dict[str, Any]

Dictionary for each row mapping column names to values

filter_isin(column, values, *, negate=False)

Filter rows where column values are in (or not in) a list.

Parameters:

Name Type Description Default
column str

Column name to filter on

required
values list[Any]

List of values to match

required
negate bool

If True, keep rows NOT in values list, by default False

False

Returns:

Type Description
DataBackend

New backend with filtered data

drop_duplicates(subset=None, *, keep='first')

Remove duplicate rows from the data.

Parameters:

Name Type Description Default
subset list[str] | None

Column names to consider for identifying duplicates. If None, use all columns, by default None

None
keep Literal['first', 'last']

Which duplicate to keep, by default "first"

'first'

Returns:

Type Description
DataBackend

New backend with duplicates removed

dropna(subset=None)

Remove rows with missing values.

Parameters:

Name Type Description Default
subset list[str] | None

Column names to consider for null detection. If None, check all columns, by default None

None

Returns:

Type Description
DataBackend

New backend with null rows removed

get_unique(column)

Get sorted unique values from a column.

Parameters:

Name Type Description Default
column str

Column name

required

Returns:

Type Description
list[Any]

Sorted list of unique values (nulls excluded)

histogram(column)

Get value counts (histogram) for a column.

Parameters:

Name Type Description Default
column str

Column name

required

Returns:

Type Description
dict[Any, int]

Dictionary mapping unique values to their counts (nulls excluded)

map_column(column, mapping, output_column, *, default=None)

Create a new column by mapping values from an existing column.

Parameters:

Name Type Description Default
column str

Source column name

required
mapping dict[Any, Any]

Dictionary mapping source values to output values

required
output_column str

Name of the new column to create

required
default Any

Value to use for unmapped keys, by default None

None

Returns:

Type Description
DataBackend

New backend with mapped column added

rename_columns(mapping)

Rename data columns.

Parameters:

Name Type Description Default
mapping dict[str, str]

Dictionary mapping old column names to new names

required

Returns:

Type Description
DataBackend

New backend with renamed columns

add_column(column, values)

Add a new column to the data.

Parameters:

Name Type Description Default
column str

Name of the new column

required
values Any

Values for the new column (scalar or array-like)

required

Returns:

Type Description
DataBackend

New backend with new column added

select_columns(columns)

Select a subset of columns from the data.

Parameters:

Name Type Description Default
columns list[str]

List of column names to keep

required

Returns:

Type Description
DataBackend

New backend with only specified columns

concat(backends, *, ignore_index=True, sort=False) classmethod

Concatenate multiple backend instances vertically (row-wise).

Parameters:

Name Type Description Default
backends list[DataBackend]

List of backend instances to concatenate

required
ignore_index bool

If True, reset index in result, by default True

True
sort bool

If True, sort columns alphabetically, by default False

False

Returns:

Type Description
DataBackend

New backend with concatenated data

column_exists(column)

Check if a column exists in the data.

Parameters:

Name Type Description Default
column str

Column name to look for

required

subsample_by_column(column, ratios, *, seed=42)

Subsample rows by column values with specified ratios.

For each unique value in the column, sample the specified ratio of rows. Special key "other" can be used to subsample all values not explicitly listed.

Note: The "other" key pools all unlisted values together and samples from the pooled group, rather than applying the ratio per unlisted category.

Parameters:

Name Type Description Default
column str

Column name to group by

required
ratios dict[str, float]

Dictionary mapping column values to sampling ratios (0.0 to 1.0). Special key "other" applies to all unlisted values (pooled together).

required
seed int

Random seed for reproducibility, by default 42

42

Returns:

Type Description
DataBackend

New backend with subsampled rows

Raises:

Type Description
KeyError

If the specified column does not exist in the DataFrame

ValueError

If any ratio is negative or greater than 1.0

upsample_by_column(column, target_counts, *, seed=42)

Upsample rows by column values to target counts with replacement.

For each unique value in the column, sample rows with replacement to reach the target count. If a category already has more rows than the target, it will be downsampled (without replacement) to the target count.

Note: The "other" key pools all unlisted values together and samples from the pooled group to reach the target count, rather than applying the target per unlisted category.

Parameters:

Name Type Description Default
column str

Column name to group by

required
target_counts dict[str, int]

Dictionary mapping column values to target sample counts. Special key "other" applies to all unlisted values (pooled together).

required
seed int

Random seed for reproducibility, by default 42

42

Returns:

Type Description
DataBackend

New backend with upsampled/downsampled rows

Raises:

Type Description
KeyError

If the specified column does not exist in the DataFrame

ValueError

If any target count is negative

sample_rows(n, *, seed=42, replace=False)

Randomly sample n rows from the data.

Parameters:

Name Type Description Default
n int

Number of rows to sample

required
seed int

Random seed for reproducibility, by default 42

42
replace bool

Whether to sample with replacement, by default False

False

Returns:

Type Description
DataBackend

New backend with sampled rows

copy()

Create a copy of the backend with a copied data.

Returns:

Type Description
DataBackend

New backend instance with copied data

apply_fn(fn, **fn_kwargs)

Apply a custom function to the underlying data.

Parameters:

Name Type Description Default
fn Callable

Function to apply. It should accept the underlying data type (e.g., pd.DataFrame, pl.DataFrame) as the first argument.

required
**fn_kwargs Any

Keyword arguments to pass to the function

{}

Returns:

Type Description
DataBackend

New backend wrapping the result of the function application

multilabel_from_features(input_features, output_feature, label_map=None, allow_missing_labels=False)

Create a multilabel column from multiple feature columns.

Parameters:

Name Type Description Default
input_features list[str]

List of input feature column names to combine

required
output_feature str

Name of the output multilabel column

required
label_map dict[str, Any] | None

Optional mapping from input feature values to output labels, by default None

None
allow_missing_labels bool

If True, ignore missing labels in input features, by default False

False

Returns:

Type Description
tuple[DataBackend, dict]

New backend with multilabel column and metadata dictionary

alp_data.backends.PandasBackend

Pandas implementation of the DataFrameBackend protocol.

This backend wraps a pandas DataFrame and provides a unified interface for DataFrame operations that can work across different backend implementations.

Supports both eager (in-memory) and streaming (chunked) modes.

Parameters:

Name Type Description Default
df DataFrame | TextFileReader

The pandas DataFrame to wrap, or TextFileReader for streaming

required
streaming bool

Whether the backend is in streaming mode

False
streaming_chunk_size int

Number of rows per chunk in streaming mode

1000

Examples:

>>> import pandas as pd
>>> from alp_data.backends import PandasBackend
>>> df = pd.DataFrame({"col1": range(100), "col2": list("abcdefghij") * 10})
>>> backend = PandasBackend(df, streaming=False)
>>> backend[0]  # Get first row as dict
{'col1': 0, 'col2': 'a'}
>>> backend[[0, 5, 10]]  # Get rows 0, 5, 10 as new backend
PandasBackend(shape=(3, 2))
>>> filtered = backend.filter_isin("col2", ["a", "b"])
>>> len(filtered)  # Number of rows with col2 in ['a', 'b']
20
>>> backend.columns  # List of column names
['col1', 'col2']
>>> backend.column_exists("col1")  # Check if column exists
True
>>> sub = backend.subsample_by_column("col2", {"a": 0.5, "b": 0.5, "other": 0.1})
>>> counts = sub.unwrap["col2"].count()  # Subsampled counts
>>> assert counts <= 20

columns property

Get the list of column names.

Returns:

Type Description
list[str]

List of column names

is_streaming property

Check if backend is in streaming mode.

Returns:

Type Description
bool

True if in streaming mode, False otherwise

unwrap property

Get the underlying DataFrame object.

Returns:

Type Description
DataFrame

The underlying pandas DataFrame

__getitem__(key)

Get row(s) from the DataFrame using Pythonic indexing.

Parameters:

Name Type Description Default
key int | list[int] | slice
  • int: Get single row as dict
  • list[int]: Get multiple rows as new backend
  • slice: Get row range as new backend
required

Returns:

Type Description
dict[str, Any] | PandasBackend
  • dict if key is int (single row)
  • PandasBackend if key is list or slice (multiple rows)

Raises:

Type Description
IndexError

If index is out of bounds

TypeError

If key type is not supported

RuntimeError

If backend is in streaming mode

Examples:

>>> import pandas as pd
>>> df = pd.DataFrame({"col1": range(100), "col2": list("abcdefghij") * 10})
>>> from alp_data.backends import PandasBackend
>>> backend = PandasBackend(df)
>>> backend[0]  # Get first row as dict
{'col1': 0, 'col2': 'a'}
>>> backend[[0, 5, 10]]  # Get rows 0, 5, 10 as new backend
PandasBackend(shape=(3, 2))
>>> backend[5:]  # Get rows from index 5 to end
PandasBackend(shape=(95, 2))
>>> backend[:10]  # Get first 10 rows
PandasBackend(shape=(10, 2))

__init__(df, *, streaming=False, streaming_chunk_size=1000)

Initialize the backend with a pandas DataFrame.

Parameters:

Name Type Description Default
df DataFrame | TextFileReader

The DataFrame to wrap, or TextFileReader for streaming mode

required
streaming bool

Whether to use streaming mode, by default False

False
streaming_chunk_size int

Number of rows per chunk in streaming mode, by default 1000

1000

__iter__()

Iterate over DataFrame rows as dictionaries.

In streaming mode, yields rows from chunks as they are read. In eager mode, yields rows from the loaded DataFrame.

Yields:

Type Description
dict[str, Any]

Dictionary for each row mapping column names to values

__len__()

Get the number of rows in the DataFrame.

Returns:

Type Description
int

Number of rows

Raises:

Type Description
RuntimeError

If backend is in streaming mode (length unknown until consumed)

__repr__()

Return string representation of the backend.

Returns:

Type Description
str

String representation showing backend type and DataFrame shape

add_column(column, values)

Add a new column to the DataFrame.

Parameters:

Name Type Description Default
column str

Name of the new column

required
values Any

Values for the new column (scalar or array-like)

required

Returns:

Type Description
PandasBackend

New backend with new column added

apply_fn(fn, fn_kwargs, apply_kwargs)

Apply a function to the DataFrame.

Parameters:

Name Type Description Default
fn Callable

Function to apply to the DataFrame. Should accept a DataFrame as the first argument and return a modified DataFrame.

required
apply_kwargs dict

Additional keyword arguments to pass to pandas.DataFrame.apply() For e.g. engine="numba"

required
fn_kwargs Any

Additional keyword arguments to pass to the function

required

Returns:

Type Description
PandasBackend

New backend with modified DataFrame

Raises:

Type Description
ValueError

If the function does not return a pandas DataFrame

column_exists(column)

Check if a column exists in the DataFrame.

Parameters:

Name Type Description Default
column str

Column name to look for

required

Returns:

Type Description
bool

True if column exists, False otherwise

concat(backends, *, ignore_index=True, sort=False) classmethod

Concatenate multiple backend instances vertically (row-wise).

Parameters:

Name Type Description Default
backends list[PandasBackend]

List of backend instances to concatenate

required
ignore_index bool

If True, reset index in result, by default True

True
sort bool

If True, sort columns alphabetically, by default False

False

Returns:

Type Description
PandasBackend

New backend with concatenated data

copy()

Create a copy of the backend with a copied DataFrame.

Returns:

Type Description
PandasBackend

New backend instance with copied DataFrame

drop_duplicates(subset=None, *, keep='first')

Remove duplicate rows from the DataFrame.

Parameters:

Name Type Description Default
subset list[str] | None

Column names to consider for identifying duplicates. If None, use all columns, by default None

None
keep Literal['first', 'last']

Which duplicate to keep, by default "first"

'first'

Returns:

Type Description
PandasBackend

New backend with duplicates removed

dropna(subset=None)

Remove rows with missing values.

Parameters:

Name Type Description Default
subset list[str] | None

Column names to consider for null detection. If None, check all columns, by default None

None

Returns:

Type Description
PandasBackend

New backend with null rows removed

filter_isin(column, values, *, negate=False)

Filter DataFrame rows where column values are in (or not in) a list.

Parameters:

Name Type Description Default
column str

Column name to filter on

required
values list[Any]

List of values to match

required
negate bool

If True, keep rows NOT in values list, by default False

False

Returns:

Type Description
PandasBackend

New backend with filtered DataFrame

from_csv(path, *, streaming=False, streaming_chunk_size=1000, **kwargs) classmethod

Read a CSV file and return a wrapped DataFrame backend.

Parameters:

Name Type Description Default
path str

Path to the CSV file (supports local and cloud paths via cloudpathlib)

required
streaming bool

If True, use streaming mode with chunked reading, by default False

False
streaming_chunk_size int

Number of rows per chunk in streaming mode, by default 1000

1000
**kwargs Any

Additional pandas-specific arguments

{}

Returns:

Type Description
PandasBackend

Backend instance wrapping the loaded DataFrame

from_json(path, *, lines=False, streaming=False, streaming_chunk_size=1000, **kwargs) classmethod

Read a JSON file and return a wrapped DataFrame backend.

Parameters:

Name Type Description Default
path str

Path to the JSON file

required
lines bool

If True, read file as JSON lines (one JSON object per line), by default False

False
streaming bool

If True, use streaming mode with chunked reading, by default False

False
streaming_chunk_size int

Number of rows per chunk in streaming mode, by default 1000

1000
**kwargs Any

Additional pandas-specific arguments

{}

Returns:

Type Description
PandasBackend

Backend instance wrapping the loaded DataFrame

from_parquet(path, *, streaming=False, **kwargs) classmethod

Read a Parquet file and return a wrapped DataFrame backend.

Parameters:

Name Type Description Default
path str

Path to the Parquet file

required
streaming bool

If True, use streaming mode (not supported for parquet in pandas), by default False

False
**kwargs Any

Additional pandas-specific arguments

{}

Returns:

Type Description
PandasBackend

Backend instance wrapping the loaded DataFrame

Note

Pandas does not natively support streaming parquet files. Consider using polars backend for large parquet files.

get_unique(column)

Get sorted unique values from a column.

Parameters:

Name Type Description Default
column str

Column name

required

Returns:

Type Description
list[Any]

Sorted list of unique values (nulls excluded)

histogram(column)

Get value counts (histogram) for a column.

Parameters:

Name Type Description Default
column str

Column name

required

Returns:

Type Description
dict[Any, int]

Dictionary mapping unique values to their counts (nulls excluded)

map_column(column, mapping, output_column)

Create a new column by mapping values from an existing column.

Parameters:

Name Type Description Default
column str

Source column name

required
mapping dict[Any, Any]

Dictionary mapping source values to output values

required
output_column str

Name of the new column to create

required

Returns:

Type Description
PandasBackend

New backend with mapped column added

multilabel_from_features(input_features, output_feature, label_map=None, allow_missing_labels=False)

Create a multi-label column by combining multiple input feature columns. Each row in the output column will contain a sorted list of integer IDs corresponding to the labels found in the specified input feature columns.

Parameters:

Name Type Description Default
input_features list[str]

List of column names to use as sources for labels. Each column can contain single values or lists of values.

required
output_feature str

Name of the output column to store the generated label lists.

required
label_map dict[str, Any] | None

Mapping of unique label values to integer IDs. If None, a mapping will be generated from the unique values in the input features.

None
allow_missing_labels bool

If True, rows with no labels will be included in the output. If False, rows with no labels will be dropped. Default is False.

False

Returns:

Type Description
tuple[PandasBackend, dict]

A tuple containing: - New PandasBackend instance with the added multi-label column - The label_map used for mapping labels to IDs

rename_columns(mapping)

Rename DataFrame columns.

Parameters:

Name Type Description Default
mapping dict[str, str]

Dictionary mapping old column names to new names

required

Returns:

Type Description
PandasBackend

New backend with renamed columns

sample_rows(n, *, seed=42, replace=False)

Randomly sample n rows from the DataFrame.

Parameters:

Name Type Description Default
n int

Number of rows to sample

required
seed int

Random seed for reproducibility, by default 42

42
replace bool

Whether to sample with replacement, by default False

False

Returns:

Type Description
PandasBackend

New backend with sampled rows

select_columns(columns)

Select a subset of columns from the DataFrame.

Parameters:

Name Type Description Default
columns list[str]

List of column names to keep

required

Returns:

Type Description
PandasBackend

New backend with only specified columns

subsample_by_column(column, ratios, *, seed=42)

Subsample rows by column values with specified ratios.

For each unique value in the column, sample the specified ratio of rows. Special key "other" can be used to subsample all values not explicitly listed.

Note: The "other" key pools all unlisted values together and samples from the pooled group, rather than applying the ratio per unlisted category.

Parameters:

Name Type Description Default
column str

Column name to group by

required
ratios dict[str, float]

Dictionary mapping column values to sampling ratios (0.0 to 1.0). Special key "other" applies to all unlisted values (pooled together).

required
seed int

Random seed for reproducibility, by default 42

42

Returns:

Type Description
PandasBackend

New backend with subsampled rows

Raises:

Type Description
KeyError

If the specified column does not exist in the DataFrame

ValueError

If any ratio is negative or greater than 1.0

upsample_by_column(column, target_counts, *, seed=42)

Upsample rows by column values to target counts with replacement.

For each unique value in the column, sample rows with replacement to reach the target count. If a category already has more rows than the target, it will be downsampled (without replacement) to the target count.

Note: The "other" key pools all unlisted values together and samples from the pooled group to reach the target count, rather than applying the target per unlisted category.

Parameters:

Name Type Description Default
column str

Column name to group by

required
target_counts dict[str, int]

Dictionary mapping column values to target sample counts. Special key "other" applies to all unlisted values (pooled together).

required
seed int

Random seed for reproducibility, by default 42

42

Returns:

Type Description
PandasBackend

New backend with upsampled/downsampled rows

Raises:

Type Description
KeyError

If the specified column does not exist in the DataFrame

ValueError

If any target count is negative

TypeError

If any target count is not an integer

alp_data.backends.PolarsBackend

Polars implementation of the DataFrameBackend protocol.

This backend wraps a polars DataFrame or LazyFrame and provides a unified interface for DataFrame operations that can work across different backend implementations.

Supports both eager (DataFrame) and streaming (LazyFrame) modes.

Parameters:

Name Type Description Default
df DataFrame | LazyFrame

The polars DataFrame or LazyFrame to wrap

required
streaming bool

Whether the backend is in streaming mode (LazyFrame)

False
streaming_chunk_size int

Number of rows per batch when iterating in streaming mode (default: 1000) 1000 is a good number because its high enough to reduce I/O and any higher doesn't help because the main latency source in Dataset getitem calls are in loading audio anyway.

1000

Examples:

>>> import polars as pl
>>> from alp_data.backends import PolarsBackend
>>> df = pl.DataFrame({
...     "species": ["cat", "dog", "fish", "cat", "dog", None],
...     "count": [5, 3, 8, 2, 7, 1]
... })
>>> backend = PolarsBackend(df)
>>> row = backend[0]
>>> filtered = backend.filter_isin("species", ["cat", "dog"])
>>> assert filtered.unwrap["species"].to_list() == ["cat", "dog", "cat", "dog"]
>>> # Streaming mode with LazyFrame
>>> df = pl.LazyFrame({
...     "species": ["cat", "dog", "fish", "cat", "dog", None],
...     "count": [5, 3, 8, 2, 7, 1]
... })
>>> backend = PolarsBackend(df, streaming=True)
>>> assert isinstance(backend.unwrap, pl.LazyFrame)
>>> print(backend.columns)
['species', 'count']
>>> for row in backend:
...     print(row)
...     break
{'species': 'cat', 'count': 5}
>>> collected = backend.collect()
>>> assert isinstance(collected.unwrap, pl.DataFrame)

columns property

Get the list of column names.

Returns:

Type Description
list[str]

List of column names

is_streaming property

Check if backend is in streaming mode.

Returns:

Type Description
bool

True if in streaming mode (LazyFrame), False otherwise

unwrap property

Get the underlying DataFrame object.

Returns:

Type Description
DataFrame | LazyFrame

The underlying polars DataFrame or LazyFrame

__getitem__(key)

Get row(s) from the DataFrame using Pythonic indexing.

Parameters:

Name Type Description Default
key int | list[int] | slice
  • int: Get single row as dict
  • list[int]: Get multiple rows as new backend
  • slice: Get row range as new backend
required

Returns:

Type Description
dict[str, Any] | PolarsBackend
  • dict if key is int (single row)
  • PolarsBackend if key is list or slice (multiple rows)

Raises:

Type Description
IndexError

If index is out of bounds

TypeError

If key type is not supported

__init__(df, *, streaming=False, streaming_chunk_size=1000)

Initialize the backend with a polars DataFrame or LazyFrame.

Parameters:

Name Type Description Default
df DataFrame | LazyFrame

The DataFrame or LazyFrame to wrap

required
streaming bool

Whether to use streaming mode (LazyFrame), by default False

False
streaming_chunk_size int

Number of rows per batch when iterating in streaming mode, by default 1000

1000

__iter__()

Iterate over DataFrame rows as dictionaries.

In streaming mode (LazyFrame), uses LazyFrame.collect_batches() to materialize the query one chunk at a time, so the full result never needs to live in memory at once. In eager mode (DataFrame), yields rows directly.

Yields:

Type Description
dict[str, Any]

Dictionary for each row mapping column names to values

__len__()

Get the number of rows in the DataFrame.

Returns:

Type Description
int

Number of rows

__repr__()

Return string representation of the backend.

Returns:

Type Description
str

String representation showing backend type and DataFrame shape

add_column(column, values)

Add a new column to the DataFrame.

Parameters:

Name Type Description Default
column str

Name of the new column

required
values Any

Values for the new column (scalar or array-like)

required

Returns:

Type Description
PolarsBackend

New backend with new column added

apply_fn(fn, fn_kwargs, apply_kwargs)

Apply a custom function to rows and create a new column.

Parameters:

Name Type Description Default
fn Any

Function to apply to each row. Should accept a dict of column values.

required
fn_kwargs dict

Additional keyword arguments to pass to the function

required
apply_kwargs dict

Additional keyword arguments to pass to polars.DataFrame.map_rows()

required

Returns:

Type Description
PolarsBackend

New backend with the new column added

Notes

This method collects the DataFrame if in streaming mode, as polars does not support arbitrary row-wise functions in LazyFrame. The returned backend will be in eager mode (streaming=False).

collect()

Materialize the LazyFrame and return an eager backend.

This method collects the LazyFrame into a DataFrame and returns a new backend with streaming mode disabled.

Returns:

Type Description
PolarsBackend

New backend in eager mode with materialized DataFrame

Notes

If the backend is already in eager mode, returns a copy of the backend.

column_exists(column)

Check if a column exists in the DataFrame.

Parameters:

Name Type Description Default
column str

Column name to look for

required

Returns:

Type Description
bool

True if column exists, False otherwise

concat(backends, *, ignore_index=True, sort=False) classmethod

Concatenate multiple backend instances vertically (row-wise).

Parameters:

Name Type Description Default
backends list[PolarsBackend]

List of backend instances to concatenate

required
sort bool

If True, sort columns alphabetically, by default False

False

Returns:

Type Description
PolarsBackend

New backend with concatenated data

copy()

Create a copy of the backend with a copied DataFrame.

Returns:

Type Description
PolarsBackend

New backend instance with copied DataFrame

drop_duplicates(subset=None, *, keep='first')

Remove duplicate rows from the DataFrame.

Parameters:

Name Type Description Default
subset list[str] | None

Column names to consider for identifying duplicates. If None, use all columns, by default None

None
keep Literal['first', 'last']

Which duplicate to keep, by default "first"

'first'

Returns:

Type Description
PolarsBackend

New backend with duplicates removed

Notes

In streaming mode (LazyFrame), this operation preserves the lazy computation. Call .collect() to materialize the deduplicated result into an eager backend.

dropna(subset=None)

Remove rows with missing values.

Parameters:

Name Type Description Default
subset list[str] | None

Column names to consider for null detection. If None, check all columns, by default None

None

Returns:

Type Description
PolarsBackend

New backend with null rows removed

Notes

In streaming mode (LazyFrame), this operation preserves the lazy computation. Call .collect() to materialize the cleaned result into an eager backend.

filter_isin(column, values, *, negate=False)

Filter DataFrame rows where column values are in (or not in) a list.

Parameters:

Name Type Description Default
column str

Column name to filter on

required
values list[Any]

List of values to match

required
negate bool

If True, keep rows NOT in values list, by default False

False

Returns:

Type Description
PolarsBackend

New backend with filtered DataFrame

Notes

In streaming mode (LazyFrame), this operation preserves the lazy computation. Call .collect() to materialize the filtered result into an eager backend.

from_csv(path, *, streaming=False, **kwargs) classmethod

Read a CSV file and return a wrapped DataFrame backend.

Parameters:

Name Type Description Default
path str

Path to the CSV file (supports local and cloud paths via cloudpathlib)

required
streaming bool

If True, use streaming mode with LazyFrame, by default False

False
**kwargs Any

Additional polars-specific arguments

{}

Returns:

Type Description
PolarsBackend

Backend instance wrapping the loaded DataFrame or LazyFrame

from_json(path, *, lines=False, streaming=False, **kwargs) classmethod

Read a JSON file and return a wrapped DataFrame backend.

Parameters:

Name Type Description Default
path str

Path to the JSON file

required
lines bool

If True, read file as JSON lines (one JSON object per line), by default False

False
streaming bool

If True, use streaming mode with LazyFrame, by default False

False
**kwargs Any

Additional polars-specific arguments

{}

Returns:

Type Description
PolarsBackend

Backend instance wrapping the loaded DataFrame

from_parquet(path, *, streaming=False, **kwargs) classmethod

Read a Parquet file and return a wrapped DataFrame backend.

Parameters:

Name Type Description Default
path str

Path to the Parquet file

required
streaming bool

If True, use streaming mode with LazyFrame, by default False

False
**kwargs Any

Additional polars-specific arguments

{}

Returns:

Type Description
PolarsBackend

Backend instance wrapping the loaded DataFrame or LazyFrame

get_unique(column)

Get sorted unique values from a column.

Parameters:

Name Type Description Default
column str

Column name

required

Returns:

Type Description
list[Any]

Sorted list of unique values (nulls excluded)

Notes

In streaming mode (LazyFrame), materializes the full column to compute uniques. A UserWarning is emitted because this forces collection of the underlying query.

histogram(column)

Get value counts (histogram) for a column.

Parameters:

Name Type Description Default
column str

Column name

required

Returns:

Type Description
dict[Any, int]

Dictionary mapping unique values to their counts (nulls excluded)

Notes

In streaming mode (LazyFrame), materializes the full column to compute counts. A UserWarning is emitted because this forces collection of the underlying query.

iter_batches(batch_size=1000)

Iterate over DataFrame in batches.

Parameters:

Name Type Description Default
batch_size int

Number of rows per batch, by default 1000

1000

Yields:

Type Description
PolarsBackend

Backend instances wrapping batches of up to batch_size rows. Yielded backends are always in eager mode.

Notes

In streaming mode, uses LazyFrame.collect_batches(chunk_size=batch_size) to produce batches incrementally, so the full result never needs to live in memory at once. Note that polars may return chunks that are smaller than batch_size; it treats it as a hint rather than a strict cap.

map_column(column, mapping, output_column, *, default=None)

Create a new column by mapping values from an existing column.

Parameters:

Name Type Description Default
column str

Source column name

required
mapping dict[Any, Any]

Dictionary mapping source values to output values

required
output_column str

Name of the new column to create

required
default Any

Value to use for unmapped keys, by default None

None

Returns:

Type Description
PolarsBackend

New backend with mapped column added

multilabel_from_features(input_features, output_feature, label_map=None, allow_missing_labels=True)

Create a multi-label column by combining multiple input feature columns. Each row in the output column will contain a sorted list of integer IDs corresponding to the labels found in the specified input feature columns.

Parameters:

Name Type Description Default
input_features list[str]

List of column names to use as sources for labels. Each column can contain single values or lists of values.

required
output_feature str

Name of the output column to store the generated label lists.

required
label_map dict[str, Any] | None

Mapping of unique label values to integer IDs. If None, a mapping will be generated from the unique values in the input features.

None
allow_missing_labels bool

If True, rows with no labels will be included in the output. If False, rows with no labels will be dropped. Default is True.

True

Returns:

Type Description
tuple[PolarsBackend, dict]

A tuple containing: - New PolarsBackend instance with the added multi-label column - The label_map used for mapping labels to IDs

Raises:

Type Description
ValueError

If any input feature does not exist or is not of type List.

rename_columns(mapping)

Rename DataFrame columns.

Parameters:

Name Type Description Default
mapping dict[str, str]

Dictionary mapping old column names to new names

required

Returns:

Type Description
PolarsBackend

New backend with renamed columns

sample_rows(n, *, seed=42, replace=False)

Randomly sample n rows from the DataFrame.

Parameters:

Name Type Description Default
n int

Number of rows to sample

required
seed int

Random seed for reproducibility, by default 42

42
replace bool

Whether to sample with replacement, by default False

False

Returns:

Type Description
PolarsBackend

New backend with sampled rows

select_columns(columns)

Select a subset of columns from the DataFrame.

Parameters:

Name Type Description Default
columns list[str]

List of column names to keep

required

Returns:

Type Description
PolarsBackend

New backend with only specified columns

subsample_by_column(column, ratios, *, seed=42)

Subsample rows by column values with specified ratios.

For each unique value in the column, sample the specified ratio of rows. Special key "other" can be used to subsample all values not explicitly listed.

If the backend is in streaming mode, a UserWarning will be issued and the LazyFrame will be collected since sampling requires materialization.

Note: The "other" key pools all unlisted values together and samples from the pooled group, rather than applying the ratio per unlisted category.

Parameters:

Name Type Description Default
column str

Column name to group by

required
ratios dict[str, float]

Dictionary mapping column values to sampling ratios (0.0 to 1.0). Special key "other" applies to all unlisted values (pooled together).

required
seed int

Random seed for reproducibility, by default 42

42

Returns:

Type Description
PolarsBackend

New backend with subsampled rows

Raises:

Type Description
KeyError

If the specified column does not exist in the DataFrame

ValueError

If any ratio is negative or greater than 1.0

upsample_by_column(column, target_counts, *, seed=42)

Upsample rows by column values to target counts with replacement.

For each unique value in the column, sample rows with replacement to reach the target count. If a category already has more rows than the target, it will be downsampled (without replacement) to the target count.

If the backend is in streaming mode, a UserWarning will be issued and the LazyFrame will be collected since sampling requires materialization.

Note: The "other" key pools all unlisted values together and samples from the pooled group to reach the target count, rather than applying the target per unlisted category.

Parameters:

Name Type Description Default
column str

Column name to group by

required
target_counts dict[str, int]

Dictionary mapping column values to target sample counts. Special key "other" applies to all unlisted values (pooled together).

required
seed int

Random seed for reproducibility, by default 42

42

Returns:

Type Description
PolarsBackend

New backend with upsampled/downsampled rows

Raises:

Type Description
KeyError

If the specified column does not exist in the DataFrame

ValueError

If any target count is negative

TypeError

If any target count is not an integer

alp_data.backends.get_backend(backend)

Get the backend class for the specified backend type.

Parameters:

Name Type Description Default
backend BackendType

Name of the backend ("pandas" or "polars")

required

Returns:

Type Description
Type[DataBackend | StreamingDataBackend]

The backend class (not an instance)

Raises:

Type Description
ValueError

If the backend name is not recognized

Examples:

>>> backend_cls = get_backend("pandas")
>>> assert backend_cls is PandasBackend
>>> backend_cls = get_backend("polars")
>>> assert backend_cls is PolarsBackend
Source code in alp_data/backends/backends.py
def get_backend(backend: BackendType) -> Type[DataBackend | StreamingDataBackend]:
    """Get the backend class for the specified backend type.

    Parameters
    ----------
    backend : BackendType
        Name of the backend ("pandas" or "polars")

    Returns
    -------
    Type[DataBackend | StreamingDataBackend]
        The backend class (not an instance)

    Raises
    ------
    ValueError
        If the backend name is not recognized

    Examples
    --------
    >>> backend_cls = get_backend("pandas")
    >>> assert backend_cls is PandasBackend
    >>> backend_cls = get_backend("polars")
    >>> assert backend_cls is PolarsBackend
    """
    if backend not in _BACKEND_REGISTRY:
        raise ValueError(
            f"Unknown backend: {backend}. Supported backends: {list(_BACKEND_REGISTRY.keys())}"
        )
    return _BACKEND_REGISTRY[backend]