alp_data.backends Module
What are data backends?
A DataBackend is a Python Protocol that defines an interface for any library that is used to read data and perform common operations on data. It's an abstraction that allows alp-data to support multiple data libraries without being tightly coupled to any specific one.
An example of such a library is pandas and so the corresponding backend class is alp_data.backends.PandasBackend. All the Dataset and Transform classes in alp-data up to version 1.3.0 used pandas to load the underlying annotation csv / jsonl files. This required implementing functions like pd.read_csv and pandas based dataframe manipulation directly within class methods. We want to reduce this dependence on pandas and allow ourselves the freedom to use other libraries like polars, duckdb, pyarrow, webdataset etc. to load and manipulate data because each library has its own strengths and weaknesses for different ML use-cases.
Available Backends
Currently, alp-data provides two backend implementations:
| Backend | Class | Description |
|---|---|---|
pandas |
PandasBackend |
Uses pandas DataFrames for data operations. Supports streaming via chunked reading. |
polars |
PolarsBackend |
Uses polars DataFrames/LazyFrames. Supports streaming via LazyFrame for memory-efficient processing. |
How to Use Backends
Specifying a Backend in Dataset Configuration
When loading a dataset, you can specify which backend to use via the backend parameter:
from alp_data.datasets import BirdSet
# Use polars backend (default)
dataset = BirdSet(split="HSN-train", backend="polars")
# Use pandas backend
dataset = BirdSet(split="HSN-train", backend="pandas")
Or via YAML configuration:
Direct Backend Usage
You can also use backends directly for standalone data operations:
from alp_data.backends import PandasBackend, PolarsBackend
# Load data with PandasBackend
backend = PandasBackend.from_csv("path/to/data.csv")
print(len(backend)) # Number of rows
print(backend.columns) # Column names
# Load data with PolarsBackend
backend = PolarsBackend.from_parquet("path/to/data.parquet")
row = backend[0] # Get first row as dict
Streaming Mode
Both backends support streaming mode for memory-efficient processing of large datasets:
from alp_data.datasets import BirdSet
# Enable streaming mode
dataset = BirdSet(split="HSN-train", backend="polars", streaming=True)
# Iterate over rows without loading entire dataset into memory
for sample in dataset:
process(sample)
Streaming Limitations
In streaming mode, __getitem__ indexing is disabled. Use iteration instead. Additionally, len() is not available until the stream is consumed.
Accessing the Underlying Data Object
If you need to perform library-specific operations, use the unwrap property to access the underlying data object:
from alp_data.backends import PandasBackend
backend = PandasBackend.from_csv("data.csv")
df = backend.unwrap # Returns pd.DataFrame
For more details, see Accessing the Underlying Data.
Pandas vs Polars: When to Use Which
| Use Case | Recommended Backend | Reason |
|---|---|---|
| Small to medium datasets | Either | Both perform well |
| Large datasets (> 1GB) | polars |
Better memory efficiency and performance |
| Streaming/lazy evaluation | polars |
Native LazyFrame support |
| Compatibility with existing pandas code | pandas |
Direct access to pd.DataFrame via unwrap |
| Parquet file streaming | polars |
Pandas doesn't support streaming parquet |
Accessing the Underlying Data
If you need to perform operations not covered by the backend interface, use the unwrap property:
from alp_data.backends import PandasBackend, PolarsBackend
# Pandas backend
pandas_backend = PandasBackend.from_csv("data.csv")
df = pandas_backend.unwrap # Returns pd.DataFrame
# Now use pandas-specific operations
df.describe()
# Polars backend
polars_backend = PolarsBackend.from_csv("data.csv")
df = polars_backend.unwrap # Returns pl.DataFrame or pl.LazyFrame
# Now use polars-specific operations
df.select(pl.col("species").value_counts())
Tip
When using unwrap, be aware that you lose the backend abstraction. Operations on the unwrapped object won't automatically work with other backends.
The DataBackend Protocol
The DataBackend protocol defines a common interface that all backend implementations must follow. This enables alp-data to work uniformly with different data libraries.
Core Interface
The protocol defines these key operations:
Data Loading (Class Methods)
| Method | Description |
|---|---|
from_csv(path, streaming=False) |
Load data from a CSV file |
from_json(path, lines=False, streaming=False) |
Load data from a JSON file (supports JSON lines format) |
from_parquet(path, streaming=False) |
Load data from a Parquet file |
Data Access
| Method | Description |
|---|---|
__getitem__(key) |
Get row(s) by index (int returns dict, list/slice returns new backend) |
__len__() |
Get number of rows |
__iter__() |
Iterate over rows as dictionaries |
columns |
Property returning list of column names |
column_exists(column) |
Check if a column exists |
unwrap |
Property returning the underlying data object (e.g., pd.DataFrame, pl.DataFrame) |
Data Manipulation
| Method | Description |
|---|---|
filter_isin(column, values, negate=False) |
Filter rows by column values |
drop_duplicates(subset=None, keep="first") |
Remove duplicate rows |
dropna(subset=None) |
Remove rows with missing values |
get_unique(column) |
Get sorted unique values from a column |
map_column(column, mapping, output_column) |
Create new column by mapping values |
rename_columns(mapping) |
Rename columns |
add_column(column, values) |
Add a new column |
select_columns(columns) |
Select subset of columns |
concat(backends, ignore_index=True) |
Concatenate multiple backends vertically |
Sampling
| Method | Description |
|---|---|
sample_rows(n, seed=42, replace=False) |
Randomly sample n rows |
subsample_by_column(column, ratios, seed=42) |
Subsample by column values with specified ratios |
Advanced Operations
| Method | Description |
|---|---|
copy() |
Create a copy of the backend |
apply_fn(fn, fn_kwargs, apply_kwargs) |
Apply a custom function to the data |
multilabel_from_features(input_features, output_feature, ...) |
Create multilabel column from multiple features |
Backend Integration in alp-data
Integration with Datasets
All dataset classes use backends internally to manage their data. The backend is selected at instantiation time:
# alp_data/dataset.py
class Dataset(ABC):
def __init__(
self,
output_take_and_give: dict[str, str] = None,
backend: BackendType = "polars",
streaming: bool = False,
) -> None:
self._backend_class = get_backend(backend)
Datasets then use the backend class to load data:
# Example from BirdSet._load()
self._data = self._backend_class.from_json(
location, lines=True, streaming=self._streaming
)
Integration with Transforms
Transforms operate directly on backend instances rather than raw DataFrames. This makes transforms backend-agnostic:
from alp_data.transforms import Filter
from alp_data.backends import PandasBackend
# Create backend
backend = PandasBackend.from_csv("data.csv")
# Apply transform - works with any backend
filter_transform = Filter(property="species", values=["cat", "dog"], mode="include")
filtered_backend, metadata = filter_transform(backend)
Transforms use the backend's methods rather than library-specific operations:
# alp_data/transforms/filter.py
class Filter:
def __call__(self, backend: DataBackend) -> tuple[DataBackend, dict]:
# Uses backend.filter_isin() instead of pandas-specific code
negate = self.mode == "exclude"
filtered_backend = backend.filter_isin(self.property, self.values, negate=negate)
return filtered_backend, {}
Integration with Dataset Concatenation
The ConcatenatedDataset class uses backend operations to merge multiple datasets:
from alp_data.datasets import InsectSet459, BirdSet
from alp_data.concat import ConcatenatedDataset
dataset1 = InsectSet459(split="validation", backend="polars")
dataset2 = BirdSet(split="HSN-test", backend="polars")
# All datasets must use the same backend type
concat_ds = ConcatenatedDataset([dataset1, dataset2], merge_level="soft")
The concatenation uses the backend's concat class method internally.
API Reference
alp_data.backends.protocol.DataBackend
Protocol defining the interface all data backends must implement.
This protocol uses the Adapter pattern where the backend wraps a data and provides a unified interface. The wrapped data is stored as an instance attribute, making the API more Pythonic and cleaner.
is_streaming
property
Check if backend is in streaming mode.
Returns:
| Type | Description |
|---|---|
bool
|
True if in streaming mode, False otherwise |
columns
property
Get the list of column names.
Returns:
| Type | Description |
|---|---|
list[str]
|
List of column names |
unwrap
property
Get the underlying data object.
This is useful when you need to access backend-specific functionality or pass the data to functions that expect the native type.
Returns:
| Type | Description |
|---|---|
Any
|
The underlying data (e.g., pd.DataFrame, pl.DataFrame) |
from_csv(path, *, streaming=False, **kwargs)
classmethod
Read a CSV file and return a wrapped data backend.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the CSV file (supports local and cloud paths via cloudpathlib) |
required |
streaming
|
bool
|
If True, use streaming mode (lazy evaluation). In streaming mode, getitem is disabled and data is processed via iteration. By default False. |
False
|
**kwargs
|
Any
|
Additional backend-specific arguments |
{}
|
Returns:
| Type | Description |
|---|---|
DataBackend
|
Backend instance wrapping the loaded data |
from_json(path, *, lines=False, streaming=False, **kwargs)
classmethod
Read a JSON file and return a wrapped data backend.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the JSON file |
required |
lines
|
bool
|
If True, read file as JSON lines (one JSON object per line), by default False |
False
|
streaming
|
bool
|
If True, use streaming mode (lazy evaluation), by default False |
False
|
**kwargs
|
Any
|
Additional backend-specific arguments |
{}
|
Returns:
| Type | Description |
|---|---|
DataBackend
|
Backend instance wrapping the loaded data |
from_parquet(path, *, streaming=False, **kwargs)
classmethod
Read a Parquet file and return a wrapped data backend.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the Parquet file |
required |
streaming
|
bool
|
If True, use streaming mode (lazy evaluation), by default False |
False
|
**kwargs
|
Any
|
Additional backend-specific arguments |
{}
|
Returns:
| Type | Description |
|---|---|
DataBackend
|
Backend instance wrapping the loaded data object |
from_path(path, *, streaming=False, **kwargs)
classmethod
Load a tabular file, dispatching on extension.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to a |
required |
streaming
|
bool
|
Whether to use streaming mode, by default False. |
False
|
**kwargs
|
Any
|
Additional backend-specific arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
DataBackend
|
Backend instance wrapping the loaded data. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the file extension is not supported. |
__init__(df, *, streaming=False)
Wrap an existing data object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
Any
|
The data to wrap (e.g., pd.DataFrame, pl.DataFrame). |
required |
streaming
|
bool
|
If True, use streaming mode where getitem is disabled and iteration processes data in chunks, by default False |
False
|
__getitem__(key)
Get row(s) from the dataset using Pythonic indexing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
int | list[int] | slice
|
|
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any] | DataBackend
|
|
Raises:
| Type | Description |
|---|---|
IndexError
|
If index is out of bounds |
TypeError
|
If key type is not supported |
RuntimeError
|
If backend is in streaming mode (use iteration instead) |
Note
In streaming mode, getitem is disabled. Use iteration instead: for row in backend: process(row)
__len__()
Get the number of rows in the dataset.
Returns:
| Type | Description |
|---|---|
int
|
Number of rows |
__iter__()
Iterate over rows as dictionaries.
Yields:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary for each row mapping column names to values |
filter_isin(column, values, *, negate=False)
Filter rows where column values are in (or not in) a list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name to filter on |
required |
values
|
list[Any]
|
List of values to match |
required |
negate
|
bool
|
If True, keep rows NOT in values list, by default False |
False
|
Returns:
| Type | Description |
|---|---|
DataBackend
|
New backend with filtered data |
drop_duplicates(subset=None, *, keep='first')
Remove duplicate rows from the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset
|
list[str] | None
|
Column names to consider for identifying duplicates. If None, use all columns, by default None |
None
|
keep
|
Literal['first', 'last']
|
Which duplicate to keep, by default "first" |
'first'
|
Returns:
| Type | Description |
|---|---|
DataBackend
|
New backend with duplicates removed |
dropna(subset=None)
Remove rows with missing values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset
|
list[str] | None
|
Column names to consider for null detection. If None, check all columns, by default None |
None
|
Returns:
| Type | Description |
|---|---|
DataBackend
|
New backend with null rows removed |
get_unique(column)
Get sorted unique values from a column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name |
required |
Returns:
| Type | Description |
|---|---|
list[Any]
|
Sorted list of unique values (nulls excluded) |
histogram(column)
Get value counts (histogram) for a column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name |
required |
Returns:
| Type | Description |
|---|---|
dict[Any, int]
|
Dictionary mapping unique values to their counts (nulls excluded) |
map_column(column, mapping, output_column, *, default=None)
Create a new column by mapping values from an existing column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Source column name |
required |
mapping
|
dict[Any, Any]
|
Dictionary mapping source values to output values |
required |
output_column
|
str
|
Name of the new column to create |
required |
default
|
Any
|
Value to use for unmapped keys, by default None |
None
|
Returns:
| Type | Description |
|---|---|
DataBackend
|
New backend with mapped column added |
rename_columns(mapping)
Rename data columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mapping
|
dict[str, str]
|
Dictionary mapping old column names to new names |
required |
Returns:
| Type | Description |
|---|---|
DataBackend
|
New backend with renamed columns |
add_column(column, values)
Add a new column to the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Name of the new column |
required |
values
|
Any
|
Values for the new column (scalar or array-like) |
required |
Returns:
| Type | Description |
|---|---|
DataBackend
|
New backend with new column added |
select_columns(columns)
Select a subset of columns from the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns
|
list[str]
|
List of column names to keep |
required |
Returns:
| Type | Description |
|---|---|
DataBackend
|
New backend with only specified columns |
concat(backends, *, ignore_index=True, sort=False)
classmethod
Concatenate multiple backend instances vertically (row-wise).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
backends
|
list[DataBackend]
|
List of backend instances to concatenate |
required |
ignore_index
|
bool
|
If True, reset index in result, by default True |
True
|
sort
|
bool
|
If True, sort columns alphabetically, by default False |
False
|
Returns:
| Type | Description |
|---|---|
DataBackend
|
New backend with concatenated data |
column_exists(column)
Check if a column exists in the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name to look for |
required |
subsample_by_column(column, ratios, *, seed=42)
Subsample rows by column values with specified ratios.
For each unique value in the column, sample the specified ratio of rows. Special key "other" can be used to subsample all values not explicitly listed.
Note: The "other" key pools all unlisted values together and samples from the pooled group, rather than applying the ratio per unlisted category.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name to group by |
required |
ratios
|
dict[str, float]
|
Dictionary mapping column values to sampling ratios (0.0 to 1.0). Special key "other" applies to all unlisted values (pooled together). |
required |
seed
|
int
|
Random seed for reproducibility, by default 42 |
42
|
Returns:
| Type | Description |
|---|---|
DataBackend
|
New backend with subsampled rows |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the specified column does not exist in the DataFrame |
ValueError
|
If any ratio is negative or greater than 1.0 |
upsample_by_column(column, target_counts, *, seed=42)
Upsample rows by column values to target counts with replacement.
For each unique value in the column, sample rows with replacement to reach the target count. If a category already has more rows than the target, it will be downsampled (without replacement) to the target count.
Note: The "other" key pools all unlisted values together and samples from the pooled group to reach the target count, rather than applying the target per unlisted category.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name to group by |
required |
target_counts
|
dict[str, int]
|
Dictionary mapping column values to target sample counts. Special key "other" applies to all unlisted values (pooled together). |
required |
seed
|
int
|
Random seed for reproducibility, by default 42 |
42
|
Returns:
| Type | Description |
|---|---|
DataBackend
|
New backend with upsampled/downsampled rows |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the specified column does not exist in the DataFrame |
ValueError
|
If any target count is negative |
sample_rows(n, *, seed=42, replace=False)
Randomly sample n rows from the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of rows to sample |
required |
seed
|
int
|
Random seed for reproducibility, by default 42 |
42
|
replace
|
bool
|
Whether to sample with replacement, by default False |
False
|
Returns:
| Type | Description |
|---|---|
DataBackend
|
New backend with sampled rows |
copy()
Create a copy of the backend with a copied data.
Returns:
| Type | Description |
|---|---|
DataBackend
|
New backend instance with copied data |
apply_fn(fn, **fn_kwargs)
Apply a custom function to the underlying data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
Callable
|
Function to apply. It should accept the underlying data type (e.g., pd.DataFrame, pl.DataFrame) as the first argument. |
required |
**fn_kwargs
|
Any
|
Keyword arguments to pass to the function |
{}
|
Returns:
| Type | Description |
|---|---|
DataBackend
|
New backend wrapping the result of the function application |
multilabel_from_features(input_features, output_feature, label_map=None, allow_missing_labels=False)
Create a multilabel column from multiple feature columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_features
|
list[str]
|
List of input feature column names to combine |
required |
output_feature
|
str
|
Name of the output multilabel column |
required |
label_map
|
dict[str, Any] | None
|
Optional mapping from input feature values to output labels, by default None |
None
|
allow_missing_labels
|
bool
|
If True, ignore missing labels in input features, by default False |
False
|
Returns:
| Type | Description |
|---|---|
tuple[DataBackend, dict]
|
New backend with multilabel column and metadata dictionary |
alp_data.backends.PandasBackend
Pandas implementation of the DataFrameBackend protocol.
This backend wraps a pandas DataFrame and provides a unified interface for DataFrame operations that can work across different backend implementations.
Supports both eager (in-memory) and streaming (chunked) modes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame | TextFileReader
|
The pandas DataFrame to wrap, or TextFileReader for streaming |
required |
streaming
|
bool
|
Whether the backend is in streaming mode |
False
|
streaming_chunk_size
|
int
|
Number of rows per chunk in streaming mode |
1000
|
Examples:
>>> import pandas as pd
>>> from alp_data.backends import PandasBackend
>>> df = pd.DataFrame({"col1": range(100), "col2": list("abcdefghij") * 10})
>>> backend = PandasBackend(df, streaming=False)
>>> backend[0] # Get first row as dict
{'col1': 0, 'col2': 'a'}
>>> backend[[0, 5, 10]] # Get rows 0, 5, 10 as new backend
PandasBackend(shape=(3, 2))
>>> filtered = backend.filter_isin("col2", ["a", "b"])
>>> len(filtered) # Number of rows with col2 in ['a', 'b']
20
>>> backend.columns # List of column names
['col1', 'col2']
>>> backend.column_exists("col1") # Check if column exists
True
>>> sub = backend.subsample_by_column("col2", {"a": 0.5, "b": 0.5, "other": 0.1})
>>> counts = sub.unwrap["col2"].count() # Subsampled counts
>>> assert counts <= 20
columns
property
Get the list of column names.
Returns:
| Type | Description |
|---|---|
list[str]
|
List of column names |
is_streaming
property
Check if backend is in streaming mode.
Returns:
| Type | Description |
|---|---|
bool
|
True if in streaming mode, False otherwise |
unwrap
property
Get the underlying DataFrame object.
Returns:
| Type | Description |
|---|---|
DataFrame
|
The underlying pandas DataFrame |
__getitem__(key)
Get row(s) from the DataFrame using Pythonic indexing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
int | list[int] | slice
|
|
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any] | PandasBackend
|
|
Raises:
| Type | Description |
|---|---|
IndexError
|
If index is out of bounds |
TypeError
|
If key type is not supported |
RuntimeError
|
If backend is in streaming mode |
Examples:
>>> import pandas as pd
>>> df = pd.DataFrame({"col1": range(100), "col2": list("abcdefghij") * 10})
>>> from alp_data.backends import PandasBackend
>>> backend = PandasBackend(df)
>>> backend[0] # Get first row as dict
{'col1': 0, 'col2': 'a'}
>>> backend[[0, 5, 10]] # Get rows 0, 5, 10 as new backend
PandasBackend(shape=(3, 2))
>>> backend[5:] # Get rows from index 5 to end
PandasBackend(shape=(95, 2))
>>> backend[:10] # Get first 10 rows
PandasBackend(shape=(10, 2))
__init__(df, *, streaming=False, streaming_chunk_size=1000)
Initialize the backend with a pandas DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame | TextFileReader
|
The DataFrame to wrap, or TextFileReader for streaming mode |
required |
streaming
|
bool
|
Whether to use streaming mode, by default False |
False
|
streaming_chunk_size
|
int
|
Number of rows per chunk in streaming mode, by default 1000 |
1000
|
__iter__()
Iterate over DataFrame rows as dictionaries.
In streaming mode, yields rows from chunks as they are read. In eager mode, yields rows from the loaded DataFrame.
Yields:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary for each row mapping column names to values |
__len__()
Get the number of rows in the DataFrame.
Returns:
| Type | Description |
|---|---|
int
|
Number of rows |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If backend is in streaming mode (length unknown until consumed) |
__repr__()
Return string representation of the backend.
Returns:
| Type | Description |
|---|---|
str
|
String representation showing backend type and DataFrame shape |
add_column(column, values)
Add a new column to the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Name of the new column |
required |
values
|
Any
|
Values for the new column (scalar or array-like) |
required |
Returns:
| Type | Description |
|---|---|
PandasBackend
|
New backend with new column added |
apply_fn(fn, fn_kwargs, apply_kwargs)
Apply a function to the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
Callable
|
Function to apply to the DataFrame. Should accept a DataFrame as the first argument and return a modified DataFrame. |
required |
apply_kwargs
|
dict
|
Additional keyword arguments to pass to pandas.DataFrame.apply() For e.g. engine="numba" |
required |
fn_kwargs
|
Any
|
Additional keyword arguments to pass to the function |
required |
Returns:
| Type | Description |
|---|---|
PandasBackend
|
New backend with modified DataFrame |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the function does not return a pandas DataFrame |
column_exists(column)
Check if a column exists in the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name to look for |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if column exists, False otherwise |
concat(backends, *, ignore_index=True, sort=False)
classmethod
Concatenate multiple backend instances vertically (row-wise).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
backends
|
list[PandasBackend]
|
List of backend instances to concatenate |
required |
ignore_index
|
bool
|
If True, reset index in result, by default True |
True
|
sort
|
bool
|
If True, sort columns alphabetically, by default False |
False
|
Returns:
| Type | Description |
|---|---|
PandasBackend
|
New backend with concatenated data |
copy()
Create a copy of the backend with a copied DataFrame.
Returns:
| Type | Description |
|---|---|
PandasBackend
|
New backend instance with copied DataFrame |
drop_duplicates(subset=None, *, keep='first')
Remove duplicate rows from the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset
|
list[str] | None
|
Column names to consider for identifying duplicates. If None, use all columns, by default None |
None
|
keep
|
Literal['first', 'last']
|
Which duplicate to keep, by default "first" |
'first'
|
Returns:
| Type | Description |
|---|---|
PandasBackend
|
New backend with duplicates removed |
dropna(subset=None)
Remove rows with missing values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset
|
list[str] | None
|
Column names to consider for null detection. If None, check all columns, by default None |
None
|
Returns:
| Type | Description |
|---|---|
PandasBackend
|
New backend with null rows removed |
filter_isin(column, values, *, negate=False)
Filter DataFrame rows where column values are in (or not in) a list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name to filter on |
required |
values
|
list[Any]
|
List of values to match |
required |
negate
|
bool
|
If True, keep rows NOT in values list, by default False |
False
|
Returns:
| Type | Description |
|---|---|
PandasBackend
|
New backend with filtered DataFrame |
from_csv(path, *, streaming=False, streaming_chunk_size=1000, **kwargs)
classmethod
Read a CSV file and return a wrapped DataFrame backend.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the CSV file (supports local and cloud paths via cloudpathlib) |
required |
streaming
|
bool
|
If True, use streaming mode with chunked reading, by default False |
False
|
streaming_chunk_size
|
int
|
Number of rows per chunk in streaming mode, by default 1000 |
1000
|
**kwargs
|
Any
|
Additional pandas-specific arguments |
{}
|
Returns:
| Type | Description |
|---|---|
PandasBackend
|
Backend instance wrapping the loaded DataFrame |
from_json(path, *, lines=False, streaming=False, streaming_chunk_size=1000, **kwargs)
classmethod
Read a JSON file and return a wrapped DataFrame backend.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the JSON file |
required |
lines
|
bool
|
If True, read file as JSON lines (one JSON object per line), by default False |
False
|
streaming
|
bool
|
If True, use streaming mode with chunked reading, by default False |
False
|
streaming_chunk_size
|
int
|
Number of rows per chunk in streaming mode, by default 1000 |
1000
|
**kwargs
|
Any
|
Additional pandas-specific arguments |
{}
|
Returns:
| Type | Description |
|---|---|
PandasBackend
|
Backend instance wrapping the loaded DataFrame |
from_parquet(path, *, streaming=False, **kwargs)
classmethod
Read a Parquet file and return a wrapped DataFrame backend.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the Parquet file |
required |
streaming
|
bool
|
If True, use streaming mode (not supported for parquet in pandas), by default False |
False
|
**kwargs
|
Any
|
Additional pandas-specific arguments |
{}
|
Returns:
| Type | Description |
|---|---|
PandasBackend
|
Backend instance wrapping the loaded DataFrame |
Note
Pandas does not natively support streaming parquet files. Consider using polars backend for large parquet files.
get_unique(column)
Get sorted unique values from a column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name |
required |
Returns:
| Type | Description |
|---|---|
list[Any]
|
Sorted list of unique values (nulls excluded) |
histogram(column)
Get value counts (histogram) for a column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name |
required |
Returns:
| Type | Description |
|---|---|
dict[Any, int]
|
Dictionary mapping unique values to their counts (nulls excluded) |
map_column(column, mapping, output_column)
Create a new column by mapping values from an existing column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Source column name |
required |
mapping
|
dict[Any, Any]
|
Dictionary mapping source values to output values |
required |
output_column
|
str
|
Name of the new column to create |
required |
Returns:
| Type | Description |
|---|---|
PandasBackend
|
New backend with mapped column added |
multilabel_from_features(input_features, output_feature, label_map=None, allow_missing_labels=False)
Create a multi-label column by combining multiple input feature columns. Each row in the output column will contain a sorted list of integer IDs corresponding to the labels found in the specified input feature columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_features
|
list[str]
|
List of column names to use as sources for labels. Each column can contain single values or lists of values. |
required |
output_feature
|
str
|
Name of the output column to store the generated label lists. |
required |
label_map
|
dict[str, Any] | None
|
Mapping of unique label values to integer IDs. If None, a mapping will be generated from the unique values in the input features. |
None
|
allow_missing_labels
|
bool
|
If True, rows with no labels will be included in the output. If False, rows with no labels will be dropped. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
tuple[PandasBackend, dict]
|
A tuple containing: - New PandasBackend instance with the added multi-label column - The label_map used for mapping labels to IDs |
rename_columns(mapping)
Rename DataFrame columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mapping
|
dict[str, str]
|
Dictionary mapping old column names to new names |
required |
Returns:
| Type | Description |
|---|---|
PandasBackend
|
New backend with renamed columns |
sample_rows(n, *, seed=42, replace=False)
Randomly sample n rows from the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of rows to sample |
required |
seed
|
int
|
Random seed for reproducibility, by default 42 |
42
|
replace
|
bool
|
Whether to sample with replacement, by default False |
False
|
Returns:
| Type | Description |
|---|---|
PandasBackend
|
New backend with sampled rows |
select_columns(columns)
Select a subset of columns from the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns
|
list[str]
|
List of column names to keep |
required |
Returns:
| Type | Description |
|---|---|
PandasBackend
|
New backend with only specified columns |
subsample_by_column(column, ratios, *, seed=42)
Subsample rows by column values with specified ratios.
For each unique value in the column, sample the specified ratio of rows. Special key "other" can be used to subsample all values not explicitly listed.
Note: The "other" key pools all unlisted values together and samples from the pooled group, rather than applying the ratio per unlisted category.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name to group by |
required |
ratios
|
dict[str, float]
|
Dictionary mapping column values to sampling ratios (0.0 to 1.0). Special key "other" applies to all unlisted values (pooled together). |
required |
seed
|
int
|
Random seed for reproducibility, by default 42 |
42
|
Returns:
| Type | Description |
|---|---|
PandasBackend
|
New backend with subsampled rows |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the specified column does not exist in the DataFrame |
ValueError
|
If any ratio is negative or greater than 1.0 |
upsample_by_column(column, target_counts, *, seed=42)
Upsample rows by column values to target counts with replacement.
For each unique value in the column, sample rows with replacement to reach the target count. If a category already has more rows than the target, it will be downsampled (without replacement) to the target count.
Note: The "other" key pools all unlisted values together and samples from the pooled group to reach the target count, rather than applying the target per unlisted category.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name to group by |
required |
target_counts
|
dict[str, int]
|
Dictionary mapping column values to target sample counts. Special key "other" applies to all unlisted values (pooled together). |
required |
seed
|
int
|
Random seed for reproducibility, by default 42 |
42
|
Returns:
| Type | Description |
|---|---|
PandasBackend
|
New backend with upsampled/downsampled rows |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the specified column does not exist in the DataFrame |
ValueError
|
If any target count is negative |
TypeError
|
If any target count is not an integer |
alp_data.backends.PolarsBackend
Polars implementation of the DataFrameBackend protocol.
This backend wraps a polars DataFrame or LazyFrame and provides a unified interface for DataFrame operations that can work across different backend implementations.
Supports both eager (DataFrame) and streaming (LazyFrame) modes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame | LazyFrame
|
The polars DataFrame or LazyFrame to wrap |
required |
streaming
|
bool
|
Whether the backend is in streaming mode (LazyFrame) |
False
|
streaming_chunk_size
|
int
|
Number of rows per batch when iterating in streaming mode (default: 1000) 1000 is a good number because its high enough to reduce I/O and any higher doesn't help because the main latency source in Dataset getitem calls are in loading audio anyway. |
1000
|
Examples:
>>> import polars as pl
>>> from alp_data.backends import PolarsBackend
>>> df = pl.DataFrame({
... "species": ["cat", "dog", "fish", "cat", "dog", None],
... "count": [5, 3, 8, 2, 7, 1]
... })
>>> backend = PolarsBackend(df)
>>> row = backend[0]
>>> filtered = backend.filter_isin("species", ["cat", "dog"])
>>> assert filtered.unwrap["species"].to_list() == ["cat", "dog", "cat", "dog"]
>>> # Streaming mode with LazyFrame
>>> df = pl.LazyFrame({
... "species": ["cat", "dog", "fish", "cat", "dog", None],
... "count": [5, 3, 8, 2, 7, 1]
... })
>>> backend = PolarsBackend(df, streaming=True)
>>> assert isinstance(backend.unwrap, pl.LazyFrame)
>>> print(backend.columns)
['species', 'count']
>>> for row in backend:
... print(row)
... break
{'species': 'cat', 'count': 5}
>>> collected = backend.collect()
>>> assert isinstance(collected.unwrap, pl.DataFrame)
columns
property
Get the list of column names.
Returns:
| Type | Description |
|---|---|
list[str]
|
List of column names |
is_streaming
property
Check if backend is in streaming mode.
Returns:
| Type | Description |
|---|---|
bool
|
True if in streaming mode (LazyFrame), False otherwise |
unwrap
property
Get the underlying DataFrame object.
Returns:
| Type | Description |
|---|---|
DataFrame | LazyFrame
|
The underlying polars DataFrame or LazyFrame |
__getitem__(key)
Get row(s) from the DataFrame using Pythonic indexing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
int | list[int] | slice
|
|
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any] | PolarsBackend
|
|
Raises:
| Type | Description |
|---|---|
IndexError
|
If index is out of bounds |
TypeError
|
If key type is not supported |
__init__(df, *, streaming=False, streaming_chunk_size=1000)
Initialize the backend with a polars DataFrame or LazyFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame | LazyFrame
|
The DataFrame or LazyFrame to wrap |
required |
streaming
|
bool
|
Whether to use streaming mode (LazyFrame), by default False |
False
|
streaming_chunk_size
|
int
|
Number of rows per batch when iterating in streaming mode, by default 1000 |
1000
|
__iter__()
Iterate over DataFrame rows as dictionaries.
In streaming mode (LazyFrame), uses LazyFrame.collect_batches() to
materialize the query one chunk at a time, so the full result never
needs to live in memory at once.
In eager mode (DataFrame), yields rows directly.
Yields:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary for each row mapping column names to values |
__len__()
Get the number of rows in the DataFrame.
Returns:
| Type | Description |
|---|---|
int
|
Number of rows |
__repr__()
Return string representation of the backend.
Returns:
| Type | Description |
|---|---|
str
|
String representation showing backend type and DataFrame shape |
add_column(column, values)
Add a new column to the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Name of the new column |
required |
values
|
Any
|
Values for the new column (scalar or array-like) |
required |
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend with new column added |
apply_fn(fn, fn_kwargs, apply_kwargs)
Apply a custom function to rows and create a new column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
Any
|
Function to apply to each row. Should accept a dict of column values. |
required |
fn_kwargs
|
dict
|
Additional keyword arguments to pass to the function |
required |
apply_kwargs
|
dict
|
Additional keyword arguments to pass to polars.DataFrame.map_rows() |
required |
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend with the new column added |
Notes
This method collects the DataFrame if in streaming mode, as polars does not support arbitrary row-wise functions in LazyFrame. The returned backend will be in eager mode (streaming=False).
collect()
Materialize the LazyFrame and return an eager backend.
This method collects the LazyFrame into a DataFrame and returns a new backend with streaming mode disabled.
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend in eager mode with materialized DataFrame |
Notes
If the backend is already in eager mode, returns a copy of the backend.
column_exists(column)
Check if a column exists in the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name to look for |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if column exists, False otherwise |
concat(backends, *, ignore_index=True, sort=False)
classmethod
Concatenate multiple backend instances vertically (row-wise).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
backends
|
list[PolarsBackend]
|
List of backend instances to concatenate |
required |
sort
|
bool
|
If True, sort columns alphabetically, by default False |
False
|
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend with concatenated data |
copy()
Create a copy of the backend with a copied DataFrame.
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend instance with copied DataFrame |
drop_duplicates(subset=None, *, keep='first')
Remove duplicate rows from the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset
|
list[str] | None
|
Column names to consider for identifying duplicates. If None, use all columns, by default None |
None
|
keep
|
Literal['first', 'last']
|
Which duplicate to keep, by default "first" |
'first'
|
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend with duplicates removed |
Notes
In streaming mode (LazyFrame), this operation preserves the lazy computation.
Call .collect() to materialize the deduplicated result into an eager backend.
dropna(subset=None)
Remove rows with missing values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset
|
list[str] | None
|
Column names to consider for null detection. If None, check all columns, by default None |
None
|
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend with null rows removed |
Notes
In streaming mode (LazyFrame), this operation preserves the lazy computation.
Call .collect() to materialize the cleaned result into an eager backend.
filter_isin(column, values, *, negate=False)
Filter DataFrame rows where column values are in (or not in) a list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name to filter on |
required |
values
|
list[Any]
|
List of values to match |
required |
negate
|
bool
|
If True, keep rows NOT in values list, by default False |
False
|
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend with filtered DataFrame |
Notes
In streaming mode (LazyFrame), this operation preserves the lazy computation.
Call .collect() to materialize the filtered result into an eager backend.
from_csv(path, *, streaming=False, **kwargs)
classmethod
Read a CSV file and return a wrapped DataFrame backend.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the CSV file (supports local and cloud paths via cloudpathlib) |
required |
streaming
|
bool
|
If True, use streaming mode with LazyFrame, by default False |
False
|
**kwargs
|
Any
|
Additional polars-specific arguments |
{}
|
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
Backend instance wrapping the loaded DataFrame or LazyFrame |
from_json(path, *, lines=False, streaming=False, **kwargs)
classmethod
Read a JSON file and return a wrapped DataFrame backend.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the JSON file |
required |
lines
|
bool
|
If True, read file as JSON lines (one JSON object per line), by default False |
False
|
streaming
|
bool
|
If True, use streaming mode with LazyFrame, by default False |
False
|
**kwargs
|
Any
|
Additional polars-specific arguments |
{}
|
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
Backend instance wrapping the loaded DataFrame |
from_parquet(path, *, streaming=False, **kwargs)
classmethod
Read a Parquet file and return a wrapped DataFrame backend.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the Parquet file |
required |
streaming
|
bool
|
If True, use streaming mode with LazyFrame, by default False |
False
|
**kwargs
|
Any
|
Additional polars-specific arguments |
{}
|
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
Backend instance wrapping the loaded DataFrame or LazyFrame |
get_unique(column)
Get sorted unique values from a column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name |
required |
Returns:
| Type | Description |
|---|---|
list[Any]
|
Sorted list of unique values (nulls excluded) |
Notes
In streaming mode (LazyFrame), materializes the full column to compute uniques. A UserWarning is emitted because this forces collection of the underlying query.
histogram(column)
Get value counts (histogram) for a column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name |
required |
Returns:
| Type | Description |
|---|---|
dict[Any, int]
|
Dictionary mapping unique values to their counts (nulls excluded) |
Notes
In streaming mode (LazyFrame), materializes the full column to compute counts. A UserWarning is emitted because this forces collection of the underlying query.
iter_batches(batch_size=1000)
Iterate over DataFrame in batches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size
|
int
|
Number of rows per batch, by default 1000 |
1000
|
Yields:
| Type | Description |
|---|---|
PolarsBackend
|
Backend instances wrapping batches of up to batch_size rows. Yielded backends are always in eager mode. |
Notes
In streaming mode, uses LazyFrame.collect_batches(chunk_size=batch_size)
to produce batches incrementally, so the full result never needs to
live in memory at once. Note that polars may return chunks that are
smaller than batch_size; it treats it as a hint rather than a strict
cap.
map_column(column, mapping, output_column, *, default=None)
Create a new column by mapping values from an existing column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Source column name |
required |
mapping
|
dict[Any, Any]
|
Dictionary mapping source values to output values |
required |
output_column
|
str
|
Name of the new column to create |
required |
default
|
Any
|
Value to use for unmapped keys, by default None |
None
|
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend with mapped column added |
multilabel_from_features(input_features, output_feature, label_map=None, allow_missing_labels=True)
Create a multi-label column by combining multiple input feature columns. Each row in the output column will contain a sorted list of integer IDs corresponding to the labels found in the specified input feature columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_features
|
list[str]
|
List of column names to use as sources for labels. Each column can contain single values or lists of values. |
required |
output_feature
|
str
|
Name of the output column to store the generated label lists. |
required |
label_map
|
dict[str, Any] | None
|
Mapping of unique label values to integer IDs. If None, a mapping will be generated from the unique values in the input features. |
None
|
allow_missing_labels
|
bool
|
If True, rows with no labels will be included in the output. If False, rows with no labels will be dropped. Default is True. |
True
|
Returns:
| Type | Description |
|---|---|
tuple[PolarsBackend, dict]
|
A tuple containing: - New PolarsBackend instance with the added multi-label column - The label_map used for mapping labels to IDs |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any input feature does not exist or is not of type List. |
rename_columns(mapping)
Rename DataFrame columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mapping
|
dict[str, str]
|
Dictionary mapping old column names to new names |
required |
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend with renamed columns |
sample_rows(n, *, seed=42, replace=False)
Randomly sample n rows from the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of rows to sample |
required |
seed
|
int
|
Random seed for reproducibility, by default 42 |
42
|
replace
|
bool
|
Whether to sample with replacement, by default False |
False
|
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend with sampled rows |
select_columns(columns)
Select a subset of columns from the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns
|
list[str]
|
List of column names to keep |
required |
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend with only specified columns |
subsample_by_column(column, ratios, *, seed=42)
Subsample rows by column values with specified ratios.
For each unique value in the column, sample the specified ratio of rows. Special key "other" can be used to subsample all values not explicitly listed.
If the backend is in streaming mode, a UserWarning will be issued and the LazyFrame will be collected since sampling requires materialization.
Note: The "other" key pools all unlisted values together and samples from the pooled group, rather than applying the ratio per unlisted category.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name to group by |
required |
ratios
|
dict[str, float]
|
Dictionary mapping column values to sampling ratios (0.0 to 1.0). Special key "other" applies to all unlisted values (pooled together). |
required |
seed
|
int
|
Random seed for reproducibility, by default 42 |
42
|
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend with subsampled rows |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the specified column does not exist in the DataFrame |
ValueError
|
If any ratio is negative or greater than 1.0 |
upsample_by_column(column, target_counts, *, seed=42)
Upsample rows by column values to target counts with replacement.
For each unique value in the column, sample rows with replacement to reach the target count. If a category already has more rows than the target, it will be downsampled (without replacement) to the target count.
If the backend is in streaming mode, a UserWarning will be issued and the LazyFrame will be collected since sampling requires materialization.
Note: The "other" key pools all unlisted values together and samples from the pooled group to reach the target count, rather than applying the target per unlisted category.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
Column name to group by |
required |
target_counts
|
dict[str, int]
|
Dictionary mapping column values to target sample counts. Special key "other" applies to all unlisted values (pooled together). |
required |
seed
|
int
|
Random seed for reproducibility, by default 42 |
42
|
Returns:
| Type | Description |
|---|---|
PolarsBackend
|
New backend with upsampled/downsampled rows |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the specified column does not exist in the DataFrame |
ValueError
|
If any target count is negative |
TypeError
|
If any target count is not an integer |
alp_data.backends.get_backend(backend)
Get the backend class for the specified backend type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
backend
|
BackendType
|
Name of the backend ("pandas" or "polars") |
required |
Returns:
| Type | Description |
|---|---|
Type[DataBackend | StreamingDataBackend]
|
The backend class (not an instance) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the backend name is not recognized |
Examples:
>>> backend_cls = get_backend("pandas")
>>> assert backend_cls is PandasBackend
>>> backend_cls = get_backend("polars")
>>> assert backend_cls is PolarsBackend