`alp_data.backends` Module

What are data backends?

A DataBackend is a Python Protocol that defines an interface for any library that is used to read data and perform common operations on data. It's an abstraction that allows alp-data to support multiple data libraries without being tightly coupled to any specific one.

An example of such a library is pandas and so the corresponding backend class is alp_data.backends.PandasBackend. All the Dataset and Transform classes in alp-data up to version 1.3.0 used pandas to load the underlying annotation csv / jsonl files. This required implementing functions like pd.read_csv and pandas based dataframe manipulation directly within class methods. We want to reduce this dependence on pandas and allow ourselves the freedom to use other libraries like polars, duckdb, pyarrow, webdataset etc. to load and manipulate data because each library has its own strengths and weaknesses for different ML use-cases.

Available Backends

Currently, alp-data provides two backend implementations:

Backend	Class	Description
`pandas`	`PandasBackend`	Uses pandas DataFrames for data operations. Supports streaming via chunked reading.
`polars`	`PolarsBackend`	Uses polars DataFrames/LazyFrames. Supports streaming via LazyFrame for memory-efficient processing.

How to Use Backends

Specifying a Backend in Dataset Configuration

When loading a dataset, you can specify which backend to use via the backend parameter:

from alp_data.datasets import BirdSet

# Use polars backend (default)
dataset = BirdSet(split="HSN-train", backend="polars")

# Use pandas backend
dataset = BirdSet(split="HSN-train", backend="pandas")

Or via YAML configuration:

dataset:
  dataset_name: birdset
  split: HSN-train
  backend: polars  # or "pandas"
  streaming: false

Direct Backend Usage

You can also use backends directly for standalone data operations:

from alp_data.backends import PandasBackend, PolarsBackend

# Load data with PandasBackend
backend = PandasBackend.from_csv("path/to/data.csv")
print(len(backend))  # Number of rows
print(backend.columns)  # Column names

# Load data with PolarsBackend
backend = PolarsBackend.from_parquet("path/to/data.parquet")
row = backend[0]  # Get first row as dict

Streaming Mode

Both backends support streaming mode for memory-efficient processing of large datasets:

from alp_data.datasets import BirdSet

# Enable streaming mode
dataset = BirdSet(split="HSN-train", backend="polars", streaming=True)

# Iterate over rows without loading entire dataset into memory
for sample in dataset:
    process(sample)

Streaming Limitations

In streaming mode, __getitem__ indexing is disabled. Use iteration instead. Additionally, len() is not available until the stream is consumed.

Accessing the Underlying Data Object

If you need to perform library-specific operations, use the unwrap property to access the underlying data object:

from alp_data.backends import PandasBackend

backend = PandasBackend.from_csv("data.csv")
df = backend.unwrap  # Returns pd.DataFrame

For more details, see Accessing the Underlying Data.

Pandas vs Polars: When to Use Which

Use Case	Recommended Backend	Reason
Small to medium datasets	Either	Both perform well
Large datasets (> 1GB)	`polars`	Better memory efficiency and performance
Streaming/lazy evaluation	`polars`	Native LazyFrame support
Compatibility with existing pandas code	`pandas`	Direct access to `pd.DataFrame` via `unwrap`
Parquet file streaming	`polars`	Pandas doesn't support streaming parquet

Accessing the Underlying Data

If you need to perform operations not covered by the backend interface, use the unwrap property:

from alp_data.backends import PandasBackend, PolarsBackend

# Pandas backend
pandas_backend = PandasBackend.from_csv("data.csv")
df = pandas_backend.unwrap  # Returns pd.DataFrame
# Now use pandas-specific operations
df.describe()

# Polars backend
polars_backend = PolarsBackend.from_csv("data.csv")
df = polars_backend.unwrap  # Returns pl.DataFrame or pl.LazyFrame
# Now use polars-specific operations
df.select(pl.col("species").value_counts())

Tip

When using unwrap, be aware that you lose the backend abstraction. Operations on the unwrapped object won't automatically work with other backends.

The DataBackend Protocol

The DataBackend protocol defines a common interface that all backend implementations must follow. This enables alp-data to work uniformly with different data libraries.

Core Interface

The protocol defines these key operations:

Data Loading (Class Methods)

Method	Description
`from_csv(path, streaming=False)`	Load data from a CSV file
`from_json(path, lines=False, streaming=False)`	Load data from a JSON file (supports JSON lines format)
`from_parquet(path, streaming=False)`	Load data from a Parquet file

Data Access

Method	Description
`__getitem__(key)`	Get row(s) by index (int returns dict, list/slice returns new backend)
`__len__()`	Get number of rows
`__iter__()`	Iterate over rows as dictionaries
`columns`	Property returning list of column names
`column_exists(column)`	Check if a column exists
`unwrap`	Property returning the underlying data object (e.g., `pd.DataFrame`, `pl.DataFrame`)

Data Manipulation

Method	Description
`filter_isin(column, values, negate=False)`	Filter rows by column values
`drop_duplicates(subset=None, keep="first")`	Remove duplicate rows
`dropna(subset=None)`	Remove rows with missing values
`get_unique(column)`	Get sorted unique values from a column
`map_column(column, mapping, output_column)`	Create new column by mapping values
`rename_columns(mapping)`	Rename columns
`add_column(column, values)`	Add a new column
`select_columns(columns)`	Select subset of columns
`concat(backends, ignore_index=True)`	Concatenate multiple backends vertically

Sampling

Method	Description
`sample_rows(n, seed=42, replace=False)`	Randomly sample n rows
`subsample_by_column(column, ratios, seed=42)`	Subsample by column values with specified ratios

Advanced Operations

Method	Description
`copy()`	Create a copy of the backend
`apply_fn(fn, fn_kwargs, apply_kwargs)`	Apply a custom function to the data
`multilabel_from_features(input_features, output_feature, ...)`	Create multilabel column from multiple features

Backend Integration in alp-data

Integration with Datasets

All dataset classes use backends internally to manage their data. The backend is selected at instantiation time:

# alp_data/dataset.py
class Dataset(ABC):
    def __init__(
        self,
        output_take_and_give: dict[str, str] = None,
        backend: BackendType = "polars",
        streaming: bool = False,
    ) -> None:
        self._backend_class = get_backend(backend)

Datasets then use the backend class to load data:

# Example from BirdSet._load()
self._data = self._backend_class.from_json(
    location, lines=True, streaming=self._streaming
)

Integration with Transforms

Transforms operate directly on backend instances rather than raw DataFrames. This makes transforms backend-agnostic:

from alp_data.transforms import Filter
from alp_data.backends import PandasBackend

# Create backend
backend = PandasBackend.from_csv("data.csv")

# Apply transform - works with any backend
filter_transform = Filter(property="species", values=["cat", "dog"], mode="include")
filtered_backend, metadata = filter_transform(backend)

Transforms use the backend's methods rather than library-specific operations:

# alp_data/transforms/filter.py
class Filter:
    def __call__(self, backend: DataBackend) -> tuple[DataBackend, dict]:
        # Uses backend.filter_isin() instead of pandas-specific code
        negate = self.mode == "exclude"
        filtered_backend = backend.filter_isin(self.property, self.values, negate=negate)
        return filtered_backend, {}

Integration with Dataset Concatenation

The ConcatenatedDataset class uses backend operations to merge multiple datasets:

from alp_data.datasets import InsectSet459, BirdSet
from alp_data.concat import ConcatenatedDataset

dataset1 = InsectSet459(split="validation", backend="polars")
dataset2 = BirdSet(split="HSN-test", backend="polars")

# All datasets must use the same backend type
concat_ds = ConcatenatedDataset([dataset1, dataset2], merge_level="soft")

The concatenation uses the backend's concat class method internally.

API Reference

`alp_data.backends.protocol.DataBackend`

Protocol defining the interface all data backends must implement.

This protocol uses the Adapter pattern where the backend wraps a data and provides a unified interface. The wrapped data is stored as an instance attribute, making the API more Pythonic and cleaner.

`is_streaming` `property`

Check if backend is in streaming mode.

Returns:

Type	Description
`bool`	True if in streaming mode, False otherwise

`columns` `property`

Get the list of column names.

Returns:

Type	Description
`list[str]`	List of column names

`unwrap` `property`

Get the underlying data object.

This is useful when you need to access backend-specific functionality or pass the data to functions that expect the native type.

Returns:

Type	Description
`Any`	The underlying data (e.g., pd.DataFrame, pl.DataFrame)

`from_csv(path, *, streaming=False, **kwargs)` `classmethod`

Read a CSV file and return a wrapped data backend.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the CSV file (supports local and cloud paths via cloudpathlib)	required
`streaming`	`bool`	If True, use streaming mode (lazy evaluation). In streaming mode, getitem is disabled and data is processed via iteration. By default False.	`False`
`**kwargs`	`Any`	Additional backend-specific arguments	`{}`

Returns:

Type	Description
`DataBackend`	Backend instance wrapping the loaded data

`from_json(path, *, lines=False, streaming=False, **kwargs)` `classmethod`

Read a JSON file and return a wrapped data backend.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the JSON file	required
`lines`	`bool`	If True, read file as JSON lines (one JSON object per line), by default False	`False`
`streaming`	`bool`	If True, use streaming mode (lazy evaluation), by default False	`False`
`**kwargs`	`Any`	Additional backend-specific arguments	`{}`

Returns:

Type	Description
`DataBackend`	Backend instance wrapping the loaded data

`from_parquet(path, *, streaming=False, **kwargs)` `classmethod`

Read a Parquet file and return a wrapped data backend.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the Parquet file	required
`streaming`	`bool`	If True, use streaming mode (lazy evaluation), by default False	`False`
`**kwargs`	`Any`	Additional backend-specific arguments	`{}`

Returns:

Type	Description
`DataBackend`	Backend instance wrapping the loaded data object

`from_path(path, *, streaming=False, **kwargs)` `classmethod`

Load a tabular file, dispatching on extension.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to a `.parquet`, `.csv`, `.json`, `.jsonl`, or `.ndjson` file.	required
`streaming`	`bool`	Whether to use streaming mode, by default False.	`False`
`**kwargs`	`Any`	Additional backend-specific arguments.	`{}`

Returns:

Type	Description
`DataBackend`	Backend instance wrapping the loaded data.

Raises:

Type	Description
`ValueError`	If the file extension is not supported.

`init(df, *, streaming=False)`

Wrap an existing data object.

Parameters:

Name	Type	Description	Default
`df`	`Any`	The data to wrap (e.g., pd.DataFrame, pl.DataFrame).	required
`streaming`	`bool`	If True, use streaming mode where getitem is disabled and iteration processes data in chunks, by default False	`False`

`getitem(key)`

__getitem__(key: int) -> dict[str, Any]

__getitem__(key: list[int]) -> 'DataBackend'

__getitem__(key: slice) -> 'DataBackend'

Get row(s) from the dataset using Pythonic indexing.

Parameters:

Name	Type	Description	Default
`key`	`int \| list[int] \| slice`	int: Get single row as dict list[int]: Get multiple rows as new backend slice: Get row range as new backend	required

Returns:

Type	Description
`dict[str, Any] \| DataBackend`	dict if key is int (single row) DataBackend if key is list or slice (multiple rows)

Raises:

Type	Description
`IndexError`	If index is out of bounds
`TypeError`	If key type is not supported
`RuntimeError`	If backend is in streaming mode (use iteration instead)

Note

In streaming mode, getitem is disabled. Use iteration instead: for row in backend: process(row)

`len()`

Get the number of rows in the dataset.

Returns:

Type	Description
`int`	Number of rows

`iter()`

Iterate over rows as dictionaries.

Yields:

Type	Description
`dict[str, Any]`	Dictionary for each row mapping column names to values

`filter_isin(column, values, *, negate=False)`

Filter rows where column values are in (or not in) a list.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name to filter on	required
`values`	`list[Any]`	List of values to match	required
`negate`	`bool`	If True, keep rows NOT in values list, by default False	`False`

Returns:

Type	Description
`DataBackend`	New backend with filtered data

`drop_duplicates(subset=None, *, keep='first')`

Remove duplicate rows from the data.

Parameters:

Name	Type	Description	Default
`subset`	`list[str] \| None`	Column names to consider for identifying duplicates. If None, use all columns, by default None	`None`
`keep`	`Literal['first', 'last']`	Which duplicate to keep, by default "first"	`'first'`

Returns:

Type	Description
`DataBackend`	New backend with duplicates removed

`dropna(subset=None)`

Remove rows with missing values.

Parameters:

Name	Type	Description	Default
`subset`	`list[str] \| None`	Column names to consider for null detection. If None, check all columns, by default None	`None`

Returns:

Type	Description
`DataBackend`	New backend with null rows removed

`get_unique(column)`

Get sorted unique values from a column.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name	required

Returns:

Type	Description
`list[Any]`	Sorted list of unique values (nulls excluded)

`histogram(column)`

Get value counts (histogram) for a column.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name	required

Returns:

Type	Description
`dict[Any, int]`	Dictionary mapping unique values to their counts (nulls excluded)

`map_column(column, mapping, output_column, *, default=None)`

Create a new column by mapping values from an existing column.

Parameters:

Name	Type	Description	Default
`column`	`str`	Source column name	required
`mapping`	`dict[Any, Any]`	Dictionary mapping source values to output values	required
`output_column`	`str`	Name of the new column to create	required
`default`	`Any`	Value to use for unmapped keys, by default None	`None`

Returns:

Type	Description
`DataBackend`	New backend with mapped column added

`rename_columns(mapping)`

Rename data columns.

Parameters:

Name	Type	Description	Default
`mapping`	`dict[str, str]`	Dictionary mapping old column names to new names	required

Returns:

Type	Description
`DataBackend`	New backend with renamed columns

`add_column(column, values)`

Add a new column to the data.

Parameters:

Name	Type	Description	Default
`column`	`str`	Name of the new column	required
`values`	`Any`	Values for the new column (scalar or array-like)	required

Returns:

Type	Description
`DataBackend`	New backend with new column added

`select_columns(columns)`

Select a subset of columns from the data.

Parameters:

Name	Type	Description	Default
`columns`	`list[str]`	List of column names to keep	required

Returns:

Type	Description
`DataBackend`	New backend with only specified columns

`concat(backends, *, ignore_index=True, sort=False)` `classmethod`

Concatenate multiple backend instances vertically (row-wise).

Parameters:

Name	Type	Description	Default
`backends`	`list[DataBackend]`	List of backend instances to concatenate	required
`ignore_index`	`bool`	If True, reset index in result, by default True	`True`
`sort`	`bool`	If True, sort columns alphabetically, by default False	`False`

Returns:

Type	Description
`DataBackend`	New backend with concatenated data

`column_exists(column)`

Check if a column exists in the data.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name to look for	required

`subsample_by_column(column, ratios, *, seed=42)`

Subsample rows by column values with specified ratios.

For each unique value in the column, sample the specified ratio of rows. Special key "other" can be used to subsample all values not explicitly listed.

Note: The "other" key pools all unlisted values together and samples from the pooled group, rather than applying the ratio per unlisted category.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name to group by	required
`ratios`	`dict[str, float]`	Dictionary mapping column values to sampling ratios (0.0 to 1.0). Special key "other" applies to all unlisted values (pooled together).	required
`seed`	`int`	Random seed for reproducibility, by default 42	`42`

Returns:

Type	Description
`DataBackend`	New backend with subsampled rows

Raises:

Type	Description
`KeyError`	If the specified column does not exist in the DataFrame
`ValueError`	If any ratio is negative or greater than 1.0

`upsample_by_column(column, target_counts, *, seed=42)`

Upsample rows by column values to target counts with replacement.

For each unique value in the column, sample rows with replacement to reach the target count. If a category already has more rows than the target, it will be downsampled (without replacement) to the target count.

Note: The "other" key pools all unlisted values together and samples from the pooled group to reach the target count, rather than applying the target per unlisted category.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name to group by	required
`target_counts`	`dict[str, int]`	Dictionary mapping column values to target sample counts. Special key "other" applies to all unlisted values (pooled together).	required
`seed`	`int`	Random seed for reproducibility, by default 42	`42`

Returns:

Type	Description
`DataBackend`	New backend with upsampled/downsampled rows

Raises:

Type	Description
`KeyError`	If the specified column does not exist in the DataFrame
`ValueError`	If any target count is negative

`sample_rows(n, *, seed=42, replace=False)`

Randomly sample n rows from the data.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of rows to sample	required
`seed`	`int`	Random seed for reproducibility, by default 42	`42`
`replace`	`bool`	Whether to sample with replacement, by default False	`False`

Returns:

Type	Description
`DataBackend`	New backend with sampled rows

`copy()`

Create a copy of the backend with a copied data.

Returns:

Type	Description
`DataBackend`	New backend instance with copied data

`apply_fn(fn, **fn_kwargs)`

Apply a custom function to the underlying data.

Parameters:

Name	Type	Description	Default
`fn`	`Callable`	Function to apply. It should accept the underlying data type (e.g., pd.DataFrame, pl.DataFrame) as the first argument.	required
`**fn_kwargs`	`Any`	Keyword arguments to pass to the function	`{}`

Returns:

Type	Description
`DataBackend`	New backend wrapping the result of the function application

`multilabel_from_features(input_features, output_feature, label_map=None, allow_missing_labels=False)`

Create a multilabel column from multiple feature columns.

Parameters:

Name	Type	Description	Default
`input_features`	`list[str]`	List of input feature column names to combine	required
`output_feature`	`str`	Name of the output multilabel column	required
`label_map`	`dict[str, Any] \| None`	Optional mapping from input feature values to output labels, by default None	`None`
`allow_missing_labels`	`bool`	If True, ignore missing labels in input features, by default False	`False`

Returns:

Type	Description
`tuple[DataBackend, dict]`	New backend with multilabel column and metadata dictionary

`alp_data.backends.PandasBackend`

Pandas implementation of the DataFrameBackend protocol.

This backend wraps a pandas DataFrame and provides a unified interface for DataFrame operations that can work across different backend implementations.

Supports both eager (in-memory) and streaming (chunked) modes.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame \| TextFileReader`	The pandas DataFrame to wrap, or TextFileReader for streaming	required
`streaming`	`bool`	Whether the backend is in streaming mode	`False`
`streaming_chunk_size`	`int`	Number of rows per chunk in streaming mode	`1000`

Examples:

>>> import pandas as pd
>>> from alp_data.backends import PandasBackend
>>> df = pd.DataFrame({"col1": range(100), "col2": list("abcdefghij") * 10})
>>> backend = PandasBackend(df, streaming=False)
>>> backend[0]  # Get first row as dict
{'col1': 0, 'col2': 'a'}
>>> backend[[0, 5, 10]]  # Get rows 0, 5, 10 as new backend
PandasBackend(shape=(3, 2))
>>> filtered = backend.filter_isin("col2", ["a", "b"])
>>> len(filtered)  # Number of rows with col2 in ['a', 'b']
20
>>> backend.columns  # List of column names
['col1', 'col2']
>>> backend.column_exists("col1")  # Check if column exists
True
>>> sub = backend.subsample_by_column("col2", {"a": 0.5, "b": 0.5, "other": 0.1})
>>> counts = sub.unwrap["col2"].count()  # Subsampled counts
>>> assert counts <= 20

`columns` `property`

Get the list of column names.

Returns:

Type	Description
`list[str]`	List of column names

`is_streaming` `property`

Check if backend is in streaming mode.

Returns:

Type	Description
`bool`	True if in streaming mode, False otherwise

`unwrap` `property`

Get the underlying DataFrame object.

Returns:

Type	Description
`DataFrame`	The underlying pandas DataFrame

`getitem(key)`

Get row(s) from the DataFrame using Pythonic indexing.

Parameters:

Name	Type	Description	Default
`key`	`int \| list[int] \| slice`	int: Get single row as dict list[int]: Get multiple rows as new backend slice: Get row range as new backend	required

Returns:

Type	Description
`dict[str, Any] \| PandasBackend`	dict if key is int (single row) PandasBackend if key is list or slice (multiple rows)

Raises:

Type	Description
`IndexError`	If index is out of bounds
`TypeError`	If key type is not supported
`RuntimeError`	If backend is in streaming mode

Examples:

>>> import pandas as pd
>>> df = pd.DataFrame({"col1": range(100), "col2": list("abcdefghij") * 10})
>>> from alp_data.backends import PandasBackend
>>> backend = PandasBackend(df)
>>> backend[0]  # Get first row as dict
{'col1': 0, 'col2': 'a'}
>>> backend[[0, 5, 10]]  # Get rows 0, 5, 10 as new backend
PandasBackend(shape=(3, 2))
>>> backend[5:]  # Get rows from index 5 to end
PandasBackend(shape=(95, 2))
>>> backend[:10]  # Get first 10 rows
PandasBackend(shape=(10, 2))

`init(df, *, streaming=False, streaming_chunk_size=1000)`

Initialize the backend with a pandas DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame \| TextFileReader`	The DataFrame to wrap, or TextFileReader for streaming mode	required
`streaming`	`bool`	Whether to use streaming mode, by default False	`False`
`streaming_chunk_size`	`int`	Number of rows per chunk in streaming mode, by default 1000	`1000`

`iter()`

Iterate over DataFrame rows as dictionaries.

In streaming mode, yields rows from chunks as they are read. In eager mode, yields rows from the loaded DataFrame.

Yields:

Type	Description
`dict[str, Any]`	Dictionary for each row mapping column names to values

`len()`

Get the number of rows in the DataFrame.

Returns:

Type	Description
`int`	Number of rows

Raises:

Type	Description
`RuntimeError`	If backend is in streaming mode (length unknown until consumed)

`repr()`

Return string representation of the backend.

Returns:

Type	Description
`str`	String representation showing backend type and DataFrame shape

`add_column(column, values)`

Add a new column to the DataFrame.

Parameters:

Name	Type	Description	Default
`column`	`str`	Name of the new column	required
`values`	`Any`	Values for the new column (scalar or array-like)	required

Returns:

Type	Description
`PandasBackend`	New backend with new column added

`apply_fn(fn, fn_kwargs, apply_kwargs)`

Apply a function to the DataFrame.

Parameters:

Name	Type	Description	Default
`fn`	`Callable`	Function to apply to the DataFrame. Should accept a DataFrame as the first argument and return a modified DataFrame.	required
`apply_kwargs`	`dict`	Additional keyword arguments to pass to pandas.DataFrame.apply() For e.g. engine="numba"	required
`fn_kwargs`	`Any`	Additional keyword arguments to pass to the function	required

Returns:

Type	Description
`PandasBackend`	New backend with modified DataFrame

Raises:

Type	Description
`ValueError`	If the function does not return a pandas DataFrame

`column_exists(column)`

Check if a column exists in the DataFrame.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name to look for	required

Returns:

Type	Description
`bool`	True if column exists, False otherwise

`concat(backends, *, ignore_index=True, sort=False)` `classmethod`

Concatenate multiple backend instances vertically (row-wise).

Parameters:

Name	Type	Description	Default
`backends`	`list[PandasBackend]`	List of backend instances to concatenate	required
`ignore_index`	`bool`	If True, reset index in result, by default True	`True`
`sort`	`bool`	If True, sort columns alphabetically, by default False	`False`

Returns:

Type	Description
`PandasBackend`	New backend with concatenated data

`copy()`

Create a copy of the backend with a copied DataFrame.

Returns:

Type	Description
`PandasBackend`	New backend instance with copied DataFrame

`drop_duplicates(subset=None, *, keep='first')`

Remove duplicate rows from the DataFrame.

Parameters:

Name	Type	Description	Default
`subset`	`list[str] \| None`	Column names to consider for identifying duplicates. If None, use all columns, by default None	`None`
`keep`	`Literal['first', 'last']`	Which duplicate to keep, by default "first"	`'first'`

Returns:

Type	Description
`PandasBackend`	New backend with duplicates removed

`dropna(subset=None)`

Remove rows with missing values.

Parameters:

Name	Type	Description	Default
`subset`	`list[str] \| None`	Column names to consider for null detection. If None, check all columns, by default None	`None`

Returns:

Type	Description
`PandasBackend`	New backend with null rows removed

`filter_isin(column, values, *, negate=False)`

Filter DataFrame rows where column values are in (or not in) a list.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name to filter on	required
`values`	`list[Any]`	List of values to match	required
`negate`	`bool`	If True, keep rows NOT in values list, by default False	`False`

Returns:

Type	Description
`PandasBackend`	New backend with filtered DataFrame

`from_csv(path, *, streaming=False, streaming_chunk_size=1000, **kwargs)` `classmethod`

Read a CSV file and return a wrapped DataFrame backend.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the CSV file (supports local and cloud paths via cloudpathlib)	required
`streaming`	`bool`	If True, use streaming mode with chunked reading, by default False	`False`
`streaming_chunk_size`	`int`	Number of rows per chunk in streaming mode, by default 1000	`1000`
`**kwargs`	`Any`	Additional pandas-specific arguments	`{}`

Returns:

Type	Description
`PandasBackend`	Backend instance wrapping the loaded DataFrame

`from_json(path, *, lines=False, streaming=False, streaming_chunk_size=1000, **kwargs)` `classmethod`

Read a JSON file and return a wrapped DataFrame backend.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the JSON file	required
`lines`	`bool`	If True, read file as JSON lines (one JSON object per line), by default False	`False`
`streaming`	`bool`	If True, use streaming mode with chunked reading, by default False	`False`
`streaming_chunk_size`	`int`	Number of rows per chunk in streaming mode, by default 1000	`1000`
`**kwargs`	`Any`	Additional pandas-specific arguments	`{}`

Returns:

Type	Description
`PandasBackend`	Backend instance wrapping the loaded DataFrame

`from_parquet(path, *, streaming=False, **kwargs)` `classmethod`

Read a Parquet file and return a wrapped DataFrame backend.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the Parquet file	required
`streaming`	`bool`	If True, use streaming mode (not supported for parquet in pandas), by default False	`False`
`**kwargs`	`Any`	Additional pandas-specific arguments	`{}`

Returns:

Type	Description
`PandasBackend`	Backend instance wrapping the loaded DataFrame

Note

Pandas does not natively support streaming parquet files. Consider using polars backend for large parquet files.

`get_unique(column)`

Get sorted unique values from a column.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name	required

Returns:

Type	Description
`list[Any]`	Sorted list of unique values (nulls excluded)

`histogram(column)`

Get value counts (histogram) for a column.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name	required

Returns:

Type	Description
`dict[Any, int]`	Dictionary mapping unique values to their counts (nulls excluded)

`map_column(column, mapping, output_column)`

Create a new column by mapping values from an existing column.

Parameters:

Name	Type	Description	Default
`column`	`str`	Source column name	required
`mapping`	`dict[Any, Any]`	Dictionary mapping source values to output values	required
`output_column`	`str`	Name of the new column to create	required

Returns:

Type	Description
`PandasBackend`	New backend with mapped column added

`multilabel_from_features(input_features, output_feature, label_map=None, allow_missing_labels=False)`

Create a multi-label column by combining multiple input feature columns. Each row in the output column will contain a sorted list of integer IDs corresponding to the labels found in the specified input feature columns.

Parameters:

Name	Type	Description	Default
`input_features`	`list[str]`	List of column names to use as sources for labels. Each column can contain single values or lists of values.	required
`output_feature`	`str`	Name of the output column to store the generated label lists.	required
`label_map`	`dict[str, Any] \| None`	Mapping of unique label values to integer IDs. If None, a mapping will be generated from the unique values in the input features.	`None`
`allow_missing_labels`	`bool`	If True, rows with no labels will be included in the output. If False, rows with no labels will be dropped. Default is False.	`False`

Returns:

Type	Description
`tuple[PandasBackend, dict]`	A tuple containing: - New PandasBackend instance with the added multi-label column - The label_map used for mapping labels to IDs

`rename_columns(mapping)`

Rename DataFrame columns.

Parameters:

Name	Type	Description	Default
`mapping`	`dict[str, str]`	Dictionary mapping old column names to new names	required

Returns:

Type	Description
`PandasBackend`	New backend with renamed columns

`sample_rows(n, *, seed=42, replace=False)`

Randomly sample n rows from the DataFrame.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of rows to sample	required
`seed`	`int`	Random seed for reproducibility, by default 42	`42`
`replace`	`bool`	Whether to sample with replacement, by default False	`False`

Returns:

Type	Description
`PandasBackend`	New backend with sampled rows

`select_columns(columns)`

Select a subset of columns from the DataFrame.

Parameters:

Name	Type	Description	Default
`columns`	`list[str]`	List of column names to keep	required

Returns:

Type	Description
`PandasBackend`	New backend with only specified columns

`subsample_by_column(column, ratios, *, seed=42)`

Subsample rows by column values with specified ratios.

For each unique value in the column, sample the specified ratio of rows. Special key "other" can be used to subsample all values not explicitly listed.

Note: The "other" key pools all unlisted values together and samples from the pooled group, rather than applying the ratio per unlisted category.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name to group by	required
`ratios`	`dict[str, float]`	Dictionary mapping column values to sampling ratios (0.0 to 1.0). Special key "other" applies to all unlisted values (pooled together).	required
`seed`	`int`	Random seed for reproducibility, by default 42	`42`

Returns:

Type	Description
`PandasBackend`	New backend with subsampled rows

Raises:

Type	Description
`KeyError`	If the specified column does not exist in the DataFrame
`ValueError`	If any ratio is negative or greater than 1.0

`upsample_by_column(column, target_counts, *, seed=42)`

Upsample rows by column values to target counts with replacement.

For each unique value in the column, sample rows with replacement to reach the target count. If a category already has more rows than the target, it will be downsampled (without replacement) to the target count.

Note: The "other" key pools all unlisted values together and samples from the pooled group to reach the target count, rather than applying the target per unlisted category.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name to group by	required
`target_counts`	`dict[str, int]`	Dictionary mapping column values to target sample counts. Special key "other" applies to all unlisted values (pooled together).	required
`seed`	`int`	Random seed for reproducibility, by default 42	`42`

Returns:

Type	Description
`PandasBackend`	New backend with upsampled/downsampled rows

Raises:

Type	Description
`KeyError`	If the specified column does not exist in the DataFrame
`ValueError`	If any target count is negative
`TypeError`	If any target count is not an integer

`alp_data.backends.PolarsBackend`

Polars implementation of the DataFrameBackend protocol.

This backend wraps a polars DataFrame or LazyFrame and provides a unified interface for DataFrame operations that can work across different backend implementations.

Supports both eager (DataFrame) and streaming (LazyFrame) modes.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame \| LazyFrame`	The polars DataFrame or LazyFrame to wrap	required
`streaming`	`bool`	Whether the backend is in streaming mode (LazyFrame)	`False`
`streaming_chunk_size`	`int`	Number of rows per batch when iterating in streaming mode (default: 1000) 1000 is a good number because its high enough to reduce I/O and any higher doesn't help because the main latency source in Dataset getitem calls are in loading audio anyway.	`1000`

Examples:

>>> import polars as pl
>>> from alp_data.backends import PolarsBackend
>>> df = pl.DataFrame({
...     "species": ["cat", "dog", "fish", "cat", "dog", None],
...     "count": [5, 3, 8, 2, 7, 1]
... })
>>> backend = PolarsBackend(df)
>>> row = backend[0]
>>> filtered = backend.filter_isin("species", ["cat", "dog"])
>>> assert filtered.unwrap["species"].to_list() == ["cat", "dog", "cat", "dog"]
>>> # Streaming mode with LazyFrame
>>> df = pl.LazyFrame({
...     "species": ["cat", "dog", "fish", "cat", "dog", None],
...     "count": [5, 3, 8, 2, 7, 1]
... })
>>> backend = PolarsBackend(df, streaming=True)
>>> assert isinstance(backend.unwrap, pl.LazyFrame)
>>> print(backend.columns)
['species', 'count']
>>> for row in backend:
...     print(row)
...     break
{'species': 'cat', 'count': 5}
>>> collected = backend.collect()
>>> assert isinstance(collected.unwrap, pl.DataFrame)

`columns` `property`

Get the list of column names.

Returns:

Type	Description
`list[str]`	List of column names

`is_streaming` `property`

Check if backend is in streaming mode.

Returns:

Type	Description
`bool`	True if in streaming mode (LazyFrame), False otherwise

`unwrap` `property`

Get the underlying DataFrame object.

Returns:

Type	Description
`DataFrame \| LazyFrame`	The underlying polars DataFrame or LazyFrame

`getitem(key)`

Get row(s) from the DataFrame using Pythonic indexing.

Parameters:

Name	Type	Description	Default
`key`	`int \| list[int] \| slice`	int: Get single row as dict list[int]: Get multiple rows as new backend slice: Get row range as new backend	required

Returns:

Type	Description
`dict[str, Any] \| PolarsBackend`	dict if key is int (single row) PolarsBackend if key is list or slice (multiple rows)

Raises:

Type	Description
`IndexError`	If index is out of bounds
`TypeError`	If key type is not supported

`init(df, *, streaming=False, streaming_chunk_size=1000)`

Initialize the backend with a polars DataFrame or LazyFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame \| LazyFrame`	The DataFrame or LazyFrame to wrap	required
`streaming`	`bool`	Whether to use streaming mode (LazyFrame), by default False	`False`
`streaming_chunk_size`	`int`	Number of rows per batch when iterating in streaming mode, by default 1000	`1000`

`iter()`

Iterate over DataFrame rows as dictionaries.

In streaming mode (LazyFrame), uses LazyFrame.collect_batches() to materialize the query one chunk at a time, so the full result never needs to live in memory at once. In eager mode (DataFrame), yields rows directly.

Yields:

Type	Description
`dict[str, Any]`	Dictionary for each row mapping column names to values

`len()`

Get the number of rows in the DataFrame.

Returns:

Type	Description
`int`	Number of rows

`repr()`

Return string representation of the backend.

Returns:

Type	Description
`str`	String representation showing backend type and DataFrame shape

`add_column(column, values)`

Add a new column to the DataFrame.

Parameters:

Name	Type	Description	Default
`column`	`str`	Name of the new column	required
`values`	`Any`	Values for the new column (scalar or array-like)	required

Returns:

Type	Description
`PolarsBackend`	New backend with new column added

`apply_fn(fn, fn_kwargs, apply_kwargs)`

Apply a custom function to rows and create a new column.

Parameters:

Name	Type	Description	Default
`fn`	`Any`	Function to apply to each row. Should accept a dict of column values.	required
`fn_kwargs`	`dict`	Additional keyword arguments to pass to the function	required
`apply_kwargs`	`dict`	Additional keyword arguments to pass to polars.DataFrame.map_rows()	required

Returns:

Type	Description
`PolarsBackend`	New backend with the new column added

Notes

This method collects the DataFrame if in streaming mode, as polars does not support arbitrary row-wise functions in LazyFrame. The returned backend will be in eager mode (streaming=False).

`collect()`

Materialize the LazyFrame and return an eager backend.

This method collects the LazyFrame into a DataFrame and returns a new backend with streaming mode disabled.

Returns:

Type	Description
`PolarsBackend`	New backend in eager mode with materialized DataFrame

Notes

If the backend is already in eager mode, returns a copy of the backend.

`column_exists(column)`

Check if a column exists in the DataFrame.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name to look for	required

Returns:

Type	Description
`bool`	True if column exists, False otherwise

`concat(backends, *, ignore_index=True, sort=False)` `classmethod`

Concatenate multiple backend instances vertically (row-wise).

Parameters:

Name	Type	Description	Default
`backends`	`list[PolarsBackend]`	List of backend instances to concatenate	required
`sort`	`bool`	If True, sort columns alphabetically, by default False	`False`

Returns:

Type	Description
`PolarsBackend`	New backend with concatenated data

`copy()`

Create a copy of the backend with a copied DataFrame.

Returns:

Type	Description
`PolarsBackend`	New backend instance with copied DataFrame

`drop_duplicates(subset=None, *, keep='first')`

Remove duplicate rows from the DataFrame.

Parameters:

Name	Type	Description	Default
`subset`	`list[str] \| None`	Column names to consider for identifying duplicates. If None, use all columns, by default None	`None`
`keep`	`Literal['first', 'last']`	Which duplicate to keep, by default "first"	`'first'`

Returns:

Type	Description
`PolarsBackend`	New backend with duplicates removed

Notes

In streaming mode (LazyFrame), this operation preserves the lazy computation. Call .collect() to materialize the deduplicated result into an eager backend.

`dropna(subset=None)`

Remove rows with missing values.

Parameters:

Name	Type	Description	Default
`subset`	`list[str] \| None`	Column names to consider for null detection. If None, check all columns, by default None	`None`

Returns:

Type	Description
`PolarsBackend`	New backend with null rows removed

Notes

In streaming mode (LazyFrame), this operation preserves the lazy computation. Call .collect() to materialize the cleaned result into an eager backend.

`filter_isin(column, values, *, negate=False)`

Filter DataFrame rows where column values are in (or not in) a list.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name to filter on	required
`values`	`list[Any]`	List of values to match	required
`negate`	`bool`	If True, keep rows NOT in values list, by default False	`False`

Returns:

Type	Description
`PolarsBackend`	New backend with filtered DataFrame

Notes

In streaming mode (LazyFrame), this operation preserves the lazy computation. Call .collect() to materialize the filtered result into an eager backend.

`from_csv(path, *, streaming=False, **kwargs)` `classmethod`

Read a CSV file and return a wrapped DataFrame backend.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the CSV file (supports local and cloud paths via cloudpathlib)	required
`streaming`	`bool`	If True, use streaming mode with LazyFrame, by default False	`False`
`**kwargs`	`Any`	Additional polars-specific arguments	`{}`

Returns:

Type	Description
`PolarsBackend`	Backend instance wrapping the loaded DataFrame or LazyFrame

`from_json(path, *, lines=False, streaming=False, **kwargs)` `classmethod`

Read a JSON file and return a wrapped DataFrame backend.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the JSON file	required
`lines`	`bool`	If True, read file as JSON lines (one JSON object per line), by default False	`False`
`streaming`	`bool`	If True, use streaming mode with LazyFrame, by default False	`False`
`**kwargs`	`Any`	Additional polars-specific arguments	`{}`

Returns:

Type	Description
`PolarsBackend`	Backend instance wrapping the loaded DataFrame

`from_parquet(path, *, streaming=False, **kwargs)` `classmethod`

Read a Parquet file and return a wrapped DataFrame backend.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to the Parquet file	required
`streaming`	`bool`	If True, use streaming mode with LazyFrame, by default False	`False`
`**kwargs`	`Any`	Additional polars-specific arguments	`{}`

Returns:

Type	Description
`PolarsBackend`	Backend instance wrapping the loaded DataFrame or LazyFrame

`get_unique(column)`

Get sorted unique values from a column.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name	required

Returns:

Type	Description
`list[Any]`	Sorted list of unique values (nulls excluded)

Notes

In streaming mode (LazyFrame), materializes the full column to compute uniques. A UserWarning is emitted because this forces collection of the underlying query.

`histogram(column)`

Get value counts (histogram) for a column.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name	required

Returns:

Type	Description
`dict[Any, int]`	Dictionary mapping unique values to their counts (nulls excluded)

Notes

In streaming mode (LazyFrame), materializes the full column to compute counts. A UserWarning is emitted because this forces collection of the underlying query.

`iter_batches(batch_size=1000)`

Iterate over DataFrame in batches.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	Number of rows per batch, by default 1000	`1000`

Yields:

Type	Description
`PolarsBackend`	Backend instances wrapping batches of up to batch_size rows. Yielded backends are always in eager mode.

Notes

In streaming mode, uses LazyFrame.collect_batches(chunk_size=batch_size) to produce batches incrementally, so the full result never needs to live in memory at once. Note that polars may return chunks that are smaller than batch_size; it treats it as a hint rather than a strict cap.

`map_column(column, mapping, output_column, *, default=None)`

Create a new column by mapping values from an existing column.

Parameters:

Name	Type	Description	Default
`column`	`str`	Source column name	required
`mapping`	`dict[Any, Any]`	Dictionary mapping source values to output values	required
`output_column`	`str`	Name of the new column to create	required
`default`	`Any`	Value to use for unmapped keys, by default None	`None`

Returns:

Type	Description
`PolarsBackend`	New backend with mapped column added

`multilabel_from_features(input_features, output_feature, label_map=None, allow_missing_labels=True)`

Create a multi-label column by combining multiple input feature columns. Each row in the output column will contain a sorted list of integer IDs corresponding to the labels found in the specified input feature columns.

Parameters:

Name	Type	Description	Default
`input_features`	`list[str]`	List of column names to use as sources for labels. Each column can contain single values or lists of values.	required
`output_feature`	`str`	Name of the output column to store the generated label lists.	required
`label_map`	`dict[str, Any] \| None`	Mapping of unique label values to integer IDs. If None, a mapping will be generated from the unique values in the input features.	`None`
`allow_missing_labels`	`bool`	If True, rows with no labels will be included in the output. If False, rows with no labels will be dropped. Default is True.	`True`

Returns:

Type	Description
`tuple[PolarsBackend, dict]`	A tuple containing: - New PolarsBackend instance with the added multi-label column - The label_map used for mapping labels to IDs

Raises:

Type	Description
`ValueError`	If any input feature does not exist or is not of type List.

`rename_columns(mapping)`

Rename DataFrame columns.

Parameters:

Name	Type	Description	Default
`mapping`	`dict[str, str]`	Dictionary mapping old column names to new names	required

Returns:

Type	Description
`PolarsBackend`	New backend with renamed columns

`sample_rows(n, *, seed=42, replace=False)`

Randomly sample n rows from the DataFrame.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of rows to sample	required
`seed`	`int`	Random seed for reproducibility, by default 42	`42`
`replace`	`bool`	Whether to sample with replacement, by default False	`False`

Returns:

Type	Description
`PolarsBackend`	New backend with sampled rows

`select_columns(columns)`

Select a subset of columns from the DataFrame.

Parameters:

Name	Type	Description	Default
`columns`	`list[str]`	List of column names to keep	required

Returns:

Type	Description
`PolarsBackend`	New backend with only specified columns

`subsample_by_column(column, ratios, *, seed=42)`

Subsample rows by column values with specified ratios.

For each unique value in the column, sample the specified ratio of rows. Special key "other" can be used to subsample all values not explicitly listed.

If the backend is in streaming mode, a UserWarning will be issued and the LazyFrame will be collected since sampling requires materialization.

Note: The "other" key pools all unlisted values together and samples from the pooled group, rather than applying the ratio per unlisted category.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name to group by	required
`ratios`	`dict[str, float]`	Dictionary mapping column values to sampling ratios (0.0 to 1.0). Special key "other" applies to all unlisted values (pooled together).	required
`seed`	`int`	Random seed for reproducibility, by default 42	`42`

Returns:

Type	Description
`PolarsBackend`	New backend with subsampled rows

Raises:

Type	Description
`KeyError`	If the specified column does not exist in the DataFrame
`ValueError`	If any ratio is negative or greater than 1.0

`upsample_by_column(column, target_counts, *, seed=42)`

Upsample rows by column values to target counts with replacement.

For each unique value in the column, sample rows with replacement to reach the target count. If a category already has more rows than the target, it will be downsampled (without replacement) to the target count.

If the backend is in streaming mode, a UserWarning will be issued and the LazyFrame will be collected since sampling requires materialization.

Note: The "other" key pools all unlisted values together and samples from the pooled group to reach the target count, rather than applying the target per unlisted category.

Parameters:

Name	Type	Description	Default
`column`	`str`	Column name to group by	required
`target_counts`	`dict[str, int]`	Dictionary mapping column values to target sample counts. Special key "other" applies to all unlisted values (pooled together).	required
`seed`	`int`	Random seed for reproducibility, by default 42	`42`

Returns:

Type	Description
`PolarsBackend`	New backend with upsampled/downsampled rows

Raises:

Type	Description
`KeyError`	If the specified column does not exist in the DataFrame
`ValueError`	If any target count is negative
`TypeError`	If any target count is not an integer

`alp_data.backends.get_backend(backend)`

Get the backend class for the specified backend type.

Parameters:

Name	Type	Description	Default
`backend`	`BackendType`	Name of the backend ("pandas" or "polars")	required

Returns:

Type	Description
`Type[DataBackend \| StreamingDataBackend]`	The backend class (not an instance)

Raises:

Type	Description
`ValueError`	If the backend name is not recognized

Examples:

>>> backend_cls = get_backend("pandas")
>>> assert backend_cls is PandasBackend
>>> backend_cls = get_backend("polars")
>>> assert backend_cls is PolarsBackend

Source code in alp_data/backends/backends.py

def get_backend(backend: BackendType) -> Type[DataBackend | StreamingDataBackend]:
    """Get the backend class for the specified backend type.

    Parameters
    ----------
    backend : BackendType
        Name of the backend ("pandas" or "polars")

    Returns
    -------
    Type[DataBackend | StreamingDataBackend]
        The backend class (not an instance)

    Raises
    ------
    ValueError
        If the backend name is not recognized

    Examples
    --------
    >>> backend_cls = get_backend("pandas")
    >>> assert backend_cls is PandasBackend
    >>> backend_cls = get_backend("polars")
    >>> assert backend_cls is PolarsBackend
    """
    if backend not in _BACKEND_REGISTRY:
        raise ValueError(
            f"Unknown backend: {backend}. Supported backends: {list(_BACKEND_REGISTRY.keys())}"
        )
    return _BACKEND_REGISTRY[backend]

alp_data.backends Module

What are data backends?

Available Backends

How to Use Backends

Specifying a Backend in Dataset Configuration

Direct Backend Usage

Streaming Mode

Accessing the Underlying Data Object

Pandas vs Polars: When to Use Which

Accessing the Underlying Data

The DataBackend Protocol

Core Interface

Data Loading (Class Methods)

Data Access

Data Manipulation

Sampling

Advanced Operations

Backend Integration in alp-data

Integration with Datasets

Integration with Transforms

Integration with Dataset Concatenation

API Reference

alp_data.backends.protocol.DataBackend

is_streaming property

columns property

unwrap property

from_csv(path, *, streaming=False, **kwargs) classmethod

from_json(path, *, lines=False, streaming=False, **kwargs) classmethod

from_parquet(path, *, streaming=False, **kwargs) classmethod

from_path(path, *, streaming=False, **kwargs) classmethod

__init__(df, *, streaming=False)

__getitem__(key)

__len__()

__iter__()

filter_isin(column, values, *, negate=False)

drop_duplicates(subset=None, *, keep='first')

dropna(subset=None)

get_unique(column)

histogram(column)

map_column(column, mapping, output_column, *, default=None)

rename_columns(mapping)

add_column(column, values)

select_columns(columns)

concat(backends, *, ignore_index=True, sort=False) classmethod

column_exists(column)

subsample_by_column(column, ratios, *, seed=42)

upsample_by_column(column, target_counts, *, seed=42)

sample_rows(n, *, seed=42, replace=False)

copy()

apply_fn(fn, **fn_kwargs)

multilabel_from_features(input_features, output_feature, label_map=None, allow_missing_labels=False)

alp_data.backends.PandasBackend

columns property

is_streaming property

unwrap property

__getitem__(key)

__init__(df, *, streaming=False, streaming_chunk_size=1000)

__iter__()

__len__()

__repr__()

add_column(column, values)

apply_fn(fn, fn_kwargs, apply_kwargs)

column_exists(column)

concat(backends, *, ignore_index=True, sort=False) classmethod

copy()

drop_duplicates(subset=None, *, keep='first')

dropna(subset=None)

filter_isin(column, values, *, negate=False)

from_csv(path, *, streaming=False, streaming_chunk_size=1000, **kwargs) classmethod

from_json(path, *, lines=False, streaming=False, streaming_chunk_size=1000, **kwargs) classmethod

from_parquet(path, *, streaming=False, **kwargs) classmethod

get_unique(column)

histogram(column)

map_column(column, mapping, output_column)

multilabel_from_features(input_features, output_feature, label_map=None, allow_missing_labels=False)

rename_columns(mapping)

sample_rows(n, *, seed=42, replace=False)

select_columns(columns)

subsample_by_column(column, ratios, *, seed=42)

upsample_by_column(column, target_counts, *, seed=42)

`alp_data.backends` Module

`alp_data.backends.protocol.DataBackend`

`is_streaming` `property`

`columns` `property`

`unwrap` `property`

`from_csv(path, *, streaming=False, **kwargs)` `classmethod`

`from_json(path, *, lines=False, streaming=False, **kwargs)` `classmethod`

`from_parquet(path, *, streaming=False, **kwargs)` `classmethod`

`from_path(path, *, streaming=False, **kwargs)` `classmethod`

`init(df, *, streaming=False)`

`getitem(key)`

`len()`

`iter()`

`filter_isin(column, values, *, negate=False)`

`drop_duplicates(subset=None, *, keep='first')`

`dropna(subset=None)`

`get_unique(column)`

`histogram(column)`

`map_column(column, mapping, output_column, *, default=None)`

`rename_columns(mapping)`

`add_column(column, values)`

`select_columns(columns)`

`concat(backends, *, ignore_index=True, sort=False)` `classmethod`

`column_exists(column)`

`subsample_by_column(column, ratios, *, seed=42)`

`upsample_by_column(column, target_counts, *, seed=42)`

`sample_rows(n, *, seed=42, replace=False)`

`copy()`

`apply_fn(fn, **fn_kwargs)`

`multilabel_from_features(input_features, output_feature, label_map=None, allow_missing_labels=False)`

`alp_data.backends.PandasBackend`

`columns` `property`

`is_streaming` `property`

`unwrap` `property`

`getitem(key)`

`init(df, *, streaming=False, streaming_chunk_size=1000)`

`iter()`

`len()`

`repr()`

`add_column(column, values)`

`apply_fn(fn, fn_kwargs, apply_kwargs)`

`column_exists(column)`

`concat(backends, *, ignore_index=True, sort=False)` `classmethod`

`copy()`

`drop_duplicates(subset=None, *, keep='first')`

`dropna(subset=None)`

`filter_isin(column, values, *, negate=False)`

`from_csv(path, *, streaming=False, streaming_chunk_size=1000, **kwargs)` `classmethod`

`from_json(path, *, lines=False, streaming=False, streaming_chunk_size=1000, **kwargs)` `classmethod`

`from_parquet(path, *, streaming=False, **kwargs)` `classmethod`

`get_unique(column)`

`histogram(column)`

`map_column(column, mapping, output_column)`

`multilabel_from_features(input_features, output_feature, label_map=None, allow_missing_labels=False)`

`rename_columns(mapping)`

`sample_rows(n, *, seed=42, replace=False)`

`select_columns(columns)`

`subsample_by_column(column, ratios, *, seed=42)`

`upsample_by_column(column, target_counts, *, seed=42)`

`alp_data.backends.PolarsBackend`

`columns` `property`

`is_streaming` `property`

`unwrap` `property`

`getitem(key)`

`init(df, *, streaming=False, streaming_chunk_size=1000)`

`iter()`

`len()`

`repr()`

`add_column(column, values)`

`apply_fn(fn, fn_kwargs, apply_kwargs)`

`collect()`

`column_exists(column)`

`concat(backends, *, ignore_index=True, sort=False)` `classmethod`

`copy()`

`drop_duplicates(subset=None, *, keep='first')`

`dropna(subset=None)`

`filter_isin(column, values, *, negate=False)`

`from_csv(path, *, streaming=False, **kwargs)` `classmethod`

`from_json(path, *, lines=False, streaming=False, **kwargs)` `classmethod`

`from_parquet(path, *, streaming=False, **kwargs)` `classmethod`