Skip to content

Data

data

Dataset

Bases: BaseDataset

cache_dir = cache_dir instance-attribute

path: Path property

The path to the dataset.

Returns:

Type Description
Path

The path to the dataset.

format: str property

The format of the dataset.

Returns:

Type Description
str

The format of the dataset.

columns: List[str] property

Get the names of the columns in the dataset.

Returns:

Type Description
List[str]

The names of the columns in the dataset.

__init__(data_or_loader: Union[List[dict], Dict[str, list], DataFrame, DatasetLoader, str] = None, format: str = DEFAULT_FORMAT, *, schema: Union[pa.Schema, BaseModel, None] = None, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)

__init__(data: Union[List[dict], Dict[str, list], DataFrame] = None, format: str = DEFAULT_FORMAT)
__init__(data: Union[List[dict], Dict[str, list], DataFrame], format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None)
__init__(loader: DatasetLoader, format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)

Parameters:

Name Type Description Default
data_or_loader list of dict, dict of list, DataFrame, BaseDatasetLoader, str

The data to load into the dataset or the (name of) loader to use.

None
format str

The format of the dataset.

DEFAULT_FORMAT
path (str, Path, None)

Load the data to this path.

None
cache_dir (str, Path, None)

The directory to use for caching.

None
loader_args (tuple, None)

The arguments to pass to the loader function if provided as the first argument.

None
loader_kwargs (dict, None)

The keyword arguments to pass to the loader function if provided as the first argument.

None

count_rows() -> int

Count the number of rows in the dataset.

Returns:

Type Description
int

The number of rows in the dataset.

__len__() -> int

Get the number of rows in the dataset.

Returns:

Type Description
int

The number of rows in the dataset.

head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame

Get the first rows of the dataset as a pandas DataFrame.

Parameters:

Name Type Description Default
num_rows int

The number of rows to get.

5
columns str, list of str, None

Names of columns to get. If None, all columns are returned.

None
filter Expression

The filter expression.

None
batch_size int

Number of rows to get at a time.

DEFAULT_BATCH_SIZE

Returns:

Type Description
DataFrame

A pandas DataFrame containing the first rows of the dataset.

__getitem__(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table]

__getitem__(indices: int) -> Dict[str, Any]
__getitem__(indices: Union[slice, List[int], ArrayLike]) -> pa.Table

Get rows from the dataset.

take(*, indices: Union[int, slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Union[dict, pa.Table]

take(*, indices: Optional[int] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Dict[str, Any]
take(*, indices: Union[slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame

Take rows(/columns) from the dataset.

Parameters:

Name Type Description Default
indices int, slice, list of int, array-like

Indices of rows to take.

None
columns str, list of str, None

Names of columns to take. If None, all columns are taken.

None
batch_size int

Number of rows to take at a time.

DEFAULT_BATCH_SIZE

Returns:

Type Description
(Document, Table)

The taken rows or row.

map(func: Any, batch_size: int = DEFAULT_BATCH_SIZE, batched: bool = False, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Union[List[str], None] = None, verbose: Union[bool, int] = 1) -> Dataset

Map a function over the dataset.

Parameters:

Name Type Description Default
func Any

The function to map over the dataset.

required
batch_size int

Number of rows to map at a time.

DEFAULT_BATCH_SIZE
batched bool

Whether the function is batched.

False
verbose bool | int

Whether to show a progress bar.

1

Returns:

Type Description
Dataset

A new dataset containing the mapped rows.

filter(expression: Expression = None) -> Dataset

Filter the dataset.

Parameters:

Name Type Description Default
expression Expression

The filter expression.

None

Returns:

Type Description
Dataset

A new dataset containing only the rows that match the filter expression.

select(columns: Union[str, List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset

Select columns from the dataset.

Parameters:

Name Type Description Default
columns str, list of str

Names of columns to select.

required

Returns:

Type Description
Dataset

A new dataset containing only the selected columns.

rename(columns: Dict[str, str], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset

Rename columns in the dataset.

Parameters:

Name Type Description Default
columns dict

Mapping of old column names to new column names.

required

Returns:

Type Description
Dataset

A new dataset with the columns renamed.

project(columns: Union[Dict[str, Expression], Dict[str, str], List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset

Project columns in the dataset.

Parameters:

Name Type Description Default
columns dict

Mapping of column names to expressions.

required
batch_size int

Number of rows to project at a time.

DEFAULT_BATCH_SIZE

Returns:

Type Description
Dataset

A new dataset with the columns projected.

load_dataset(path: Union[Path, str], format: str = DEFAULT_FORMAT, cache_dir: Union[Path, str, None] = None) -> Dataset classmethod

Load an existing dataset.

Parameters:

Name Type Description Default
path (str, Path)

The path to the dataset.

required
format str

The format of the dataset.

DEFAULT_FORMAT

Returns:

Type Description
Dataset

The loaded dataset.

to_polars() -> pl.LazyFrame

Convert the dataset to a Polars DataFrame.

Returns:

Type Description
LazyFrame

The Polars Lazy DataFrame.

Expression

Bases: BaseExpression

A class representing an expression in Octoflow.

__init__(expression: Union[Expression, ds.Expression])

Parameters:

Name Type Description Default
expression Union[Expression, Expression]

The (pyarrow) expression to wrap.

required

__eq__(other: Any) -> Expression

Compare two expressions for equality.

Parameters:

Name Type Description Default
other Any

The other expression to compare to.

required

Returns:

Type Description
Expression

The expression representing the result of the comparison.

__ne__(other: Any) -> Expression

Compare two expressions for inequality.

Parameters:

Name Type Description Default
other Any

The other expression to compare to.

required

Returns:

Type Description
Expression

The expression representing the result of the comparison.

__lt__(other: Any) -> Expression

Compare two expressions for less than.

Parameters:

Name Type Description Default
other Any

The other expression to compare to.

required

Returns:

Type Description
Expression

The expression representing the result of the comparison.

__le__(other: Any) -> Expression

Compare two expressions for less than or equal to.

Parameters:

Name Type Description Default
other Any

The other expression to compare to.

required

Returns:

Type Description
Expression

The expression representing the result of the comparison.

__gt__(other: Any) -> Expression

Compare two expressions for greater than.

Parameters:

Name Type Description Default
other Any

The other expression to compare to.

required

Returns:

Type Description
Expression

The expression representing the result of the comparison.

__ge__(other: Any) -> Expression

Compare two expressions for greater than or equal to.

Parameters:

Name Type Description Default
other Any

The other expression to compare to.

required

Returns:

Type Description
Expression

The expression representing the result of the comparison.

__and__(other: Any) -> Expression

Combine two expressions with a logical and.

Parameters:

Name Type Description Default
other Any

The other expression to combine with.

required

Returns:

Type Description
Expression

The expression representing the result of the combination.

__or__(other: Any) -> Expression

Combine two expressions with a logical or.

Parameters:

Name Type Description Default
other Any

The other expression to combine with.

required

Returns:

Type Description
Expression

The expression representing the result of the combination.

__invert__() -> Expression

Invert an expression.

Returns:

Type Description
Expression

The expression representing the inverted expression.

is_nan() -> Expression

Check if an expression is NaN.

Returns:

Type Description
Expression

The expression representing the result of the check.

is_null(nan_is_null: bool = False)

Check if an expression is null.

Parameters:

Name Type Description Default
nan_is_null bool

Whether to consider NaN values as null, by default False

False

Returns:

Type Description
Expression

The expression representing the result of the check.

is_valid() -> Expression

Check if an expression is valid.

Returns:

Type Description
Expression

The expression representing the result of the check.

isin(other: Expression) -> Expression

Check if an expression is in a set of values.

Parameters:

Name Type Description Default
other Expression

The set of values to check against.

required

Returns:

Type Description
Expression

The expression representing the result of the check.

equals(other: Expression) -> Expression

Check if an expression is equal to another expression.

Parameters:

Name Type Description Default
other Expression

The other expression to check against.

required

Returns:

Type Description
Expression

The expression representing the result of the check.

__hash__() -> int

Get the hash of the expression.

Returns:

Type Description
int

The hash of the expression.

__repr__() -> str

Get the representation of the expression.

Returns:

Type Description
str

The representation of the expression.

field(*args, **kwargs) -> Field

Create a new field getter.

scalar(value: Any) -> Expression

Create an expression from a scalar.

Parameters:

Name Type Description Default
value Any

The value of the scalar.

required

Returns:

Type Description
Expression

The expression representing the scalar.

dataloader(func: Union[F, str, None] = None, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[..., Any]] = None, path_arg: Optional[str] = None) -> Union[F, Callable[[F], F]]

dataloader(func: F, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[P, R]] = None, path_arg: Optional[str] = None) -> F
dataloader(name: str, extensions: Optional[list[str]] = None, wraps: Optional[Callable[P, R]] = None, path_arg: Optional[str] = None) -> Callable[[F], F]

Decorator to register a function as a dataset loader.

Parameters:

Name Type Description Default
func Union[Callable[..., Any], str, None]

The function to decorate, by default None.

None
name Optional[str]

The name of the loader, by default None.

None
extensions Optional[list[str]]

The extensions that the loader supports, by default None.

None
wraps Optional[Callable[..., Any]]

The function to wrap, by default None.

None
path_arg Optional[str]

The name of the argument that is the path, by default None.

None

Returns:

Type Description
DatasetLoader

The dataset loader.

load_dataset(__loader: str, __path: Optional[str], __force: bool = False, __dataset_format: str = DEFAULT_FORMAT, __dataset_path: Union[Path, str, None] = None, /, *args, **kwargs) -> Dataset

Load a dataset from a path.

Parameters:

Name Type Description Default
__loader str

The name of the loader.

required
__path Optional[str]

The path to the data (to be passed to the loader).

required
__dataset_format str

The format of the dataset, by default DEFAULT_FORMAT.

DEFAULT_FORMAT
__dataset_path Union[Path, str, None]

The path that the dataset will be stored.

None
*args tuple

The arguments to pass to the loader.

()
**kwargs dict

The keyword arguments to pass to the loader.

{}

Returns:

Type Description
Dataset

The loaded dataset.

base

ArrowType = TypeVar('ArrowType') module-attribute

P = ParamSpec('P') module-attribute

R = TypeVar('R') module-attribute

DEFAULT_BATCH_SIZE: Final[int] = 1048576 module-attribute

DEFAULT_FORMAT: Final[str] = 'arrow' module-attribute

BaseExpression = PyArrowWrapper[ds.Expression] module-attribute

BaseDataset = PyArrowWrapper[ds.Dataset] module-attribute

PyArrowWrapper

Bases: Generic[ArrowType]

__init__(wrapped: ArrowType) -> None
to_pyarrow() -> ArrowType

BaseDatasetLoader

Bases: Generic[P, R]

dataclass

T = TypeVar('T') module-attribute

Field

Bases: Field, Expression

name = name instance-attribute
__init__(name: Optional[str] = None, *, default=dc.MISSING, default_factory=dc.MISSING, init=True, repr=True, hash=None, compare=True, metadata=None, kw_only=dc.MISSING)
__call__(data: Mapping[str, Any]) -> Any

Get the value of the field.

Parameters:

Name Type Description Default
data dict

The data to be accessed.

required

FieldAccessor

Bases: tuple, Generic[T]

__new__(obj: Type[T]) -> FieldAccessor[T]
__getattr__(name: str) -> Field

ModelMeta

Bases: type

__new__(mcs, name, bases, attrs, **kwargs)
update_forward_refs(**kwargs: Any) -> None

BaseModel

__post_init__()

field(*args, **kwargs) -> Field

Create a new field getter.

field_from_dataclass_field(field: dc.Field) -> Field

Create a new field getter.

fields(cls: Type[T]) -> Union[FieldAccessor[T], Type[T]]

dataset

SourceType = Union[str, List[str], Union[Path, List[Path]], 'Dataset', List['Dataset']] module-attribute

Dataset

Bases: BaseDataset

cache_dir = cache_dir instance-attribute
path: Path property

The path to the dataset.

Returns:

Type Description
Path

The path to the dataset.

format: str property

The format of the dataset.

Returns:

Type Description
str

The format of the dataset.

columns: List[str] property

Get the names of the columns in the dataset.

Returns:

Type Description
List[str]

The names of the columns in the dataset.

__init__(data_or_loader: Union[List[dict], Dict[str, list], DataFrame, DatasetLoader, str] = None, format: str = DEFAULT_FORMAT, *, schema: Union[pa.Schema, BaseModel, None] = None, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)
__init__(data: Union[List[dict], Dict[str, list], DataFrame] = None, format: str = DEFAULT_FORMAT)
__init__(data: Union[List[dict], Dict[str, list], DataFrame], format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None)
__init__(loader: DatasetLoader, format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)

Parameters:

Name Type Description Default
data_or_loader list of dict, dict of list, DataFrame, BaseDatasetLoader, str

The data to load into the dataset or the (name of) loader to use.

None
format str

The format of the dataset.

DEFAULT_FORMAT
path (str, Path, None)

Load the data to this path.

None
cache_dir (str, Path, None)

The directory to use for caching.

None
loader_args (tuple, None)

The arguments to pass to the loader function if provided as the first argument.

None
loader_kwargs (dict, None)

The keyword arguments to pass to the loader function if provided as the first argument.

None
count_rows() -> int

Count the number of rows in the dataset.

Returns:

Type Description
int

The number of rows in the dataset.

__len__() -> int

Get the number of rows in the dataset.

Returns:

Type Description
int

The number of rows in the dataset.

head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame

Get the first rows of the dataset as a pandas DataFrame.

Parameters:

Name Type Description Default
num_rows int

The number of rows to get.

5
columns str, list of str, None

Names of columns to get. If None, all columns are returned.

None
filter Expression

The filter expression.

None
batch_size int

Number of rows to get at a time.

DEFAULT_BATCH_SIZE

Returns:

Type Description
DataFrame

A pandas DataFrame containing the first rows of the dataset.

__getitem__(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table]
__getitem__(indices: int) -> Dict[str, Any]
__getitem__(indices: Union[slice, List[int], ArrayLike]) -> pa.Table

Get rows from the dataset.

take(*, indices: Union[int, slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Union[dict, pa.Table]
take(*, indices: Optional[int] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Dict[str, Any]
take(*, indices: Union[slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame

Take rows(/columns) from the dataset.

Parameters:

Name Type Description Default
indices int, slice, list of int, array-like

Indices of rows to take.

None
columns str, list of str, None

Names of columns to take. If None, all columns are taken.

None
batch_size int

Number of rows to take at a time.

DEFAULT_BATCH_SIZE

Returns:

Type Description
(Document, Table)

The taken rows or row.

map(func: Any, batch_size: int = DEFAULT_BATCH_SIZE, batched: bool = False, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Union[List[str], None] = None, verbose: Union[bool, int] = 1) -> Dataset

Map a function over the dataset.

Parameters:

Name Type Description Default
func Any

The function to map over the dataset.

required
batch_size int

Number of rows to map at a time.

DEFAULT_BATCH_SIZE
batched bool

Whether the function is batched.

False
verbose bool | int

Whether to show a progress bar.

1

Returns:

Type Description
Dataset

A new dataset containing the mapped rows.

filter(expression: Expression = None) -> Dataset

Filter the dataset.

Parameters:

Name Type Description Default
expression Expression

The filter expression.

None

Returns:

Type Description
Dataset

A new dataset containing only the rows that match the filter expression.

select(columns: Union[str, List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset

Select columns from the dataset.

Parameters:

Name Type Description Default
columns str, list of str

Names of columns to select.

required

Returns:

Type Description
Dataset

A new dataset containing only the selected columns.

rename(columns: Dict[str, str], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset

Rename columns in the dataset.

Parameters:

Name Type Description Default
columns dict

Mapping of old column names to new column names.

required

Returns:

Type Description
Dataset

A new dataset with the columns renamed.

project(columns: Union[Dict[str, Expression], Dict[str, str], List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset

Project columns in the dataset.

Parameters:

Name Type Description Default
columns dict

Mapping of column names to expressions.

required
batch_size int

Number of rows to project at a time.

DEFAULT_BATCH_SIZE

Returns:

Type Description
Dataset

A new dataset with the columns projected.

load_dataset(path: Union[Path, str], format: str = DEFAULT_FORMAT, cache_dir: Union[Path, str, None] = None) -> Dataset classmethod

Load an existing dataset.

Parameters:

Name Type Description Default
path (str, Path)

The path to the dataset.

required
format str

The format of the dataset.

DEFAULT_FORMAT

Returns:

Type Description
Dataset

The loaded dataset.

to_polars() -> pl.LazyFrame

Convert the dataset to a Polars DataFrame.

Returns:

Type Description
LazyFrame

The Polars Lazy DataFrame.

gen_unique_cached_path(*refs: Any, cache_dir: Union[str, Path, None] = None) -> Path

writable(data: Any, schema: Optional[pa.Schema] = None) -> Union[pa.RecordBatch, pa.Table, pa.RecordBatchReader]

write_dataset(path: Union[str, Path], data: Union[ds.Dataset, pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], pa.RecordBatchReader, pd.DataFrame, Mapping[str, List[Any]], Sequence[Mapping[str, Any]]], schema: pa.Schema = None, format: Optional[str] = None) -> bool

read_dataset(path: Union[str, Path], format: str) -> ds.dataset

to_batches(data: Union[pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], Iterable[pa.Table], pa.RecordBatchReader]) -> Generator[pa.RecordBatch, None, None]

create_mapped_table(data: Union[dict, list, pd.DataFrame, pa.RecordBatch, pa.Table], existing: Optional[pa.Table] = None, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Optional[List[str]] = None) -> pa.Table

expression

Expression

Bases: BaseExpression

A class representing an expression in Octoflow.

__init__(expression: Union[Expression, ds.Expression])

Parameters:

Name Type Description Default
expression Union[Expression, Expression]

The (pyarrow) expression to wrap.

required
__eq__(other: Any) -> Expression

Compare two expressions for equality.

Parameters:

Name Type Description Default
other Any

The other expression to compare to.

required

Returns:

Type Description
Expression

The expression representing the result of the comparison.

__ne__(other: Any) -> Expression

Compare two expressions for inequality.

Parameters:

Name Type Description Default
other Any

The other expression to compare to.

required

Returns:

Type Description
Expression

The expression representing the result of the comparison.

__lt__(other: Any) -> Expression

Compare two expressions for less than.

Parameters:

Name Type Description Default
other Any

The other expression to compare to.

required

Returns:

Type Description
Expression

The expression representing the result of the comparison.

__le__(other: Any) -> Expression

Compare two expressions for less than or equal to.

Parameters:

Name Type Description Default
other Any

The other expression to compare to.

required

Returns:

Type Description
Expression

The expression representing the result of the comparison.

__gt__(other: Any) -> Expression

Compare two expressions for greater than.

Parameters:

Name Type Description Default
other Any

The other expression to compare to.

required

Returns:

Type Description
Expression

The expression representing the result of the comparison.

__ge__(other: Any) -> Expression

Compare two expressions for greater than or equal to.

Parameters:

Name Type Description Default
other Any

The other expression to compare to.

required

Returns:

Type Description
Expression

The expression representing the result of the comparison.

__and__(other: Any) -> Expression

Combine two expressions with a logical and.

Parameters:

Name Type Description Default
other Any

The other expression to combine with.

required

Returns:

Type Description
Expression

The expression representing the result of the combination.

__or__(other: Any) -> Expression

Combine two expressions with a logical or.

Parameters:

Name Type Description Default
other Any

The other expression to combine with.

required

Returns:

Type Description
Expression

The expression representing the result of the combination.

__invert__() -> Expression

Invert an expression.

Returns:

Type Description
Expression

The expression representing the inverted expression.

is_nan() -> Expression

Check if an expression is NaN.

Returns:

Type Description
Expression

The expression representing the result of the check.

is_null(nan_is_null: bool = False)

Check if an expression is null.

Parameters:

Name Type Description Default
nan_is_null bool

Whether to consider NaN values as null, by default False

False

Returns:

Type Description
Expression

The expression representing the result of the check.

is_valid() -> Expression

Check if an expression is valid.

Returns:

Type Description
Expression

The expression representing the result of the check.

isin(other: Expression) -> Expression

Check if an expression is in a set of values.

Parameters:

Name Type Description Default
other Expression

The set of values to check against.

required

Returns:

Type Description
Expression

The expression representing the result of the check.

equals(other: Expression) -> Expression

Check if an expression is equal to another expression.

Parameters:

Name Type Description Default
other Expression

The other expression to check against.

required

Returns:

Type Description
Expression

The expression representing the result of the check.

__hash__() -> int

Get the hash of the expression.

Returns:

Type Description
int

The hash of the expression.

__repr__() -> str

Get the representation of the expression.

Returns:

Type Description
str

The representation of the expression.

scalar(value: Any) -> Expression

Create an expression from a scalar.

Parameters:

Name Type Description Default
value Any

The value of the scalar.

required

Returns:

Type Description
Expression

The expression representing the scalar.

loaders

P = ParamSpec('P') module-attribute

R = TypeVar('R') module-attribute

F = TypeVar('F', bound=Callable[..., Any]) module-attribute

loaders: Dict[str, DatasetLoader] = {} module-attribute

DatasetLoader

Bases: BaseDatasetLoader

func = func instance-attribute
name = name or self.func.__name__ instance-attribute
extensions = extensions instance-attribute
path_arg = path_arg instance-attribute
wraps = wraps instance-attribute
__init__(func: Callable[..., Any], name: Optional[str] = None, extensions: Optional[list[str]] = None, path_arg: Optional[str] = None, wraps: Optional[Callable[P, R]] = None)

Parameters:

Name Type Description Default
func Callable[..., Any]

The function to decorate.

required
name Optional[str]

The name of the loader, by default None.

None
extensions Optional[list[str]]

The extensions that the loader supports, by default None.

None
path_arg Optional[str]

The name of the argument that is the path, by default None.

None
wraps Optional[Callable[..., Any]]

The function to wrap, by default None.

None
__call__(*args: P.args, **kwargs: P.kwargs) -> R

Call the loader function.

Parameters:

Name Type Description Default
args tuple

The arguments to pass to the function.

()
kwargs dict

The keyword arguments to pass to the function.

{}

Returns:

Type Description
R

The result of the function.

bind(*args: P.args, **kwargs: P.kwargs) -> Callable[..., R]

Bind arguments to the loader function.

Notes

This method is useful for creating a partial function with pre-filled arguments and keyword arguments. This helps to improve the uniqueness of the fingerprint of the dataset.

Parameters:

Name Type Description Default
args tuple

The arguments to pre-fill.

()
kwargs dict

The keyword arguments to pre-fill.

{}

Returns:

Type Description
Callable[..., R]

The partial function.

dataloader(func: Union[F, str, None] = None, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[..., Any]] = None, path_arg: Optional[str] = None) -> Union[F, Callable[[F], F]]

dataloader(func: F, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[P, R]] = None, path_arg: Optional[str] = None) -> F
dataloader(name: str, extensions: Optional[list[str]] = None, wraps: Optional[Callable[P, R]] = None, path_arg: Optional[str] = None) -> Callable[[F], F]

Decorator to register a function as a dataset loader.

Parameters:

Name Type Description Default
func Union[Callable[..., Any], str, None]

The function to decorate, by default None.

None
name Optional[str]

The name of the loader, by default None.

None
extensions Optional[list[str]]

The extensions that the loader supports, by default None.

None
wraps Optional[Callable[..., Any]]

The function to wrap, by default None.

None
path_arg Optional[str]

The name of the argument that is the path, by default None.

None

Returns:

Type Description
DatasetLoader

The dataset loader.

load_json(path: Union[str, Path], encoding: str = 'utf-8') -> Generator[List[Dict], None, None]

Load a dataset from a JSON file.

Parameters:

Name Type Description Default
path (str, Path)

The path to the file.

required
encoding str

The encoding of the file, by default "utf-8".

'utf-8'

Returns:

Type Description
dict

The loaded dataset.

load_jsonl(path: Union[str, Path], encoding: str = 'utf-8') -> Generator[List[Dict], None, None]

Load a dataset from a JSONL file.

Parameters:

Name Type Description Default
path (str, Path)

The path to the file.

required
encoding str

The encoding of the file, by default "utf-8".

'utf-8'

Returns:

Type Description
list[dict]

The loaded dataset.

load_csv(path: Union[str, Path], encoding: str = 'utf-8') -> Generator[List[Dict], None, None]

Load a dataset from a CSV/TSV file.

Parameters:

Name Type Description Default
path (str, Path)

The path to the file.

required
encoding str

The encoding of the file, by default "utf-8".

'utf-8'

Returns:

Type Description
list[dict]

The loaded dataset.

metadata

unify_metadata(left: Any, right: Any) -> Optional[dict]

sampler

Sampler

args = (0, *list(columns.values())) instance-attribute
columns = list(columns.keys()) instance-attribute
__init__(columns: Mapping[str, int])
__call__(lst: Sequence[int])

schema

T = TypeVar('T') module-attribute

unify_schemas(this: pa.Schema, other: Optional[pa.Schema]) -> pa.Schema

infer_schema(data: Dict[str, Any], metadata: Optional[Dict[str, Any]] = None) -> Self

validate(schema: pa.Schema, data: dict) -> bool

Validates a dictionary against a PyArrow schema.

Parameters:

Name Type Description Default
schema Schema

The PyArrow schema to validate against.

required
data dict

The dictionary to validate.

required

Raises:

Type Description
ValidationError

If the dictionary does not match the schema.

Examples:

>>> schema = pa.schema([pa.field('id', pa.int64()), pa.field('name', pa.string())])
>>> valid_dict = {'id': 1, 'name': 'Alice'}
>>> validate(schema, valid_dict)
>>> invalid_dict = {'id': '1', 'name': 'Alice'}
>>> validate(schema, invalid_dict)
Traceback (most recent call last):
...
ValidationError: ...

get_schema(data: T) -> Tuple[T, pa.Schema]

Extracts the schema from a PyArrow schema or a generator of PyArrow record batches.

Parameters:

Name Type Description Default
data Any

The PyArrow schema or generator of record batches.

required

Returns:

Type Description
Tuple[Any, Schema]

The data and the schema.

from_dataclass(cls: T) -> pa.Schema

Converts a dataclass to a PyArrow schema.

Parameters:

Name Type Description Default
cls Type[T]

The dataclass to convert.

required

Returns:

Type Description
Schema

The PyArrow schema.

Examples:

>>> import dataclasses
>>> @dataclasses.dataclass
... class Record:
...     id: int
...     name: str
>>> from_dataclass(Record)
pyarrow.Schema([...])

get_schema_from_dataclass(*args, **kwargs) -> pa.Schema

Alias for from_dataclass.

Examples:

>>> import dataclasses
>>> @dataclasses.dataclass
... class Record:
...     id: int
...     name: str
>>> get_schema_from_dataclass(Record)
pyarrow.Schema([...])

types

UNDEFINED = undefined() module-attribute

MonthDayNano

Bases: NamedTuple

months: int instance-attribute
days: int instance-attribute
nanoseconds: int instance-attribute

Undefined

Bases: ExtensionType

__init__()
__arrow_ext_serialize__() -> bytes
__arrow_ext_deserialize__(storage_type, serialized) -> Undefined classmethod

undefined() -> Undefined

is_undefined(obj: pa.DataType) -> bool

from_dataclass(cls: type) -> pa.DataType

Return the PyArrow data type of a dataclass.

Parameters:

Name Type Description Default
cls type

The dataclass.

required

Returns:

Type Description
DataType

The PyArrow data type.

from_typed_dict(cls: _TypedDictMeta) -> pa.DataType

Return the PyArrow data type of a TypedDict.

Parameters:

Name Type Description Default
cls _TypedDictMeta

The TypedDict.

required

Returns:

Type Description
DataType

The PyArrow data type.

from_union(args: tuple[type, ...]) -> pa.DataType

from_dtype(dtype: Union[type, np.dtype, None]) -> pa.DataType

Return the PyArrow data type of a provided native/NumPy data type.

Parameters:

Name Type Description Default
dtype type | dtype | None

The native or NumPy data type.

required

Returns:

Type Description
DataType

The PyArrow data type.

unify_types(left: pa.DataType, right: pa.DataType) -> pa.DataType

Return the PyArrow data type that can represent both left and right.

Parameters:

Name Type Description Default
left DataType

The left PyArrow data type.

required
right DataType

The right PyArrow data type.

required

Returns:

Type Description
DataType

The PyArrow data type.

infer_type(obj: Any) -> pa.DataType

Return the PyArrow data type of an object.

Parameters:

Name Type Description Default
obj Any

The object.

required

Returns:

Type Description
DataType

The PyArrow data type.