Data

`data` ¶

`Dataset` ¶

Bases: BaseDataset

`cache_dir = cache_dir` `instance-attribute` ¶

`path: Path` `property` ¶

The path to the dataset.

Returns:

Type	Description
`Path`	The path to the dataset.

`format: str` `property` ¶

The format of the dataset.

Returns:

Type	Description
`str`	The format of the dataset.

`columns: List[str]` `property` ¶

Get the names of the columns in the dataset.

Returns:

Type	Description
`List[str]`	The names of the columns in the dataset.

`init(data_or_loader: Union[List[dict], Dict[str, list], DataFrame, DatasetLoader, str] = None, format: str = DEFAULT_FORMAT, *, schema: Union[pa.Schema, BaseModel, None] = None, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)` ¶

__init__(data: Union[List[dict], Dict[str, list], DataFrame] = None, format: str = DEFAULT_FORMAT)

__init__(data: Union[List[dict], Dict[str, list], DataFrame], format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None)

__init__(loader: DatasetLoader, format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)

Parameters:

Name	Type	Description	Default
`data_or_loader`	`list of dict, dict of list, DataFrame, BaseDatasetLoader, str`	The data to load into the dataset or the (name of) loader to use.	`None`
`format`	`str`	The format of the dataset.	`DEFAULT_FORMAT`
`path`	`(str, Path, None)`	Load the data to this path.	`None`
`cache_dir`	`(str, Path, None)`	The directory to use for caching.	`None`
`loader_args`	`(tuple, None)`	The arguments to pass to the loader function if provided as the first argument.	`None`
`loader_kwargs`	`(dict, None)`	The keyword arguments to pass to the loader function if provided as the first argument.	`None`

`count_rows() -> int` ¶

Count the number of rows in the dataset.

Returns:

Type	Description
`int`	The number of rows in the dataset.

`len() -> int` ¶

Get the number of rows in the dataset.

Returns:

Type	Description
`int`	The number of rows in the dataset.

`head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame` ¶

Get the first rows of the dataset as a pandas DataFrame.

Parameters:

Name	Type	Description	Default
`num_rows`	`int`	The number of rows to get.	`5`
`columns`	`str, list of str, None`	Names of columns to get. If None, all columns are returned.	`None`
`filter`	`Expression`	The filter expression.	`None`
`batch_size`	`int`	Number of rows to get at a time.	`DEFAULT_BATCH_SIZE`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame containing the first rows of the dataset.

`getitem(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table]` ¶

__getitem__(indices: int) -> Dict[str, Any]

__getitem__(indices: Union[slice, List[int], ArrayLike]) -> pa.Table

Get rows from the dataset.

`take(*, indices: Union[int, slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Union[dict, pa.Table]` ¶

take(*, indices: Optional[int] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Dict[str, Any]

take(*, indices: Union[slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame

Take rows(/columns) from the dataset.

Parameters:

Name	Type	Description	Default
`indices`	`int, slice, list of int, array-like`	Indices of rows to take.	`None`
`columns`	`str, list of str, None`	Names of columns to take. If None, all columns are taken.	`None`
`batch_size`	`int`	Number of rows to take at a time.	`DEFAULT_BATCH_SIZE`

Returns:

Type	Description
`(Document, Table)`	The taken rows or row.

`map(func: Any, batch_size: int = DEFAULT_BATCH_SIZE, batched: bool = False, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Union[List[str], None] = None, verbose: Union[bool, int] = 1) -> Dataset` ¶

Map a function over the dataset.

Parameters:

Name	Type	Description	Default
`func`	`Any`	The function to map over the dataset.	required
`batch_size`	`int`	Number of rows to map at a time.	`DEFAULT_BATCH_SIZE`
`batched`	`bool`	Whether the function is batched.	`False`
`verbose`	`bool \| int`	Whether to show a progress bar.	`1`

Returns:

Type	Description
`Dataset`	A new dataset containing the mapped rows.

`filter(expression: Expression = None) -> Dataset` ¶

Filter the dataset.

Parameters:

Name	Type	Description	Default
`expression`	`Expression`	The filter expression.	`None`

Returns:

Type	Description
`Dataset`	A new dataset containing only the rows that match the filter expression.

`select(columns: Union[str, List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

Select columns from the dataset.

Parameters:

Name	Type	Description	Default
`columns`	`str, list of str`	Names of columns to select.	required

Returns:

Type	Description
`Dataset`	A new dataset containing only the selected columns.

`rename(columns: Dict[str, str], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

Rename columns in the dataset.

Parameters:

Name	Type	Description	Default
`columns`	`dict`	Mapping of old column names to new column names.	required

Returns:

Type	Description
`Dataset`	A new dataset with the columns renamed.

`project(columns: Union[Dict[str, Expression], Dict[str, str], List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

Project columns in the dataset.

Parameters:

Name	Type	Description	Default
`columns`	`dict`	Mapping of column names to expressions.	required
`batch_size`	`int`	Number of rows to project at a time.	`DEFAULT_BATCH_SIZE`

Returns:

Type	Description
`Dataset`	A new dataset with the columns projected.

`load_dataset(path: Union[Path, str], format: str = DEFAULT_FORMAT, cache_dir: Union[Path, str, None] = None) -> Dataset` `classmethod` ¶

Load an existing dataset.

Parameters:

Name	Type	Description	Default
`path`	`(str, Path)`	The path to the dataset.	required
`format`	`str`	The format of the dataset.	`DEFAULT_FORMAT`

Returns:

Type	Description
`Dataset`	The loaded dataset.

`to_polars() -> pl.LazyFrame` ¶

Convert the dataset to a Polars DataFrame.

Returns:

Type	Description
`LazyFrame`	The Polars Lazy DataFrame.

`Expression` ¶

Bases: BaseExpression

A class representing an expression in Octoflow.

`init(expression: Union[Expression, ds.Expression])` ¶

Parameters:

Name	Type	Description	Default
`expression`	`Union[Expression, Expression]`	The (pyarrow) expression to wrap.	required

`eq(other: Any) -> Expression` ¶

Compare two expressions for equality.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to compare to.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the comparison.

`ne(other: Any) -> Expression` ¶

Compare two expressions for inequality.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to compare to.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the comparison.

`lt(other: Any) -> Expression` ¶

Compare two expressions for less than.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to compare to.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the comparison.

`le(other: Any) -> Expression` ¶

Compare two expressions for less than or equal to.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to compare to.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the comparison.

`gt(other: Any) -> Expression` ¶

Compare two expressions for greater than.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to compare to.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the comparison.

`ge(other: Any) -> Expression` ¶

Compare two expressions for greater than or equal to.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to compare to.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the comparison.

`and(other: Any) -> Expression` ¶

Combine two expressions with a logical and.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to combine with.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the combination.

`or(other: Any) -> Expression` ¶

Combine two expressions with a logical or.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to combine with.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the combination.

`invert() -> Expression` ¶

Invert an expression.

Returns:

Type	Description
`Expression`	The expression representing the inverted expression.

`is_nan() -> Expression` ¶

Check if an expression is NaN.

Returns:

Type	Description
`Expression`	The expression representing the result of the check.

`is_null(nan_is_null: bool = False)` ¶

Check if an expression is null.

Parameters:

Name	Type	Description	Default
`nan_is_null`	`bool`	Whether to consider NaN values as null, by default False	`False`

Returns:

Type	Description
`Expression`	The expression representing the result of the check.

`is_valid() -> Expression` ¶

Check if an expression is valid.

Returns:

Type	Description
`Expression`	The expression representing the result of the check.

`isin(other: Expression) -> Expression` ¶

Check if an expression is in a set of values.

Parameters:

Name	Type	Description	Default
`other`	`Expression`	The set of values to check against.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the check.

`equals(other: Expression) -> Expression` ¶

Check if an expression is equal to another expression.

Parameters:

Name	Type	Description	Default
`other`	`Expression`	The other expression to check against.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the check.

`hash() -> int` ¶

Get the hash of the expression.

Returns:

Type	Description
`int`	The hash of the expression.

`repr() -> str` ¶

Get the representation of the expression.

Returns:

Type	Description
`str`	The representation of the expression.

`field(*args, **kwargs) -> Field` ¶

Create a new field getter.

`scalar(value: Any) -> Expression` ¶

Create an expression from a scalar.

Parameters:

Name	Type	Description	Default
`value`	`Any`	The value of the scalar.	required

Returns:

Type	Description
`Expression`	The expression representing the scalar.

`dataloader(func: Union[F, str, None] = None, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[..., Any]] = None, path_arg: Optional[str] = None) -> Union[F, Callable[[F], F]]` ¶

dataloader(func: F, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[P, R]] = None, path_arg: Optional[str] = None) -> F

dataloader(name: str, extensions: Optional[list[str]] = None, wraps: Optional[Callable[P, R]] = None, path_arg: Optional[str] = None) -> Callable[[F], F]

Decorator to register a function as a dataset loader.

Parameters:

Name	Type	Description	Default
`func`	`Union[Callable[..., Any], str, None]`	The function to decorate, by default None.	`None`
`name`	`Optional[str]`	The name of the loader, by default None.	`None`
`extensions`	`Optional[list[str]]`	The extensions that the loader supports, by default None.	`None`
`wraps`	`Optional[Callable[..., Any]]`	The function to wrap, by default None.	`None`
`path_arg`	`Optional[str]`	The name of the argument that is the path, by default None.	`None`

Returns:

Type	Description
`DatasetLoader`	The dataset loader.

`load_dataset(loader: str, path: Optional[str], force: bool = False, dataset_format: str = DEFAULT_FORMAT, __dataset_path: Union[Path, str, None] = None, /, *args, **kwargs) -> Dataset` ¶

Load a dataset from a path.

Parameters:

Name	Type	Description	Default
`__loader`	`str`	The name of the loader.	required
`__path`	`Optional[str]`	The path to the data (to be passed to the loader).	required
`__dataset_format`	`str`	The format of the dataset, by default DEFAULT_FORMAT.	`DEFAULT_FORMAT`
`__dataset_path`	`Union[Path, str, None]`	The path that the dataset will be stored.	`None`
`*args`	`tuple`	The arguments to pass to the loader.	`()`
`**kwargs`	`dict`	The keyword arguments to pass to the loader.	`{}`

Returns:

Type	Description
`Dataset`	The loaded dataset.

`base` ¶

`ArrowType = TypeVar('ArrowType')` `module-attribute` ¶

`P = ParamSpec('P')` `module-attribute` ¶

`R = TypeVar('R')` `module-attribute` ¶

`DEFAULT_BATCH_SIZE: Final[int] = 1048576` `module-attribute` ¶

`DEFAULT_FORMAT: Final[str] = 'arrow'` `module-attribute` ¶

`BaseExpression = PyArrowWrapper[ds.Expression]` `module-attribute` ¶

`BaseDataset = PyArrowWrapper[ds.Dataset]` `module-attribute` ¶

`PyArrowWrapper` ¶

Bases: Generic[ArrowType]

`init(wrapped: ArrowType) -> None` ¶

`to_pyarrow() -> ArrowType` ¶

`BaseDatasetLoader` ¶

Bases: Generic[P, R]

`dataclass` ¶

`T = TypeVar('T')` `module-attribute` ¶

`Field` ¶

Bases: Field, Expression

`name = name` `instance-attribute` ¶

`init(name: Optional[str] = None, *, default=dc.MISSING, default_factory=dc.MISSING, init=True, repr=True, hash=None, compare=True, metadata=None, kw_only=dc.MISSING)` ¶

`call(data: Mapping[str, Any]) -> Any` ¶

Get the value of the field.

Parameters:

Name	Type	Description	Default
`data`	`dict`	The data to be accessed.	required

`FieldAccessor` ¶

Bases: tuple, Generic[T]

`new(obj: Type[T]) -> FieldAccessor[T]` ¶

`getattr(name: str) -> Field` ¶

`ModelMeta` ¶

Bases: type

`new(mcs, name, bases, attrs, **kwargs)` ¶

`update_forward_refs(**kwargs: Any) -> None` ¶

`BaseModel` ¶

`__post_init__()` ¶

`field(*args, **kwargs) -> Field` ¶

Create a new field getter.

`field_from_dataclass_field(field: dc.Field) -> Field` ¶

Create a new field getter.

`fields(cls: Type[T]) -> Union[FieldAccessor[T], Type[T]]` ¶

`dataset` ¶

`SourceType = Union[str, List[str], Union[Path, List[Path]], 'Dataset', List['Dataset']]` `module-attribute` ¶

`Dataset` ¶

Bases: BaseDataset

`cache_dir = cache_dir` `instance-attribute` ¶

`path: Path` `property` ¶

The path to the dataset.

Returns:

Type	Description
`Path`	The path to the dataset.

`format: str` `property` ¶

The format of the dataset.

Returns:

Type	Description
`str`	The format of the dataset.

`columns: List[str]` `property` ¶

Get the names of the columns in the dataset.

Returns:

Type	Description
`List[str]`	The names of the columns in the dataset.

`init(data_or_loader: Union[List[dict], Dict[str, list], DataFrame, DatasetLoader, str] = None, format: str = DEFAULT_FORMAT, *, schema: Union[pa.Schema, BaseModel, None] = None, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)` ¶

__init__(data: Union[List[dict], Dict[str, list], DataFrame] = None, format: str = DEFAULT_FORMAT)

__init__(data: Union[List[dict], Dict[str, list], DataFrame], format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None)

__init__(loader: DatasetLoader, format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)

Parameters:

Name	Type	Description	Default
`data_or_loader`	`list of dict, dict of list, DataFrame, BaseDatasetLoader, str`	The data to load into the dataset or the (name of) loader to use.	`None`
`format`	`str`	The format of the dataset.	`DEFAULT_FORMAT`
`path`	`(str, Path, None)`	Load the data to this path.	`None`
`cache_dir`	`(str, Path, None)`	The directory to use for caching.	`None`
`loader_args`	`(tuple, None)`	The arguments to pass to the loader function if provided as the first argument.	`None`
`loader_kwargs`	`(dict, None)`	The keyword arguments to pass to the loader function if provided as the first argument.	`None`

`count_rows() -> int` ¶

Count the number of rows in the dataset.

Returns:

Type	Description
`int`	The number of rows in the dataset.

`len() -> int` ¶

Get the number of rows in the dataset.

Returns:

Type	Description
`int`	The number of rows in the dataset.

`head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame` ¶

Get the first rows of the dataset as a pandas DataFrame.

Parameters:

Name	Type	Description	Default
`num_rows`	`int`	The number of rows to get.	`5`
`columns`	`str, list of str, None`	Names of columns to get. If None, all columns are returned.	`None`
`filter`	`Expression`	The filter expression.	`None`
`batch_size`	`int`	Number of rows to get at a time.	`DEFAULT_BATCH_SIZE`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame containing the first rows of the dataset.

`getitem(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table]` ¶

__getitem__(indices: int) -> Dict[str, Any]

__getitem__(indices: Union[slice, List[int], ArrayLike]) -> pa.Table

Get rows from the dataset.

`take(*, indices: Union[int, slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Union[dict, pa.Table]` ¶

take(*, indices: Optional[int] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Dict[str, Any]

take(*, indices: Union[slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame

Take rows(/columns) from the dataset.

Parameters:

Name	Type	Description	Default
`indices`	`int, slice, list of int, array-like`	Indices of rows to take.	`None`
`columns`	`str, list of str, None`	Names of columns to take. If None, all columns are taken.	`None`
`batch_size`	`int`	Number of rows to take at a time.	`DEFAULT_BATCH_SIZE`

Returns:

Type	Description
`(Document, Table)`	The taken rows or row.

`map(func: Any, batch_size: int = DEFAULT_BATCH_SIZE, batched: bool = False, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Union[List[str], None] = None, verbose: Union[bool, int] = 1) -> Dataset` ¶

Map a function over the dataset.

Parameters:

Name	Type	Description	Default
`func`	`Any`	The function to map over the dataset.	required
`batch_size`	`int`	Number of rows to map at a time.	`DEFAULT_BATCH_SIZE`
`batched`	`bool`	Whether the function is batched.	`False`
`verbose`	`bool \| int`	Whether to show a progress bar.	`1`

Returns:

Type	Description
`Dataset`	A new dataset containing the mapped rows.

`filter(expression: Expression = None) -> Dataset` ¶

Filter the dataset.

Parameters:

Name	Type	Description	Default
`expression`	`Expression`	The filter expression.	`None`

Returns:

Type	Description
`Dataset`	A new dataset containing only the rows that match the filter expression.

`select(columns: Union[str, List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

Select columns from the dataset.

Parameters:

Name	Type	Description	Default
`columns`	`str, list of str`	Names of columns to select.	required

Returns:

Type	Description
`Dataset`	A new dataset containing only the selected columns.

`rename(columns: Dict[str, str], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

Rename columns in the dataset.

Parameters:

Name	Type	Description	Default
`columns`	`dict`	Mapping of old column names to new column names.	required

Returns:

Type	Description
`Dataset`	A new dataset with the columns renamed.

`project(columns: Union[Dict[str, Expression], Dict[str, str], List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

Project columns in the dataset.

Parameters:

Name	Type	Description	Default
`columns`	`dict`	Mapping of column names to expressions.	required
`batch_size`	`int`	Number of rows to project at a time.	`DEFAULT_BATCH_SIZE`

Returns:

Type	Description
`Dataset`	A new dataset with the columns projected.

`load_dataset(path: Union[Path, str], format: str = DEFAULT_FORMAT, cache_dir: Union[Path, str, None] = None) -> Dataset` `classmethod` ¶

Load an existing dataset.

Parameters:

Name	Type	Description	Default
`path`	`(str, Path)`	The path to the dataset.	required
`format`	`str`	The format of the dataset.	`DEFAULT_FORMAT`

Returns:

Type	Description
`Dataset`	The loaded dataset.

`to_polars() -> pl.LazyFrame` ¶

Convert the dataset to a Polars DataFrame.

Returns:

Type	Description
`LazyFrame`	The Polars Lazy DataFrame.

`gen_unique_cached_path(*refs: Any, cache_dir: Union[str, Path, None] = None) -> Path` ¶

`writable(data: Any, schema: Optional[pa.Schema] = None) -> Union[pa.RecordBatch, pa.Table, pa.RecordBatchReader]` ¶

`write_dataset(path: Union[str, Path], data: Union[ds.Dataset, pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], pa.RecordBatchReader, pd.DataFrame, Mapping[str, List[Any]], Sequence[Mapping[str, Any]]], schema: pa.Schema = None, format: Optional[str] = None) -> bool` ¶

`read_dataset(path: Union[str, Path], format: str) -> ds.dataset` ¶

`to_batches(data: Union[pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], Iterable[pa.Table], pa.RecordBatchReader]) -> Generator[pa.RecordBatch, None, None]` ¶

`create_mapped_table(data: Union[dict, list, pd.DataFrame, pa.RecordBatch, pa.Table], existing: Optional[pa.Table] = None, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Optional[List[str]] = None) -> pa.Table` ¶

`expression` ¶

`Expression` ¶

Bases: BaseExpression

A class representing an expression in Octoflow.

`init(expression: Union[Expression, ds.Expression])` ¶

Parameters:

Name	Type	Description	Default
`expression`	`Union[Expression, Expression]`	The (pyarrow) expression to wrap.	required

`eq(other: Any) -> Expression` ¶

Compare two expressions for equality.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to compare to.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the comparison.

`ne(other: Any) -> Expression` ¶

Compare two expressions for inequality.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to compare to.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the comparison.

`lt(other: Any) -> Expression` ¶

Compare two expressions for less than.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to compare to.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the comparison.

`le(other: Any) -> Expression` ¶

Compare two expressions for less than or equal to.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to compare to.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the comparison.

`gt(other: Any) -> Expression` ¶

Compare two expressions for greater than.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to compare to.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the comparison.

`ge(other: Any) -> Expression` ¶

Compare two expressions for greater than or equal to.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to compare to.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the comparison.

`and(other: Any) -> Expression` ¶

Combine two expressions with a logical and.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to combine with.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the combination.

`or(other: Any) -> Expression` ¶

Combine two expressions with a logical or.

Parameters:

Name	Type	Description	Default
`other`	`Any`	The other expression to combine with.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the combination.

`invert() -> Expression` ¶

Invert an expression.

Returns:

Type	Description
`Expression`	The expression representing the inverted expression.

`is_nan() -> Expression` ¶

Check if an expression is NaN.

Returns:

Type	Description
`Expression`	The expression representing the result of the check.

`is_null(nan_is_null: bool = False)` ¶

Check if an expression is null.

Parameters:

Name	Type	Description	Default
`nan_is_null`	`bool`	Whether to consider NaN values as null, by default False	`False`

Returns:

Type	Description
`Expression`	The expression representing the result of the check.

`is_valid() -> Expression` ¶

Check if an expression is valid.

Returns:

Type	Description
`Expression`	The expression representing the result of the check.

`isin(other: Expression) -> Expression` ¶

Check if an expression is in a set of values.

Parameters:

Name	Type	Description	Default
`other`	`Expression`	The set of values to check against.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the check.

`equals(other: Expression) -> Expression` ¶

Check if an expression is equal to another expression.

Parameters:

Name	Type	Description	Default
`other`	`Expression`	The other expression to check against.	required

Returns:

Type	Description
`Expression`	The expression representing the result of the check.

`hash() -> int` ¶

Get the hash of the expression.

Returns:

Type	Description
`int`	The hash of the expression.

`repr() -> str` ¶

Get the representation of the expression.

Returns:

Type	Description
`str`	The representation of the expression.

`scalar(value: Any) -> Expression` ¶

Create an expression from a scalar.

Parameters:

Name	Type	Description	Default
`value`	`Any`	The value of the scalar.	required

Returns:

Type	Description
`Expression`	The expression representing the scalar.

`loaders` ¶

`P = ParamSpec('P')` `module-attribute` ¶

`R = TypeVar('R')` `module-attribute` ¶

`F = TypeVar('F', bound=Callable[..., Any])` `module-attribute` ¶

`loaders: Dict[str, DatasetLoader] = {}` `module-attribute` ¶

`DatasetLoader` ¶

Bases: BaseDatasetLoader

`func = func` `instance-attribute` ¶

`name = name or self.func.name` `instance-attribute` ¶

`extensions = extensions` `instance-attribute` ¶

`path_arg = path_arg` `instance-attribute` ¶

`wraps = wraps` `instance-attribute` ¶

`init(func: Callable[..., Any], name: Optional[str] = None, extensions: Optional[list[str]] = None, path_arg: Optional[str] = None, wraps: Optional[Callable[P, R]] = None)` ¶

Parameters:

Name	Type	Description	Default
`func`	`Callable[..., Any]`	The function to decorate.	required
`name`	`Optional[str]`	The name of the loader, by default None.	`None`
`extensions`	`Optional[list[str]]`	The extensions that the loader supports, by default None.	`None`
`path_arg`	`Optional[str]`	The name of the argument that is the path, by default None.	`None`
`wraps`	`Optional[Callable[..., Any]]`	The function to wrap, by default None.	`None`

`call(*args: P.args, **kwargs: P.kwargs) -> R` ¶

Call the loader function.

Parameters:

Name	Type	Description	Default
`args`	`tuple`	The arguments to pass to the function.	`()`
`kwargs`	`dict`	The keyword arguments to pass to the function.	`{}`

Returns:

Type	Description
`R`	The result of the function.

`bind(*args: P.args, **kwargs: P.kwargs) -> Callable[..., R]` ¶

Bind arguments to the loader function.

Notes

This method is useful for creating a partial function with pre-filled arguments and keyword arguments. This helps to improve the uniqueness of the fingerprint of the dataset.

Parameters:

Name	Type	Description	Default
`args`	`tuple`	The arguments to pre-fill.	`()`
`kwargs`	`dict`	The keyword arguments to pre-fill.	`{}`

Returns:

Type	Description
`Callable[..., R]`	The partial function.

`dataloader(func: Union[F, str, None] = None, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[..., Any]] = None, path_arg: Optional[str] = None) -> Union[F, Callable[[F], F]]` ¶

dataloader(func: F, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[P, R]] = None, path_arg: Optional[str] = None) -> F

dataloader(name: str, extensions: Optional[list[str]] = None, wraps: Optional[Callable[P, R]] = None, path_arg: Optional[str] = None) -> Callable[[F], F]

Decorator to register a function as a dataset loader.

Parameters:

Name	Type	Description	Default
`func`	`Union[Callable[..., Any], str, None]`	The function to decorate, by default None.	`None`
`name`	`Optional[str]`	The name of the loader, by default None.	`None`
`extensions`	`Optional[list[str]]`	The extensions that the loader supports, by default None.	`None`
`wraps`	`Optional[Callable[..., Any]]`	The function to wrap, by default None.	`None`
`path_arg`	`Optional[str]`	The name of the argument that is the path, by default None.	`None`

Returns:

Type	Description
`DatasetLoader`	The dataset loader.

`load_json(path: Union[str, Path], encoding: str = 'utf-8') -> Generator[List[Dict], None, None]` ¶

Load a dataset from a JSON file.

Parameters:

Name	Type	Description	Default
`path`	`(str, Path)`	The path to the file.	required
`encoding`	`str`	The encoding of the file, by default "utf-8".	`'utf-8'`

Returns:

Type	Description
`dict`	The loaded dataset.

`load_jsonl(path: Union[str, Path], encoding: str = 'utf-8') -> Generator[List[Dict], None, None]` ¶

Load a dataset from a JSONL file.

Parameters:

Name	Type	Description	Default
`path`	`(str, Path)`	The path to the file.	required
`encoding`	`str`	The encoding of the file, by default "utf-8".	`'utf-8'`

Returns:

Type	Description
`list[dict]`	The loaded dataset.

`load_csv(path: Union[str, Path], encoding: str = 'utf-8') -> Generator[List[Dict], None, None]` ¶

Load a dataset from a CSV/TSV file.

Parameters:

Name	Type	Description	Default
`path`	`(str, Path)`	The path to the file.	required
`encoding`	`str`	The encoding of the file, by default "utf-8".	`'utf-8'`

Returns:

Type	Description
`list[dict]`	The loaded dataset.

`metadata` ¶

`unify_metadata(left: Any, right: Any) -> Optional[dict]` ¶

`sampler` ¶

`Sampler` ¶

`args = (0, *list(columns.values()))` `instance-attribute` ¶

`columns = list(columns.keys())` `instance-attribute` ¶

`init(columns: Mapping[str, int])` ¶

`call(lst: Sequence[int])` ¶

`schema` ¶

`T = TypeVar('T')` `module-attribute` ¶

`unify_schemas(this: pa.Schema, other: Optional[pa.Schema]) -> pa.Schema` ¶

`infer_schema(data: Dict[str, Any], metadata: Optional[Dict[str, Any]] = None) -> Self` ¶

`validate(schema: pa.Schema, data: dict) -> bool` ¶

Validates a dictionary against a PyArrow schema.

Parameters:

Name	Type	Description	Default
`schema`	`Schema`	The PyArrow schema to validate against.	required
`data`	`dict`	The dictionary to validate.	required

Raises:

Type	Description
`ValidationError`	If the dictionary does not match the schema.

Examples:

>>> schema = pa.schema([pa.field('id', pa.int64()), pa.field('name', pa.string())])
>>> valid_dict = {'id': 1, 'name': 'Alice'}
>>> validate(schema, valid_dict)
>>> invalid_dict = {'id': '1', 'name': 'Alice'}
>>> validate(schema, invalid_dict)
Traceback (most recent call last):
...
ValidationError: ...

`get_schema(data: T) -> Tuple[T, pa.Schema]` ¶

Extracts the schema from a PyArrow schema or a generator of PyArrow record batches.

Parameters:

Name	Type	Description	Default
`data`	`Any`	The PyArrow schema or generator of record batches.	required

Returns:

Type	Description
`Tuple[Any, Schema]`	The data and the schema.

`from_dataclass(cls: T) -> pa.Schema` ¶

Converts a dataclass to a PyArrow schema.

Parameters:

Name	Type	Description	Default
`cls`	`Type[T]`	The dataclass to convert.	required

Returns:

Type	Description
`Schema`	The PyArrow schema.

Examples:

>>> import dataclasses
>>> @dataclasses.dataclass
... class Record:
...     id: int
...     name: str
>>> from_dataclass(Record)
pyarrow.Schema([...])

`get_schema_from_dataclass(*args, **kwargs) -> pa.Schema` ¶

Alias for from_dataclass.

Examples:

>>> import dataclasses
>>> @dataclasses.dataclass
... class Record:
...     id: int
...     name: str
>>> get_schema_from_dataclass(Record)
pyarrow.Schema([...])

`types` ¶

`UNDEFINED = undefined()` `module-attribute` ¶

`MonthDayNano` ¶

Bases: NamedTuple

`months: int` `instance-attribute` ¶

`days: int` `instance-attribute` ¶

`nanoseconds: int` `instance-attribute` ¶

`Undefined` ¶

Bases: ExtensionType

`init()` ¶

`__arrow_ext_serialize__() -> bytes` ¶

`__arrow_ext_deserialize__(storage_type, serialized) -> Undefined` `classmethod` ¶

`undefined() -> Undefined` ¶

`is_undefined(obj: pa.DataType) -> bool` ¶

`from_dataclass(cls: type) -> pa.DataType` ¶

Return the PyArrow data type of a dataclass.

Parameters:

Name	Type	Description	Default
`cls`	`type`	The dataclass.	required

Returns:

Type	Description
`DataType`	The PyArrow data type.

`from_typed_dict(cls: _TypedDictMeta) -> pa.DataType` ¶

Return the PyArrow data type of a TypedDict.

Parameters:

Name	Type	Description	Default
`cls`	`_TypedDictMeta`	The TypedDict.	required

Returns:

Type	Description
`DataType`	The PyArrow data type.

`from_union(args: tuple[type, ...]) -> pa.DataType` ¶

`from_dtype(dtype: Union[type, np.dtype, None]) -> pa.DataType` ¶

Return the PyArrow data type of a provided native/NumPy data type.

Parameters:

Name	Type	Description	Default
`dtype`	`type \| dtype \| None`	The native or NumPy data type.	required

Returns:

Type	Description
`DataType`	The PyArrow data type.

`unify_types(left: pa.DataType, right: pa.DataType) -> pa.DataType` ¶

Return the PyArrow data type that can represent both left and right.

Parameters:

Name	Type	Description	Default
`left`	`DataType`	The left PyArrow data type.	required
`right`	`DataType`	The right PyArrow data type.	required

Returns:

Type	Description
`DataType`	The PyArrow data type.

`infer_type(obj: Any) -> pa.DataType` ¶

Return the PyArrow data type of an object.

Parameters:

Name	Type	Description	Default
`obj`	`Any`	The object.	required

Returns:

Type	Description
`DataType`	The PyArrow data type.

Data

data ¶

Dataset ¶

cache_dir = cache_dir instance-attribute ¶

path: Path property ¶

format: str property ¶

columns: List[str] property ¶

count_rows() -> int ¶

__len__() -> int ¶

head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame ¶

__getitem__(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table] ¶

take(*, indices: Union[int, slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Union[dict, pa.Table] ¶

map(func: Any, batch_size: int = DEFAULT_BATCH_SIZE, batched: bool = False, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Union[List[str], None] = None, verbose: Union[bool, int] = 1) -> Dataset ¶

filter(expression: Expression = None) -> Dataset ¶

select(columns: Union[str, List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶

rename(columns: Dict[str, str], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶

project(columns: Union[Dict[str, Expression], Dict[str, str], List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶

load_dataset(path: Union[Path, str], format: str = DEFAULT_FORMAT, cache_dir: Union[Path, str, None] = None) -> Dataset classmethod ¶

to_polars() -> pl.LazyFrame ¶

Expression ¶

__init__(expression: Union[Expression, ds.Expression]) ¶

__eq__(other: Any) -> Expression ¶

__ne__(other: Any) -> Expression ¶

__lt__(other: Any) -> Expression ¶

__le__(other: Any) -> Expression ¶

__gt__(other: Any) -> Expression ¶

__ge__(other: Any) -> Expression ¶

__and__(other: Any) -> Expression ¶

__or__(other: Any) -> Expression ¶

__invert__() -> Expression ¶

is_nan() -> Expression ¶

is_null(nan_is_null: bool = False) ¶

is_valid() -> Expression ¶

isin(other: Expression) -> Expression ¶

equals(other: Expression) -> Expression ¶

__hash__() -> int ¶

__repr__() -> str ¶

field(*args, **kwargs) -> Field ¶

scalar(value: Any) -> Expression ¶

dataloader(func: Union[F, str, None] = None, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[..., Any]] = None, path_arg: Optional[str] = None) -> Union[F, Callable[[F], F]] ¶

load_dataset(__loader: str, __path: Optional[str], __force: bool = False, __dataset_format: str = DEFAULT_FORMAT, __dataset_path: Union[Path, str, None] = None, /, *args, **kwargs) -> Dataset ¶

base ¶

ArrowType = TypeVar('ArrowType') module-attribute ¶

P = ParamSpec('P') module-attribute ¶

R = TypeVar('R') module-attribute ¶

DEFAULT_BATCH_SIZE: Final[int] = 1048576 module-attribute ¶

DEFAULT_FORMAT: Final[str] = 'arrow' module-attribute ¶

BaseExpression = PyArrowWrapper[ds.Expression] module-attribute ¶

BaseDataset = PyArrowWrapper[ds.Dataset] module-attribute ¶

PyArrowWrapper ¶

__init__(wrapped: ArrowType) -> None ¶

to_pyarrow() -> ArrowType ¶

BaseDatasetLoader ¶

dataclass ¶

T = TypeVar('T') module-attribute ¶

Field ¶

name = name instance-attribute ¶

__init__(name: Optional[str] = None, *, default=dc.MISSING, default_factory=dc.MISSING, init=True, repr=True, hash=None, compare=True, metadata=None, kw_only=dc.MISSING) ¶

__call__(data: Mapping[str, Any]) -> Any ¶

FieldAccessor ¶

__new__(obj: Type[T]) -> FieldAccessor[T] ¶

__getattr__(name: str) -> Field ¶

ModelMeta ¶

__new__(mcs, name, bases, attrs, **kwargs) ¶

update_forward_refs(**kwargs: Any) -> None ¶

BaseModel ¶

__post_init__() ¶

field(*args, **kwargs) -> Field ¶

field_from_dataclass_field(field: dc.Field) -> Field ¶

fields(cls: Type[T]) -> Union[FieldAccessor[T], Type[T]] ¶

dataset ¶

SourceType = Union[str, List[str], Union[Path, List[Path]], 'Dataset', List['Dataset']] module-attribute ¶

Dataset ¶

cache_dir = cache_dir instance-attribute ¶

path: Path property ¶

format: str property ¶

columns: List[str] property ¶

count_rows() -> int ¶

__len__() -> int ¶

head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame ¶

`data` ¶

`Dataset` ¶

`cache_dir = cache_dir` `instance-attribute` ¶

`path: Path` `property` ¶

`format: str` `property` ¶

`columns: List[str]` `property` ¶

`count_rows() -> int` ¶

`len() -> int` ¶

`head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame` ¶

`getitem(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table]` ¶

`take(*, indices: Union[int, slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Union[dict, pa.Table]` ¶

`map(func: Any, batch_size: int = DEFAULT_BATCH_SIZE, batched: bool = False, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Union[List[str], None] = None, verbose: Union[bool, int] = 1) -> Dataset` ¶

`filter(expression: Expression = None) -> Dataset` ¶

`select(columns: Union[str, List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

`rename(columns: Dict[str, str], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

`project(columns: Union[Dict[str, Expression], Dict[str, str], List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

`load_dataset(path: Union[Path, str], format: str = DEFAULT_FORMAT, cache_dir: Union[Path, str, None] = None) -> Dataset` `classmethod` ¶

`to_polars() -> pl.LazyFrame` ¶

`Expression` ¶

`init(expression: Union[Expression, ds.Expression])` ¶

`eq(other: Any) -> Expression` ¶

`ne(other: Any) -> Expression` ¶

`lt(other: Any) -> Expression` ¶

`le(other: Any) -> Expression` ¶

`gt(other: Any) -> Expression` ¶

`ge(other: Any) -> Expression` ¶

`and(other: Any) -> Expression` ¶

`or(other: Any) -> Expression` ¶

`invert() -> Expression` ¶

`is_nan() -> Expression` ¶

`is_null(nan_is_null: bool = False)` ¶

`is_valid() -> Expression` ¶

`isin(other: Expression) -> Expression` ¶

`equals(other: Expression) -> Expression` ¶

`hash() -> int` ¶

`repr() -> str` ¶

`field(*args, **kwargs) -> Field` ¶

`scalar(value: Any) -> Expression` ¶

`dataloader(func: Union[F, str, None] = None, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[..., Any]] = None, path_arg: Optional[str] = None) -> Union[F, Callable[[F], F]]` ¶

`load_dataset(loader: str, path: Optional[str], force: bool = False, dataset_format: str = DEFAULT_FORMAT, __dataset_path: Union[Path, str, None] = None, /, *args, **kwargs) -> Dataset` ¶

`base` ¶

`ArrowType = TypeVar('ArrowType')` `module-attribute` ¶

`P = ParamSpec('P')` `module-attribute` ¶

`R = TypeVar('R')` `module-attribute` ¶

`DEFAULT_BATCH_SIZE: Final[int] = 1048576` `module-attribute` ¶

`DEFAULT_FORMAT: Final[str] = 'arrow'` `module-attribute` ¶

`BaseExpression = PyArrowWrapper[ds.Expression]` `module-attribute` ¶

`BaseDataset = PyArrowWrapper[ds.Dataset]` `module-attribute` ¶

`PyArrowWrapper` ¶

`init(wrapped: ArrowType) -> None` ¶

`to_pyarrow() -> ArrowType` ¶

`BaseDatasetLoader` ¶

`dataclass` ¶

`T = TypeVar('T')` `module-attribute` ¶

`Field` ¶

`name = name` `instance-attribute` ¶

`init(name: Optional[str] = None, *, default=dc.MISSING, default_factory=dc.MISSING, init=True, repr=True, hash=None, compare=True, metadata=None, kw_only=dc.MISSING)` ¶

`call(data: Mapping[str, Any]) -> Any` ¶

`FieldAccessor` ¶

`new(obj: Type[T]) -> FieldAccessor[T]` ¶

`getattr(name: str) -> Field` ¶

`ModelMeta` ¶

`new(mcs, name, bases, attrs, **kwargs)` ¶

`update_forward_refs(**kwargs: Any) -> None` ¶

`BaseModel` ¶

`__post_init__()` ¶

`field(*args, **kwargs) -> Field` ¶

`field_from_dataclass_field(field: dc.Field) -> Field` ¶

`fields(cls: Type[T]) -> Union[FieldAccessor[T], Type[T]]` ¶

`dataset` ¶

`SourceType = Union[str, List[str], Union[Path, List[Path]], 'Dataset', List['Dataset']]` `module-attribute` ¶

`Dataset` ¶

`cache_dir = cache_dir` `instance-attribute` ¶

`path: Path` `property` ¶

`format: str` `property` ¶

`columns: List[str]` `property` ¶

`count_rows() -> int` ¶

`len() -> int` ¶

`head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame` ¶

`getitem(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table]` ¶