Data
data ¶
Dataset ¶
Bases: BaseDataset
cache_dir = cache_dir instance-attribute ¶
path: Path property ¶
The path to the dataset.
Returns:
| Type | Description |
|---|---|
Path | The path to the dataset. |
format: str property ¶
The format of the dataset.
Returns:
| Type | Description |
|---|---|
str | The format of the dataset. |
columns: List[str] property ¶
Get the names of the columns in the dataset.
Returns:
| Type | Description |
|---|---|
List[str] | The names of the columns in the dataset. |
__init__(data_or_loader: Union[List[dict], Dict[str, list], DataFrame, DatasetLoader, str] = None, format: str = DEFAULT_FORMAT, *, schema: Union[pa.Schema, BaseModel, None] = None, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False) ¶
__init__(data: Union[List[dict], Dict[str, list], DataFrame] = None, format: str = DEFAULT_FORMAT)
__init__(data: Union[List[dict], Dict[str, list], DataFrame], format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None)
__init__(loader: DatasetLoader, format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_or_loader | list of dict, dict of list, DataFrame, BaseDatasetLoader, str | The data to load into the dataset or the (name of) loader to use. | None |
format | str | The format of the dataset. | DEFAULT_FORMAT |
path | (str, Path, None) | Load the data to this path. | None |
cache_dir | (str, Path, None) | The directory to use for caching. | None |
loader_args | (tuple, None) | The arguments to pass to the loader function if provided as the first argument. | None |
loader_kwargs | (dict, None) | The keyword arguments to pass to the loader function if provided as the first argument. | None |
count_rows() -> int ¶
Count the number of rows in the dataset.
Returns:
| Type | Description |
|---|---|
int | The number of rows in the dataset. |
__len__() -> int ¶
Get the number of rows in the dataset.
Returns:
| Type | Description |
|---|---|
int | The number of rows in the dataset. |
head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame ¶
Get the first rows of the dataset as a pandas DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_rows | int | The number of rows to get. | 5 |
columns | str, list of str, None | Names of columns to get. If None, all columns are returned. | None |
filter | Expression | The filter expression. | None |
batch_size | int | Number of rows to get at a time. | DEFAULT_BATCH_SIZE |
Returns:
| Type | Description |
|---|---|
DataFrame | A pandas DataFrame containing the first rows of the dataset. |
__getitem__(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table] ¶
__getitem__(indices: int) -> Dict[str, Any]
__getitem__(indices: Union[slice, List[int], ArrayLike]) -> pa.Table
Get rows from the dataset.
take(*, indices: Union[int, slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Union[dict, pa.Table] ¶
take(*, indices: Optional[int] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Dict[str, Any]
take(*, indices: Union[slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame
Take rows(/columns) from the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices | int, slice, list of int, array-like | Indices of rows to take. | None |
columns | str, list of str, None | Names of columns to take. If None, all columns are taken. | None |
batch_size | int | Number of rows to take at a time. | DEFAULT_BATCH_SIZE |
Returns:
| Type | Description |
|---|---|
(Document, Table) | The taken rows or row. |
map(func: Any, batch_size: int = DEFAULT_BATCH_SIZE, batched: bool = False, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Union[List[str], None] = None, verbose: Union[bool, int] = 1) -> Dataset ¶
Map a function over the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func | Any | The function to map over the dataset. | required |
batch_size | int | Number of rows to map at a time. | DEFAULT_BATCH_SIZE |
batched | bool | Whether the function is batched. | False |
verbose | bool | int | Whether to show a progress bar. | 1 |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset containing the mapped rows. |
filter(expression: Expression = None) -> Dataset ¶
Filter the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expression | Expression | The filter expression. | None |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset containing only the rows that match the filter expression. |
select(columns: Union[str, List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶
Select columns from the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns | str, list of str | Names of columns to select. | required |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset containing only the selected columns. |
rename(columns: Dict[str, str], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶
Rename columns in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns | dict | Mapping of old column names to new column names. | required |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset with the columns renamed. |
project(columns: Union[Dict[str, Expression], Dict[str, str], List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶
Project columns in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns | dict | Mapping of column names to expressions. | required |
batch_size | int | Number of rows to project at a time. | DEFAULT_BATCH_SIZE |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset with the columns projected. |
load_dataset(path: Union[Path, str], format: str = DEFAULT_FORMAT, cache_dir: Union[Path, str, None] = None) -> Dataset classmethod ¶
Load an existing dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | (str, Path) | The path to the dataset. | required |
format | str | The format of the dataset. | DEFAULT_FORMAT |
Returns:
| Type | Description |
|---|---|
Dataset | The loaded dataset. |
to_polars() -> pl.LazyFrame ¶
Convert the dataset to a Polars DataFrame.
Returns:
| Type | Description |
|---|---|
LazyFrame | The Polars Lazy DataFrame. |
Expression ¶
Bases: BaseExpression
A class representing an expression in Octoflow.
__init__(expression: Union[Expression, ds.Expression]) ¶
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expression | Union[Expression, Expression] | The (pyarrow) expression to wrap. | required |
__eq__(other: Any) -> Expression ¶
Compare two expressions for equality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to compare to. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the comparison. |
__ne__(other: Any) -> Expression ¶
Compare two expressions for inequality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to compare to. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the comparison. |
__lt__(other: Any) -> Expression ¶
Compare two expressions for less than.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to compare to. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the comparison. |
__le__(other: Any) -> Expression ¶
Compare two expressions for less than or equal to.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to compare to. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the comparison. |
__gt__(other: Any) -> Expression ¶
Compare two expressions for greater than.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to compare to. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the comparison. |
__ge__(other: Any) -> Expression ¶
Compare two expressions for greater than or equal to.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to compare to. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the comparison. |
__and__(other: Any) -> Expression ¶
Combine two expressions with a logical and.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to combine with. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the combination. |
__or__(other: Any) -> Expression ¶
Combine two expressions with a logical or.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to combine with. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the combination. |
__invert__() -> Expression ¶
Invert an expression.
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the inverted expression. |
is_nan() -> Expression ¶
Check if an expression is NaN.
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the check. |
is_null(nan_is_null: bool = False) ¶
Check if an expression is null.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nan_is_null | bool | Whether to consider NaN values as null, by default False | False |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the check. |
is_valid() -> Expression ¶
Check if an expression is valid.
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the check. |
isin(other: Expression) -> Expression ¶
Check if an expression is in a set of values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Expression | The set of values to check against. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the check. |
equals(other: Expression) -> Expression ¶
Check if an expression is equal to another expression.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Expression | The other expression to check against. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the check. |
__hash__() -> int ¶
Get the hash of the expression.
Returns:
| Type | Description |
|---|---|
int | The hash of the expression. |
__repr__() -> str ¶
Get the representation of the expression.
Returns:
| Type | Description |
|---|---|
str | The representation of the expression. |
field(*args, **kwargs) -> Field ¶
Create a new field getter.
scalar(value: Any) -> Expression ¶
Create an expression from a scalar.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value | Any | The value of the scalar. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the scalar. |
dataloader(func: Union[F, str, None] = None, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[..., Any]] = None, path_arg: Optional[str] = None) -> Union[F, Callable[[F], F]] ¶
dataloader(func: F, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[P, R]] = None, path_arg: Optional[str] = None) -> F
dataloader(name: str, extensions: Optional[list[str]] = None, wraps: Optional[Callable[P, R]] = None, path_arg: Optional[str] = None) -> Callable[[F], F]
Decorator to register a function as a dataset loader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func | Union[Callable[..., Any], str, None] | The function to decorate, by default None. | None |
name | Optional[str] | The name of the loader, by default None. | None |
extensions | Optional[list[str]] | The extensions that the loader supports, by default None. | None |
wraps | Optional[Callable[..., Any]] | The function to wrap, by default None. | None |
path_arg | Optional[str] | The name of the argument that is the path, by default None. | None |
Returns:
| Type | Description |
|---|---|
DatasetLoader | The dataset loader. |
load_dataset(__loader: str, __path: Optional[str], __force: bool = False, __dataset_format: str = DEFAULT_FORMAT, __dataset_path: Union[Path, str, None] = None, /, *args, **kwargs) -> Dataset ¶
Load a dataset from a path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
__loader | str | The name of the loader. | required |
__path | Optional[str] | The path to the data (to be passed to the loader). | required |
__dataset_format | str | The format of the dataset, by default DEFAULT_FORMAT. | DEFAULT_FORMAT |
__dataset_path | Union[Path, str, None] | The path that the dataset will be stored. | None |
*args | tuple | The arguments to pass to the loader. | () |
**kwargs | dict | The keyword arguments to pass to the loader. | {} |
Returns:
| Type | Description |
|---|---|
Dataset | The loaded dataset. |
base ¶
ArrowType = TypeVar('ArrowType') module-attribute ¶
P = ParamSpec('P') module-attribute ¶
R = TypeVar('R') module-attribute ¶
DEFAULT_BATCH_SIZE: Final[int] = 1048576 module-attribute ¶
DEFAULT_FORMAT: Final[str] = 'arrow' module-attribute ¶
BaseExpression = PyArrowWrapper[ds.Expression] module-attribute ¶
BaseDataset = PyArrowWrapper[ds.Dataset] module-attribute ¶
PyArrowWrapper ¶
dataclass ¶
T = TypeVar('T') module-attribute ¶
Field ¶
Bases: Field, Expression
name = name instance-attribute ¶
__init__(name: Optional[str] = None, *, default=dc.MISSING, default_factory=dc.MISSING, init=True, repr=True, hash=None, compare=True, metadata=None, kw_only=dc.MISSING) ¶
__call__(data: Mapping[str, Any]) -> Any ¶
Get the value of the field.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data | dict | The data to be accessed. | required |
FieldAccessor ¶
ModelMeta ¶
field(*args, **kwargs) -> Field ¶
Create a new field getter.
field_from_dataclass_field(field: dc.Field) -> Field ¶
Create a new field getter.
fields(cls: Type[T]) -> Union[FieldAccessor[T], Type[T]] ¶
dataset ¶
SourceType = Union[str, List[str], Union[Path, List[Path]], 'Dataset', List['Dataset']] module-attribute ¶
Dataset ¶
Bases: BaseDataset
cache_dir = cache_dir instance-attribute ¶
path: Path property ¶
The path to the dataset.
Returns:
| Type | Description |
|---|---|
Path | The path to the dataset. |
format: str property ¶
The format of the dataset.
Returns:
| Type | Description |
|---|---|
str | The format of the dataset. |
columns: List[str] property ¶
Get the names of the columns in the dataset.
Returns:
| Type | Description |
|---|---|
List[str] | The names of the columns in the dataset. |
__init__(data_or_loader: Union[List[dict], Dict[str, list], DataFrame, DatasetLoader, str] = None, format: str = DEFAULT_FORMAT, *, schema: Union[pa.Schema, BaseModel, None] = None, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False) ¶
__init__(data: Union[List[dict], Dict[str, list], DataFrame] = None, format: str = DEFAULT_FORMAT)
__init__(data: Union[List[dict], Dict[str, list], DataFrame], format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None)
__init__(loader: DatasetLoader, format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_or_loader | list of dict, dict of list, DataFrame, BaseDatasetLoader, str | The data to load into the dataset or the (name of) loader to use. | None |
format | str | The format of the dataset. | DEFAULT_FORMAT |
path | (str, Path, None) | Load the data to this path. | None |
cache_dir | (str, Path, None) | The directory to use for caching. | None |
loader_args | (tuple, None) | The arguments to pass to the loader function if provided as the first argument. | None |
loader_kwargs | (dict, None) | The keyword arguments to pass to the loader function if provided as the first argument. | None |
count_rows() -> int ¶
Count the number of rows in the dataset.
Returns:
| Type | Description |
|---|---|
int | The number of rows in the dataset. |
__len__() -> int ¶
Get the number of rows in the dataset.
Returns:
| Type | Description |
|---|---|
int | The number of rows in the dataset. |
head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame ¶
Get the first rows of the dataset as a pandas DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_rows | int | The number of rows to get. | 5 |
columns | str, list of str, None | Names of columns to get. If None, all columns are returned. | None |
filter | Expression | The filter expression. | None |
batch_size | int | Number of rows to get at a time. | DEFAULT_BATCH_SIZE |
Returns:
| Type | Description |
|---|---|
DataFrame | A pandas DataFrame containing the first rows of the dataset. |
__getitem__(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table] ¶
__getitem__(indices: int) -> Dict[str, Any]
__getitem__(indices: Union[slice, List[int], ArrayLike]) -> pa.Table
Get rows from the dataset.
take(*, indices: Union[int, slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Union[dict, pa.Table] ¶
take(*, indices: Optional[int] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Dict[str, Any]
take(*, indices: Union[slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame
Take rows(/columns) from the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices | int, slice, list of int, array-like | Indices of rows to take. | None |
columns | str, list of str, None | Names of columns to take. If None, all columns are taken. | None |
batch_size | int | Number of rows to take at a time. | DEFAULT_BATCH_SIZE |
Returns:
| Type | Description |
|---|---|
(Document, Table) | The taken rows or row. |
map(func: Any, batch_size: int = DEFAULT_BATCH_SIZE, batched: bool = False, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Union[List[str], None] = None, verbose: Union[bool, int] = 1) -> Dataset ¶
Map a function over the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func | Any | The function to map over the dataset. | required |
batch_size | int | Number of rows to map at a time. | DEFAULT_BATCH_SIZE |
batched | bool | Whether the function is batched. | False |
verbose | bool | int | Whether to show a progress bar. | 1 |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset containing the mapped rows. |
filter(expression: Expression = None) -> Dataset ¶
Filter the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expression | Expression | The filter expression. | None |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset containing only the rows that match the filter expression. |
select(columns: Union[str, List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶
Select columns from the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns | str, list of str | Names of columns to select. | required |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset containing only the selected columns. |
rename(columns: Dict[str, str], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶
Rename columns in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns | dict | Mapping of old column names to new column names. | required |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset with the columns renamed. |
project(columns: Union[Dict[str, Expression], Dict[str, str], List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶
Project columns in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns | dict | Mapping of column names to expressions. | required |
batch_size | int | Number of rows to project at a time. | DEFAULT_BATCH_SIZE |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset with the columns projected. |
load_dataset(path: Union[Path, str], format: str = DEFAULT_FORMAT, cache_dir: Union[Path, str, None] = None) -> Dataset classmethod ¶
Load an existing dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | (str, Path) | The path to the dataset. | required |
format | str | The format of the dataset. | DEFAULT_FORMAT |
Returns:
| Type | Description |
|---|---|
Dataset | The loaded dataset. |
to_polars() -> pl.LazyFrame ¶
Convert the dataset to a Polars DataFrame.
Returns:
| Type | Description |
|---|---|
LazyFrame | The Polars Lazy DataFrame. |
gen_unique_cached_path(*refs: Any, cache_dir: Union[str, Path, None] = None) -> Path ¶
writable(data: Any, schema: Optional[pa.Schema] = None) -> Union[pa.RecordBatch, pa.Table, pa.RecordBatchReader] ¶
write_dataset(path: Union[str, Path], data: Union[ds.Dataset, pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], pa.RecordBatchReader, pd.DataFrame, Mapping[str, List[Any]], Sequence[Mapping[str, Any]]], schema: pa.Schema = None, format: Optional[str] = None) -> bool ¶
read_dataset(path: Union[str, Path], format: str) -> ds.dataset ¶
to_batches(data: Union[pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], Iterable[pa.Table], pa.RecordBatchReader]) -> Generator[pa.RecordBatch, None, None] ¶
create_mapped_table(data: Union[dict, list, pd.DataFrame, pa.RecordBatch, pa.Table], existing: Optional[pa.Table] = None, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Optional[List[str]] = None) -> pa.Table ¶
expression ¶
Expression ¶
Bases: BaseExpression
A class representing an expression in Octoflow.
__init__(expression: Union[Expression, ds.Expression]) ¶
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expression | Union[Expression, Expression] | The (pyarrow) expression to wrap. | required |
__eq__(other: Any) -> Expression ¶
Compare two expressions for equality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to compare to. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the comparison. |
__ne__(other: Any) -> Expression ¶
Compare two expressions for inequality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to compare to. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the comparison. |
__lt__(other: Any) -> Expression ¶
Compare two expressions for less than.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to compare to. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the comparison. |
__le__(other: Any) -> Expression ¶
Compare two expressions for less than or equal to.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to compare to. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the comparison. |
__gt__(other: Any) -> Expression ¶
Compare two expressions for greater than.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to compare to. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the comparison. |
__ge__(other: Any) -> Expression ¶
Compare two expressions for greater than or equal to.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to compare to. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the comparison. |
__and__(other: Any) -> Expression ¶
Combine two expressions with a logical and.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to combine with. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the combination. |
__or__(other: Any) -> Expression ¶
Combine two expressions with a logical or.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Any | The other expression to combine with. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the combination. |
__invert__() -> Expression ¶
Invert an expression.
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the inverted expression. |
is_nan() -> Expression ¶
Check if an expression is NaN.
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the check. |
is_null(nan_is_null: bool = False) ¶
Check if an expression is null.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nan_is_null | bool | Whether to consider NaN values as null, by default False | False |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the check. |
is_valid() -> Expression ¶
Check if an expression is valid.
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the check. |
isin(other: Expression) -> Expression ¶
Check if an expression is in a set of values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Expression | The set of values to check against. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the check. |
equals(other: Expression) -> Expression ¶
Check if an expression is equal to another expression.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Expression | The other expression to check against. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the result of the check. |
__hash__() -> int ¶
Get the hash of the expression.
Returns:
| Type | Description |
|---|---|
int | The hash of the expression. |
__repr__() -> str ¶
Get the representation of the expression.
Returns:
| Type | Description |
|---|---|
str | The representation of the expression. |
scalar(value: Any) -> Expression ¶
Create an expression from a scalar.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value | Any | The value of the scalar. | required |
Returns:
| Type | Description |
|---|---|
Expression | The expression representing the scalar. |
loaders ¶
P = ParamSpec('P') module-attribute ¶
R = TypeVar('R') module-attribute ¶
F = TypeVar('F', bound=Callable[..., Any]) module-attribute ¶
loaders: Dict[str, DatasetLoader] = {} module-attribute ¶
DatasetLoader ¶
Bases: BaseDatasetLoader
func = func instance-attribute ¶
name = name or self.func.__name__ instance-attribute ¶
extensions = extensions instance-attribute ¶
path_arg = path_arg instance-attribute ¶
wraps = wraps instance-attribute ¶
__init__(func: Callable[..., Any], name: Optional[str] = None, extensions: Optional[list[str]] = None, path_arg: Optional[str] = None, wraps: Optional[Callable[P, R]] = None) ¶
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func | Callable[..., Any] | The function to decorate. | required |
name | Optional[str] | The name of the loader, by default None. | None |
extensions | Optional[list[str]] | The extensions that the loader supports, by default None. | None |
path_arg | Optional[str] | The name of the argument that is the path, by default None. | None |
wraps | Optional[Callable[..., Any]] | The function to wrap, by default None. | None |
__call__(*args: P.args, **kwargs: P.kwargs) -> R ¶
Call the loader function.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
args | tuple | The arguments to pass to the function. | () |
kwargs | dict | The keyword arguments to pass to the function. | {} |
Returns:
| Type | Description |
|---|---|
R | The result of the function. |
bind(*args: P.args, **kwargs: P.kwargs) -> Callable[..., R] ¶
Bind arguments to the loader function.
Notes
This method is useful for creating a partial function with pre-filled arguments and keyword arguments. This helps to improve the uniqueness of the fingerprint of the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
args | tuple | The arguments to pre-fill. | () |
kwargs | dict | The keyword arguments to pre-fill. | {} |
Returns:
| Type | Description |
|---|---|
Callable[..., R] | The partial function. |
dataloader(func: Union[F, str, None] = None, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[..., Any]] = None, path_arg: Optional[str] = None) -> Union[F, Callable[[F], F]] ¶
dataloader(func: F, name: Optional[str] = None, extensions: Optional[list[str]] = None, wraps: Optional[Callable[P, R]] = None, path_arg: Optional[str] = None) -> F
dataloader(name: str, extensions: Optional[list[str]] = None, wraps: Optional[Callable[P, R]] = None, path_arg: Optional[str] = None) -> Callable[[F], F]
Decorator to register a function as a dataset loader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func | Union[Callable[..., Any], str, None] | The function to decorate, by default None. | None |
name | Optional[str] | The name of the loader, by default None. | None |
extensions | Optional[list[str]] | The extensions that the loader supports, by default None. | None |
wraps | Optional[Callable[..., Any]] | The function to wrap, by default None. | None |
path_arg | Optional[str] | The name of the argument that is the path, by default None. | None |
Returns:
| Type | Description |
|---|---|
DatasetLoader | The dataset loader. |
load_json(path: Union[str, Path], encoding: str = 'utf-8') -> Generator[List[Dict], None, None] ¶
Load a dataset from a JSON file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | (str, Path) | The path to the file. | required |
encoding | str | The encoding of the file, by default "utf-8". | 'utf-8' |
Returns:
| Type | Description |
|---|---|
dict | The loaded dataset. |
load_jsonl(path: Union[str, Path], encoding: str = 'utf-8') -> Generator[List[Dict], None, None] ¶
Load a dataset from a JSONL file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | (str, Path) | The path to the file. | required |
encoding | str | The encoding of the file, by default "utf-8". | 'utf-8' |
Returns:
| Type | Description |
|---|---|
list[dict] | The loaded dataset. |
load_csv(path: Union[str, Path], encoding: str = 'utf-8') -> Generator[List[Dict], None, None] ¶
Load a dataset from a CSV/TSV file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | (str, Path) | The path to the file. | required |
encoding | str | The encoding of the file, by default "utf-8". | 'utf-8' |
Returns:
| Type | Description |
|---|---|
list[dict] | The loaded dataset. |
sampler ¶
schema ¶
T = TypeVar('T') module-attribute ¶
unify_schemas(this: pa.Schema, other: Optional[pa.Schema]) -> pa.Schema ¶
infer_schema(data: Dict[str, Any], metadata: Optional[Dict[str, Any]] = None) -> Self ¶
validate(schema: pa.Schema, data: dict) -> bool ¶
Validates a dictionary against a PyArrow schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema | Schema | The PyArrow schema to validate against. | required |
data | dict | The dictionary to validate. | required |
Raises:
| Type | Description |
|---|---|
ValidationError | If the dictionary does not match the schema. |
Examples:
>>> schema = pa.schema([pa.field('id', pa.int64()), pa.field('name', pa.string())])
>>> valid_dict = {'id': 1, 'name': 'Alice'}
>>> validate(schema, valid_dict)
>>> invalid_dict = {'id': '1', 'name': 'Alice'}
>>> validate(schema, invalid_dict)
Traceback (most recent call last):
...
ValidationError: ...
get_schema(data: T) -> Tuple[T, pa.Schema] ¶
Extracts the schema from a PyArrow schema or a generator of PyArrow record batches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data | Any | The PyArrow schema or generator of record batches. | required |
Returns:
| Type | Description |
|---|---|
Tuple[Any, Schema] | The data and the schema. |
from_dataclass(cls: T) -> pa.Schema ¶
Converts a dataclass to a PyArrow schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls | Type[T] | The dataclass to convert. | required |
Returns:
| Type | Description |
|---|---|
Schema | The PyArrow schema. |
Examples:
>>> import dataclasses
>>> @dataclasses.dataclass
... class Record:
... id: int
... name: str
>>> from_dataclass(Record)
pyarrow.Schema([...])
get_schema_from_dataclass(*args, **kwargs) -> pa.Schema ¶
Alias for from_dataclass.
Examples:
>>> import dataclasses
>>> @dataclasses.dataclass
... class Record:
... id: int
... name: str
>>> get_schema_from_dataclass(Record)
pyarrow.Schema([...])
types ¶
UNDEFINED = undefined() module-attribute ¶
MonthDayNano ¶
Undefined ¶
undefined() -> Undefined ¶
is_undefined(obj: pa.DataType) -> bool ¶
from_dataclass(cls: type) -> pa.DataType ¶
Return the PyArrow data type of a dataclass.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls | type | The dataclass. | required |
Returns:
| Type | Description |
|---|---|
DataType | The PyArrow data type. |
from_typed_dict(cls: _TypedDictMeta) -> pa.DataType ¶
Return the PyArrow data type of a TypedDict.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls | _TypedDictMeta | The TypedDict. | required |
Returns:
| Type | Description |
|---|---|
DataType | The PyArrow data type. |
from_union(args: tuple[type, ...]) -> pa.DataType ¶
from_dtype(dtype: Union[type, np.dtype, None]) -> pa.DataType ¶
Return the PyArrow data type of a provided native/NumPy data type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dtype | type | dtype | None | The native or NumPy data type. | required |
Returns:
| Type | Description |
|---|---|
DataType | The PyArrow data type. |
unify_types(left: pa.DataType, right: pa.DataType) -> pa.DataType ¶
Return the PyArrow data type that can represent both left and right.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
left | DataType | The left PyArrow data type. | required |
right | DataType | The right PyArrow data type. | required |
Returns:
| Type | Description |
|---|---|
DataType | The PyArrow data type. |
infer_type(obj: Any) -> pa.DataType ¶
Return the PyArrow data type of an object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
obj | Any | The object. | required |
Returns:
| Type | Description |
|---|---|
DataType | The PyArrow data type. |