Dataset

`dataset` ¶

`SourceType = Union[str, List[str], Union[Path, List[Path]], 'Dataset', List['Dataset']]` `module-attribute` ¶

`Dataset` ¶

Bases: BaseDataset

`cache_dir = cache_dir` `instance-attribute` ¶

`path: Path` `property` ¶

The path to the dataset.

Returns:

Type	Description
`Path`	The path to the dataset.

`format: str` `property` ¶

The format of the dataset.

Returns:

Type	Description
`str`	The format of the dataset.

`columns: List[str]` `property` ¶

Get the names of the columns in the dataset.

Returns:

Type	Description
`List[str]`	The names of the columns in the dataset.

`init(data_or_loader: Union[List[dict], Dict[str, list], DataFrame, DatasetLoader, str] = None, format: str = DEFAULT_FORMAT, *, schema: Union[pa.Schema, BaseModel, None] = None, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)` ¶

__init__(data: Union[List[dict], Dict[str, list], DataFrame] = None, format: str = DEFAULT_FORMAT)

__init__(data: Union[List[dict], Dict[str, list], DataFrame], format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None)

__init__(loader: DatasetLoader, format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)

Parameters:

Name	Type	Description	Default
`data_or_loader`	`list of dict, dict of list, DataFrame, BaseDatasetLoader, str`	The data to load into the dataset or the (name of) loader to use.	`None`
`format`	`str`	The format of the dataset.	`DEFAULT_FORMAT`
`path`	`(str, Path, None)`	Load the data to this path.	`None`
`cache_dir`	`(str, Path, None)`	The directory to use for caching.	`None`
`loader_args`	`(tuple, None)`	The arguments to pass to the loader function if provided as the first argument.	`None`
`loader_kwargs`	`(dict, None)`	The keyword arguments to pass to the loader function if provided as the first argument.	`None`

`count_rows() -> int` ¶

Count the number of rows in the dataset.

Returns:

Type	Description
`int`	The number of rows in the dataset.

`len() -> int` ¶

Get the number of rows in the dataset.

Returns:

Type	Description
`int`	The number of rows in the dataset.

`head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame` ¶

Get the first rows of the dataset as a pandas DataFrame.

Parameters:

Name	Type	Description	Default
`num_rows`	`int`	The number of rows to get.	`5`
`columns`	`str, list of str, None`	Names of columns to get. If None, all columns are returned.	`None`
`filter`	`Expression`	The filter expression.	`None`
`batch_size`	`int`	Number of rows to get at a time.	`DEFAULT_BATCH_SIZE`

Returns:

Type	Description
`DataFrame`	A pandas DataFrame containing the first rows of the dataset.

`getitem(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table]` ¶

__getitem__(indices: int) -> Dict[str, Any]

__getitem__(indices: Union[slice, List[int], ArrayLike]) -> pa.Table

Get rows from the dataset.

`take(*, indices: Union[int, slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Union[dict, pa.Table]` ¶

take(*, indices: Optional[int] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Dict[str, Any]

take(*, indices: Union[slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame

Take rows(/columns) from the dataset.

Parameters:

Name	Type	Description	Default
`indices`	`int, slice, list of int, array-like`	Indices of rows to take.	`None`
`columns`	`str, list of str, None`	Names of columns to take. If None, all columns are taken.	`None`
`batch_size`	`int`	Number of rows to take at a time.	`DEFAULT_BATCH_SIZE`

Returns:

Type	Description
`(Document, Table)`	The taken rows or row.

`map(func: Any, batch_size: int = DEFAULT_BATCH_SIZE, batched: bool = False, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Union[List[str], None] = None, verbose: Union[bool, int] = 1) -> Dataset` ¶

Map a function over the dataset.

Parameters:

Name	Type	Description	Default
`func`	`Any`	The function to map over the dataset.	required
`batch_size`	`int`	Number of rows to map at a time.	`DEFAULT_BATCH_SIZE`
`batched`	`bool`	Whether the function is batched.	`False`
`verbose`	`bool \| int`	Whether to show a progress bar.	`1`

Returns:

Type	Description
`Dataset`	A new dataset containing the mapped rows.

`filter(expression: Expression = None) -> Dataset` ¶

Filter the dataset.

Parameters:

Name	Type	Description	Default
`expression`	`Expression`	The filter expression.	`None`

Returns:

Type	Description
`Dataset`	A new dataset containing only the rows that match the filter expression.

`select(columns: Union[str, List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

Select columns from the dataset.

Parameters:

Name	Type	Description	Default
`columns`	`str, list of str`	Names of columns to select.	required

Returns:

Type	Description
`Dataset`	A new dataset containing only the selected columns.

`rename(columns: Dict[str, str], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

Rename columns in the dataset.

Parameters:

Name	Type	Description	Default
`columns`	`dict`	Mapping of old column names to new column names.	required

Returns:

Type	Description
`Dataset`	A new dataset with the columns renamed.

`project(columns: Union[Dict[str, Expression], Dict[str, str], List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

Project columns in the dataset.

Parameters:

Name	Type	Description	Default
`columns`	`dict`	Mapping of column names to expressions.	required
`batch_size`	`int`	Number of rows to project at a time.	`DEFAULT_BATCH_SIZE`

Returns:

Type	Description
`Dataset`	A new dataset with the columns projected.

`load_dataset(path: Union[Path, str], format: str = DEFAULT_FORMAT, cache_dir: Union[Path, str, None] = None) -> Dataset` `classmethod` ¶

Load an existing dataset.

Parameters:

Name	Type	Description	Default
`path`	`(str, Path)`	The path to the dataset.	required
`format`	`str`	The format of the dataset.	`DEFAULT_FORMAT`

Returns:

Type	Description
`Dataset`	The loaded dataset.

`to_polars() -> pl.LazyFrame` ¶

Convert the dataset to a Polars DataFrame.

Returns:

Type	Description
`LazyFrame`	The Polars Lazy DataFrame.

`gen_unique_cached_path(*refs: Any, cache_dir: Union[str, Path, None] = None) -> Path` ¶

`writable(data: Any, schema: Optional[pa.Schema] = None) -> Union[pa.RecordBatch, pa.Table, pa.RecordBatchReader]` ¶

`write_dataset(path: Union[str, Path], data: Union[ds.Dataset, pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], pa.RecordBatchReader, pd.DataFrame, Mapping[str, List[Any]], Sequence[Mapping[str, Any]]], schema: pa.Schema = None, format: Optional[str] = None) -> bool` ¶

`read_dataset(path: Union[str, Path], format: str) -> ds.dataset` ¶

`to_batches(data: Union[pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], Iterable[pa.Table], pa.RecordBatchReader]) -> Generator[pa.RecordBatch, None, None]` ¶

`create_mapped_table(data: Union[dict, list, pd.DataFrame, pa.RecordBatch, pa.Table], existing: Optional[pa.Table] = None, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Optional[List[str]] = None) -> pa.Table` ¶

Dataset

dataset ¶

SourceType = Union[str, List[str], Union[Path, List[Path]], 'Dataset', List['Dataset']] module-attribute ¶

Dataset ¶

cache_dir = cache_dir instance-attribute ¶

path: Path property ¶

format: str property ¶

columns: List[str] property ¶

count_rows() -> int ¶

__len__() -> int ¶

head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame ¶

__getitem__(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table] ¶

take(*, indices: Union[int, slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Union[dict, pa.Table] ¶

map(func: Any, batch_size: int = DEFAULT_BATCH_SIZE, batched: bool = False, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Union[List[str], None] = None, verbose: Union[bool, int] = 1) -> Dataset ¶

filter(expression: Expression = None) -> Dataset ¶

select(columns: Union[str, List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶

rename(columns: Dict[str, str], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶

project(columns: Union[Dict[str, Expression], Dict[str, str], List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶

load_dataset(path: Union[Path, str], format: str = DEFAULT_FORMAT, cache_dir: Union[Path, str, None] = None) -> Dataset classmethod ¶

to_polars() -> pl.LazyFrame ¶

gen_unique_cached_path(*refs: Any, cache_dir: Union[str, Path, None] = None) -> Path ¶

writable(data: Any, schema: Optional[pa.Schema] = None) -> Union[pa.RecordBatch, pa.Table, pa.RecordBatchReader] ¶

write_dataset(path: Union[str, Path], data: Union[ds.Dataset, pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], pa.RecordBatchReader, pd.DataFrame, Mapping[str, List[Any]], Sequence[Mapping[str, Any]]], schema: pa.Schema = None, format: Optional[str] = None) -> bool ¶

read_dataset(path: Union[str, Path], format: str) -> ds.dataset ¶

to_batches(data: Union[pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], Iterable[pa.Table], pa.RecordBatchReader]) -> Generator[pa.RecordBatch, None, None] ¶

create_mapped_table(data: Union[dict, list, pd.DataFrame, pa.RecordBatch, pa.Table], existing: Optional[pa.Table] = None, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Optional[List[str]] = None) -> pa.Table ¶

`dataset` ¶

`SourceType = Union[str, List[str], Union[Path, List[Path]], 'Dataset', List['Dataset']]` `module-attribute` ¶

`Dataset` ¶

`cache_dir = cache_dir` `instance-attribute` ¶

`path: Path` `property` ¶

`format: str` `property` ¶

`columns: List[str]` `property` ¶

`count_rows() -> int` ¶

`len() -> int` ¶

`head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame` ¶

`getitem(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table]` ¶

`take(*, indices: Union[int, slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Union[dict, pa.Table]` ¶

`map(func: Any, batch_size: int = DEFAULT_BATCH_SIZE, batched: bool = False, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Union[List[str], None] = None, verbose: Union[bool, int] = 1) -> Dataset` ¶

`filter(expression: Expression = None) -> Dataset` ¶

`select(columns: Union[str, List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

`rename(columns: Dict[str, str], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

`project(columns: Union[Dict[str, Expression], Dict[str, str], List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset` ¶

`load_dataset(path: Union[Path, str], format: str = DEFAULT_FORMAT, cache_dir: Union[Path, str, None] = None) -> Dataset` `classmethod` ¶

`to_polars() -> pl.LazyFrame` ¶

`gen_unique_cached_path(*refs: Any, cache_dir: Union[str, Path, None] = None) -> Path` ¶

`writable(data: Any, schema: Optional[pa.Schema] = None) -> Union[pa.RecordBatch, pa.Table, pa.RecordBatchReader]` ¶

`write_dataset(path: Union[str, Path], data: Union[ds.Dataset, pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], pa.RecordBatchReader, pd.DataFrame, Mapping[str, List[Any]], Sequence[Mapping[str, Any]]], schema: pa.Schema = None, format: Optional[str] = None) -> bool` ¶

`read_dataset(path: Union[str, Path], format: str) -> ds.dataset` ¶

`to_batches(data: Union[pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], Iterable[pa.Table], pa.RecordBatchReader]) -> Generator[pa.RecordBatch, None, None]` ¶

`create_mapped_table(data: Union[dict, list, pd.DataFrame, pa.RecordBatch, pa.Table], existing: Optional[pa.Table] = None, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Optional[List[str]] = None) -> pa.Table` ¶