Dataset
dataset ¶
SourceType = Union[str, List[str], Union[Path, List[Path]], 'Dataset', List['Dataset']] module-attribute ¶
Dataset ¶
Bases: BaseDataset
cache_dir = cache_dir instance-attribute ¶
path: Path property ¶
The path to the dataset.
Returns:
| Type | Description |
|---|---|
Path | The path to the dataset. |
format: str property ¶
The format of the dataset.
Returns:
| Type | Description |
|---|---|
str | The format of the dataset. |
columns: List[str] property ¶
Get the names of the columns in the dataset.
Returns:
| Type | Description |
|---|---|
List[str] | The names of the columns in the dataset. |
__init__(data_or_loader: Union[List[dict], Dict[str, list], DataFrame, DatasetLoader, str] = None, format: str = DEFAULT_FORMAT, *, schema: Union[pa.Schema, BaseModel, None] = None, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False) ¶
__init__(data: Union[List[dict], Dict[str, list], DataFrame] = None, format: str = DEFAULT_FORMAT)
__init__(data: Union[List[dict], Dict[str, list], DataFrame], format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None)
__init__(loader: DatasetLoader, format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_or_loader | list of dict, dict of list, DataFrame, BaseDatasetLoader, str | The data to load into the dataset or the (name of) loader to use. | None |
format | str | The format of the dataset. | DEFAULT_FORMAT |
path | (str, Path, None) | Load the data to this path. | None |
cache_dir | (str, Path, None) | The directory to use for caching. | None |
loader_args | (tuple, None) | The arguments to pass to the loader function if provided as the first argument. | None |
loader_kwargs | (dict, None) | The keyword arguments to pass to the loader function if provided as the first argument. | None |
count_rows() -> int ¶
Count the number of rows in the dataset.
Returns:
| Type | Description |
|---|---|
int | The number of rows in the dataset. |
__len__() -> int ¶
Get the number of rows in the dataset.
Returns:
| Type | Description |
|---|---|
int | The number of rows in the dataset. |
head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame ¶
Get the first rows of the dataset as a pandas DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_rows | int | The number of rows to get. | 5 |
columns | str, list of str, None | Names of columns to get. If None, all columns are returned. | None |
filter | Expression | The filter expression. | None |
batch_size | int | Number of rows to get at a time. | DEFAULT_BATCH_SIZE |
Returns:
| Type | Description |
|---|---|
DataFrame | A pandas DataFrame containing the first rows of the dataset. |
__getitem__(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table] ¶
__getitem__(indices: int) -> Dict[str, Any]
__getitem__(indices: Union[slice, List[int], ArrayLike]) -> pa.Table
Get rows from the dataset.
take(*, indices: Union[int, slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Union[dict, pa.Table] ¶
take(*, indices: Optional[int] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Dict[str, Any]
take(*, indices: Union[slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame
Take rows(/columns) from the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices | int, slice, list of int, array-like | Indices of rows to take. | None |
columns | str, list of str, None | Names of columns to take. If None, all columns are taken. | None |
batch_size | int | Number of rows to take at a time. | DEFAULT_BATCH_SIZE |
Returns:
| Type | Description |
|---|---|
(Document, Table) | The taken rows or row. |
map(func: Any, batch_size: int = DEFAULT_BATCH_SIZE, batched: bool = False, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Union[List[str], None] = None, verbose: Union[bool, int] = 1) -> Dataset ¶
Map a function over the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func | Any | The function to map over the dataset. | required |
batch_size | int | Number of rows to map at a time. | DEFAULT_BATCH_SIZE |
batched | bool | Whether the function is batched. | False |
verbose | bool | int | Whether to show a progress bar. | 1 |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset containing the mapped rows. |
filter(expression: Expression = None) -> Dataset ¶
Filter the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expression | Expression | The filter expression. | None |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset containing only the rows that match the filter expression. |
select(columns: Union[str, List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶
Select columns from the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns | str, list of str | Names of columns to select. | required |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset containing only the selected columns. |
rename(columns: Dict[str, str], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶
Rename columns in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns | dict | Mapping of old column names to new column names. | required |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset with the columns renamed. |
project(columns: Union[Dict[str, Expression], Dict[str, str], List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset ¶
Project columns in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns | dict | Mapping of column names to expressions. | required |
batch_size | int | Number of rows to project at a time. | DEFAULT_BATCH_SIZE |
Returns:
| Type | Description |
|---|---|
Dataset | A new dataset with the columns projected. |
load_dataset(path: Union[Path, str], format: str = DEFAULT_FORMAT, cache_dir: Union[Path, str, None] = None) -> Dataset classmethod ¶
Load an existing dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | (str, Path) | The path to the dataset. | required |
format | str | The format of the dataset. | DEFAULT_FORMAT |
Returns:
| Type | Description |
|---|---|
Dataset | The loaded dataset. |
to_polars() -> pl.LazyFrame ¶
Convert the dataset to a Polars DataFrame.
Returns:
| Type | Description |
|---|---|
LazyFrame | The Polars Lazy DataFrame. |