Skip to content

Dataset

dataset

SourceType = Union[str, List[str], Union[Path, List[Path]], 'Dataset', List['Dataset']] module-attribute

Dataset

Bases: BaseDataset

cache_dir = cache_dir instance-attribute

path: Path property

The path to the dataset.

Returns:

Type Description
Path

The path to the dataset.

format: str property

The format of the dataset.

Returns:

Type Description
str

The format of the dataset.

columns: List[str] property

Get the names of the columns in the dataset.

Returns:

Type Description
List[str]

The names of the columns in the dataset.

__init__(data_or_loader: Union[List[dict], Dict[str, list], DataFrame, DatasetLoader, str] = None, format: str = DEFAULT_FORMAT, *, schema: Union[pa.Schema, BaseModel, None] = None, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)

__init__(data: Union[List[dict], Dict[str, list], DataFrame] = None, format: str = DEFAULT_FORMAT)
__init__(data: Union[List[dict], Dict[str, list], DataFrame], format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None)
__init__(loader: DatasetLoader, format: str = DEFAULT_FORMAT, *, path: Optional[Union[str, Path]] = None, cache_dir: Optional[Union[str, Path]] = None, loader_args: Optional[Tuple[Any, ...]] = None, loader_kwargs: Optional[Dict[str, Any]] = None, force: bool = False)

Parameters:

Name Type Description Default
data_or_loader list of dict, dict of list, DataFrame, BaseDatasetLoader, str

The data to load into the dataset or the (name of) loader to use.

None
format str

The format of the dataset.

DEFAULT_FORMAT
path (str, Path, None)

Load the data to this path.

None
cache_dir (str, Path, None)

The directory to use for caching.

None
loader_args (tuple, None)

The arguments to pass to the loader function if provided as the first argument.

None
loader_kwargs (dict, None)

The keyword arguments to pass to the loader function if provided as the first argument.

None

count_rows() -> int

Count the number of rows in the dataset.

Returns:

Type Description
int

The number of rows in the dataset.

__len__() -> int

Get the number of rows in the dataset.

Returns:

Type Description
int

The number of rows in the dataset.

head(num_rows: int = 5, columns: Union[str, List[str], None] = None, filter: Expression = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame

Get the first rows of the dataset as a pandas DataFrame.

Parameters:

Name Type Description Default
num_rows int

The number of rows to get.

5
columns str, list of str, None

Names of columns to get. If None, all columns are returned.

None
filter Expression

The filter expression.

None
batch_size int

Number of rows to get at a time.

DEFAULT_BATCH_SIZE

Returns:

Type Description
DataFrame

A pandas DataFrame containing the first rows of the dataset.

__getitem__(indices: Union[int, slice, List[int], ArrayLike]) -> Union[dict, pa.Table]

__getitem__(indices: int) -> Dict[str, Any]
__getitem__(indices: Union[slice, List[int], ArrayLike]) -> pa.Table

Get rows from the dataset.

take(*, indices: Union[int, slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Union[dict, pa.Table]

take(*, indices: Optional[int] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> Dict[str, Any]
take(*, indices: Union[slice, List[int], ArrayLike] = None, columns: Union[str, List[str], None] = None, batch_size: int = DEFAULT_BATCH_SIZE) -> DataFrame

Take rows(/columns) from the dataset.

Parameters:

Name Type Description Default
indices int, slice, list of int, array-like

Indices of rows to take.

None
columns str, list of str, None

Names of columns to take. If None, all columns are taken.

None
batch_size int

Number of rows to take at a time.

DEFAULT_BATCH_SIZE

Returns:

Type Description
(Document, Table)

The taken rows or row.

map(func: Any, batch_size: int = DEFAULT_BATCH_SIZE, batched: bool = False, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Union[List[str], None] = None, verbose: Union[bool, int] = 1) -> Dataset

Map a function over the dataset.

Parameters:

Name Type Description Default
func Any

The function to map over the dataset.

required
batch_size int

Number of rows to map at a time.

DEFAULT_BATCH_SIZE
batched bool

Whether the function is batched.

False
verbose bool | int

Whether to show a progress bar.

1

Returns:

Type Description
Dataset

A new dataset containing the mapped rows.

filter(expression: Expression = None) -> Dataset

Filter the dataset.

Parameters:

Name Type Description Default
expression Expression

The filter expression.

None

Returns:

Type Description
Dataset

A new dataset containing only the rows that match the filter expression.

select(columns: Union[str, List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset

Select columns from the dataset.

Parameters:

Name Type Description Default
columns str, list of str

Names of columns to select.

required

Returns:

Type Description
Dataset

A new dataset containing only the selected columns.

rename(columns: Dict[str, str], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset

Rename columns in the dataset.

Parameters:

Name Type Description Default
columns dict

Mapping of old column names to new column names.

required

Returns:

Type Description
Dataset

A new dataset with the columns renamed.

project(columns: Union[Dict[str, Expression], Dict[str, str], List[str]], batch_size: int = DEFAULT_BATCH_SIZE) -> Dataset

Project columns in the dataset.

Parameters:

Name Type Description Default
columns dict

Mapping of column names to expressions.

required
batch_size int

Number of rows to project at a time.

DEFAULT_BATCH_SIZE

Returns:

Type Description
Dataset

A new dataset with the columns projected.

load_dataset(path: Union[Path, str], format: str = DEFAULT_FORMAT, cache_dir: Union[Path, str, None] = None) -> Dataset classmethod

Load an existing dataset.

Parameters:

Name Type Description Default
path (str, Path)

The path to the dataset.

required
format str

The format of the dataset.

DEFAULT_FORMAT

Returns:

Type Description
Dataset

The loaded dataset.

to_polars() -> pl.LazyFrame

Convert the dataset to a Polars DataFrame.

Returns:

Type Description
LazyFrame

The Polars Lazy DataFrame.

gen_unique_cached_path(*refs: Any, cache_dir: Union[str, Path, None] = None) -> Path

writable(data: Any, schema: Optional[pa.Schema] = None) -> Union[pa.RecordBatch, pa.Table, pa.RecordBatchReader]

write_dataset(path: Union[str, Path], data: Union[ds.Dataset, pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], pa.RecordBatchReader, pd.DataFrame, Mapping[str, List[Any]], Sequence[Mapping[str, Any]]], schema: pa.Schema = None, format: Optional[str] = None) -> bool

read_dataset(path: Union[str, Path], format: str) -> ds.dataset

to_batches(data: Union[pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], Iterable[pa.Table], pa.RecordBatchReader]) -> Generator[pa.RecordBatch, None, None]

create_mapped_table(data: Union[dict, list, pd.DataFrame, pa.RecordBatch, pa.Table], existing: Optional[pa.Table] = None, keep_cols: Union[bool, List[str], None] = True, exclude_cols: Optional[List[str]] = None) -> pa.Table