FileDataSource

File-based datasource for CSV, JSON, and JSONL files. Extends MemoryDataSource.

Configure loading behavior with FileSourceConfig. See the DataSource Guide for usage examples.

Bases: MemoryDataSource

File-based datasource with MemoryDataSource API and advanced loading strategies.

Extends MemoryDataSource to load data from files (CSV, JSON, JSONL) with support for large files via multiple loading strategies. Provides extensive transformation pipeline for data preprocessing.

The datasource automatically selects optimal loading strategy based on file size and configuration. Supports background loading with progress callbacks for UI integration.

Parameters:

Name	Type	Description	Default
`filepath`	`str \| Path`	Path to data file	required
`config`	`Optional[FileSourceConfig]`	FileSourceConfig object (uses defaults if None)	`None`
`page_size`	`int`	Records per page for pagination (default: 10)	`10`

Attributes:

Name	Type	Description
`filepath`		Path object for the data file
`config`		Active configuration
`is_loaded`		Whether file has been loaded

Loading Strategies

Eager: Load entire file into memory - Fastest access after initial load - High memory usage - Best for: Small files (< 100k rows)

Lazy: Load data on-demand per page - Slowest access (re-parses on each request) - Minimal memory usage - Best for: Very large files (> 1M rows)

Chunked: Load file in batches - Balanced performance and memory - Shows progress during load - Best for: Medium files (100k-500k rows)

Hybrid: Index in memory, lazy-load records - Fast filtering/sorting - Moderate memory usage - Best for: Large files (500k-1M rows)

Auto: Automatically select based on file size - < 100k rows: Eager - 100k-500k: Chunked - > 500k: Hybrid

Example

ds = FileDataSource("data.csv")
ds.load()
ds.set_filter("age > 25")
page = ds.get_page(0)

Note

File is re-parsed on reload()
Threading uses daemon threads (auto-cleanup)
All MemoryDataSource methods available after load
Lazy strategy re-parses file on filter/sort changes

init

__init__(
    filepath: str | Path,
    config: Optional[FileSourceConfig] = None,
    page_size: int = 10,
)

Configure a file-backed datasource and detect file format.

Parameters:

Name	Type	Description	Default
`filepath`	`str \| Path`	Location of the data file to be read.	required
`config`	`Optional[FileSourceConfig]`	Optional overrides for parsing, transforms, and threading.	`None`
`page_size`	`int`	Number of records returned per page after loading.	`10`

Raises:

Type	Description
`FileNotFoundError`	If the supplied file path cannot be found.

load

load(on_complete: Optional[Callable] = None) -> None

Load file with configured strategy.

For threaded loading, returns immediately and calls on_complete when done. For synchronous loading, blocks until complete.

Parameters:

Name	Type	Description	Default
`on_complete`	`Optional[Callable]`	Optional callback when loading finishes (overrides config)	`None`

reload

reload() -> None

Reload from file, clearing current data.

is_loading

is_loading() -> bool

Check if file is currently loading.

get_load_progress

get_load_progress() -> tuple[int, int]

Get loading progress as (current, total) rows.

wait_for_load

wait_for_load(timeout: Optional[float] = None) -> bool

Wait for loading to complete (if threaded).

Parameters:

Name	Type	Description	Default
`timeout`	`Optional[float]`	Max seconds to wait (None = wait forever)	`None`

Returns:

Type	Description
`bool`	True if load completed, False if timed out