Skip to content

FileDataSource

File-based datasource for CSV, JSON, and JSONL files. Extends MemoryDataSource.

Configure loading behavior with FileSourceConfig. See the DataSource Guide for usage examples.

Bases: MemoryDataSource

File-based datasource with MemoryDataSource API and advanced loading strategies.

Extends MemoryDataSource to load data from files (CSV, JSON, JSONL) with support for large files via multiple loading strategies. Provides extensive transformation pipeline for data preprocessing.

The datasource automatically selects optimal loading strategy based on file size and configuration. Supports background loading with progress callbacks for UI integration.

Parameters:

Name Type Description Default
filepath str | Path

Path to data file

required
config Optional[FileSourceConfig]

FileSourceConfig object (uses defaults if None)

None
page_size int

Records per page for pagination (default: 10)

10

Attributes:

Name Type Description
filepath

Path object for the data file

config

Active configuration

is_loaded

Whether file has been loaded

Loading Strategies

Eager: Load entire file into memory - Fastest access after initial load - High memory usage - Best for: Small files (< 100k rows)

Lazy: Load data on-demand per page - Slowest access (re-parses on each request) - Minimal memory usage - Best for: Very large files (> 1M rows)

Chunked: Load file in batches - Balanced performance and memory - Shows progress during load - Best for: Medium files (100k-500k rows)

Hybrid: Index in memory, lazy-load records - Fast filtering/sorting - Moderate memory usage - Best for: Large files (500k-1M rows)

Auto: Automatically select based on file size - < 100k rows: Eager - 100k-500k: Chunked - > 500k: Hybrid

Example
ds = FileDataSource("data.csv")
ds.load()
ds.set_filter("age > 25")
page = ds.get_page(0)
Note
  • File is re-parsed on reload()
  • Threading uses daemon threads (auto-cleanup)
  • All MemoryDataSource methods available after load
  • Lazy strategy re-parses file on filter/sort changes

__init__

__init__(
    filepath: str | Path,
    config: Optional[FileSourceConfig] = None,
    page_size: int = 10,
)

Configure a file-backed datasource and detect file format.

Parameters:

Name Type Description Default
filepath str | Path

Location of the data file to be read.

required
config Optional[FileSourceConfig]

Optional overrides for parsing, transforms, and threading.

None
page_size int

Number of records returned per page after loading.

10

Raises:

Type Description
FileNotFoundError

If the supplied file path cannot be found.

load

load(on_complete: Optional[Callable] = None) -> None

Load file with configured strategy.

For threaded loading, returns immediately and calls on_complete when done. For synchronous loading, blocks until complete.

Parameters:

Name Type Description Default
on_complete Optional[Callable]

Optional callback when loading finishes (overrides config)

None

reload

reload() -> None

Reload from file, clearing current data.

is_loading

is_loading() -> bool

Check if file is currently loading.

get_load_progress

get_load_progress() -> tuple[int, int]

Get loading progress as (current, total) rows.

wait_for_load

wait_for_load(timeout: Optional[float] = None) -> bool

Wait for loading to complete (if threaded).

Parameters:

Name Type Description Default
timeout Optional[float]

Max seconds to wait (None = wait forever)

None

Returns:

Type Description
bool

True if load completed, False if timed out