FileDataSource
File-based datasource for CSV, JSON, and JSONL files. Extends MemoryDataSource.
Configure loading behavior with FileSourceConfig. See the DataSource Guide for usage examples.
Bases: MemoryDataSource
File-based datasource with MemoryDataSource API and advanced loading strategies.
Extends MemoryDataSource to load data from files (CSV, JSON, JSONL) with support for large files via multiple loading strategies. Provides extensive transformation pipeline for data preprocessing.
The datasource automatically selects optimal loading strategy based on file size and configuration. Supports background loading with progress callbacks for UI integration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str | Path
|
Path to data file |
required |
config
|
Optional[FileSourceConfig]
|
FileSourceConfig object (uses defaults if None) |
None
|
page_size
|
int
|
Records per page for pagination (default: 10) |
10
|
Attributes:
| Name | Type | Description |
|---|---|---|
filepath |
Path object for the data file |
|
config |
Active configuration |
|
is_loaded |
Whether file has been loaded |
Loading Strategies
Eager: Load entire file into memory - Fastest access after initial load - High memory usage - Best for: Small files (< 100k rows)
Lazy: Load data on-demand per page - Slowest access (re-parses on each request) - Minimal memory usage - Best for: Very large files (> 1M rows)
Chunked: Load file in batches - Balanced performance and memory - Shows progress during load - Best for: Medium files (100k-500k rows)
Hybrid: Index in memory, lazy-load records - Fast filtering/sorting - Moderate memory usage - Best for: Large files (500k-1M rows)
Auto: Automatically select based on file size - < 100k rows: Eager - 100k-500k: Chunked - > 500k: Hybrid
Example
ds = FileDataSource("data.csv")
ds.load()
ds.set_filter("age > 25")
page = ds.get_page(0)
Note
- File is re-parsed on reload()
- Threading uses daemon threads (auto-cleanup)
- All MemoryDataSource methods available after load
- Lazy strategy re-parses file on filter/sort changes
__init__
__init__(
filepath: str | Path,
config: Optional[FileSourceConfig] = None,
page_size: int = 10,
)
Configure a file-backed datasource and detect file format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str | Path
|
Location of the data file to be read. |
required |
config
|
Optional[FileSourceConfig]
|
Optional overrides for parsing, transforms, and threading. |
None
|
page_size
|
int
|
Number of records returned per page after loading. |
10
|
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the supplied file path cannot be found. |
load
load(on_complete: Optional[Callable] = None) -> None
Load file with configured strategy.
For threaded loading, returns immediately and calls on_complete when done. For synchronous loading, blocks until complete.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
on_complete
|
Optional[Callable]
|
Optional callback when loading finishes (overrides config) |
None
|
reload
reload() -> None
Reload from file, clearing current data.
is_loading
is_loading() -> bool
Check if file is currently loading.
get_load_progress
get_load_progress() -> tuple[int, int]
Get loading progress as (current, total) rows.
wait_for_load
wait_for_load(timeout: Optional[float] = None) -> bool
Wait for loading to complete (if threaded).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timeout
|
Optional[float]
|
Max seconds to wait (None = wait forever) |
None
|
Returns:
| Type | Description |
|---|---|
bool
|
True if load completed, False if timed out |