Skip to content

FileSourceConfig

Configuration for FileDataSource loading and transformations.

See the DataSource Guide for usage examples.

Attribute Categories

Category Attributes
File Format file_format, encoding
CSV/TSV delimiter, quotechar, skip_rows, header_row, has_header
JSON json_lines, json_orient
Column Transforms column_renames, column_types, column_transforms, columns_to_load, default_values
Row Processing row_filter, row_transform
Large Files loading_strategy, chunk_size, max_memory_rows
Threading use_threading, progress_callback, on_complete, on_error

Class Attributes

File Format

Attribute Type Default Description
file_format Literal['auto', 'csv', 'tsv', 'json', 'jsonl'] 'auto' Format type - auto-detected from extension if 'auto'.
encoding str 'utf-8' Character encoding for reading the file.

CSV/TSV

Attribute Type Default Description
delimiter Optional[str] None Field separator (None = auto-detect).
quotechar str '"' Quote character for fields containing the delimiter.
skip_rows int 0 Number of header rows to skip.
header_row Optional[int] 0 Row index containing column names (None = no header).
has_header bool True Whether the first row contains column names.

JSON

Attribute Type Default Description
json_lines bool False True for line-delimited JSON (JSONL/NDJSON format).
json_orient Literal['records', 'index', 'columns', 'values'] 'records' Pandas-like orientation for JSON arrays.

Column Transforms

Attribute Type Default Description
column_renames Optional[Dict[str, str]] None Map {old_name: new_name} for renaming columns.
column_types Optional[Dict[str, type]] None Map {column: type} for type conversions.
column_transforms Optional[Dict[str, Callable]] None Map {column: func} for custom transformations.
columns_to_load Optional[List[str]] None List of columns to load (None = all columns).
default_values Optional[Dict[str, Any]] None Map {column: value} for missing/null values.

Row Processing

Attribute Type Default Description
row_filter Optional[Callable[[Dict], bool]] None Function(row_dict) -> bool to filter rows during load.
row_transform Optional[Callable[[Dict], Dict]] None Function(row_dict) -> row_dict for row-level transforms.

Large Files

Attribute Type Default Description
loading_strategy Literal['eager', 'lazy', 'chunked', 'hybrid', 'auto'] 'auto' How to load file.
chunk_size int 10000 Rows per chunk for chunked/lazy loading.
max_memory_rows int 100000 Threshold for auto-switching loading strategies.

Threading

Attribute Type Default Description
use_threading bool False Load file in background thread (non-blocking).
progress_callback Optional[Callable[[int, int], None]] None Function(current, total) called during load.
on_complete Optional[Callable[[], None]] None Function() called when loading completes.
on_error Optional[Callable[[Exception], None]] None Function(exception) called if loading fails.

Configuration for file datasource loading and transformations.

Controls how files are parsed, loaded, and transformed. Provides extensive customization for handling various data formats and scenarios.

Example
config = FileSourceConfig(
    column_renames={'emp_id': 'id'},
    column_types={'age': int},
)