bootstack.data.FileSourceConfig#

class bootstack.data.FileSourceConfig(file_format='auto', encoding='utf-8', delimiter=None, quotechar='"', skip_rows=0, header_row=0, has_header=True, json_lines=False, json_records_key=None, xml_record_tag=None, hdf5_key=None, column_renames=None, column_types=None, column_transforms=None, columns_to_load=None, default_values=None, row_filter=None, row_transform=None, chunk_size=10000, progress_callback=None)#

Bases: object

Configuration for file parsing and the per-record transformation pipeline.

Example

config = FileSourceConfig(
    column_renames={'emp_id': 'id'},
    column_types={'age': int},
)

chunk_size: int = 10000#: Rows ingested per batch (bounds memory during load).

column_renames: Dict[str, str] | None = None#: Mapping from each existing column name to its replacement.

column_transforms: Dict[str, Callable[[Any], Any]] | None = None#: Mapping from a column name to a transform applied to its values.

column_types: Dict[str, Type] | None = None#: Mapping from a column name to the target type to convert its values to.

columns_to_load: List[str] | None = None#: List of columns to keep (None = all columns).

default_values: Dict[str, Any] | None = None#: Mapping from a column name to a fill value for missing or null entries.

delimiter: str | None = None#: Field separator. None auto-selects ',' for CSV and a tab for TSV.

encoding: str = 'utf-8'#: Character encoding for reading the file.

file_format: Literal['auto', 'csv', 'tsv', 'json', 'jsonl', 'ndjson', 'xml', 'parquet', 'feather', 'hdf5'] = 'auto'#: Format override; auto-detected from the extension when 'auto'.

has_header: bool = True#: Whether the first row contains column names.

hdf5_key: str | None = None#: Dataset/table key to read from an HDF5 file (None = the first key).

header_row: int | None = 0#: Row index containing column names (None = no header).

json_lines: bool = False#: True for line-delimited JSON (JSONL/NDJSON).

json_records_key: str | None = None#: Key whose value is the records list in a JSON object (e.g. 'data' for {'data': [...]}); None = a top-level array, or the object itself as one record.

progress_callback: Callable[[int], None] | None = None#: Function (count) called after each ingested chunk with the running total of rows loaded so far.

quotechar: str = '"'#: Quote character for fields containing the delimiter.

row_filter: Callable[[Dict[str, Any]], bool] | None = None#: Function (row_dict) -> bool to filter rows during load.

row_transform: Callable[[Dict[str, Any]], Dict[str, Any]] | None = None#: Function (row_dict) -> row_dict for row-level transforms.

skip_rows: int = 0#: Number of leading rows to skip before the header (CSV/TSV).

xml_record_tag: str | None = None#: Element tag that marks one record (None = direct children of the root).