exetera.io package¶
Submodules¶
exetera.io.importer module¶
-
exetera.io.importer.import_with_schema(session: Session, dataset_filename: Union[str, _io.BytesIO], dataset_alias: str, schema_file: Union[str, _io.BytesIO, _io.StringIO], files: Union[str, dict], overwrite: bool, include: Optional[Dict] = None, exclude: Optional[Dict] = None, timestamp: Union[str, datetime.datetime] = None, chunk_row_size: int = 1048576)¶ Imports the source data described by ‘files’ into a dataset specified by ‘dataset_name’, with the session alias ‘dataset_alias’. The source data described by ‘files’ must conform to the schema specified in ‘schema_file’.
If ‘dataset_name’ refers to an existing dataset, an error will be raised unless ‘overwrite’ is set to True, otherwise, ‘overwrite’ doesn’t do anything.
- Parameters
session – The exetera Session object used to hold the resulting open dataset
dataset_filename – A relative or absolute path and name for the dataset. If this refers
to an existing file, and the caller has not specified ‘overwrite’ to be True, an error will be raised. Otherwise, a dataset will be created at this location. This can also be a BytesIO object, primarily for testing purposes. :param dataset_alias: An alias for the dataset in the session. This is required so that the dataset can be easily retrieved from the session subsequently. :param schema_file: The path / name of an exetera schema file that describes the data in the data sources specified by ‘files’. :param include: An optional parameter that specifies fields to be included from the data sources. Only one of ‘include’ or ‘exclude’ may be used for each data source. :param exclude: An optional parameter that specifies fields to be excluded from the data soures. Only one of ‘include’ or ‘exclude’ may be used for each data source. :param timestamp: An optional parameter the specifies an official timestamp for the dataset. If this is not set, a timestamp will be generated using at the moment this method is called. :param chunk_row_size: An optional parameter that tweaks the import performance. Larger values use more memory but improve import speed. Typically this should be left at its default value.
exetera.io.load_schema module¶
-
exetera.io.load_schema.load_schema(source: Union[str, _io.StringIO], verbosity=0)¶
-
exetera.io.load_schema.load_schema_file(source: Union[str, _io.StringIO], verbosity=0)¶
-
exetera.io.load_schema.schema_file_to_dict(schema)¶
exetera.io.parsers module¶
-
exetera.io.parsers.read_csv(csv_file: str, ddf: exetera.core.abstract_types.DataFrame, schema_dictionary: Mapping[str, exetera.io.field_importers.ImporterDefinition] = None, schema_file: Union[str, _io.StringIO] = None, schema_key: str = None, include: List[str] = None, exclude: List[str] = None, chunk_row_size: int = 1048576, timestamp=1649178756.969108)¶ Read a comma-separated values (csv) file into HDF5DataFrame.
- Params csv_file
string path for csv file.
- Params ddf
destination dataframe.
- Params schema_dictionary
provide schema in dictionary format. Default is None.
- Params schema_file
provide schema in file. Default is None.
- Params include
a list of included field names. Default is None.
- Params exclude
a list of excluded field names. Default in None.
- Params chunk_row_size
read file chunk by chunk. Row sizes in each chunk is chunk_row_size. Default is 1 << 20.
- Params timestamp
timestamp. Default is timestamp of current time.
-
exetera.io.parsers.read_csv_with_schema_dict(csv_file: str, ddf: exetera.core.abstract_types.DataFrame, schema_dictionary: Mapping[str, exetera.io.field_importers.ImporterDefinition], timestamp: float, include: List[str] = None, exclude: List[str] = None, chunk_row_size: int = 1048576, stop_after_rows=None)¶ Read a comma-separated values (csv) file into HDF5DataFrame, with schema provided in dictionary formats.
- Params csv_file
string path for csv file.
- Params ddf
destination dataframe.
- Params schema_dictionary
provide schema in dictionary format.
- Params timestamp
timestamp.
- Params include
a list of included field names. Default is None.
- Params exclude
a list of excluded field names. Default in None.
- Params chunk_row_size
read file chunk by chunk. Row sizes in each chunk is chunk_row_size. Default is 1 << 20.
- Params stop_after_rows
stop after given rows. Default is None.