exetera.core package¶

Submodules¶

exetera.core.data_writer module¶

class exetera.core.data_writer.DataWriter¶

Bases: object

static clear_dataset(parent_group, name)¶

static create_group(parent_group, name, attrs)¶

static flush(group)¶

static write(group, name, field, count, dtype=None)¶

static write_additional(group, name, field, count)¶

static write_first(group, name, field, count, dtype=None)¶

exetera.core.dataset module¶

class exetera.core.dataset.HDF5Dataset(session, dataset_path, mode, name)¶

Bases: exetera.core.abstract_types.Dataset

Dataset is the means which which you interact with an ExeTera datastore. These are created and loaded through Session.open_dataset, rather than being constructed directly.

Datasets are composed of one or more DataFrame objects and the means by which DataFrames are interacted with.

For a detailed explanation of Dataset along with examples of its use, please refer to the wiki documentation at https://github.com/KCL-BMEIS/ExeTera/wiki/Dataset-API

Parameters

session – The session instance to include this dataset to.
dataset_path – The path of HDF5 file.
mode – the mode in which the dataset should be opened. This is one of “r”, “r+” or “w”.
name – the name that is associated with this dataset. This can be used to retrieve the dataset when calling get_dataset().

Returns

A HDF5Dataset instance.

close()¶: Close the HDF5 file operations.

contains_dataframe(dataframe: exetera.core.abstract_types.DataFrame)¶

Check if a dataframe is contained in this dataset by the dataframe object itself.

Parameters: dataframe – the dataframe object to check
Returns: True or False if the dataframe is contained

copy(dataframe, name)¶

Add an existing dataframe (from other dataset) to this dataset, write the existing group attributes and HDF5 datasets to this dataset.

Parameters

dataframe – the dataframe to copy to this dataset.
name – optional- change the dataframe name.

Returns

None if the operation is successful; otherwise throw Error.

create_dataframe(name: str, dataframe: Optional[exetera.core.abstract_types.DataFrame] = None)¶

Create a new DataFrame object as a part of this Dataset.

Parameters

name – name of the dataframe
dataframe – if set, this is a dataframe object whose contents are duplicated

Returns

a dataframe object

create_group(name: str)¶: This method is a wrapper around create_dataframe() instead.

delete_dataframe(dataframe: exetera.core.abstract_types.DataFrame)¶

Remove dataframe from this dataset by the dataframe object.

Parameters: dataframe – The dataframe instance to delete.
Returns: Boolean if the dataframe is deleted.

drop(name: str)¶

get_dataframe(name: str)¶

Get the dataframe by dataset.get_dataframe(dataframe_name).

Parameters: name – The name of the dataframe.
Returns: The dataframe or throw Error if the name is not existed in this dataset.

items()¶: Return the (name, dataframe) tuple in this dataset.

keys()¶: Return all dataframe names in this dataset.

require_dataframe(name)¶

Get a dataframe, creating it if it doesn’t exist.

Parameters: name – name of the dataframe

property session¶

The session property interface.

Returns: The _session instance.

values()¶: Return all dataframe instance in this dataset.

exetera.core.dataset.copy(dataframe: exetera.core.abstract_types.DataFrame, dataset: exetera.core.abstract_types.Dataset, name: str)¶

Copy dataframe to another dataset via HDF5DataFrame.copy(ds1[‘df1’], ds2, ‘df1’])

Parameters

dataframe – The dataframe to copy.
dataset – The destination dataset.
name – The name of dataframe in destination dataset.

exetera.core.dataset.move(dataframe: exetera.core.abstract_types.DataFrame, dataset: exetera.core.abstract_types.Dataset, name: str)¶

Move a dataframe to another dataset via HDF5DataFrame.move(ds1[‘df1’], ds2, ‘df1’]). If move within the same dataset, e.g. HDF5DataFrame.move(ds1[‘df1’], ds1, ‘df2’]), function as a rename for both dataframe and HDF5Group. However, to

Parameters

dataframe – The dataframe to copy.
dataset – The destination dataset.
name – The name of dataframe in destination dataset.

exetera.core.dataframe module¶

class exetera.core.dataframe.HDF5DataFrame(dataset: exetera.core.abstract_types.Dataset, name: str, h5group: h5py._hl.group.Group)¶

Bases: exetera.core.abstract_types.DataFrame

DataFrame is the means which which you interact with an ExeTera datastore. These are created and loaded through Dataset.create_dataframe, and other methods, rather than being constructed directly.

DataFrames closely resemble Pandas DataFrames, but with a number of key differences: 1. Instead of Series, DataFrames are composed of Field objects 2. DataFrames can store fields of differing lengths, although all fields must be of the same length when performing certain operations such as merges. 3. ExeTera DataFrames do not (yet) have the ability to create filtered views onto an underlying DataFrame, although this functionality will be added in upcoming releases

For a detailed explanation of DataFrame along with examples of its use, please refer to the wiki documentation at https://github.com/KCL-BMEIS/ExeTera/wiki/DataFrame-API

Parameters

name – name of the dataframe.
dataset – a dataset object, where this dataframe belongs to.
h5group – the h5group object to store the fields. If the h5group is not empty, acquire data from h5group object directly. The h5group structure is h5group<-h5group-dataset structure, the later group has a ‘fieldtype’ attribute and only one dataset named ‘values’. So that the structure is mapped to Dataframe<-Field-Field.data automatically.
dataframe – optional - replicate data from another dictionary of (name:str, field: Field).

add(field: exetera.core.abstract_types.Field)¶

Add a field to this dataframe as well as the HDF5 Group.

Parameters: field – field to add to this dataframe, copy the underlying dataset

apply_filter(filter_to_apply, ddf=None)¶

Apply the filter to all the fields in this dataframe, return a dataframe with filtered fields.

Parameters

filter_to_apply – the filter to be applied to the source field, an array of boolean
ddf – optional- the destination data frame

Returns

a dataframe contains all the fields filterd, self if ddf is not set

apply_index(index_to_apply, ddf=None)¶

Apply the index to all the fields in this dataframe, return a dataframe with indexed fields.

Parameters

index_to_apply – the index to be applied to the fields, an ndarray of integers
ddf – optional- the destination data frame

Returns

a dataframe contains all the fields re-indexed, self if ddf is not set

property columns¶: The columns property interface. Columns is a dictionary to store the fields by (field_name, field_object). The field_name is field.name without prefix ‘/’ and HDF5 group name.

contains_field(field)¶

check if dataframe contains a field by the field object

Parameters: field – the filed object to check, return a tuple(bool,str). The str is the name stored in dataframe.

create_categorical(name: str, nformat: int, key: dict, timestamp: Optional[str] = None, chunksize: Optional[int] = None)¶: Create a categorical type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#categoricalfield for a detailed description of indexed string fields

create_fixed_string(name: str, length: int, timestamp: Optional[str] = None, chunksize: Optional[int] = None)¶: Create a fixed string type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#fixedstringfield for a detailed description of fixed string fields

create_group(name: str)¶

Create a group object in HDF5 file for field to use. Please note, this function is for backwards compatibility with older scripts and should not be used in the general case.

Parameters: name – the name of the group and field
Returns: a hdf5 group object

create_indexed_string(name: str, timestamp: Optional[str] = None, chunksize: Optional[int] = None)¶: Create a indexed string type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#indexedstringfield for a detailed description of indexed string fields

create_numeric(name: str, nformat: int, timestamp: Optional[str] = None, chunksize: Optional[int] = None)¶: Create a numeric type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#numericfield for a detailed description of numeric fields

create_timestamp(name: str, timestamp: Optional[str] = None, chunksize: Optional[int] = None)¶: Create a timestamp type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#timestampfield for a detailed description of timestamp fields

property dataset¶: The dataset property interface.

delete_field(field)¶

Remove field from dataframe by field.

Parameters: field – The field to delete from this dataframe.

describe(include=None, exclude=None, output='terminal')¶

Show the basic statistics of the data in each field.

Parameters

include – The field name or data type or simply ‘all’ to indicate the fields included in the calculation.
exclude – The filed name or data type to exclude in the calculation.
output – Display the result in stdout if set to terminal, otherwise silent.

Returns

A dataframe contains the statistic results.

drop(name: str)¶: Drop a field from this dataframe as well as the HDF5 Group

drop_duplicates(by: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame = None, hint_keys_is_sorted=False)¶

Distinct values of a field or a list of field, return a dataframe with distinct values.

Parameters

by – Name (str) or list of names (str) to distinct.
ddf – optional - the destination dataframe

Returns

DataFrame with distinct values.

get_field(name)¶

Get a field stored by the field name.

Parameters: name – The name of field to get.

groupby(by: Union[str, List[str]], hint_keys_is_sorted=False)¶

Group DataFrame using a field or a list of field, return a groupby object.

Parameters

by – Name (str) or list of names (str) to group by.
hint_keys_is_sorted – an optional flag that users could set to skip the sorted check. Note that it runs faster and uses less memory when the dataframe is sorted, that is, hint_key_is_sorted=True.

Returns

Returns a groupby object that contains information about the groups.

property h5group¶: The h5group property interface, used to handle underlying storage.

items()¶: Return all the field names and their corresponding field values

keys()¶: Return all the field names

rename(field: Union[str, Mapping[str, str]], field_to: Optional[str] = None) → None¶

Rename provides you with the means to rename fields within a dataframe. You can specify either a single field to be renamed or you can provide a dictionary with a set of fields to be renamed.

Example:

# rename a single field
df.rename('a', 'b')

# rename multiple fields
df.rename({'a': 'b', 'b': 'c', 'c': 'a'})

Field renaming can fail if the resulting set of renamed fields would have name clashes. If this is the case, none of the rename operations go ahead and the dataframe remains unmodified.

Parameters

field – Either a string or a dictionary of name pairs, each of which is the existing field name and the destination field name
field_to – Optional parameter containing a string, if field is a string. If ‘field’ is a dictionary, parameter should not be set. Field references remain valid after this operation and reflect their renaming.

Returns

None

sort_values(by: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame = None, axis=0, ascending=True, kind='stable')¶

Sort by the values of a field or a list of fields

Parameters

by – Name (str) or list of names (str) to sort by.
ddf – optional - the destination data frame
axis – Axis to be sorted. Currently only supports 0
ascending – Sort ascending vs. descending. Currently only supports ascending=True.
kind – Choice of sorting algorithm. Currently only supports “stable”

Returns

DataFrame with sorted values or None if ddf=None.

to_csv(filepath: str, row_filter: Union[numpy.ndarray, exetera.core.abstract_types.Field] = None, column_filter: Union[str, List[str]] = None, chunk_row_size: int = 32768)¶: Write object to a comma-separated values (csv) file. :param filepath: File path. :param row_filter: A boolean array / field. Only select rows when filter value is True :param column_filter: A sequence of string names for the fields. :chunk_row_size: Write rows for every chunk which has maximum chunk_row_size rows. The default is 1<<15.

to_pandas(row_filter: List[bool] = None, col_filter: Union[str, List[str]] = None)¶

Convert an ExeTera dataframe to Pandas DataFrame. :param row_filter: A boolean array indicates which rows to export. :param col_filter: String or list of strings indicates which columns to export. :returns: A pandas dataframe.

Example:

pandas_df = df.to_pandas()

values()¶: Return all the field values

class exetera.core.dataframe.HDF5DataFrameGroupBy(columns, by, sorted_index, spans)¶

Bases: exetera.core.abstract_types.DataFrameGroupBy

count(ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame¶

Compute max of group values.

Parameters

target – Name (str) or list of names (str) to compute count.
ddf – the destination data frame
write_keys – write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with count of group values

distinct(ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame¶

first(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame¶

Get first of group values.

Parameters

target – Name (str) or list of names (str) to get first value.
ddf – the destination data frame
write_keys – write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with first of group values

last(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame¶

Get last of group values.

Parameters

target – Name (str) or list of names (str) to get last value.
ddf – the destination data frame
write_keys – write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with last of group values

max(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame¶

Compute max of group values.

Parameters

target – Name (str) or list of names (str) to compute max.
ddf – the destination data frame
write_keys – write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with max of group values

min(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame¶

Compute min of group values.

Parameters

target – Name (str) or list of names (str) to compute min.
ddf – the destination data frame
write_keys – write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with min of group values

exetera.core.dataframe.copy(field: exetera.core.abstract_types.Field, dataframe: exetera.core.abstract_types.DataFrame, name: str)¶

Copy a field to another dataframe as well as underlying dataset.

Parameters

field – The source field to copy.
dataframe – The destination dataframe to copy to.
name – The name of field under destination dataframe.

exetera.core.dataframe.merge(left: exetera.core.abstract_types.DataFrame, right: exetera.core.abstract_types.DataFrame, dest: exetera.core.abstract_types.DataFrame, left_on: Union[Tuple[Union[str, exetera.core.abstract_types.Field]], str, exetera.core.abstract_types.Field], right_on: Union[Tuple[Union[str, exetera.core.abstract_types.Field]], str, exetera.core.abstract_types.Field], left_fields: Optional[Sequence[str]] = None, right_fields: Optional[Sequence[str]] = None, left_suffix: str = '_l', right_suffix: str = '_r', how='left', hint_left_keys_ordered: Optional[bool] = None, hint_left_keys_unique: Optional[bool] = None, hint_right_keys_ordered: Optional[bool] = None, hint_right_keys_unique: Optional[bool] = None, chunk_size=1048576)¶

Merge ‘left’ and ‘right’ DataFrames into a destination dataset. The merge is a database-style join operation, in any of the following modes (“left”, “right”, “inner”, “outer”). This method closely follows the Pandas ‘merge’ functionality.

The join is performed using the fields specified by ‘left_on’ and ‘right_on’; these can either be strings or fields; if they strings then they refer to fields that must exist in the corresponding dataframe.

You can optionally set ‘left_fields’ and / or ‘right_fields’ if you want to have only a subset of fields joined from the left and right dataframes. If you don’t want any fields to be joined from a given dataframe, you can pass an empty list.

Fields are written to the destination dataframe. If the field names clash, they will get appended with the strings specified in ‘left_suffix’ and ‘right_suffix’ respectively.

Parameters

left – The left dataframe
right – The right dataframe
left_on – The field corresponding to the left key used to perform the join. This is either the the name of the field, or a field object. If it is a field object, it can be from another dataframe but it must be the same length as the fields being joined. This can also be a tuple of such values when performing joins on compound keys
right_on – The field corresponding to the right key used to perform the join. This is either the name of the field, or a field object. If it is a field object, it can be from another dataframe but it must be the same length as the fields being joined. This can also be a tuple of such values when performing joins on compound keys
left_fields – Optional parameter listing which fields are to be joined from the left table. If this is not set, all fields from the left table are joined
right_fields – Optional parameter listing which fields are to be joined from the right table. If this is not set, all fields from the right table are joined
left_suffix – A string to be appended to fields from the left table if they clash with fields from the right table.
right_suffix – A string to be appended to fields from the right table if they clash with fields from the left table.
how – Optional parameter specifying the merge mode. It must be one of (‘left’, ‘right’, ‘inner’, ‘outer’ or ‘cross). If not set, the ‘left’ join is performed.

exetera.core.dataframe.move(field: exetera.core.abstract_types.Field, dest_df: exetera.core.abstract_types.DataFrame, name: str)¶

Move a field to another dataframe as well as underlying dataset.

Parameters

src_df – The source dataframe where the field is located.
field – The field to move.
dest_df – The destination dataframe to move to.
name – The name of field under destination dataframe.

exetera.core.exporter module¶

exetera.core.exporter.export_schema(destination, readers)¶

exetera.core.exporter.export_to_csv(destination, datastore, fields)¶: Export selected fields of selected dataframe to csv file.

exetera.core.exporter.schema_from_reader_type(reader)¶

exetera.core.exporter.transform_from_reader_type(reader)¶

exetera.core.fields module¶

class exetera.core.fields.CategoricalField(session, group, dataframe, write_enabled=False)¶

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)¶

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters

filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)¶

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters

index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)¶

apply_spans_last(spans_to_apply, target=None, in_place=False)¶

apply_spans_max(spans_to_apply, target=None, in_place=False)¶

apply_spans_min(spans_to_apply, target=None, in_place=False)¶

create_like(group=None, name=None, timestamp=None)¶

property data¶

get_spans()¶

is_sorted()¶

isin(test_elements: Union[list, set, numpy.ndarray])¶

property keys¶

property nformat¶

remap(key_map, new_key)¶

Remap the key names and key values.

Parameters

key_map – The mapping rule of convert the old key into the new key.
new_key – The new key.

Returns

A CategoricalMemField with the new key.

unique(return_index=False, return_inverse=False, return_counts=False)¶: Find the unique elements of CategoricalField

writeable()¶

class exetera.core.fields.CategoricalMemField(session, nformat, keys)¶

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)¶

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters

filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)¶

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters

index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)¶

apply_spans_last(spans_to_apply, target=None, in_place=False)¶

apply_spans_max(spans_to_apply, target=None, in_place=False)¶

apply_spans_min(spans_to_apply, target=None, in_place=False)¶

create_like(group=None, name=None, timestamp=None)¶

property data¶

get_spans()¶

is_sorted()¶

isin(test_elements: Union[list, set, numpy.ndarray])¶

property keys¶

remap(key_map, new_key)¶

Remap the key names and key values.

Parameters

key_map – The mapping rule of convert the old key into the new key.
new_key – The new key.

Returns

A CategoricalMemField with the new key.

unique(return_index=False, return_inverse=False, return_counts=False)¶: Find the unique elements of CategoricalMemField

writeable()¶

class exetera.core.fields.FieldDataOps¶

Bases: object

static apply_filter_to_field(source, filter_to_apply, target=None, in_place=False)¶

static apply_filter_to_indexed_field(source, filter_to_apply, target=None, in_place=False)¶

static apply_index_to_field(source, index_to_apply, target=None, in_place=False)¶

static apply_index_to_indexed_field(source, index_to_apply, target=None, in_place=False)¶

static apply_isin(source: exetera.core.abstract_types.Field, test_elements: Union[list, set, numpy.ndarray])¶

static apply_spans_first(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field¶

static apply_spans_last(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field¶

static apply_spans_max(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field¶

static apply_spans_min(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field¶

static apply_unique(src: exetera.core.abstract_types.Field, return_index=False, return_inverse=False, return_counts=False) → numpy.ndarray¶

static categorical_field_create_like(source, group, name, timestamp)¶

classmethod equal(session, first, second)¶

static fixed_string_field_create_like(source, group, name, timestamp)¶

classmethod greater_than(session, first, second)¶

classmethod greater_than_equal(session, first, second)¶

static indexed_string_create_like(source, group, name, timestamp)¶

classmethod invert(session, first)¶

classmethod less_than(session, first, second)¶

classmethod less_than_equal(session, first, second)¶

classmethod logical_not(session, first)¶

classmethod not_equal(session, first, second)¶

classmethod numeric_add(session, first, second)¶

classmethod numeric_and(session, first, second)¶

classmethod numeric_divmod(session, first, second)¶

static numeric_field_create_like(source, group, name, timestamp)¶

classmethod numeric_floordiv(session, first, second)¶

classmethod numeric_mod(session, first, second)¶

classmethod numeric_mul(session, first, second)¶

classmethod numeric_or(session, first, second)¶

classmethod numeric_sub(session, first, second)¶

classmethod numeric_truediv(session, first, second)¶

classmethod numeric_xor(session, first, second)¶

static timestamp_field_create_like(source, group, name, timestamp)¶

class exetera.core.fields.FixedStringField(session, group, dataframe, write_enabled=False)¶

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)¶

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters

filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)¶

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters

index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)¶

apply_spans_last(spans_to_apply, target=None, in_place=False)¶

apply_spans_max(spans_to_apply, target=None, in_place=False)¶

apply_spans_min(spans_to_apply, target=None, in_place=False)¶

create_like(group=None, name=None, timestamp=None)¶

property data¶

get_spans()¶

is_sorted()¶

isin(test_elements: Union[list, set, numpy.ndarray])¶

unique(return_index=False, return_inverse=False, return_counts=False)¶: Find the unique elements of FixedStringField

writeable()¶

class exetera.core.fields.FixedStringMemField(session, length)¶

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)¶

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters

filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)¶

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters

index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)¶

apply_spans_last(spans_to_apply, target=None, in_place=False)¶

apply_spans_max(spans_to_apply, target=None, in_place=False)¶

apply_spans_min(spans_to_apply, target=None, in_place=False)¶

create_like(group=None, name=None, timestamp=None)¶

property data¶

get_spans()¶

is_sorted()¶

isin(test_elements: Union[list, set, numpy.ndarray])¶

unique(return_index=False, return_inverse=False, return_counts=False)¶: Find the unique elements of FixedStringMemField

writeable()¶

class exetera.core.fields.HDF5Field(session, group, dataframe, write_enabled=False)¶

Bases: exetera.core.abstract_types.Field

apply_filter(filter_to_apply, dstfld=None)¶

apply_index(index_to_apply, dstfld=None)¶

property chunksize¶: The chunksize for the field. This is not generally required for users, and may be ignored depending on the storage medium.

property dataframe¶: The owning dataframe of this field, or None if the field is now owned by a dataframe

get_spans()¶

property indexed¶: Whether the field is an indexed field or not. Indexed fields store their data internally as index and value arrays for efficiency, as well as making it accessible through the data property.

property name¶: The name of the field within a dataframe, if the field belongs to a dataframe

property timestamp¶: The timestamp representing the field creation time. This is the time at which the data for this field was added to the dataset, rather than the point at which the field wrapper was created.

property valid¶: Returns whether the field is a valid field object. Fields can become invalid as a result of certain operations, such as a field being moved from one dataframe to another. A field that is invalid with throw exceptions if any other operation is performed on them.

class exetera.core.fields.IndexedStringField(session, group, dataframe, write_enabled=False)¶

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)¶

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters

filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)¶

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters

index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)¶

apply_spans_last(spans_to_apply, target=None, in_place=False)¶

apply_spans_max(spans_to_apply, target=None, in_place=False)¶

apply_spans_min(spans_to_apply, target=None, in_place=False)¶

create_like(group=None, name=None, timestamp=None)¶: Create an empty field of the same type as this field.

property data¶

get_spans()¶

property indexed¶: Whether the field is an indexed field or not. Indexed fields store their data internally as index and value arrays for efficiency, as well as making it accessible through the data property.

property indices¶

is_sorted()¶

isin(test_elements: Union[list, set, numpy.ndarray])¶

unique(return_index=False, return_inverse=False, return_counts=False)¶: Find the unique elements of IndexedStringField

property values¶

writeable()¶: Indicates whether this field permits write operations. By default, dataframe fields are read-only in order to protect accidental writes to datasets

class exetera.core.fields.IndexedStringMemField(session, chunksize=None)¶

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)¶

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters

filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)¶

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters

index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)¶

apply_spans_last(spans_to_apply, target=None, in_place=False)¶

apply_spans_max(spans_to_apply, target=None, in_place=False)¶

apply_spans_min(spans_to_apply, target=None, in_place=False)¶

create_like(group=None, name=None, timestamp=None)¶

property data¶

get_spans()¶

property indexed¶

property indices¶

is_sorted()¶

isin(test_elements: Union[list, set, numpy.ndarray])¶

unique(return_index=False, return_inverse=False, return_counts=False)¶: Find the unique elements of IndexedStringMemField

property values¶

writeable()¶

class exetera.core.fields.MemoryField(session)¶

Bases: exetera.core.abstract_types.Field

apply_filter(filter_to_apply, dstfld=None)¶

apply_index(index_to_apply, dstfld=None)¶

property chunksize¶

property dataframe¶

property indexed¶

property name¶

property timestamp¶

property valid¶

class exetera.core.fields.MemoryFieldArray(dtype)¶

Bases: object

clear()¶

complete()¶

property dtype¶

write(part)¶

write_part(part, move_mem=False)¶

class exetera.core.fields.NumericField(session, group, dataframe, write_enabled=False)¶

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)¶

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters

filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)¶

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters

index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)¶

apply_spans_last(spans_to_apply, target=None, in_place=False)¶

apply_spans_max(spans_to_apply, target=None, in_place=False)¶

apply_spans_min(spans_to_apply, target=None, in_place=False)¶

astype(dtype: str, casting='unsafe')¶

Convert the field data type to dtype parameter given.

Parameters

dtype – The new datatype, given as a str object. The dtype must be a subtype of np.number, e.g. int, float, etc.
casting – Similar to the casting parameter in numpy ndarray.astype, can be ‘no’, ‘equiv’, ‘safe’, ‘same_kind’, or ‘unsafe’.

Returns

The field with new datatype.

create_like(group=None, name=None, timestamp=None)¶

property data¶

get_spans()¶

is_sorted()¶

isin(test_elements: Union[list, set, numpy.ndarray])¶

logical_not()¶

unique(return_index=False, return_inverse=False, return_counts=False)¶: Find the unique elements of NumericField

writeable()¶

class exetera.core.fields.NumericMemField(session, nformat)¶

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)¶

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters

filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)¶

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters

index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)¶

apply_spans_last(spans_to_apply, target=None, in_place=False)¶

apply_spans_max(spans_to_apply, target=None, in_place=False)¶

apply_spans_min(spans_to_apply, target=None, in_place=False)¶

create_like(group=None, name=None, timestamp=None)¶

property data¶

get_spans()¶

is_sorted()¶

isin(test_elements: Union[list, set, numpy.ndarray])¶

logical_not()¶

unique(return_index=False, return_inverse=False, return_counts=False)¶: Find the unique elements of NumericMemField

writeable()¶

class exetera.core.fields.ReadOnlyFieldArray(field, dataset_name)¶

Bases: object

clear()¶

complete()¶

property dtype¶

write(part)¶

write_part(part)¶

class exetera.core.fields.ReadOnlyIndexedFieldArray(field, indices, values)¶

Bases: object

clear()¶

complete()¶

property dtype¶

write(part)¶

write_part(part)¶

class exetera.core.fields.TimestampField(session, group, dataframe, write_enabled=False)¶

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)¶

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters

filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)¶

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters

index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)¶

apply_spans_last(spans_to_apply, target=None, in_place=False)¶

apply_spans_max(spans_to_apply, target=None, in_place=False)¶

apply_spans_min(spans_to_apply, target=None, in_place=False)¶

create_like(group=None, name=None, timestamp=None)¶

property data¶

get_spans()¶

is_sorted()¶

isin(test_elements: Union[list, set, numpy.ndarray])¶

unique(return_index=False, return_inverse=False, return_counts=False)¶: Find the unique elements of TimestampField

writeable()¶

class exetera.core.fields.TimestampMemField(session)¶

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)¶

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters

filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)¶

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters

index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)¶

apply_spans_last(spans_to_apply, target=None, in_place=False)¶

apply_spans_max(spans_to_apply, target=None, in_place=False)¶

apply_spans_min(spans_to_apply, target=None, in_place=False)¶

create_like(group=None, name=None, timestamp=None)¶

property data¶

get_spans()¶

is_sorted()¶

isin(test_elements: Union[list, set, numpy.ndarray])¶

unique(return_index=False, return_inverse=False, return_counts=False)¶: Find the unique elements of TimestampMemField

writeable()¶

class exetera.core.fields.WriteableFieldArray(field, dataset_name)¶

Bases: object

clear()¶

complete()¶

property dtype¶

write(part)¶

write_part(part)¶

class exetera.core.fields.WriteableIndexedFieldArray(chunksize, indices, values)¶

Bases: object

clear()¶

complete()¶

property dtype¶

write(part)¶

write_part(part)¶

exetera.core.fields.argsort(field: exetera.core.abstract_types.Field, dtype: str = None)¶

exetera.core.fields.as_field(data, key=None)¶

exetera.core.fields.base_field_contructor(session, group, name, timestamp=None, chunksize=None)¶: Constructor are for 1)create the field (hdf5 group), 2)add basic attributes like chunksize, timestamp, field type, and 3)add the dataset to the field (hdf5 group) under the name ‘values’

exetera.core.fields.categorical_field_constructor(session, group, name, nformat, key, timestamp=None, chunksize=None)¶

exetera.core.fields.dtype_to_str(dtype)¶

exetera.core.fields.fixed_string_field_constructor(session, group, name, length, timestamp=None, chunksize=None)¶

exetera.core.fields.indexed_string_field_constructor(session, group, name, timestamp=None, chunksize=None)¶

exetera.core.fields.isin(field, test_elements)¶

exetera.core.fields.numeric_field_constructor(session, group, name, nformat, timestamp=None, chunksize=None)¶

exetera.core.fields.timestamp_field_constructor(session, group, name, timestamp=None, chunksize=None)¶

exetera.core.filtered_field module¶

class exetera.core.filtered_field.FilteredField(field, filter)¶: Bases: object

exetera.core.filtered_field.filtered_field(field, filter)¶

exetera.core.indexed_array module¶

class exetera.core.indexed_array.IndexedArray¶: Bases: object

exetera.core.journal module¶

exetera.core.journal.journal_table(session, schema, old_src, new_src, src_pk, result)¶

exetera.core.journal.journal_test_harness(session, schema, old_file, new_file, dest_file)¶

exetera.core.operations module¶

exetera.core.operations.apply_filter_to_index_values(index_filter, indices, values)¶

exetera.core.operations.apply_indices_to_index_values(indices_to_apply, indices, values)¶

exetera.core.operations.apply_spans_concat(spans, src_index, src_values, dest_index, dest_values, max_index_i, max_value_i, s_start)¶

exetera.core.operations.apply_spans_count(spans, dest_array=None)¶

exetera.core.operations.apply_spans_first(spans, src_array, dest_array=None)¶

exetera.core.operations.apply_spans_index_of_first(spans, dest_array=None)¶

exetera.core.operations.apply_spans_index_of_first_filter(spans, dest_array, filter_array)¶

exetera.core.operations.apply_spans_index_of_last(spans, dest_array=None)¶

exetera.core.operations.apply_spans_index_of_last_filter(spans, dest_array, filter_array)¶

exetera.core.operations.apply_spans_index_of_max(spans, src_array, dest_array=None)¶

exetera.core.operations.apply_spans_index_of_max_filter(spans, src_array, dest_array, filter_array)¶

exetera.core.operations.apply_spans_index_of_max_indexed(spans, src_indices, src_values, dest_array=None)¶

exetera.core.operations.apply_spans_index_of_min(spans, src_array, dest_array=None)¶

exetera.core.operations.apply_spans_index_of_min_filter(spans, src_array, dest_array, filter_array)¶

exetera.core.operations.apply_spans_index_of_min_indexed(spans, src_indices, src_values, dest_array=None)¶

exetera.core.operations.apply_spans_last(spans, src_array, dest_array=None)¶

exetera.core.operations.apply_spans_max(spans, src_array, dest_array=None)¶

exetera.core.operations.apply_spans_min(spans, src_array, dest_array=None)¶

exetera.core.operations.calculate_chunk_decomposition(s_start, s_end, indices, value_chunk_size, sub_chunks)¶

exetera.core.operations.categorical_transform(chunk, i_c, column_inds, column_vals, column_offsets, cat_keys, cat_index, cat_values)¶: Transform method for categorical importer in readerwriter.py

exetera.core.operations.check_if_sorted_for_multi_fields(fields_data)¶

Check if input fields data is sorted. Note that fields_data should be treat as a group key

pre_row[j] < cur_row[j], means these two rows are sorted, move to next row => i + 1 pre_row[j] = cur_row[j], means we need to check if next element is sorted => j + 1 pre_row[j] > cur_row[j], means input data is not sorted

exetera.core.operations.chunked_copy(src_field, dest_field, chunksize=1048576)¶

exetera.core.operations.chunks(length, chunksize=1048576)¶

exetera.core.operations.compare_arrays(source[s1: s2], target[t1: t2])¶

exetera.core.operations.compare_indexed_rows_for_journalling(old_map, new_map, old_indices, old_values, new_indices, new_values, to_keep)¶

exetera.core.operations.compare_rows_for_journalling(old_map, new_map, old_field, new_field, to_keep)¶

exetera.core.operations.count_back(array)¶

This is a helper function that provides functionality specific to streaming ordered merges. It takes an array in sorted order and calculates a trimmed length that excludes the final sequence of equal values: Example:

[10, 20, 30, 40, 50] -> 4 ([10, 20, 30, 40])
[10, 20, 30, 40, 40] -> 3 ([10, 20, 30])
[10, 20, 30, 30, 30] -> 2 ([10, 20])
[10, 20, 20, 20, 20] -> 1 ([10])

exetera.core.operations.data_iterator(data_field, chunksize=1048576)¶

exetera.core.operations.dtype_to_str(dtype)¶

exetera.core.operations.element_chunked_copy(src_elem, dest_elem, chunksize)¶

exetera.core.operations.first_trimmed_chunk(field, chunk_size)¶

exetera.core.operations.first_untrimmed_chunk(field, chunk_size)¶

exetera.core.operations.fixed_string_transform(column_inds, column_vals, column_offsets, col_idx, written_row_count, strlen, memory)¶: Transform method for fixed string importer in field_importer.py

exetera.core.operations.generate_ordered_map_to_inner_both_unique_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r)¶

exetera.core.operations.generate_ordered_map_to_inner_both_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶

exetera.core.operations.generate_ordered_map_to_inner_left_unique_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r)¶

exetera.core.operations.generate_ordered_map_to_inner_left_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶

exetera.core.operations.generate_ordered_map_to_inner_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r, ii, jj, ii_max, jj_max, inner)¶

This function performs generates a mapping from a subset of a left key to a subset of a a right key, writing the resulting mapping to a buffer, where both keys can contain repeated entries.

Example:

left = [10, 20, 30, 40, 40, 50, 50]
right = [20, 30, 30, 40, 40, 40, 60, 70]

i  j op r lres rres
0 <  0  0   INV
0 =  1  1   0
1 =  2  2   1
2    3  2   2
3    4  3   3
4    5  3   4
5    6  3   5
3    7  4   3
4    8  4   4
5    9  4   5
6   10  5   INV
6   11  6   INV


left_map = [0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6]
right_map = [INV, 1, 2, 2, 3, 3, 3, 4, 4, 4, INV, INV]

Everything about this function is optimised for performance under njit. It is effectively a finite state machine that iterates through left, right, and result arrays. The various…

i and i_max are used to track the index of the left source j and j_max are used to track the index of the right source

exetera.core.operations.generate_ordered_map_to_inner_right_unique_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r)¶

exetera.core.operations.generate_ordered_map_to_inner_right_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶

exetera.core.operations.generate_ordered_map_to_inner_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶

This function performs the most generic type of left to right mapping calculation in which both key fields can have repeated key values. At its heart, the function generates a mapping from left to right that can then be used to map data in the right space to data in the left space. Note that this can also be used to generate the inverse mapping my simply flipping left and right collections.

As the Fields left and right can contain arbitrarily long sequences of data, the data is streamed through the algorithm in a series of chunks. Similarly, the resulting map is written to a buffer that is written to the result field in chunks.

This streamed function makes a sequence of calls to a corresponding _partial function that does the heavy lifting. Inside the _partial function, a finite state machine (FSM) iterates over the data, performing the mapping. The _partial function call exits whenever any of the chunks (left_, right_ or result_ that it is passed become exhausted.

Please take a look at the documentation for the partial function to understand the finite state machine parameters to understand that role that the various parameters play.

We have to make some adjustments to the finite state machine between calls to _partial:

if the call used all the left_ data, add the size of that data chunk to i_off
if the call used all of the right_ data, add the size of that data chunk to j_off
write the accumulated result_ data to the result` field, and reset r to 0

exetera.core.operations.generate_ordered_map_to_left_both_unique(first, second, result, invalid)¶

exetera.core.operations.generate_ordered_map_to_left_both_unique_partial(left, right, r_result, invalid, j_off, i, j, r)¶

exetera.core.operations.generate_ordered_map_to_left_both_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶

exetera.core.operations.generate_ordered_map_to_left_left_unique_partial(left, right, j_max, l_result, r_result, invalid, i_off, j_off, i, j, r)¶

exetera.core.operations.generate_ordered_map_to_left_left_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶

exetera.core.operations.generate_ordered_map_to_left_partial(left, i_max, right, j_max, l_result, r_result, invalid, i_off, j_off, i, j, r, ii, jj, ii_max, jj_max, inner)¶

This function performs generates a mapping from a subset of a left key to a subset of a a right key, writing the resulting mapping to a buffer, where both keys can contain repeated entries.

Example:

left = [10, 20, 30, 40, 40, 50, 50]
right = [20, 30, 30, 40, 40, 40, 60, 70]

i  j op r lres rres
0 <  0  0   INV
0 =  1  1   0
1 =  2  2   1
2    3  2   2
3    4  3   3
4    5  3   4
5    6  3   5
3    7  4   3
4    8  4   4
5    9  4   5
6   10  5   INV
6   11  6   INV


left_map = [0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6]
right_map = [INV, 1, 2, 2, 3, 3, 3, 4, 4, 4, INV, INV]

Everything about this function is optimised for performance under njit. It is effectively a finite state machine that iterates through left, right, and result arrays. The various…

i and i_max are used to track the index of the left source j and j_max are used to track the index of the right source

exetera.core.operations.generate_ordered_map_to_left_remaining(i_max, l_result, r_result, i_off, i, r, invalid)¶

exetera.core.operations.generate_ordered_map_to_left_right_unique(first, second, result, invalid)¶

exetera.core.operations.generate_ordered_map_to_left_right_unique_partial(left, i_max, right, r_result, invalid, j_off, i, j, r)¶

exetera.core.operations.generate_ordered_map_to_left_right_unique_partial_old(d_j, left, right, left_to_right, invalid)¶: Returns: [0]: how many positions forward i moved [1]: how many positions forward j moved [2]: how many elements were written

exetera.core.operations.generate_ordered_map_to_left_right_unique_remaining(i_max, r_result, i, r, invalid)¶

exetera.core.operations.generate_ordered_map_to_left_right_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶

exetera.core.operations.generate_ordered_map_to_left_right_unique_streamed_old(left, right, left_to_right, invalid=- 1, chunksize=1048576)¶

exetera.core.operations.generate_ordered_map_to_left_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶

This function performs the most generic type of left to right mapping calculation in which both key fields can have repeated key values. At its heart, the function generates a mapping from left to right that can then be used to map data in the right space to data in the left space. Note that this can also be used to generate the inverse mapping my simply flipping left and right collections.

As the Fields left and right can contain arbitrarily long sequences of data, the data is streamed through the algorithm in a series of chunks. Similarly, the resulting map is written to a buffer that is written to the result field in chunks.

This streamed function makes a sequence of calls to a corresponding _partial function that does the heavy lifting. Inside the _partial function, a finite state machine (FSM) iterates over the data, performing the mapping. The _partial function call exits whenever any of the chunks (left_, right_ or result_ that it is passed become exhausted.

Please take a look at the documentation for the partial function to understand the finite state machine parameters to understand that role that the various parameters play.

We have to make some adjustments to the finite state machine between calls to _partial:

if the call used all the left_ data, add the size of that data chunk to i_off
if the call used all of the right_ data, add the size of that data chunk to j_off
write the accumulated result_ data to the result` field, and reset r to 0

exetera.core.operations.get_byte_map(string_map)¶: Getting byte indices and byte values from categorical key-value pair

exetera.core.operations.get_map_datatype_based_on_lengths(left_len, right_len)¶

exetera.core.operations.get_map_datatype_str_based_on_lengths(left_len, right_len)¶

exetera.core.operations.get_map_subchunks_based_on_index_lengths(map_, invalid, chunksize)¶

exetera.core.operations.get_next_chunk(start: int, chunk_size: int, field: exetera.core.abstract_types.Field)¶

This is a helper function that provides functionality specific to streaming ordered merges. It assumes that field is in sorted order.

This function is used to fetch chunks of memory from a field to be consumed by streaming merges. It first fetches the chunk of a given chunk size, or the size of the remaining memory, whichever is smaller. It then ‘trims’ that memory by removing the last sequence of equal values from the valid range.

Parameters

start – The start of the chunk to be returned
chunksize – The size of the chunk to be considered. The returned chunk will always

be shorter than this unless it is the final chunk of the field data :param field: The field from which data should be fetched. This field must be in sorted order :return: A tuple representing the range (inclusive, exclusive) and an numpy ndarray containing the data. Note, this is is typically longer than the range returned, as we do not trim the data for performance reasons.

exetera.core.operations.get_spans_for_field(ndarray)¶

exetera.core.operations.get_valid_value_extents(chunk, start, end, invalid=- 1)¶

exetera.core.operations.indexed_string_unique(indices, values, unique_result, unique_index, unique_inverse, unique_counts)¶: Find the unique elements for indexed string field using njit function.

exetera.core.operations.is_ordered(field)¶

exetera.core.operations.isin_for_indexed_string_field(test_elements, indices, values)¶

exetera.core.operations.isin_indexed_string_speedup(test_elements, indices, values)¶

exetera.core.operations.leaky_categorical_transform(chunk, freetext_indices, freetext_values, i_c, column_inds, column_vals, column_offsets, cat_keys, cat_index, cat_values)¶: Transform method for categorical importer in readerwriter.py

exetera.core.operations.map_valid(data_field, map_field, result=None, invalid=- 1)¶

exetera.core.operations.merge_entries_segment(i_start, cur_old_start, old_map, new_map, to_keep, old_src, new_src, dest)¶

Parameters

i_start – the initial value to apply to ‘i’
cur_old_start – the initial value to apply to ‘cur_old
old_map – the map (in i-space) for the existing records
new_map – the map (in i-space) for the new records
to_keep – the flags (in i-space) indicating whether the new record should be kept
old_src – the source for the existing records
new_src – the source for the new records
dest – the sink for the merged sources

Returns

exetera.core.operations.merge_indexed_journalled_entries(old_map, new_map, to_keep, old_src_inds, old_src_vals, new_src_inds, new_src_vals, dest_inds, dest_vals)¶

exetera.core.operations.merge_indexed_journalled_entries_count(old_map, new_map, to_keep, old_src_inds, new_src_inds)¶

exetera.core.operations.merge_journalled_entries(old_map, new_map, to_keep, old_src, new_src, dest)¶

exetera.core.operations.next_chunk(current: int, length: int, desired: int)¶: This is a helper function that can be used whenever you want to access a large sequence of data in chunks. It simply carries out the calculation that returns the extents of the next chunk taking into account the length of the sequence. The sequence itself is not required here, only the length. :param current: the starting point of the chunk :param length: the length of the sequence being chunked :param desired: the requested length of the chunk :return: A tuple of the chunk extents. The first value is inclusive; the second is exclusive

exetera.core.operations.next_map_subchunk(map_, sm, invalid, chunksize)¶

exetera.core.operations.next_trimmed_chunk(field, chunk, chunk_size)¶

exetera.core.operations.next_untrimmed_chunk(field, chunk, chunk_size)¶

exetera.core.operations.numeric_bool_transform(elements, validity, column_inds, column_vals, column_offsets, col_idx, written_row_count, invalid_value, validation_mode, field_name)¶: Transform method for numeric importer (bool) in readerwriter.py

exetera.core.operations.ordered_generate_journalling_indices(old, new)¶

exetera.core.operations.ordered_get_last_as_filter(field)¶

exetera.core.operations.ordered_inner_map(left, right, left_to_inner, right_to_inner)¶

exetera.core.operations.ordered_inner_map_both_unique(left, right, left_to_inner, right_to_inner)¶

exetera.core.operations.ordered_inner_map_left_unique(left, right, left_to_inner, right_to_inner)¶

exetera.core.operations.ordered_inner_map_left_unique_partial(d_i, d_j, left, right, left_to_inner, right_to_inner)¶: Returns: [0]: how many positions forward i moved [1]: how many positions forward j moved [2]: how many elements were written

exetera.core.operations.ordered_inner_map_left_unique_streamed(left, right, left_to_inner, right_to_inner, chunksize=1048576)¶

exetera.core.operations.ordered_inner_map_result_size(left, right)¶

exetera.core.operations.ordered_left_map_result_size(left, right)¶

exetera.core.operations.ordered_map_valid_indexed_partial(sm_values, sm_start, sm_end, indices, i_start, i_max, values, mv_start, result_indices, result_values, invalid, sm, ri, rv, ri_accum)¶

exetera.core.operations.ordered_map_valid_indexed_stream(data_field, map_field, result_field, invalid=- 1, chunksize=1048576, value_factor=8)¶

exetera.core.operations.ordered_map_valid_partial(values, map_values, sm_start, sm_end, d_start, result_data, invalid, invalid_value)¶

exetera.core.operations.ordered_map_valid_partial_old(d, data_field, map_field, result, invalid)¶

exetera.core.operations.ordered_map_valid_stream(data_field, map_field, result_field, invalid=- 1, chunksize=1048576)¶

. for each map chunk

. calculate sub chunks based on indices

. for each sub chunk: . map indices for sub chunk

exetera.core.operations.ordered_map_valid_stream_old(data_field, map_field, result_field, invalid=- 1, chunksize=1048576)¶

exetera.core.operations.ordered_outer_map_result_size_both_unique(left, right)¶

exetera.core.operations.raiseNumericException(exception_message, exception_args)¶

exetera.core.operations.safe_map_indexed_values(data_indices, data_values, map_field, map_filter, empty_value=None)¶

exetera.core.operations.safe_map_values(data_field, map_field, map_filter, empty_value=None)¶

exetera.core.operations.str_to_dtype(str_dtype)¶

exetera.core.operations.streaming_sort_merge(src_index_f, src_value_f, tgt_index_f, tgt_value_f, segment_length, chunk_length)¶

exetera.core.operations.streaming_sort_partial(in_chunk_indices, in_chunk_lengths, src_value_chunks, src_index_chunks, dest_value_chunk, dest_index_chunk)¶

exetera.core.operations.transform_float(column_inds, column_vals, column_offsets, col_idx, written_row_count, invalid_value, validation_mode, data_type, field_name)¶: Transform float method for numeric importer in field_importer.py

exetera.core.operations.transform_int(column_inds, column_vals, column_offsets, col_idx, written_row_count, invalid_value, validation_mode, data_type, field_name)¶: Transform int method for numeric importer in field_importer.py

exetera.core.operations.transform_to_values(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶: Trasnform method for byte data from np.int to np.bytes_

exetera.core.operations.unique_for_indexed_string(indices, values, return_index, return_inverse, return_counts)¶: Find the unique elements for indexed string field.

exetera.core.persistence module¶

class exetera.core.persistence.DataStore(chunksize=1048576, timestamp='2022-04-05 17:12:36.942412+00:00')¶

Bases: object

aggregate_count(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)¶

aggregate_custom(predicate, fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)¶

aggregate_first(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)¶

aggregate_last(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)¶

aggregate_max(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)¶

aggregate_min(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)¶

apply_filter(filter_to_apply, reader, writer=None)¶

apply_indices(indices_to_apply, reader, writer=None)¶

apply_sort(index, reader, writer=None)¶

apply_spans_concat(spans, reader, writer)¶

apply_spans_count(spans, _, writer=None)¶

apply_spans_first(spans, reader, writer)¶

apply_spans_index_of_first(spans, writer=None)¶

apply_spans_index_of_last(spans, writer=None)¶

apply_spans_index_of_max(spans, reader, writer=None)¶

apply_spans_index_of_min(spans, reader, writer=None)¶

apply_spans_last(spans, reader, writer)¶

apply_spans_max(spans, reader, writer)¶

apply_spans_min(spans, reader, writer)¶

chunks(length, chunksize=None)¶

dataset_sort(readers, index=None)¶

distinct(field=None, fields=None, filter=None)¶

get_categorical_writer(group, name, categories, timestamp=None, writemode='write')¶

get_compatible_writer(field, dest_group, dest_name, timestamp=None, writemode='write')¶

get_existing_writer(field, timestamp=None)¶

get_fixed_string_writer(group, name, width, timestamp=None, writemode='write')¶

get_index(target, foreign_key, destination=None)¶

get_indexed_string_writer(group, name, timestamp=None, writemode='write')¶

get_numeric_writer(group, name, dtype, timestamp=None, writemode='write')¶

get_or_create_group(group, name)¶

get_reader(field)¶

get_shared_index(keys)¶

get_spans(field=None, fields=None)¶

get_timestamp_writer(group, name, timestamp=None, writemode='write')¶

get_trash_group(group)¶

index_spans(spans)¶

join(destination_pkey, fkey_indices, values_to_join, writer=None, fkey_index_spans=None)¶

predicate_and_join(predicate, destination_pkey, fkey_indices, reader=None, writer=None, fkey_index_spans=None)¶

process(inputs, outputs, predicate)¶

set_timestamp(timestamp='2022-04-05 17:12:36.942424+00:00')¶

sort_on(src_group, dest_group, keys, fields=None, timestamp=None, write_mode='write')¶

temp_filename()¶

exetera.core.persistence.dataset_merge_sort(group, index, fields)¶

exetera.core.persistence.filter_duplicate_fields(field)¶

exetera.core.persistence.filtered_iterator(values, filter, default=nan)¶

exetera.core.persistence.foreign_key_is_in_primary_key(primary_key, foreign_key)¶

exetera.core.persistence.temp_dataset()¶

exetera.core.persistence.timestamp_to_date(values)¶

exetera.core.persistence.try_str_to_bool(value, invalid=0)¶

exetera.core.persistence.try_str_to_float(value, invalid=0)¶

exetera.core.persistence.try_str_to_float_to_int(value, invalid=0)¶

exetera.core.persistence.try_str_to_int(value, invalid=0)¶

exetera.core.readerwriter module¶

class exetera.core.readerwriter.CategoricalImporter(datastore, group, name, categories, timestamp=None, write_mode='write')¶

Bases: object

chunk_factory(length)¶

flush()¶

import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶

write(values)¶

write_part(values)¶

write_strings(values)¶

class exetera.core.readerwriter.CategoricalReader(datastore, field)¶

Bases: exetera.core.readerwriter.Reader

dtype()¶

get_writer(dest_group, dest_name, timestamp=None, write_mode='write')¶

class exetera.core.readerwriter.CategoricalWriter(datastore, group, name, categories, timestamp=None, write_mode='write')¶

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)¶

flush()¶

write(values)¶

write_part(values)¶

class exetera.core.readerwriter.DateTimeImporter(datastore, group, name, create_day_field=False, optional=True, timestamp=None, write_mode='write')¶

Bases: object

chunk_factory(length)¶

flush()¶

import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶

write(values)¶

write_part(values)¶

class exetera.core.readerwriter.DateTimeWriter(datastore, group, name, timestamp=None, write_mode='write')¶

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)¶

flush()¶

write(values)¶

write_part(values)¶

class exetera.core.readerwriter.DateWriter(datastore, group, name, timestamp=None, write_mode='write')¶

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)¶

flush()¶

write(values)¶

write_part(values)¶

class exetera.core.readerwriter.FixedStringReader(datastore, field)¶

Bases: exetera.core.readerwriter.Reader

dtype()¶

get_writer(dest_group, dest_name, timestamp=None, write_mode='write')¶

class exetera.core.readerwriter.FixedStringWriter(datastore, group, name, strlen, timestamp=None, write_mode='write')¶

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)¶

flush()¶

import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶

write(values)¶

write_part(values)¶

class exetera.core.readerwriter.IndexedStringReader(datastore, field)¶

Bases: exetera.core.readerwriter.Reader

dtype()¶

get_writer(dest_group, dest_name, timestamp=None, write_mode='write')¶

sort(index, writer)¶

class exetera.core.readerwriter.IndexedStringWriter(datastore, group, name, timestamp=None, write_mode='write')¶

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)¶

flush()¶

import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶

write(values)¶

write_part(values)¶

Writes a list of strings in indexed string form to a field.

Parameters: values – a list of utf8 strings

write_part_raw(index, values)¶

write_raw(index, values)¶

class exetera.core.readerwriter.LeakyCategoricalImporter(datastore, group, name, categories, out_of_range, timestamp=None, write_mode='write')¶

Bases: object

chunk_factory(length)¶

flush()¶

import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶

write(values)¶

write_part(values)¶

class exetera.core.readerwriter.NumericImporter(datastore, group, name, nformat, parser, invalid_value=0, validation_mode='allow_empty', create_flag_field=True, flag_field_suffix='_valid', timestamp=None, write_mode='write')¶

Bases: object

chunk_factory(length)¶

flush()¶

import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶

write(values)¶

write_part(values)¶

Given a list of strings, parse the strings and write the parsed values. Values that cannot be parsed are written out as zero for the values, and zero for the flags to indicate that that entry is not valid.

Parameters: values – a list of strings to be parsed

class exetera.core.readerwriter.NumericReader(datastore, field)¶

Bases: exetera.core.readerwriter.Reader

dtype()¶

get_writer(dest_group, dest_name, timestamp=None, write_mode='write')¶

class exetera.core.readerwriter.NumericWriter(datastore, group, name, nformat, timestamp=None, write_mode='write')¶

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)¶

flush()¶

write(values)¶

write_part(values)¶

class exetera.core.readerwriter.OptionalDateImporter(datastore, group, name, create_day_field=False, optional=True, timestamp=None, write_mode='write')¶

Bases: object

chunk_factory(length)¶

flush()¶

import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶

write(values)¶

write_part(values)¶

class exetera.core.readerwriter.Reader(field)¶: Bases: object

class exetera.core.readerwriter.TimestampReader(datastore, field)¶

Bases: exetera.core.readerwriter.Reader

dtype()¶

get_writer(dest_group, dest_name, timestamp=None, write_mode='write')¶

class exetera.core.readerwriter.TimestampWriter(datastore, group, name, timestamp=None, write_mode='write')¶

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)¶

flush()¶

write(values)¶

write_part(values)¶

class exetera.core.readerwriter.Writer(datastore, group, name, write_mode, attributes)¶

Bases: object

property chunksize¶

flush()¶

exetera.core.regression module¶

exetera.core.regression.check_row(exp_ds, exp_index, act_ds, act_index, keys, custom_checks)¶

exetera.core.regression.datetime_compare_to_secs(value1, value2)¶

exetera.core.regression.na_compare(value1, value2)¶

exetera.core.regression.na_or_value(value)¶

exetera.core.session module¶

class exetera.core.session.Session(chunksize: int = 1048576, timestamp: str = '2022-04-05 17:12:36.959786+00:00')¶

Bases: exetera.core.abstract_types.AbstractSession

Session is the top-level object that is used to create and open ExeTera Datasets. It also provides operations that can be performed on Fields. For a more detailed explanation of Session and examples of its usage, please refer to https://github.com/KCL-BMEIS/ExeTera/wiki/Session-API

Parameters

chunksize – Change the default chunksize that fields created with this dataset use. Note this is a hint parameter and future versions of Session may choose to ignore it if it is no longer required. In general, it should only be changed for testing.
timestamp – Set the official timestamp for the Session’s creation rather than taking the current date/time.

aggregate_count(index, dest=None)¶

Finds the number of entries within each sub-group of index.

Example:

Index:  a a a b b x a c c d d d
Result: 3     2   1 1 2   3

Parameters

index – A numpy array or Field containing the index that defines the ranges over which count is applied.
dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

aggregate_custom(predicate, index, target=None, dest=None)¶

aggregate_first(index, target=None, dest=None)¶

Finds the first entries within each sub-group of index.

Example:

Index: a a a b b x a c c d d d Target: 1 2 3 4 5 6 7 8 9 0 1 2 Result: 1 4 6 7 8 0

Parameters

index – A numpy array or Field containing the index that defines the ranges over which count is applied.
target – A numpy array to which the index and predicate are applied
dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

aggregate_last(index, target=None, dest=None)¶

Finds the first entries within each sub-group of index.

Example:

Index:  a a a b b x a c c d d d
Target: 1 2 3 4 5 6 7 8 9 0 1 2
Result: 3     5   6 7 9   2

Parameters

index – A numpy array or Field containing the index that defines the ranges over which count is applied.
target – A numpy array to which the index and predicate are applied
dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

aggregate_max(index, target=None, dest=None)¶

Finds the maximum value within each sub-group of index.

Example:

Index: a a a b b x a c c d d d Target: 1 2 3 5 4 6 7 8 9 2 1 0 Result: 3 5 6 7 9 2

Parameters

index – A numpy array or Field containing the index that defines the ranges over which max is applied.
target – A numpy array to which the index and predicate are applied
dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

aggregate_min(index, target=None, dest=None)¶

Finds the minimum value within each sub-group of index.

Example:

Index:  a a a b b x a c c d d d
Target: 1 2 3 5 4 6 7 8 9 2 1 0
Result: 1     4   6 7 8   0

Parameters

index – A numpy array or Field containing the index that defines the ranges over which min is applied.
target – A numpy array to which the index and predicate are applied
dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

apply_filter(filter_to_apply, src, dest=None)¶

Apply a filter to an a src field. The filtered field is written to dest if it set, and returned from the function call. If the field is an IndexedStringField, the indices and values are returned separately.

Parameters

filter_to_apply – the filter to be applied to the source field, an array of boolean
src – the field to be filtered
dest – optional - a field to write the filtered data to

Returns

the filtered values

apply_index(index_to_apply, src, dest=None)¶

Apply a index to an a src field. The indexed field is written to dest if it set, and returned from the function call. If the field is an IndexedStringField, the indices and values are returned separately.

Parameters

index_to_apply – the index to be applied to the source field, must be one of Group, Field, or ndarray
src – the field to be index
dest – optional - a field to write the indexed data to

Returns

the indexed values

apply_spans_concat(spans, target, dest, src_chunksize=None, dest_chunksize=None, chunksize_mult=None)¶

apply_spans_count(spans: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶

Finds the number of entries within each span.

Parameters

spans – the numpy array of spans to be applied
dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_first(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶

Finds the first entry within each span on a target field.

Parameters

spans – the numpy array of spans to be applied
target – the field to which the spans are applied
dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_index_of_first(spans: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶

Finds the index of the first entry within each span.

Parameters

spans – the numpy array of spans to be applied
dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_index_of_last(spans: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶

Finds the index of the last entry within each span.

Parameters

spans – the numpy array of spans to be applied
dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_index_of_max(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶

Finds the index of the maximum value within each span on a target field.

Parameters

spans – the numpy array of spans to be applied
target – the field to which the spans are applied
dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_index_of_min(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶

Finds the index of the minimum value within each span on a target field.

Parameters

spans – the numpy array of spans to be applied
target – the field to which the spans are applied
dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_last(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶

Finds the last entry within each span on a target field.

Parameters

spans – the numpy array of spans to be applied
target – the field to which the spans are applied
dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_max(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶

Finds the maximum value within each span on a target field.

Parameters

spans – the numpy array of spans to be applied
target – the field to which the spans are applied
dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_min(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶

Finds the minimum value within span on a target field.

Parameters

spans – the numpy array of spans to be applied
target – the field to which the spans are applied
dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

chunks(length: int, chunksize: Optional[int] = None)¶

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

‘chunks’ is a convenience method that, given an overall length and a chunksize, will yield a set of ranges for the chunks in question. ie. chunks(1048576, 500000) -> (0, 500000), (500000, 1000000), (1000000, 1048576)

Parameters

length – The range to be split into chunks
chunksize – Optional parameter detailing the size of each chunk. If not set, the chunksize that the Session was initialized with is used.

close()¶

Close all open datasets.

Returns: None

close_dataset(name: str)¶

Close the dataset with the given name. If there is no dataset with that name, do nothing.

Parameters: name – The name of the dataset to be closed
Returns: None

create_categorical(group, name, nformat, key, timestamp=None, chunksize=None)¶

Create a categorical field in the given DataFrame with the given name. This function also takes a numerical format for the numeric representation of the categories, and a key that maps numeric values to their string string descriptions.

Parameters

group – The group in which the new field should be created
name – The name of the new field
nformat – A numerical type in the set (int8, uint8, int16, uint18, int32, uint32, int64). It is recommended to use ‘int8’.
key – A dictionary that maps numerical values to their string representations
timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.
chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.

create_fixed_string(group, name, length, timestamp=None, chunksize=None)¶

Create a fixed string field in the given DataFrame, given name, and given max string length per entry.

Parameters

group – The group in which the new field should be created
name – The name of the new field
length – The maximum length in bytes that each entry can have.
timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.
chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.

create_indexed_string(group, name, timestamp=None, chunksize=None)¶

Create an indexed string field in the given DataFrame with the given name.

Parameters

group – The group in which the new field should be created
name – The name of the new field
timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.
chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.

create_like(field, dest_group, dest_name, timestamp=None, chunksize=None)¶

Create a field of the same type as an existing field, in the location and with the name provided.

Example:

with Session as s:
  ...
  a = s.get(table_1['a'])
  b = s.create_like(a, table_2, 'a_times_2')
  b.data.write(a.data[:] * 2)

Parameters

field – The Field whose type is to be copied
dest_group – The group in which the new field should be created
dest_name – The name of the new field

create_numeric(group, name, nformat, timestamp=None, chunksize=None)¶

Create a numeric field in the given DataFrame with the given name.

Parameters

group – The group in which the new field should be created
name – The name of the new field
nformat – A numerical type in the set (int8, uint8, int16, uint18, int32, uint32, int64, uint64, float32, float64). It is recommended to avoid uint64 as certain operations in numpy cause conversions to floating point values.
timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.
chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.

create_timestamp(group, name, timestamp=None, chunksize=None)¶: Create a timestamp field in the given group with the given name.

dataset_sort_index(sort_indices, index=None)¶

Generate a sorted index based on a set of fields upon which to sort and an optional index to apply to the sort_indices.

Parameters

sort_indices – a tuple or list of indices that determine the sorted order
index – optional - the index by which the initial field should be permuted

Returns

the resulting index that can be used to permute unsorted fields

distinct(field=None, fields=None, filter=None)¶

get(field: Union[exetera.core.abstract_types.Field, h5py._hl.group.Group])¶

Get a Field from a h5py Group.

Example:

# this code for context
with Session() as s:

  # open a dataset about wildlife
  src = s.open_dataset("/my/wildlife/dataset.hdf5", "r", "src")

  # fetch the group containing bird data
  birds = src['birds']

  # get the bird decibel field
  bird_decibels = s.get(birds['decibels'])

Parameters: field – The Field or Group object to retrieve.

get_dataset(name: str)¶

Get the dataset with the given name. If there is no dataset with that name, raise a KeyError indicating that the dataset with that name is not present.

Parameters: name – Name of the dataset to be fetched. This is the name that was given to it when it was opened through open_dataset().
Returns: Dataset with that name.

get_index(target, foreign_key, destination=None)¶

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please make use of Dataframe.merge functionality instead. This method can be emulated by adding an index (via np.arange) to a dataframe, performing a merge and then fetching the mapped index field.

‘get_index’ maps a primary key (‘target’) into the space of a foreign key (‘foreign_key’).

get_or_create_group(group: Union[h5py._hl.group.Group, h5py._hl.files.File], name: str)¶: Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

get_shared_index(keys: Tuple[numpy.ndarray])¶

Create a shared index based on a tuple of numpy arrays containing keys. This function generates the sorted union of a tuple of key fields and then maps the individual arrays to their corresponding indices in the sorted union.

Parameters: keys – a tuple of groups, fields or ndarrays whose contents represent keys

Example:

key_1 = ['a', 'b', 'e', 'g', 'i']
key_2 = ['b', 'b', 'c', 'c, 'e', 'g', 'j']
key_3 = ['a', 'c' 'd', 'e', 'g', 'h', 'h', 'i']

sorted_union = ['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j']

key_1_index = [0, 1, 4, 5, 7]
key_2_index = [1, 1, 2, 2, 4, 5, 8]
key_3_index = [0, 2, 3, 4, 5, 6, 6, 7]

get_spans(field: Union[exetera.core.abstract_types.Field, numpy.ndarray] = None, dest: exetera.core.abstract_types.Field = None, **kwargs)¶

Calculate a set of spans that indicate contiguous equal values. The entries in the result array correspond to the inclusive start and exclusive end of the span (the ith span is represented by element i and element i+1 of the result array). The last entry of the result array is the length of the source field.

Only one of ‘field’ or ‘fields’ may be set. If ‘fields’ is used and more than one field specified, the fields are effectively zipped and the check for spans is carried out on each corresponding tuple in the zipped field.

Example:

field: [1, 2, 2, 1, 1, 1, 3, 4, 4, 4, 2, 2, 2, 2, 2]
result: [0, 1, 3, 6, 7, 10, 15]

Parameters

field – A Field or numpy array to be evaluated for spans
dest – A destination Field to store the result
**kwargs – See below. For parameters set in both argument and kwargs, use kwargs

Keyword Arguments

field – Similar to field parameter, in case user specify field as keyword
fields – A tuple of Fields or tuple of numpy arrays to be evaluated for spans
dest – Similar to dest parameter, in case user specify as keyword

Returns

The resulting set of spans as a numpy array

join(destination_pkey, fkey_indices, values_to_join, writer=None, fkey_index_spans=None)¶: This method is due for removal and should not be used. Please use the merge or ordered_merge functions instead.

list_datasets()¶

List the open datasets for this Session object. This is returned as a tuple of strings rather than the datasets themselves. The individual datasets can be fetched using get_dataset().

Example:

names = s.list_datasets()
datasets = [s.get_dataset(n) for n in names]

Returns: A tuple containing the names of the currently open datasets for this Session object

merge_inner(left_on, right_on, left_fields=None, left_writers=None, right_fields=None, right_writers=None)¶

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please use DataFrame.merge instead.

Perform a database-style inner join on left_fields, outputting the result to left_writers, if set.

Parameters

left_on – The key to perform the join on on the left hand side
right_on – The key to perform the join on on the right hand side
left_fields – The fields to be mapped from left to inner
left_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.
right_fields – The fields to be mapped from right to inner
right_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.

merge_left(left_on, right_on, right_fields=(), right_writers=None)¶

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please use DataFrame.merge instead.

Perform a database-style left join on right_fields, outputting the result to right_writers, if set.

Parameters

left_on – The key to perform the join on on the left hand side
right_on – The key to perform the join on on the right hand side
right_fields – The fields to be mapped from right to left
right_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.

merge_right(left_on, right_on, left_fields=(), left_writers=None)¶

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please use DataFrame.merge instead.

Perform a database-style right join on left_fields, outputting the result to left_writers, if set.

Parameters

left_on – The key to perform the join on on the left hand side
right_on – The key to perform the join on on the right hand side
left_fields – The fields to be mapped from right to left
left_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.

open_dataset(dataset_path: Union[str, IO[bytes]], mode: str, name: str)¶

Open a dataset with the given access mode.

Parameters

dataset_path – the path to the dataset
mode – the mode in which the dataset should be opened. This is one of “r”, “r+” or “w”.
name – the name that is associated with this dataset. This can be used to retrieve the dataset when calling get_dataset().

Returns

The top-level dataset object

ordered_merge_inner(left_on, right_on, left_field_sources=(), left_field_sinks=None, right_field_sources=(), right_field_sinks=None, left_unique=False, right_unique=False)¶

Generate the results of an inner join and apply it to the fields described in the tuple ‘right_field_sources’. If ‘right_field_sinks’ is set, the mapped values are written to the fields / arrays set there.

Note: in order to achieve best scalability, you should use groups / fields rather than numpy arrays and provide a tuple of groups/fields to right_field_sinks, so that the session and compute the merge and apply the mapping in a streaming fashion.

Parameters

left_on – the group/field/numba array that contains the left key values
right_on – the group/field/numba array that contains the right key values
right_to_left_map – a group/field/numba array that the map is written to. If it is a numba array, it must be the size of the resulting merge
right_field_sources – a tuple of group/fields/numba arrays that contain the fields to be joined
right_field_sinks – optional - a tuple of group/fields/numba arrays that the mapped fields should be written to
left_unique – a hint to indicate whether the ‘left_on’ field contains unique values
right_unique – a hint to indicate whether the ‘right_on’ field contains unique values

Returns

If right_field_sinks is not set, a tuple of the output fields is returned

ordered_merge_left(left_on, right_on, right_field_sources=(), left_field_sinks=None, left_to_right_map=None, left_unique=False, right_unique=False)¶

Generate the results of a left join and apply it to the fields described in the tuple ‘left_field_sources’. If ‘left_field_sinks’ is set, the mapped values are written to the fields / arrays set there. Note: in order to achieve best scalability, you should use groups / fields rather than numpy arrays and provide a tuple of groups/fields to left_field_sinks, so that the session and compute the merge and apply the mapping in a streaming fashion.

Parameters

left_on – the group/field/numba array that contains the left key values
right_on – the group/field/numba array that contains the right key values
left_to_right_map – a group/field/numba array that the map is written to. If it is a numba array, it must be the size of the resulting merge
left_field_sources – a tuple of group/fields/numba arrays that contain the fields to be joined
left_field_sinks – optional - a tuple of group/fields/numba arrays that the mapped fields should be written to
left_unique – a hint to indicate whether the ‘left_on’ field contains unique values
right_unique – a hint to indicate whether the ‘right_on’ field contains unique values

Returns

If left_field_sinks is not set, a tuple of the output fields is returned

ordered_merge_right(left_on, right_on, left_field_sources=(), right_field_sinks=None, right_to_left_map=None, left_unique=False, right_unique=False)¶

Generate the results of a right join and apply it to the fields described in the tuple ‘right_field_sources’. If ‘right_field_sinks’ is set, the mapped values are written to the fields / arrays set there.

Note: in order to achieve best scalability, you should use groups / fields rather than numpy arrays and provide a tuple of groups/fields to right_field_sinks, so that the session and compute the merge and apply the mapping in a streaming fashion.

Parameters

left_on – the group/field/numba array that contains the left key values
right_on – the group/field/numba array that contains the right key values
right_to_left_map – a group/field/numba array that the map is written to. If it is a numba array, it must be the size of the resulting merge
right_field_sources – a tuple of group/fields/numba arrays that contain the fields to be joined
right_field_sinks – optional - a tuple of group/fields/numba arrays that the mapped fields should be written to
left_unique – a hint to indicate whether the ‘left_on’ field contains unique values
right_unique – a hint to indicate whether the ‘right_on’ field contains unique values

Returns

If right_field_sinks is not set, a tuple of the output fields is returned

predicate_and_join(predicate, destination_pkey, fkey_indices, reader=None, writer=None, fkey_index_spans=None)¶: This method is due for removal and should not be used. Please use the merge or ordered_merge functions instead.

set_timestamp(timestamp: str = '2022-04-05 17:12:36.959841+00:00')¶

Set the default timestamp to be used when creating fields without specifying an explicit timestamp.

Parameters: timestamp – a string representing a valid Datetime
Returns: None

sort_on(src_group: h5py._hl.group.Group, dest_group: h5py._hl.group.Group, keys: Union[tuple, list], timestamp=datetime.datetime(2022, 4, 5, 17, 12, 36, 959847, tzinfo=datetime.timezone.utc), write_mode='write', verbose=True)¶

Sort a group (src_group) of fields by the specified set of keys, and write the sorted fields to dest_group.

Parameters

src_group – the group of fields that are to be sorted
dest_group – the group into which sorted fields are written
keys – fields to sort on
timestamp – optional - timestamp to write on the sorted fields
write_mode – optional - write mode to use if the destination fields already exist

Returns

None

temp_filename()¶

exetera.core.split module¶

exetera.core.split.assessment_splitter(input_filename, output_filename, assessment_buckets, bucket)¶

exetera.core.split.patient_splitter(input_filename, output_filenames, sorted_indices, bucket_size)¶

exetera.core.split.split_data(patient_data, assessment_data, bucket_size=500000, territories=None)¶

exetera.core.utils module¶

class exetera.core.utils.Timer(start_msg, new_line=False, end_msg='completed in')¶: Bases: object

exetera.core.utils.build_histogram(dataset, filtered_records=None, tx=None)¶

exetera.core.utils.bytearray_to_escaped(srcbytearray, destbytearray, src_start=0, src_end=None, dest_start=0, separator=b',', delimiter=b'"')¶

exetera.core.utils.check_input_lengths(names, fields)¶

exetera.core.utils.chunks(length, chunksize)¶

exetera.core.utils.clear_set_flag(values, to_clear)¶

exetera.core.utils.concatenate_maybe_strs(sequence, value, separator=',', delimiter='"')¶

exetera.core.utils.count_flag_empty(flags)¶

exetera.core.utils.count_flag_not_set(flags, flag_to_test)¶

exetera.core.utils.count_flag_set(flags, flag_to_test)¶

exetera.core.utils.datetime_to_seconds(dt)¶

exetera.core.utils.filter_field(fields, filter_list, f_missing, f_bad, is_type_fn, type_fn, valid_fn)¶

exetera.core.utils.find_longest_sequence_of(string, char)¶

exetera.core.utils.from_escaped(string)¶

exetera.core.utils.get_min_max(value_type)¶

exetera.core.utils.guess_encoding(filename)¶

Attempt to determine the encodig of the given text file by reading the byte order mark, defaulting to utf-8 if none is found.

Parameters: filename – path to a text file containing possible UTF-8, UTF-16, or UTF-32 text
Returns: encoding name, one of utf-8, utf-8-sig, utf-16, utf-32

exetera.core.utils.is_float(value)¶

exetera.core.utils.is_int(value)¶

exetera.core.utils.list_to_escaped(strings)¶

exetera.core.utils.map_between_categories(first_map, second_map)¶

exetera.core.utils.one_dim_data_to_indexed_for_test(data, field_size)¶

exetera.core.utils.print_diagnostic_row(preamble, ds, ir, keys, fns=None)¶

exetera.core.utils.replace_if_invalid(replacement)¶

exetera.core.utils.sort_mixed_list(values, check_fn, sort_fn)¶

exetera.core.utils.string_to_datetime(field)¶

exetera.core.utils.timestamp_to_day(field)¶

exetera.core.utils.to_categorical(field, transform)¶

exetera.core.utils.to_escaped(string, separator=',', delimiter='"')¶

exetera.core.utils.to_float(value)¶

exetera.core.utils.to_int(value)¶

exetera.core.utils.valid_range_fac(f_min, f_max, default_value='')¶

exetera.core.utils.valid_range_fac_inc(f_min, f_max, default_value='')¶

exetera.core.utils.validate_file_exists(file_name)¶

exetera.core.validation module¶

exetera.core.validation.all_same_basic_type(name, fields)¶

exetera.core.validation.array_from_field_or_lower(name, field)¶

exetera.core.validation.array_from_parameter(session, name, field)¶

exetera.core.validation.ensure_valid_field(name, field)¶

exetera.core.validation.ensure_valid_field_like(name, field)¶

exetera.core.validation.field_from_parameter(session, name, field)¶

exetera.core.validation.is_field_parameter(field)¶

exetera.core.validation.raw_array_from_parameter(datastore, name, field)¶

exetera.core.validation.validate_all_field_length_in_df(df: exetera.core.abstract_types.DataFrame)¶

exetera.core.validation.validate_and_get_key_fields(side, df, key)¶

exetera.core.validation.validate_and_normalize_categorical_key(param_name, key)¶

exetera.core.validation.validate_boolean_row_filter(name, field)¶

exetera.core.validation.validate_chunk_size(chunk_size_name, chunk_size)¶

exetera.core.validation.validate_field_lengths(side, lens, df, names=None)¶

exetera.core.validation.validate_filter(filter_to_apply)¶

exetera.core.validation.validate_groupby_target(target, by, all)¶

exetera.core.validation.validate_key_field_consistency(lname, rname, lkey, rkey)¶

exetera.core.validation.validate_key_lengths(side, df, key)¶

exetera.core.validation.validate_require_key(context, key, dictionary)¶

exetera.core.validation.validate_selected_keys(by, all)¶

exetera.core package¶

Submodules¶

exetera.core.data_writer module¶

exetera.core.dataset module¶

exetera.core.dataframe module¶

exetera.core.exporter module¶

exetera.core.fields module¶

exetera.core.filtered_field module¶

exetera.core.indexed_array module¶

exetera.core.journal module¶

exetera.core.operations module¶

exetera.core.persistence module¶

exetera.core.readerwriter module¶

exetera.core.regression module¶

exetera.core.session module¶

exetera.core.split module¶

exetera.core.utils module¶

exetera.core.validation module¶

Module contents¶