exetera.core package

Submodules

exetera.core.data_writer module

class exetera.core.data_writer.DataWriter

Bases: object

static clear_dataset(parent_group, name)
static create_group(parent_group, name, attrs)
static flush(group)
static write(group, name, field, count, dtype=None)
static write_additional(group, name, field, count)
static write_first(group, name, field, count, dtype=None)

exetera.core.dataset module

class exetera.core.dataset.HDF5Dataset(session, dataset_path, mode, name)

Bases: exetera.core.abstract_types.Dataset

Dataset is the means which which you interact with an ExeTera datastore. These are created and loaded through Session.open_dataset, rather than being constructed directly.

Datasets are composed of one or more DataFrame objects and the means by which DataFrames are interacted with.

For a detailed explanation of Dataset along with examples of its use, please refer to the wiki documentation at https://github.com/KCL-BMEIS/ExeTera/wiki/Dataset-API

Parameters
  • session – The session instance to include this dataset to.

  • dataset_path – The path of HDF5 file.

  • mode – the mode in which the dataset should be opened. This is one of “r”, “r+” or “w”.

  • name – the name that is associated with this dataset. This can be used to retrieve the dataset when calling get_dataset().

Returns

A HDF5Dataset instance.

close()

Close the HDF5 file operations.

contains_dataframe(dataframe: exetera.core.abstract_types.DataFrame)

Check if a dataframe is contained in this dataset by the dataframe object itself.

Parameters

dataframe – the dataframe object to check

Returns

True or False if the dataframe is contained

copy(dataframe, name)

Add an existing dataframe (from other dataset) to this dataset, write the existing group attributes and HDF5 datasets to this dataset.

Parameters
  • dataframe – the dataframe to copy to this dataset.

  • name – optional- change the dataframe name.

Returns

None if the operation is successful; otherwise throw Error.

create_dataframe(name: str, dataframe: Optional[exetera.core.abstract_types.DataFrame] = None)

Create a new DataFrame object as a part of this Dataset.

Parameters
  • name – name of the dataframe

  • dataframe – if set, this is a dataframe object whose contents are duplicated

Returns

a dataframe object

create_group(name: str)

This method is a wrapper around create_dataframe() instead.

delete_dataframe(dataframe: exetera.core.abstract_types.DataFrame)

Remove dataframe from this dataset by the dataframe object.

Parameters

dataframe – The dataframe instance to delete.

Returns

Boolean if the dataframe is deleted.

drop(name: str)
get_dataframe(name: str)

Get the dataframe by dataset.get_dataframe(dataframe_name).

Parameters

name – The name of the dataframe.

Returns

The dataframe or throw Error if the name is not existed in this dataset.

items()

Return the (name, dataframe) tuple in this dataset.

keys()

Return all dataframe names in this dataset.

require_dataframe(name)

Get a dataframe, creating it if it doesn’t exist.

Parameters

name – name of the dataframe

property session

The session property interface.

Returns

The _session instance.

values()

Return all dataframe instance in this dataset.

exetera.core.dataset.copy(dataframe: exetera.core.abstract_types.DataFrame, dataset: exetera.core.abstract_types.Dataset, name: str)

Copy dataframe to another dataset via HDF5DataFrame.copy(ds1[‘df1’], ds2, ‘df1’])

Parameters
  • dataframe – The dataframe to copy.

  • dataset – The destination dataset.

  • name – The name of dataframe in destination dataset.

exetera.core.dataset.move(dataframe: exetera.core.abstract_types.DataFrame, dataset: exetera.core.abstract_types.Dataset, name: str)

Move a dataframe to another dataset via HDF5DataFrame.move(ds1[‘df1’], ds2, ‘df1’]). If move within the same dataset, e.g. HDF5DataFrame.move(ds1[‘df1’], ds1, ‘df2’]), function as a rename for both dataframe and HDF5Group. However, to

Parameters
  • dataframe – The dataframe to copy.

  • dataset – The destination dataset.

  • name – The name of dataframe in destination dataset.

exetera.core.dataframe module

class exetera.core.dataframe.HDF5DataFrame(dataset: exetera.core.abstract_types.Dataset, name: str, h5group: h5py._hl.group.Group)

Bases: exetera.core.abstract_types.DataFrame

DataFrame is the means which which you interact with an ExeTera datastore. These are created and loaded through Dataset.create_dataframe, and other methods, rather than being constructed directly.

DataFrames closely resemble Pandas DataFrames, but with a number of key differences: 1. Instead of Series, DataFrames are composed of Field objects 2. DataFrames can store fields of differing lengths, although all fields must be of the same length when performing certain operations such as merges. 3. ExeTera DataFrames do not (yet) have the ability to create filtered views onto an underlying DataFrame, although this functionality will be added in upcoming releases

For a detailed explanation of DataFrame along with examples of its use, please refer to the wiki documentation at https://github.com/KCL-BMEIS/ExeTera/wiki/DataFrame-API

Parameters
  • name – name of the dataframe.

  • dataset – a dataset object, where this dataframe belongs to.

  • h5group – the h5group object to store the fields. If the h5group is not empty, acquire data from h5group object directly. The h5group structure is h5group<-h5group-dataset structure, the later group has a ‘fieldtype’ attribute and only one dataset named ‘values’. So that the structure is mapped to Dataframe<-Field-Field.data automatically.

  • dataframe – optional - replicate data from another dictionary of (name:str, field: Field).

add(field: exetera.core.abstract_types.Field)

Add a field to this dataframe as well as the HDF5 Group.

Parameters

field – field to add to this dataframe, copy the underlying dataset

apply_filter(filter_to_apply, ddf=None)

Apply the filter to all the fields in this dataframe, return a dataframe with filtered fields.

Parameters
  • filter_to_apply – the filter to be applied to the source field, an array of boolean

  • ddf – optional- the destination data frame

Returns

a dataframe contains all the fields filterd, self if ddf is not set

apply_index(index_to_apply, ddf=None)

Apply the index to all the fields in this dataframe, return a dataframe with indexed fields.

Parameters
  • index_to_apply – the index to be applied to the fields, an ndarray of integers

  • ddf – optional- the destination data frame

Returns

a dataframe contains all the fields re-indexed, self if ddf is not set

property columns

The columns property interface. Columns is a dictionary to store the fields by (field_name, field_object). The field_name is field.name without prefix ‘/’ and HDF5 group name.

contains_field(field)

check if dataframe contains a field by the field object

Parameters

field – the filed object to check, return a tuple(bool,str). The str is the name stored in dataframe.

create_categorical(name: str, nformat: int, key: dict, timestamp: Optional[str] = None, chunksize: Optional[int] = None)

Create a categorical type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#categoricalfield for a detailed description of indexed string fields

create_fixed_string(name: str, length: int, timestamp: Optional[str] = None, chunksize: Optional[int] = None)

Create a fixed string type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#fixedstringfield for a detailed description of fixed string fields

create_group(name: str)

Create a group object in HDF5 file for field to use. Please note, this function is for backwards compatibility with older scripts and should not be used in the general case.

Parameters

name – the name of the group and field

Returns

a hdf5 group object

create_indexed_string(name: str, timestamp: Optional[str] = None, chunksize: Optional[int] = None)

Create a indexed string type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#indexedstringfield for a detailed description of indexed string fields

create_numeric(name: str, nformat: int, timestamp: Optional[str] = None, chunksize: Optional[int] = None)

Create a numeric type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#numericfield for a detailed description of numeric fields

create_timestamp(name: str, timestamp: Optional[str] = None, chunksize: Optional[int] = None)

Create a timestamp type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#timestampfield for a detailed description of timestamp fields

property dataset

The dataset property interface.

delete_field(field)

Remove field from dataframe by field.

Parameters

field – The field to delete from this dataframe.

describe(include=None, exclude=None, output='terminal')

Show the basic statistics of the data in each field.

Parameters
  • include – The field name or data type or simply ‘all’ to indicate the fields included in the calculation.

  • exclude – The filed name or data type to exclude in the calculation.

  • output – Display the result in stdout if set to terminal, otherwise silent.

Returns

A dataframe contains the statistic results.

drop(name: str)

Drop a field from this dataframe as well as the HDF5 Group

drop_duplicates(by: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame = None, hint_keys_is_sorted=False)

Distinct values of a field or a list of field, return a dataframe with distinct values.

Parameters
  • by – Name (str) or list of names (str) to distinct.

  • ddf – optional - the destination dataframe

Returns

DataFrame with distinct values.

get_field(name)

Get a field stored by the field name.

Parameters

name – The name of field to get.

groupby(by: Union[str, List[str]], hint_keys_is_sorted=False)

Group DataFrame using a field or a list of field, return a groupby object.

Parameters
  • by – Name (str) or list of names (str) to group by.

  • hint_keys_is_sorted – an optional flag that users could set to skip the sorted check. Note that it runs faster and uses less memory when the dataframe is sorted, that is, hint_key_is_sorted=True.

Returns

Returns a groupby object that contains information about the groups.

property h5group

The h5group property interface, used to handle underlying storage.

items()

Return all the field names and their corresponding field values

keys()

Return all the field names

rename(field: Union[str, Mapping[str, str]], field_to: Optional[str] = None) → None

Rename provides you with the means to rename fields within a dataframe. You can specify either a single field to be renamed or you can provide a dictionary with a set of fields to be renamed.

Example:

# rename a single field
df.rename('a', 'b')

# rename multiple fields
df.rename({'a': 'b', 'b': 'c', 'c': 'a'})

Field renaming can fail if the resulting set of renamed fields would have name clashes. If this is the case, none of the rename operations go ahead and the dataframe remains unmodified.

Parameters
  • field – Either a string or a dictionary of name pairs, each of which is the existing field name and the destination field name

  • field_to – Optional parameter containing a string, if field is a string. If ‘field’ is a dictionary, parameter should not be set. Field references remain valid after this operation and reflect their renaming.

Returns

None

sort_values(by: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame = None, axis=0, ascending=True, kind='stable')

Sort by the values of a field or a list of fields

Parameters
  • by – Name (str) or list of names (str) to sort by.

  • ddf – optional - the destination data frame

  • axis – Axis to be sorted. Currently only supports 0

  • ascending – Sort ascending vs. descending. Currently only supports ascending=True.

  • kind – Choice of sorting algorithm. Currently only supports “stable”

Returns

DataFrame with sorted values or None if ddf=None.

to_csv(filepath: str, row_filter: Union[numpy.ndarray, exetera.core.abstract_types.Field] = None, column_filter: Union[str, List[str]] = None, chunk_row_size: int = 32768)

Write object to a comma-separated values (csv) file. :param filepath: File path. :param row_filter: A boolean array / field. Only select rows when filter value is True :param column_filter: A sequence of string names for the fields. :chunk_row_size: Write rows for every chunk which has maximum chunk_row_size rows. The default is 1<<15.

to_pandas(row_filter: List[bool] = None, col_filter: Union[str, List[str]] = None)

Convert an ExeTera dataframe to Pandas DataFrame. :param row_filter: A boolean array indicates which rows to export. :param col_filter: String or list of strings indicates which columns to export. :returns: A pandas dataframe.

Example:

pandas_df = df.to_pandas()
values()

Return all the field values

class exetera.core.dataframe.HDF5DataFrameGroupBy(columns, by, sorted_index, spans)

Bases: exetera.core.abstract_types.DataFrameGroupBy

count(ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame

Compute max of group values.

Parameters
  • target – Name (str) or list of names (str) to compute count.

  • ddf – the destination data frame

  • write_keys – write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with count of group values

distinct(ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame
first(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame

Get first of group values.

Parameters
  • target – Name (str) or list of names (str) to get first value.

  • ddf – the destination data frame

  • write_keys – write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with first of group values

last(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame

Get last of group values.

Parameters
  • target – Name (str) or list of names (str) to get last value.

  • ddf – the destination data frame

  • write_keys – write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with last of group values

max(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame

Compute max of group values.

Parameters
  • target – Name (str) or list of names (str) to compute max.

  • ddf – the destination data frame

  • write_keys – write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with max of group values

min(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame

Compute min of group values.

Parameters
  • target – Name (str) or list of names (str) to compute min.

  • ddf – the destination data frame

  • write_keys – write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with min of group values

exetera.core.dataframe.copy(field: exetera.core.abstract_types.Field, dataframe: exetera.core.abstract_types.DataFrame, name: str)

Copy a field to another dataframe as well as underlying dataset.

Parameters
  • field – The source field to copy.

  • dataframe – The destination dataframe to copy to.

  • name – The name of field under destination dataframe.

exetera.core.dataframe.merge(left: exetera.core.abstract_types.DataFrame, right: exetera.core.abstract_types.DataFrame, dest: exetera.core.abstract_types.DataFrame, left_on: Union[Tuple[Union[str, exetera.core.abstract_types.Field]], str, exetera.core.abstract_types.Field], right_on: Union[Tuple[Union[str, exetera.core.abstract_types.Field]], str, exetera.core.abstract_types.Field], left_fields: Optional[Sequence[str]] = None, right_fields: Optional[Sequence[str]] = None, left_suffix: str = '_l', right_suffix: str = '_r', how='left', hint_left_keys_ordered: Optional[bool] = None, hint_left_keys_unique: Optional[bool] = None, hint_right_keys_ordered: Optional[bool] = None, hint_right_keys_unique: Optional[bool] = None, chunk_size=1048576)

Merge ‘left’ and ‘right’ DataFrames into a destination dataset. The merge is a database-style join operation, in any of the following modes (“left”, “right”, “inner”, “outer”). This method closely follows the Pandas ‘merge’ functionality.

The join is performed using the fields specified by ‘left_on’ and ‘right_on’; these can either be strings or fields; if they strings then they refer to fields that must exist in the corresponding dataframe.

You can optionally set ‘left_fields’ and / or ‘right_fields’ if you want to have only a subset of fields joined from the left and right dataframes. If you don’t want any fields to be joined from a given dataframe, you can pass an empty list.

Fields are written to the destination dataframe. If the field names clash, they will get appended with the strings specified in ‘left_suffix’ and ‘right_suffix’ respectively.

Parameters
  • left – The left dataframe

  • right – The right dataframe

  • left_on – The field corresponding to the left key used to perform the join. This is either the the name of the field, or a field object. If it is a field object, it can be from another dataframe but it must be the same length as the fields being joined. This can also be a tuple of such values when performing joins on compound keys

  • right_on – The field corresponding to the right key used to perform the join. This is either the name of the field, or a field object. If it is a field object, it can be from another dataframe but it must be the same length as the fields being joined. This can also be a tuple of such values when performing joins on compound keys

  • left_fields – Optional parameter listing which fields are to be joined from the left table. If this is not set, all fields from the left table are joined

  • right_fields – Optional parameter listing which fields are to be joined from the right table. If this is not set, all fields from the right table are joined

  • left_suffix – A string to be appended to fields from the left table if they clash with fields from the right table.

  • right_suffix – A string to be appended to fields from the right table if they clash with fields from the left table.

  • how – Optional parameter specifying the merge mode. It must be one of (‘left’, ‘right’, ‘inner’, ‘outer’ or ‘cross). If not set, the ‘left’ join is performed.

exetera.core.dataframe.move(field: exetera.core.abstract_types.Field, dest_df: exetera.core.abstract_types.DataFrame, name: str)

Move a field to another dataframe as well as underlying dataset.

Parameters
  • src_df – The source dataframe where the field is located.

  • field – The field to move.

  • dest_df – The destination dataframe to move to.

  • name – The name of field under destination dataframe.

exetera.core.exporter module

exetera.core.exporter.export_schema(destination, readers)
exetera.core.exporter.export_to_csv(destination, datastore, fields)

Export selected fields of selected dataframe to csv file.

exetera.core.exporter.schema_from_reader_type(reader)
exetera.core.exporter.transform_from_reader_type(reader)

exetera.core.fields module

class exetera.core.fields.CategoricalField(session, group, dataframe, write_enabled=False)

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)
apply_spans_last(spans_to_apply, target=None, in_place=False)
apply_spans_max(spans_to_apply, target=None, in_place=False)
apply_spans_min(spans_to_apply, target=None, in_place=False)
create_like(group=None, name=None, timestamp=None)
property data
get_spans()
is_sorted()
isin(test_elements: Union[list, set, numpy.ndarray])
property keys
property nformat
remap(key_map, new_key)

Remap the key names and key values.

Parameters
  • key_map – The mapping rule of convert the old key into the new key.

  • new_key – The new key.

Returns

A CategoricalMemField with the new key.

unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of CategoricalField

writeable()
class exetera.core.fields.CategoricalMemField(session, nformat, keys)

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)
apply_spans_last(spans_to_apply, target=None, in_place=False)
apply_spans_max(spans_to_apply, target=None, in_place=False)
apply_spans_min(spans_to_apply, target=None, in_place=False)
create_like(group=None, name=None, timestamp=None)
property data
get_spans()
is_sorted()
isin(test_elements: Union[list, set, numpy.ndarray])
property keys
remap(key_map, new_key)

Remap the key names and key values.

Parameters
  • key_map – The mapping rule of convert the old key into the new key.

  • new_key – The new key.

Returns

A CategoricalMemField with the new key.

unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of CategoricalMemField

writeable()
class exetera.core.fields.FieldDataOps

Bases: object

static apply_filter_to_field(source, filter_to_apply, target=None, in_place=False)
static apply_filter_to_indexed_field(source, filter_to_apply, target=None, in_place=False)
static apply_index_to_field(source, index_to_apply, target=None, in_place=False)
static apply_index_to_indexed_field(source, index_to_apply, target=None, in_place=False)
static apply_isin(source: exetera.core.abstract_types.Field, test_elements: Union[list, set, numpy.ndarray])
static apply_spans_first(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field
static apply_spans_last(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field
static apply_spans_max(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field
static apply_spans_min(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field
static apply_unique(src: exetera.core.abstract_types.Field, return_index=False, return_inverse=False, return_counts=False) → numpy.ndarray
static categorical_field_create_like(source, group, name, timestamp)
classmethod equal(session, first, second)
static fixed_string_field_create_like(source, group, name, timestamp)
classmethod greater_than(session, first, second)
classmethod greater_than_equal(session, first, second)
static indexed_string_create_like(source, group, name, timestamp)
classmethod invert(session, first)
classmethod less_than(session, first, second)
classmethod less_than_equal(session, first, second)
classmethod logical_not(session, first)
classmethod not_equal(session, first, second)
classmethod numeric_add(session, first, second)
classmethod numeric_and(session, first, second)
classmethod numeric_divmod(session, first, second)
static numeric_field_create_like(source, group, name, timestamp)
classmethod numeric_floordiv(session, first, second)
classmethod numeric_mod(session, first, second)
classmethod numeric_mul(session, first, second)
classmethod numeric_or(session, first, second)
classmethod numeric_sub(session, first, second)
classmethod numeric_truediv(session, first, second)
classmethod numeric_xor(session, first, second)
static timestamp_field_create_like(source, group, name, timestamp)
class exetera.core.fields.FixedStringField(session, group, dataframe, write_enabled=False)

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)
apply_spans_last(spans_to_apply, target=None, in_place=False)
apply_spans_max(spans_to_apply, target=None, in_place=False)
apply_spans_min(spans_to_apply, target=None, in_place=False)
create_like(group=None, name=None, timestamp=None)
property data
get_spans()
is_sorted()
isin(test_elements: Union[list, set, numpy.ndarray])
unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of FixedStringField

writeable()
class exetera.core.fields.FixedStringMemField(session, length)

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)
apply_spans_last(spans_to_apply, target=None, in_place=False)
apply_spans_max(spans_to_apply, target=None, in_place=False)
apply_spans_min(spans_to_apply, target=None, in_place=False)
create_like(group=None, name=None, timestamp=None)
property data
get_spans()
is_sorted()
isin(test_elements: Union[list, set, numpy.ndarray])
unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of FixedStringMemField

writeable()
class exetera.core.fields.HDF5Field(session, group, dataframe, write_enabled=False)

Bases: exetera.core.abstract_types.Field

apply_filter(filter_to_apply, dstfld=None)
apply_index(index_to_apply, dstfld=None)
property chunksize

The chunksize for the field. This is not generally required for users, and may be ignored depending on the storage medium.

property dataframe

The owning dataframe of this field, or None if the field is now owned by a dataframe

get_spans()
property indexed

Whether the field is an indexed field or not. Indexed fields store their data internally as index and value arrays for efficiency, as well as making it accessible through the data property.

property name

The name of the field within a dataframe, if the field belongs to a dataframe

property timestamp

The timestamp representing the field creation time. This is the time at which the data for this field was added to the dataset, rather than the point at which the field wrapper was created.

property valid

Returns whether the field is a valid field object. Fields can become invalid as a result of certain operations, such as a field being moved from one dataframe to another. A field that is invalid with throw exceptions if any other operation is performed on them.

class exetera.core.fields.IndexedStringField(session, group, dataframe, write_enabled=False)

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)
apply_spans_last(spans_to_apply, target=None, in_place=False)
apply_spans_max(spans_to_apply, target=None, in_place=False)
apply_spans_min(spans_to_apply, target=None, in_place=False)
create_like(group=None, name=None, timestamp=None)

Create an empty field of the same type as this field.

property data
get_spans()
property indexed

Whether the field is an indexed field or not. Indexed fields store their data internally as index and value arrays for efficiency, as well as making it accessible through the data property.

property indices
is_sorted()
isin(test_elements: Union[list, set, numpy.ndarray])
unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of IndexedStringField

property values
writeable()

Indicates whether this field permits write operations. By default, dataframe fields are read-only in order to protect accidental writes to datasets

class exetera.core.fields.IndexedStringMemField(session, chunksize=None)

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)
apply_spans_last(spans_to_apply, target=None, in_place=False)
apply_spans_max(spans_to_apply, target=None, in_place=False)
apply_spans_min(spans_to_apply, target=None, in_place=False)
create_like(group=None, name=None, timestamp=None)
property data
get_spans()
property indexed
property indices
is_sorted()
isin(test_elements: Union[list, set, numpy.ndarray])
unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of IndexedStringMemField

property values
writeable()
class exetera.core.fields.MemoryField(session)

Bases: exetera.core.abstract_types.Field

apply_filter(filter_to_apply, dstfld=None)
apply_index(index_to_apply, dstfld=None)
property chunksize
property dataframe
property indexed
property name
property timestamp
property valid
class exetera.core.fields.MemoryFieldArray(dtype)

Bases: object

clear()
complete()
property dtype
write(part)
write_part(part, move_mem=False)
class exetera.core.fields.NumericField(session, group, dataframe, write_enabled=False)

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)
apply_spans_last(spans_to_apply, target=None, in_place=False)
apply_spans_max(spans_to_apply, target=None, in_place=False)
apply_spans_min(spans_to_apply, target=None, in_place=False)
astype(dtype: str, casting='unsafe')

Convert the field data type to dtype parameter given.

Parameters
  • dtype – The new datatype, given as a str object. The dtype must be a subtype of np.number, e.g. int, float, etc.

  • casting – Similar to the casting parameter in numpy ndarray.astype, can be ‘no’, ‘equiv’, ‘safe’, ‘same_kind’, or ‘unsafe’.

Returns

The field with new datatype.

create_like(group=None, name=None, timestamp=None)
property data
get_spans()
is_sorted()
isin(test_elements: Union[list, set, numpy.ndarray])
logical_not()
unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of NumericField

writeable()
class exetera.core.fields.NumericMemField(session, nformat)

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)
apply_spans_last(spans_to_apply, target=None, in_place=False)
apply_spans_max(spans_to_apply, target=None, in_place=False)
apply_spans_min(spans_to_apply, target=None, in_place=False)
create_like(group=None, name=None, timestamp=None)
property data
get_spans()
is_sorted()
isin(test_elements: Union[list, set, numpy.ndarray])
logical_not()
unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of NumericMemField

writeable()
class exetera.core.fields.ReadOnlyFieldArray(field, dataset_name)

Bases: object

clear()
complete()
property dtype
write(part)
write_part(part)
class exetera.core.fields.ReadOnlyIndexedFieldArray(field, indices, values)

Bases: object

clear()
complete()
property dtype
write(part)
write_part(part)
class exetera.core.fields.TimestampField(session, group, dataframe, write_enabled=False)

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)
apply_spans_last(spans_to_apply, target=None, in_place=False)
apply_spans_max(spans_to_apply, target=None, in_place=False)
apply_spans_min(spans_to_apply, target=None, in_place=False)
create_like(group=None, name=None, timestamp=None)
property data
get_spans()
is_sorted()
isin(test_elements: Union[list, set, numpy.ndarray])
unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of TimestampField

writeable()
class exetera.core.fields.TimestampMemField(session)

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)
apply_spans_last(spans_to_apply, target=None, in_place=False)
apply_spans_max(spans_to_apply, target=None, in_place=False)
apply_spans_min(spans_to_apply, target=None, in_place=False)
create_like(group=None, name=None, timestamp=None)
property data
get_spans()
is_sorted()
isin(test_elements: Union[list, set, numpy.ndarray])
unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of TimestampMemField

writeable()
class exetera.core.fields.WriteableFieldArray(field, dataset_name)

Bases: object

clear()
complete()
property dtype
write(part)
write_part(part)
class exetera.core.fields.WriteableIndexedFieldArray(chunksize, indices, values)

Bases: object

clear()
complete()
property dtype
write(part)
write_part(part)
exetera.core.fields.argsort(field: exetera.core.abstract_types.Field, dtype: str = None)
exetera.core.fields.as_field(data, key=None)
exetera.core.fields.base_field_contructor(session, group, name, timestamp=None, chunksize=None)

Constructor are for 1)create the field (hdf5 group), 2)add basic attributes like chunksize, timestamp, field type, and 3)add the dataset to the field (hdf5 group) under the name ‘values’

exetera.core.fields.categorical_field_constructor(session, group, name, nformat, key, timestamp=None, chunksize=None)
exetera.core.fields.dtype_to_str(dtype)
exetera.core.fields.fixed_string_field_constructor(session, group, name, length, timestamp=None, chunksize=None)
exetera.core.fields.indexed_string_field_constructor(session, group, name, timestamp=None, chunksize=None)
exetera.core.fields.isin(field, test_elements)
exetera.core.fields.numeric_field_constructor(session, group, name, nformat, timestamp=None, chunksize=None)
exetera.core.fields.timestamp_field_constructor(session, group, name, timestamp=None, chunksize=None)

exetera.core.filtered_field module

class exetera.core.filtered_field.FilteredField(field, filter)

Bases: object

exetera.core.filtered_field.filtered_field(field, filter)

exetera.core.indexed_array module

class exetera.core.indexed_array.IndexedArray

Bases: object

exetera.core.journal module

exetera.core.journal.journal_table(session, schema, old_src, new_src, src_pk, result)
exetera.core.journal.journal_test_harness(session, schema, old_file, new_file, dest_file)

exetera.core.operations module

exetera.core.operations.apply_filter_to_index_values(index_filter, indices, values)
exetera.core.operations.apply_indices_to_index_values(indices_to_apply, indices, values)
exetera.core.operations.apply_spans_concat(spans, src_index, src_values, dest_index, dest_values, max_index_i, max_value_i, s_start)
exetera.core.operations.apply_spans_count(spans, dest_array=None)
exetera.core.operations.apply_spans_first(spans, src_array, dest_array=None)
exetera.core.operations.apply_spans_index_of_first(spans, dest_array=None)
exetera.core.operations.apply_spans_index_of_first_filter(spans, dest_array, filter_array)
exetera.core.operations.apply_spans_index_of_last(spans, dest_array=None)
exetera.core.operations.apply_spans_index_of_last_filter(spans, dest_array, filter_array)
exetera.core.operations.apply_spans_index_of_max(spans, src_array, dest_array=None)
exetera.core.operations.apply_spans_index_of_max_filter(spans, src_array, dest_array, filter_array)
exetera.core.operations.apply_spans_index_of_max_indexed(spans, src_indices, src_values, dest_array=None)
exetera.core.operations.apply_spans_index_of_min(spans, src_array, dest_array=None)
exetera.core.operations.apply_spans_index_of_min_filter(spans, src_array, dest_array, filter_array)
exetera.core.operations.apply_spans_index_of_min_indexed(spans, src_indices, src_values, dest_array=None)
exetera.core.operations.apply_spans_last(spans, src_array, dest_array=None)
exetera.core.operations.apply_spans_max(spans, src_array, dest_array=None)
exetera.core.operations.apply_spans_min(spans, src_array, dest_array=None)
exetera.core.operations.calculate_chunk_decomposition(s_start, s_end, indices, value_chunk_size, sub_chunks)
exetera.core.operations.categorical_transform(chunk, i_c, column_inds, column_vals, column_offsets, cat_keys, cat_index, cat_values)

Transform method for categorical importer in readerwriter.py

exetera.core.operations.check_if_sorted_for_multi_fields(fields_data)

Check if input fields data is sorted. Note that fields_data should be treat as a group key

pre_row[j] < cur_row[j], means these two rows are sorted, move to next row => i + 1 pre_row[j] = cur_row[j], means we need to check if next element is sorted => j + 1 pre_row[j] > cur_row[j], means input data is not sorted

exetera.core.operations.chunked_copy(src_field, dest_field, chunksize=1048576)
exetera.core.operations.chunks(length, chunksize=1048576)
exetera.core.operations.compare_arrays(source[s1: s2], target[t1: t2])
exetera.core.operations.compare_indexed_rows_for_journalling(old_map, new_map, old_indices, old_values, new_indices, new_values, to_keep)
exetera.core.operations.compare_rows_for_journalling(old_map, new_map, old_field, new_field, to_keep)
exetera.core.operations.count_back(array)

This is a helper function that provides functionality specific to streaming ordered merges. It takes an array in sorted order and calculates a trimmed length that excludes the final sequence of equal values: Example:

[10, 20, 30, 40, 50] -> 4 ([10, 20, 30, 40])
[10, 20, 30, 40, 40] -> 3 ([10, 20, 30])
[10, 20, 30, 30, 30] -> 2 ([10, 20])
[10, 20, 20, 20, 20] -> 1 ([10])
exetera.core.operations.data_iterator(data_field, chunksize=1048576)
exetera.core.operations.dtype_to_str(dtype)
exetera.core.operations.element_chunked_copy(src_elem, dest_elem, chunksize)
exetera.core.operations.first_trimmed_chunk(field, chunk_size)
exetera.core.operations.first_untrimmed_chunk(field, chunk_size)
exetera.core.operations.fixed_string_transform(column_inds, column_vals, column_offsets, col_idx, written_row_count, strlen, memory)

Transform method for fixed string importer in field_importer.py

exetera.core.operations.generate_ordered_map_to_inner_both_unique_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r)
exetera.core.operations.generate_ordered_map_to_inner_both_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)
exetera.core.operations.generate_ordered_map_to_inner_left_unique_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r)
exetera.core.operations.generate_ordered_map_to_inner_left_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)
exetera.core.operations.generate_ordered_map_to_inner_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r, ii, jj, ii_max, jj_max, inner)

This function performs generates a mapping from a subset of a left key to a subset of a a right key, writing the resulting mapping to a buffer, where both keys can contain repeated entries.

Example:

left = [10, 20, 30, 40, 40, 50, 50]
right = [20, 30, 30, 40, 40, 40, 60, 70]

i  j op r lres rres
0  0 <  0  0   INV
1  0 =  1  1   0
2  1 =  2  2   1
2  2    3  2   2
3  3    4  3   3
3  4    5  3   4
3  5    6  3   5
4  3    7  4   3
4  4    8  4   4
4  5    9  4   5
5  6   10  5   INV
6  6   11  6   INV


left_map = [0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6]
right_map = [INV, 1, 2, 2, 3, 3, 3, 4, 4, 4, INV, INV]

Everything about this function is optimised for performance under njit. It is effectively a finite state machine that iterates through left, right, and result arrays. The various…

i and i_max are used to track the index of the left source j and j_max are used to track the index of the right source

exetera.core.operations.generate_ordered_map_to_inner_right_unique_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r)
exetera.core.operations.generate_ordered_map_to_inner_right_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)
exetera.core.operations.generate_ordered_map_to_inner_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)

This function performs the most generic type of left to right mapping calculation in which both key fields can have repeated key values. At its heart, the function generates a mapping from left to right that can then be used to map data in the right space to data in the left space. Note that this can also be used to generate the inverse mapping my simply flipping left and right collections.

As the Fields left and right can contain arbitrarily long sequences of data, the data is streamed through the algorithm in a series of chunks. Similarly, the resulting map is written to a buffer that is written to the result field in chunks.

This streamed function makes a sequence of calls to a corresponding _partial function that does the heavy lifting. Inside the _partial function, a finite state machine (FSM) iterates over the data, performing the mapping. The _partial function call exits whenever any of the chunks (left_, right_ or result_ that it is passed become exhausted.

Please take a look at the documentation for the partial function to understand the finite state machine parameters to understand that role that the various parameters play.

We have to make some adjustments to the finite state machine between calls to _partial:
  • if the call used all the left_ data, add the size of that data chunk to i_off

  • if the call used all of the right_ data, add the size of that data chunk to j_off

  • write the accumulated result_ data to the result` field, and reset r to 0

exetera.core.operations.generate_ordered_map_to_left_both_unique(first, second, result, invalid)
exetera.core.operations.generate_ordered_map_to_left_both_unique_partial(left, right, r_result, invalid, j_off, i, j, r)
exetera.core.operations.generate_ordered_map_to_left_both_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)
exetera.core.operations.generate_ordered_map_to_left_left_unique_partial(left, right, j_max, l_result, r_result, invalid, i_off, j_off, i, j, r)
exetera.core.operations.generate_ordered_map_to_left_left_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)
exetera.core.operations.generate_ordered_map_to_left_partial(left, i_max, right, j_max, l_result, r_result, invalid, i_off, j_off, i, j, r, ii, jj, ii_max, jj_max, inner)

This function performs generates a mapping from a subset of a left key to a subset of a a right key, writing the resulting mapping to a buffer, where both keys can contain repeated entries.

Example:

left = [10, 20, 30, 40, 40, 50, 50]
right = [20, 30, 30, 40, 40, 40, 60, 70]

i  j op r lres rres
0  0 <  0  0   INV
1  0 =  1  1   0
2  1 =  2  2   1
2  2    3  2   2
3  3    4  3   3
3  4    5  3   4
3  5    6  3   5
4  3    7  4   3
4  4    8  4   4
4  5    9  4   5
5  6   10  5   INV
6  6   11  6   INV


left_map = [0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6]
right_map = [INV, 1, 2, 2, 3, 3, 3, 4, 4, 4, INV, INV]

Everything about this function is optimised for performance under njit. It is effectively a finite state machine that iterates through left, right, and result arrays. The various…

i and i_max are used to track the index of the left source j and j_max are used to track the index of the right source

exetera.core.operations.generate_ordered_map_to_left_remaining(i_max, l_result, r_result, i_off, i, r, invalid)
exetera.core.operations.generate_ordered_map_to_left_right_unique(first, second, result, invalid)
exetera.core.operations.generate_ordered_map_to_left_right_unique_partial(left, i_max, right, r_result, invalid, j_off, i, j, r)
exetera.core.operations.generate_ordered_map_to_left_right_unique_partial_old(d_j, left, right, left_to_right, invalid)

Returns: [0]: how many positions forward i moved [1]: how many positions forward j moved [2]: how many elements were written

exetera.core.operations.generate_ordered_map_to_left_right_unique_remaining(i_max, r_result, i, r, invalid)
exetera.core.operations.generate_ordered_map_to_left_right_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)
exetera.core.operations.generate_ordered_map_to_left_right_unique_streamed_old(left, right, left_to_right, invalid=- 1, chunksize=1048576)
exetera.core.operations.generate_ordered_map_to_left_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)

This function performs the most generic type of left to right mapping calculation in which both key fields can have repeated key values. At its heart, the function generates a mapping from left to right that can then be used to map data in the right space to data in the left space. Note that this can also be used to generate the inverse mapping my simply flipping left and right collections.

As the Fields left and right can contain arbitrarily long sequences of data, the data is streamed through the algorithm in a series of chunks. Similarly, the resulting map is written to a buffer that is written to the result field in chunks.

This streamed function makes a sequence of calls to a corresponding _partial function that does the heavy lifting. Inside the _partial function, a finite state machine (FSM) iterates over the data, performing the mapping. The _partial function call exits whenever any of the chunks (left_, right_ or result_ that it is passed become exhausted.

Please take a look at the documentation for the partial function to understand the finite state machine parameters to understand that role that the various parameters play.

We have to make some adjustments to the finite state machine between calls to _partial:
  • if the call used all the left_ data, add the size of that data chunk to i_off

  • if the call used all of the right_ data, add the size of that data chunk to j_off

  • write the accumulated result_ data to the result` field, and reset r to 0

exetera.core.operations.get_byte_map(string_map)

Getting byte indices and byte values from categorical key-value pair

exetera.core.operations.get_map_datatype_based_on_lengths(left_len, right_len)
exetera.core.operations.get_map_datatype_str_based_on_lengths(left_len, right_len)
exetera.core.operations.get_map_subchunks_based_on_index_lengths(map_, invalid, chunksize)
exetera.core.operations.get_next_chunk(start: int, chunk_size: int, field: exetera.core.abstract_types.Field)

This is a helper function that provides functionality specific to streaming ordered merges. It assumes that field is in sorted order.

This function is used to fetch chunks of memory from a field to be consumed by streaming merges. It first fetches the chunk of a given chunk size, or the size of the remaining memory, whichever is smaller. It then ‘trims’ that memory by removing the last sequence of equal values from the valid range.

Parameters
  • start – The start of the chunk to be returned

  • chunksize – The size of the chunk to be considered. The returned chunk will always

be shorter than this unless it is the final chunk of the field data :param field: The field from which data should be fetched. This field must be in sorted order :return: A tuple representing the range (inclusive, exclusive) and an numpy ndarray containing the data. Note, this is is typically longer than the range returned, as we do not trim the data for performance reasons.

exetera.core.operations.get_spans_for_field(ndarray)
exetera.core.operations.get_valid_value_extents(chunk, start, end, invalid=- 1)
exetera.core.operations.indexed_string_unique(indices, values, unique_result, unique_index, unique_inverse, unique_counts)

Find the unique elements for indexed string field using njit function.

exetera.core.operations.is_ordered(field)
exetera.core.operations.isin_for_indexed_string_field(test_elements, indices, values)
exetera.core.operations.isin_indexed_string_speedup(test_elements, indices, values)
exetera.core.operations.leaky_categorical_transform(chunk, freetext_indices, freetext_values, i_c, column_inds, column_vals, column_offsets, cat_keys, cat_index, cat_values)

Transform method for categorical importer in readerwriter.py

exetera.core.operations.map_valid(data_field, map_field, result=None, invalid=- 1)
exetera.core.operations.merge_entries_segment(i_start, cur_old_start, old_map, new_map, to_keep, old_src, new_src, dest)
Parameters
  • i_start – the initial value to apply to ‘i’

  • cur_old_start – the initial value to apply to ‘cur_old

  • old_map – the map (in i-space) for the existing records

  • new_map – the map (in i-space) for the new records

  • to_keep – the flags (in i-space) indicating whether the new record should be kept

  • old_src – the source for the existing records

  • new_src – the source for the new records

  • dest – the sink for the merged sources

Returns

exetera.core.operations.merge_indexed_journalled_entries(old_map, new_map, to_keep, old_src_inds, old_src_vals, new_src_inds, new_src_vals, dest_inds, dest_vals)
exetera.core.operations.merge_indexed_journalled_entries_count(old_map, new_map, to_keep, old_src_inds, new_src_inds)
exetera.core.operations.merge_journalled_entries(old_map, new_map, to_keep, old_src, new_src, dest)
exetera.core.operations.next_chunk(current: int, length: int, desired: int)

This is a helper function that can be used whenever you want to access a large sequence of data in chunks. It simply carries out the calculation that returns the extents of the next chunk taking into account the length of the sequence. The sequence itself is not required here, only the length. :param current: the starting point of the chunk :param length: the length of the sequence being chunked :param desired: the requested length of the chunk :return: A tuple of the chunk extents. The first value is inclusive; the second is exclusive

exetera.core.operations.next_map_subchunk(map_, sm, invalid, chunksize)
exetera.core.operations.next_trimmed_chunk(field, chunk, chunk_size)
exetera.core.operations.next_untrimmed_chunk(field, chunk, chunk_size)
exetera.core.operations.numeric_bool_transform(elements, validity, column_inds, column_vals, column_offsets, col_idx, written_row_count, invalid_value, validation_mode, field_name)

Transform method for numeric importer (bool) in readerwriter.py

exetera.core.operations.ordered_generate_journalling_indices(old, new)
exetera.core.operations.ordered_get_last_as_filter(field)
exetera.core.operations.ordered_inner_map(left, right, left_to_inner, right_to_inner)
exetera.core.operations.ordered_inner_map_both_unique(left, right, left_to_inner, right_to_inner)
exetera.core.operations.ordered_inner_map_left_unique(left, right, left_to_inner, right_to_inner)
exetera.core.operations.ordered_inner_map_left_unique_partial(d_i, d_j, left, right, left_to_inner, right_to_inner)

Returns: [0]: how many positions forward i moved [1]: how many positions forward j moved [2]: how many elements were written

exetera.core.operations.ordered_inner_map_left_unique_streamed(left, right, left_to_inner, right_to_inner, chunksize=1048576)
exetera.core.operations.ordered_inner_map_result_size(left, right)
exetera.core.operations.ordered_left_map_result_size(left, right)
exetera.core.operations.ordered_map_valid_indexed_partial(sm_values, sm_start, sm_end, indices, i_start, i_max, values, mv_start, result_indices, result_values, invalid, sm, ri, rv, ri_accum)
exetera.core.operations.ordered_map_valid_indexed_stream(data_field, map_field, result_field, invalid=- 1, chunksize=1048576, value_factor=8)
exetera.core.operations.ordered_map_valid_partial(values, map_values, sm_start, sm_end, d_start, result_data, invalid, invalid_value)
exetera.core.operations.ordered_map_valid_partial_old(d, data_field, map_field, result, invalid)
exetera.core.operations.ordered_map_valid_stream(data_field, map_field, result_field, invalid=- 1, chunksize=1048576)
. for each map chunk
. calculate sub chunks based on indices
. for each sub chunk

. map indices for sub chunk

exetera.core.operations.ordered_map_valid_stream_old(data_field, map_field, result_field, invalid=- 1, chunksize=1048576)
exetera.core.operations.ordered_outer_map_result_size_both_unique(left, right)
exetera.core.operations.raiseNumericException(exception_message, exception_args)
exetera.core.operations.safe_map_indexed_values(data_indices, data_values, map_field, map_filter, empty_value=None)
exetera.core.operations.safe_map_values(data_field, map_field, map_filter, empty_value=None)
exetera.core.operations.str_to_dtype(str_dtype)
exetera.core.operations.streaming_sort_merge(src_index_f, src_value_f, tgt_index_f, tgt_value_f, segment_length, chunk_length)
exetera.core.operations.streaming_sort_partial(in_chunk_indices, in_chunk_lengths, src_value_chunks, src_index_chunks, dest_value_chunk, dest_index_chunk)
exetera.core.operations.transform_float(column_inds, column_vals, column_offsets, col_idx, written_row_count, invalid_value, validation_mode, data_type, field_name)

Transform float method for numeric importer in field_importer.py

exetera.core.operations.transform_int(column_inds, column_vals, column_offsets, col_idx, written_row_count, invalid_value, validation_mode, data_type, field_name)

Transform int method for numeric importer in field_importer.py

exetera.core.operations.transform_to_values(column_inds, column_vals, column_offsets, col_idx, written_row_count)

Trasnform method for byte data from np.int to np.bytes_

exetera.core.operations.unique_for_indexed_string(indices, values, return_index, return_inverse, return_counts)

Find the unique elements for indexed string field.

exetera.core.persistence module

class exetera.core.persistence.DataStore(chunksize=1048576, timestamp='2022-04-05 17:12:36.942412+00:00')

Bases: object

aggregate_count(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)
aggregate_custom(predicate, fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)
aggregate_first(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)
aggregate_last(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)
aggregate_max(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)
aggregate_min(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)
apply_filter(filter_to_apply, reader, writer=None)
apply_indices(indices_to_apply, reader, writer=None)
apply_sort(index, reader, writer=None)
apply_spans_concat(spans, reader, writer)
apply_spans_count(spans, _, writer=None)
apply_spans_first(spans, reader, writer)
apply_spans_index_of_first(spans, writer=None)
apply_spans_index_of_last(spans, writer=None)
apply_spans_index_of_max(spans, reader, writer=None)
apply_spans_index_of_min(spans, reader, writer=None)
apply_spans_last(spans, reader, writer)
apply_spans_max(spans, reader, writer)
apply_spans_min(spans, reader, writer)
chunks(length, chunksize=None)
dataset_sort(readers, index=None)
distinct(field=None, fields=None, filter=None)
get_categorical_writer(group, name, categories, timestamp=None, writemode='write')
get_compatible_writer(field, dest_group, dest_name, timestamp=None, writemode='write')
get_existing_writer(field, timestamp=None)
get_fixed_string_writer(group, name, width, timestamp=None, writemode='write')
get_index(target, foreign_key, destination=None)
get_indexed_string_writer(group, name, timestamp=None, writemode='write')
get_numeric_writer(group, name, dtype, timestamp=None, writemode='write')
get_or_create_group(group, name)
get_reader(field)
get_shared_index(keys)
get_spans(field=None, fields=None)
get_timestamp_writer(group, name, timestamp=None, writemode='write')
get_trash_group(group)
index_spans(spans)
join(destination_pkey, fkey_indices, values_to_join, writer=None, fkey_index_spans=None)
predicate_and_join(predicate, destination_pkey, fkey_indices, reader=None, writer=None, fkey_index_spans=None)
process(inputs, outputs, predicate)
set_timestamp(timestamp='2022-04-05 17:12:36.942424+00:00')
sort_on(src_group, dest_group, keys, fields=None, timestamp=None, write_mode='write')
temp_filename()
exetera.core.persistence.dataset_merge_sort(group, index, fields)
exetera.core.persistence.filter_duplicate_fields(field)
exetera.core.persistence.filtered_iterator(values, filter, default=nan)
exetera.core.persistence.foreign_key_is_in_primary_key(primary_key, foreign_key)
exetera.core.persistence.temp_dataset()
exetera.core.persistence.timestamp_to_date(values)
exetera.core.persistence.try_str_to_bool(value, invalid=0)
exetera.core.persistence.try_str_to_float(value, invalid=0)
exetera.core.persistence.try_str_to_float_to_int(value, invalid=0)
exetera.core.persistence.try_str_to_int(value, invalid=0)

exetera.core.readerwriter module

class exetera.core.readerwriter.CategoricalImporter(datastore, group, name, categories, timestamp=None, write_mode='write')

Bases: object

chunk_factory(length)
flush()
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)
write(values)
write_part(values)
write_strings(values)
class exetera.core.readerwriter.CategoricalReader(datastore, field)

Bases: exetera.core.readerwriter.Reader

dtype()
get_writer(dest_group, dest_name, timestamp=None, write_mode='write')
class exetera.core.readerwriter.CategoricalWriter(datastore, group, name, categories, timestamp=None, write_mode='write')

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)
flush()
write(values)
write_part(values)
class exetera.core.readerwriter.DateTimeImporter(datastore, group, name, create_day_field=False, optional=True, timestamp=None, write_mode='write')

Bases: object

chunk_factory(length)
flush()
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)
write(values)
write_part(values)
class exetera.core.readerwriter.DateTimeWriter(datastore, group, name, timestamp=None, write_mode='write')

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)
flush()
write(values)
write_part(values)
class exetera.core.readerwriter.DateWriter(datastore, group, name, timestamp=None, write_mode='write')

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)
flush()
write(values)
write_part(values)
class exetera.core.readerwriter.FixedStringReader(datastore, field)

Bases: exetera.core.readerwriter.Reader

dtype()
get_writer(dest_group, dest_name, timestamp=None, write_mode='write')
class exetera.core.readerwriter.FixedStringWriter(datastore, group, name, strlen, timestamp=None, write_mode='write')

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)
flush()
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)
write(values)
write_part(values)
class exetera.core.readerwriter.IndexedStringReader(datastore, field)

Bases: exetera.core.readerwriter.Reader

dtype()
get_writer(dest_group, dest_name, timestamp=None, write_mode='write')
sort(index, writer)
class exetera.core.readerwriter.IndexedStringWriter(datastore, group, name, timestamp=None, write_mode='write')

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)
flush()
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)
write(values)
write_part(values)

Writes a list of strings in indexed string form to a field.

Parameters

values – a list of utf8 strings

write_part_raw(index, values)
write_raw(index, values)
class exetera.core.readerwriter.LeakyCategoricalImporter(datastore, group, name, categories, out_of_range, timestamp=None, write_mode='write')

Bases: object

chunk_factory(length)
flush()
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)
write(values)
write_part(values)
class exetera.core.readerwriter.NumericImporter(datastore, group, name, nformat, parser, invalid_value=0, validation_mode='allow_empty', create_flag_field=True, flag_field_suffix='_valid', timestamp=None, write_mode='write')

Bases: object

chunk_factory(length)
flush()
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)
write(values)
write_part(values)

Given a list of strings, parse the strings and write the parsed values. Values that cannot be parsed are written out as zero for the values, and zero for the flags to indicate that that entry is not valid.

Parameters

values – a list of strings to be parsed

class exetera.core.readerwriter.NumericReader(datastore, field)

Bases: exetera.core.readerwriter.Reader

dtype()
get_writer(dest_group, dest_name, timestamp=None, write_mode='write')
class exetera.core.readerwriter.NumericWriter(datastore, group, name, nformat, timestamp=None, write_mode='write')

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)
flush()
write(values)
write_part(values)
class exetera.core.readerwriter.OptionalDateImporter(datastore, group, name, create_day_field=False, optional=True, timestamp=None, write_mode='write')

Bases: object

chunk_factory(length)
flush()
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)
write(values)
write_part(values)
class exetera.core.readerwriter.Reader(field)

Bases: object

class exetera.core.readerwriter.TimestampReader(datastore, field)

Bases: exetera.core.readerwriter.Reader

dtype()
get_writer(dest_group, dest_name, timestamp=None, write_mode='write')
class exetera.core.readerwriter.TimestampWriter(datastore, group, name, timestamp=None, write_mode='write')

Bases: exetera.core.readerwriter.Writer

chunk_factory(length)
flush()
write(values)
write_part(values)
class exetera.core.readerwriter.Writer(datastore, group, name, write_mode, attributes)

Bases: object

property chunksize
flush()

exetera.core.regression module

exetera.core.regression.check_row(exp_ds, exp_index, act_ds, act_index, keys, custom_checks)
exetera.core.regression.datetime_compare_to_secs(value1, value2)
exetera.core.regression.na_compare(value1, value2)
exetera.core.regression.na_or_value(value)

exetera.core.session module

class exetera.core.session.Session(chunksize: int = 1048576, timestamp: str = '2022-04-05 17:12:36.959786+00:00')

Bases: exetera.core.abstract_types.AbstractSession

Session is the top-level object that is used to create and open ExeTera Datasets. It also provides operations that can be performed on Fields. For a more detailed explanation of Session and examples of its usage, please refer to https://github.com/KCL-BMEIS/ExeTera/wiki/Session-API

Parameters
  • chunksize – Change the default chunksize that fields created with this dataset use. Note this is a hint parameter and future versions of Session may choose to ignore it if it is no longer required. In general, it should only be changed for testing.

  • timestamp – Set the official timestamp for the Session’s creation rather than taking the current date/time.

aggregate_count(index, dest=None)

Finds the number of entries within each sub-group of index.

Example:

Index:  a a a b b x a c c d d d
Result: 3     2   1 1 2   3
Parameters
  • index – A numpy array or Field containing the index that defines the ranges over which count is applied.

  • dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

aggregate_custom(predicate, index, target=None, dest=None)
aggregate_first(index, target=None, dest=None)

Finds the first entries within each sub-group of index.

Example:

Index: a a a b b x a c c d d d Target: 1 2 3 4 5 6 7 8 9 0 1 2 Result: 1 4 6 7 8 0

Parameters
  • index – A numpy array or Field containing the index that defines the ranges over which count is applied.

  • target – A numpy array to which the index and predicate are applied

  • dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

aggregate_last(index, target=None, dest=None)

Finds the first entries within each sub-group of index.

Example:

Index:  a a a b b x a c c d d d
Target: 1 2 3 4 5 6 7 8 9 0 1 2
Result: 3     5   6 7 9   2
Parameters
  • index – A numpy array or Field containing the index that defines the ranges over which count is applied.

  • target – A numpy array to which the index and predicate are applied

  • dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

aggregate_max(index, target=None, dest=None)

Finds the maximum value within each sub-group of index.

Example:

Index: a a a b b x a c c d d d Target: 1 2 3 5 4 6 7 8 9 2 1 0 Result: 3 5 6 7 9 2

Parameters
  • index – A numpy array or Field containing the index that defines the ranges over which max is applied.

  • target – A numpy array to which the index and predicate are applied

  • dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

aggregate_min(index, target=None, dest=None)

Finds the minimum value within each sub-group of index.

Example:

Index:  a a a b b x a c c d d d
Target: 1 2 3 5 4 6 7 8 9 2 1 0
Result: 1     4   6 7 8   0
Parameters
  • index – A numpy array or Field containing the index that defines the ranges over which min is applied.

  • target – A numpy array to which the index and predicate are applied

  • dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

apply_filter(filter_to_apply, src, dest=None)

Apply a filter to an a src field. The filtered field is written to dest if it set, and returned from the function call. If the field is an IndexedStringField, the indices and values are returned separately.

Parameters
  • filter_to_apply – the filter to be applied to the source field, an array of boolean

  • src – the field to be filtered

  • dest – optional - a field to write the filtered data to

Returns

the filtered values

apply_index(index_to_apply, src, dest=None)

Apply a index to an a src field. The indexed field is written to dest if it set, and returned from the function call. If the field is an IndexedStringField, the indices and values are returned separately.

Parameters
  • index_to_apply – the index to be applied to the source field, must be one of Group, Field, or ndarray

  • src – the field to be index

  • dest – optional - a field to write the indexed data to

Returns

the indexed values

apply_spans_concat(spans, target, dest, src_chunksize=None, dest_chunksize=None, chunksize_mult=None)
apply_spans_count(spans: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the number of entries within each span.

Parameters
  • spans – the numpy array of spans to be applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_first(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the first entry within each span on a target field.

Parameters
  • spans – the numpy array of spans to be applied

  • target – the field to which the spans are applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_index_of_first(spans: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the index of the first entry within each span.

Parameters
  • spans – the numpy array of spans to be applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_index_of_last(spans: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the index of the last entry within each span.

Parameters
  • spans – the numpy array of spans to be applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_index_of_max(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the index of the maximum value within each span on a target field.

Parameters
  • spans – the numpy array of spans to be applied

  • target – the field to which the spans are applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_index_of_min(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the index of the minimum value within each span on a target field.

Parameters
  • spans – the numpy array of spans to be applied

  • target – the field to which the spans are applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_last(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the last entry within each span on a target field.

Parameters
  • spans – the numpy array of spans to be applied

  • target – the field to which the spans are applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_max(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the maximum value within each span on a target field.

Parameters
  • spans – the numpy array of spans to be applied

  • target – the field to which the spans are applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_min(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the minimum value within span on a target field.

Parameters
  • spans – the numpy array of spans to be applied

  • target – the field to which the spans are applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

chunks(length: int, chunksize: Optional[int] = None)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

‘chunks’ is a convenience method that, given an overall length and a chunksize, will yield a set of ranges for the chunks in question. ie. chunks(1048576, 500000) -> (0, 500000), (500000, 1000000), (1000000, 1048576)

Parameters
  • length – The range to be split into chunks

  • chunksize – Optional parameter detailing the size of each chunk. If not set, the chunksize that the Session was initialized with is used.

close()

Close all open datasets.

Returns

None

close_dataset(name: str)

Close the dataset with the given name. If there is no dataset with that name, do nothing.

Parameters

name – The name of the dataset to be closed

Returns

None

create_categorical(group, name, nformat, key, timestamp=None, chunksize=None)

Create a categorical field in the given DataFrame with the given name. This function also takes a numerical format for the numeric representation of the categories, and a key that maps numeric values to their string string descriptions.

Parameters
  • group – The group in which the new field should be created

  • name – The name of the new field

  • nformat – A numerical type in the set (int8, uint8, int16, uint18, int32, uint32, int64). It is recommended to use ‘int8’.

  • key – A dictionary that maps numerical values to their string representations

  • timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.

  • chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.

create_fixed_string(group, name, length, timestamp=None, chunksize=None)

Create a fixed string field in the given DataFrame, given name, and given max string length per entry.

Parameters
  • group – The group in which the new field should be created

  • name – The name of the new field

  • length – The maximum length in bytes that each entry can have.

  • timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.

  • chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.

create_indexed_string(group, name, timestamp=None, chunksize=None)

Create an indexed string field in the given DataFrame with the given name.

Parameters
  • group – The group in which the new field should be created

  • name – The name of the new field

  • timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.

  • chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.

create_like(field, dest_group, dest_name, timestamp=None, chunksize=None)

Create a field of the same type as an existing field, in the location and with the name provided.

Example:

with Session as s:
  ...
  a = s.get(table_1['a'])
  b = s.create_like(a, table_2, 'a_times_2')
  b.data.write(a.data[:] * 2)
Parameters
  • field – The Field whose type is to be copied

  • dest_group – The group in which the new field should be created

  • dest_name – The name of the new field

create_numeric(group, name, nformat, timestamp=None, chunksize=None)

Create a numeric field in the given DataFrame with the given name.

Parameters
  • group – The group in which the new field should be created

  • name – The name of the new field

  • nformat – A numerical type in the set (int8, uint8, int16, uint18, int32, uint32, int64, uint64, float32, float64). It is recommended to avoid uint64 as certain operations in numpy cause conversions to floating point values.

  • timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.

  • chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.

create_timestamp(group, name, timestamp=None, chunksize=None)

Create a timestamp field in the given group with the given name.

dataset_sort_index(sort_indices, index=None)

Generate a sorted index based on a set of fields upon which to sort and an optional index to apply to the sort_indices.

Parameters
  • sort_indices – a tuple or list of indices that determine the sorted order

  • index – optional - the index by which the initial field should be permuted

Returns

the resulting index that can be used to permute unsorted fields

distinct(field=None, fields=None, filter=None)
get(field: Union[exetera.core.abstract_types.Field, h5py._hl.group.Group])

Get a Field from a h5py Group.

Example:

# this code for context
with Session() as s:

  # open a dataset about wildlife
  src = s.open_dataset("/my/wildlife/dataset.hdf5", "r", "src")

  # fetch the group containing bird data
  birds = src['birds']

  # get the bird decibel field
  bird_decibels = s.get(birds['decibels'])
Parameters

field – The Field or Group object to retrieve.

get_dataset(name: str)

Get the dataset with the given name. If there is no dataset with that name, raise a KeyError indicating that the dataset with that name is not present.

Parameters

name – Name of the dataset to be fetched. This is the name that was given to it when it was opened through open_dataset().

Returns

Dataset with that name.

get_index(target, foreign_key, destination=None)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please make use of Dataframe.merge functionality instead. This method can be emulated by adding an index (via np.arange) to a dataframe, performing a merge and then fetching the mapped index field.

‘get_index’ maps a primary key (‘target’) into the space of a foreign key (‘foreign_key’).

get_or_create_group(group: Union[h5py._hl.group.Group, h5py._hl.files.File], name: str)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

get_shared_index(keys: Tuple[numpy.ndarray])

Create a shared index based on a tuple of numpy arrays containing keys. This function generates the sorted union of a tuple of key fields and then maps the individual arrays to their corresponding indices in the sorted union.

Parameters

keys – a tuple of groups, fields or ndarrays whose contents represent keys

Example:

key_1 = ['a', 'b', 'e', 'g', 'i']
key_2 = ['b', 'b', 'c', 'c, 'e', 'g', 'j']
key_3 = ['a', 'c' 'd', 'e', 'g', 'h', 'h', 'i']

sorted_union = ['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j']

key_1_index = [0, 1, 4, 5, 7]
key_2_index = [1, 1, 2, 2, 4, 5, 8]
key_3_index = [0, 2, 3, 4, 5, 6, 6, 7]
get_spans(field: Union[exetera.core.abstract_types.Field, numpy.ndarray] = None, dest: exetera.core.abstract_types.Field = None, **kwargs)

Calculate a set of spans that indicate contiguous equal values. The entries in the result array correspond to the inclusive start and exclusive end of the span (the ith span is represented by element i and element i+1 of the result array). The last entry of the result array is the length of the source field.

Only one of ‘field’ or ‘fields’ may be set. If ‘fields’ is used and more than one field specified, the fields are effectively zipped and the check for spans is carried out on each corresponding tuple in the zipped field.

Example:

field: [1, 2, 2, 1, 1, 1, 3, 4, 4, 4, 2, 2, 2, 2, 2]
result: [0, 1, 3, 6, 7, 10, 15]
Parameters
  • field – A Field or numpy array to be evaluated for spans

  • dest – A destination Field to store the result

  • **kwargs – See below. For parameters set in both argument and kwargs, use kwargs

Keyword Arguments
  • field – Similar to field parameter, in case user specify field as keyword

  • fields – A tuple of Fields or tuple of numpy arrays to be evaluated for spans

  • dest – Similar to dest parameter, in case user specify as keyword

Returns

The resulting set of spans as a numpy array

join(destination_pkey, fkey_indices, values_to_join, writer=None, fkey_index_spans=None)

This method is due for removal and should not be used. Please use the merge or ordered_merge functions instead.

list_datasets()

List the open datasets for this Session object. This is returned as a tuple of strings rather than the datasets themselves. The individual datasets can be fetched using get_dataset().

Example:

names = s.list_datasets()
datasets = [s.get_dataset(n) for n in names]
Returns

A tuple containing the names of the currently open datasets for this Session object

merge_inner(left_on, right_on, left_fields=None, left_writers=None, right_fields=None, right_writers=None)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please use DataFrame.merge instead.

Perform a database-style inner join on left_fields, outputting the result to left_writers, if set.

Parameters
  • left_on – The key to perform the join on on the left hand side

  • right_on – The key to perform the join on on the right hand side

  • left_fields – The fields to be mapped from left to inner

  • left_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.

  • right_fields – The fields to be mapped from right to inner

  • right_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.

merge_left(left_on, right_on, right_fields=(), right_writers=None)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please use DataFrame.merge instead.

Perform a database-style left join on right_fields, outputting the result to right_writers, if set.

Parameters
  • left_on – The key to perform the join on on the left hand side

  • right_on – The key to perform the join on on the right hand side

  • right_fields – The fields to be mapped from right to left

  • right_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.

merge_right(left_on, right_on, left_fields=(), left_writers=None)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please use DataFrame.merge instead.

Perform a database-style right join on left_fields, outputting the result to left_writers, if set.

Parameters
  • left_on – The key to perform the join on on the left hand side

  • right_on – The key to perform the join on on the right hand side

  • left_fields – The fields to be mapped from right to left

  • left_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.

open_dataset(dataset_path: Union[str, IO[bytes]], mode: str, name: str)

Open a dataset with the given access mode.

Parameters
  • dataset_path – the path to the dataset

  • mode – the mode in which the dataset should be opened. This is one of “r”, “r+” or “w”.

  • name – the name that is associated with this dataset. This can be used to retrieve the dataset when calling get_dataset().

Returns

The top-level dataset object

ordered_merge_inner(left_on, right_on, left_field_sources=(), left_field_sinks=None, right_field_sources=(), right_field_sinks=None, left_unique=False, right_unique=False)

Generate the results of an inner join and apply it to the fields described in the tuple ‘right_field_sources’. If ‘right_field_sinks’ is set, the mapped values are written to the fields / arrays set there.

Note: in order to achieve best scalability, you should use groups / fields rather than numpy arrays and provide a tuple of groups/fields to right_field_sinks, so that the session and compute the merge and apply the mapping in a streaming fashion.

Parameters
  • left_on – the group/field/numba array that contains the left key values

  • right_on – the group/field/numba array that contains the right key values

  • right_to_left_map – a group/field/numba array that the map is written to. If it is a numba array, it must be the size of the resulting merge

  • right_field_sources – a tuple of group/fields/numba arrays that contain the fields to be joined

  • right_field_sinks – optional - a tuple of group/fields/numba arrays that the mapped fields should be written to

  • left_unique – a hint to indicate whether the ‘left_on’ field contains unique values

  • right_unique – a hint to indicate whether the ‘right_on’ field contains unique values

Returns

If right_field_sinks is not set, a tuple of the output fields is returned

ordered_merge_left(left_on, right_on, right_field_sources=(), left_field_sinks=None, left_to_right_map=None, left_unique=False, right_unique=False)

Generate the results of a left join and apply it to the fields described in the tuple ‘left_field_sources’. If ‘left_field_sinks’ is set, the mapped values are written to the fields / arrays set there. Note: in order to achieve best scalability, you should use groups / fields rather than numpy arrays and provide a tuple of groups/fields to left_field_sinks, so that the session and compute the merge and apply the mapping in a streaming fashion.

Parameters
  • left_on – the group/field/numba array that contains the left key values

  • right_on – the group/field/numba array that contains the right key values

  • left_to_right_map – a group/field/numba array that the map is written to. If it is a numba array, it must be the size of the resulting merge

  • left_field_sources – a tuple of group/fields/numba arrays that contain the fields to be joined

  • left_field_sinks – optional - a tuple of group/fields/numba arrays that the mapped fields should be written to

  • left_unique – a hint to indicate whether the ‘left_on’ field contains unique values

  • right_unique – a hint to indicate whether the ‘right_on’ field contains unique values

Returns

If left_field_sinks is not set, a tuple of the output fields is returned

ordered_merge_right(left_on, right_on, left_field_sources=(), right_field_sinks=None, right_to_left_map=None, left_unique=False, right_unique=False)

Generate the results of a right join and apply it to the fields described in the tuple ‘right_field_sources’. If ‘right_field_sinks’ is set, the mapped values are written to the fields / arrays set there.

Note: in order to achieve best scalability, you should use groups / fields rather than numpy arrays and provide a tuple of groups/fields to right_field_sinks, so that the session and compute the merge and apply the mapping in a streaming fashion.

Parameters
  • left_on – the group/field/numba array that contains the left key values

  • right_on – the group/field/numba array that contains the right key values

  • right_to_left_map – a group/field/numba array that the map is written to. If it is a numba array, it must be the size of the resulting merge

  • right_field_sources – a tuple of group/fields/numba arrays that contain the fields to be joined

  • right_field_sinks – optional - a tuple of group/fields/numba arrays that the mapped fields should be written to

  • left_unique – a hint to indicate whether the ‘left_on’ field contains unique values

  • right_unique – a hint to indicate whether the ‘right_on’ field contains unique values

Returns

If right_field_sinks is not set, a tuple of the output fields is returned

predicate_and_join(predicate, destination_pkey, fkey_indices, reader=None, writer=None, fkey_index_spans=None)

This method is due for removal and should not be used. Please use the merge or ordered_merge functions instead.

set_timestamp(timestamp: str = '2022-04-05 17:12:36.959841+00:00')

Set the default timestamp to be used when creating fields without specifying an explicit timestamp.

Parameters

timestamp – a string representing a valid Datetime

Returns

None

sort_on(src_group: h5py._hl.group.Group, dest_group: h5py._hl.group.Group, keys: Union[tuple, list], timestamp=datetime.datetime(2022, 4, 5, 17, 12, 36, 959847, tzinfo=datetime.timezone.utc), write_mode='write', verbose=True)

Sort a group (src_group) of fields by the specified set of keys, and write the sorted fields to dest_group.

Parameters
  • src_group – the group of fields that are to be sorted

  • dest_group – the group into which sorted fields are written

  • keys – fields to sort on

  • timestamp – optional - timestamp to write on the sorted fields

  • write_mode – optional - write mode to use if the destination fields already exist

Returns

None

temp_filename()

exetera.core.split module

exetera.core.split.assessment_splitter(input_filename, output_filename, assessment_buckets, bucket)
exetera.core.split.patient_splitter(input_filename, output_filenames, sorted_indices, bucket_size)
exetera.core.split.split_data(patient_data, assessment_data, bucket_size=500000, territories=None)

exetera.core.utils module

class exetera.core.utils.Timer(start_msg, new_line=False, end_msg='completed in')

Bases: object

exetera.core.utils.build_histogram(dataset, filtered_records=None, tx=None)
exetera.core.utils.bytearray_to_escaped(srcbytearray, destbytearray, src_start=0, src_end=None, dest_start=0, separator=b',', delimiter=b'"')
exetera.core.utils.check_input_lengths(names, fields)
exetera.core.utils.chunks(length, chunksize)
exetera.core.utils.clear_set_flag(values, to_clear)
exetera.core.utils.concatenate_maybe_strs(sequence, value, separator=',', delimiter='"')
exetera.core.utils.count_flag_empty(flags)
exetera.core.utils.count_flag_not_set(flags, flag_to_test)
exetera.core.utils.count_flag_set(flags, flag_to_test)
exetera.core.utils.datetime_to_seconds(dt)
exetera.core.utils.filter_field(fields, filter_list, f_missing, f_bad, is_type_fn, type_fn, valid_fn)
exetera.core.utils.find_longest_sequence_of(string, char)
exetera.core.utils.from_escaped(string)
exetera.core.utils.get_min_max(value_type)
exetera.core.utils.guess_encoding(filename)

Attempt to determine the encodig of the given text file by reading the byte order mark, defaulting to utf-8 if none is found.

Parameters

filename – path to a text file containing possible UTF-8, UTF-16, or UTF-32 text

Returns

encoding name, one of utf-8, utf-8-sig, utf-16, utf-32

exetera.core.utils.is_float(value)
exetera.core.utils.is_int(value)
exetera.core.utils.list_to_escaped(strings)
exetera.core.utils.map_between_categories(first_map, second_map)
exetera.core.utils.one_dim_data_to_indexed_for_test(data, field_size)
exetera.core.utils.print_diagnostic_row(preamble, ds, ir, keys, fns=None)
exetera.core.utils.replace_if_invalid(replacement)
exetera.core.utils.sort_mixed_list(values, check_fn, sort_fn)
exetera.core.utils.string_to_datetime(field)
exetera.core.utils.timestamp_to_day(field)
exetera.core.utils.to_categorical(field, transform)
exetera.core.utils.to_escaped(string, separator=',', delimiter='"')
exetera.core.utils.to_float(value)
exetera.core.utils.to_int(value)
exetera.core.utils.valid_range_fac(f_min, f_max, default_value='')
exetera.core.utils.valid_range_fac_inc(f_min, f_max, default_value='')
exetera.core.utils.validate_file_exists(file_name)

exetera.core.validation module

exetera.core.validation.all_same_basic_type(name, fields)
exetera.core.validation.array_from_field_or_lower(name, field)
exetera.core.validation.array_from_parameter(session, name, field)
exetera.core.validation.ensure_valid_field(name, field)
exetera.core.validation.ensure_valid_field_like(name, field)
exetera.core.validation.field_from_parameter(session, name, field)
exetera.core.validation.is_field_parameter(field)
exetera.core.validation.raw_array_from_parameter(datastore, name, field)
exetera.core.validation.validate_all_field_length_in_df(df: exetera.core.abstract_types.DataFrame)
exetera.core.validation.validate_and_get_key_fields(side, df, key)
exetera.core.validation.validate_and_normalize_categorical_key(param_name, key)
exetera.core.validation.validate_boolean_row_filter(name, field)
exetera.core.validation.validate_chunk_size(chunk_size_name, chunk_size)
exetera.core.validation.validate_field_lengths(side, lens, df, names=None)
exetera.core.validation.validate_filter(filter_to_apply)
exetera.core.validation.validate_groupby_target(target, by, all)
exetera.core.validation.validate_key_field_consistency(lname, rname, lkey, rkey)
exetera.core.validation.validate_key_lengths(side, df, key)
exetera.core.validation.validate_require_key(context, key, dictionary)
exetera.core.validation.validate_selected_keys(by, all)

Module contents