exetera.core package

Submodules

exetera.core.data_writer module

class exetera.core.data_writer.DataWriter

Bases: object

static clear_dataset(parent_group, name)
static create_group(parent_group, name, attrs)
static flush(group)
static write(group, name, field, count, dtype=None)
static write_additional(group, name, field, count)
static write_first(group, name, field, count, dtype=None)

exetera.core.dataset module

class exetera.core.dataset.HDF5Dataset(session, dataset_path, mode, name)

Bases: exetera.core.abstract_types.Dataset

Dataset is the means which which you interact with an ExeTera datastore. These are created and loaded through Session.open_dataset, rather than being constructed directly.

Datasets are composed of one or more DataFrame objects and the means by which DataFrames are interacted with.

For a detailed explanation of Dataset along with examples of its use, please refer to the wiki documentation at https://github.com/KCL-BMEIS/ExeTera/wiki/Dataset-API

Parameters
  • session – The session instance to include this dataset to.

  • dataset_path – The path of HDF5 file.

  • mode – the mode in which the dataset should be opened. This is one of “r”, “r+” or “w”.

  • name – the name that is associated with this dataset. This can be used to retrieve the dataset when calling get_dataset().

Returns

A HDF5Dataset instance.

close()

Close the HDF5 file operations.

contains_dataframe(dataframe: exetera.core.abstract_types.DataFrame)

Check if a dataframe is contained in this dataset by the dataframe object itself.

Parameters

dataframe – the dataframe object to check

Returns

True or False if the dataframe is contained

copy(dataframe, name)

Add an existing dataframe (from other dataset) to this dataset, write the existing group attributes and HDF5 datasets to this dataset.

Parameters
  • dataframe – the dataframe to copy to this dataset.

  • name – optional- change the dataframe name.

Returns

None if the operation is successful; otherwise throw Error.

create_dataframe(name: str, dataframe: Optional[exetera.core.abstract_types.DataFrame] = None)

Create a new DataFrame object as a part of this Dataset.

Parameters
  • name – name of the dataframe

  • dataframe – if set, this is a dataframe object whose contents are duplicated

Returns

a dataframe object

create_group(name: str)

This method is a wrapper around create_dataframe() instead.

delete_dataframe(dataframe: exetera.core.abstract_types.DataFrame)

Remove dataframe from this dataset by the dataframe object.

Parameters

dataframe – The dataframe instance to delete.

Returns

Boolean if the dataframe is deleted.

drop(name: str)
get_dataframe(name: str)

Get the dataframe by dataset.get_dataframe(dataframe_name).

Parameters

name – The name of the dataframe.

Returns

The dataframe or throw Error if the name is not existed in this dataset.

items()

Return the (name, dataframe) tuple in this dataset.

keys()

Return all dataframe names in this dataset.

require_dataframe(name)

Get a dataframe, creating it if it doesn’t exist.

Parameters

name – name of the dataframe

property session

The session property interface.

Returns

The _session instance.

values()

Return all dataframe instance in this dataset.

exetera.core.dataset.copy(dataframe: exetera.core.abstract_types.DataFrame, dataset: exetera.core.abstract_types.Dataset, name: str)

Copy dataframe to another dataset via HDF5DataFrame.copy(ds1[‘df1’], ds2, ‘df1’])

Parameters
  • dataframe – The dataframe to copy.

  • dataset – The destination dataset.

  • name – The name of dataframe in destination dataset.

exetera.core.dataset.move(dataframe: exetera.core.abstract_types.DataFrame, dataset: exetera.core.abstract_types.Dataset, name: str)

Move a dataframe to another dataset via HDF5DataFrame.move(ds1[‘df1’], ds2, ‘df1’]). If move within the same dataset, e.g. HDF5DataFrame.move(ds1[‘df1’], ds1, ‘df2’]), function as a rename for both dataframe and HDF5Group. However, to

Parameters
  • dataframe – The dataframe to copy.

  • dataset – The destination dataset.

  • name – The name of dataframe in destination dataset.

exetera.core.dataframe module

class exetera.core.dataframe.HDF5DataFrame(dataset: exetera.core.abstract_types.Dataset, name: str, h5group: h5py._hl.group.Group)

Bases: exetera.core.abstract_types.DataFrame

DataFrame is the means which which you interact with an ExeTera datastore. These are created and loaded through Dataset.create_dataframe, and other methods, rather than being constructed directly.

DataFrames closely resemble Pandas DataFrames, but with a number of key differences: 1. Instead of Series, DataFrames are composed of Field objects 2. DataFrames can store fields of differing lengths, although all fields must be of the same length when performing certain operations such as merges. 3. ExeTera DataFrames do not (yet) have the ability to create filtered views onto an underlying DataFrame, although this functionality will be added in upcoming releases

For a detailed explanation of DataFrame along with examples of its use, please refer to the wiki documentation at https://github.com/KCL-BMEIS/ExeTera/wiki/DataFrame-API

Parameters
  • name – name of the dataframe.

  • dataset – a dataset object, where this dataframe belongs to.

  • h5group – the h5group object to store the fields. If the h5group is not empty, acquire data from h5group object directly. The h5group structure is h5group<-h5group-dataset structure, the later group has a ‘fieldtype’ attribute and only one dataset named ‘values’. So that the structure is mapped to Dataframe<-Field-Field.data automatically.

  • dataframe – optional - replicate data from another dictionary of (name:str, field: Field).

add(field: exetera.core.abstract_types.Field)

Add a field to this dataframe as well as the HDF5 Group.

Parameters

field – field to add to this dataframe, copy the underlying dataset

apply_filter(filter_to_apply, ddf=None)

Apply a filter to all fields in this dataframe, returns filtered dataframe (itself) or a new target (destination) dataframe

Example:

df = ... # df contains a field ('foo') with data: ["a", "b", "c", "d", "e", "f", "g"]

# apply boolean filter to dataframe in place
bfilter = np.array([0, 1, 0, 1, 0, 1, 1], dtype='bool')
df.apply_filter(bfilter)
print(df['foo'].data[:])     # prints ["b", "d", "f", "g"]

# apply numeric filter to dataframe and store filtered result to designated dataframe
nfilter = np.array([0, 1, 0, 1, 0, 1, 1, 0])
df.apply_filter(nfilter, ddf = df2)
print(df2['foo'].data[0:10]) # prints ["b", "d", "f", "g"]
Parameters
  • filter_to_apply – the filter to be applied to the source field, an array of boolean

  • ddf – optional- the destination data frame

Returns

a dataframe contains all the fields filterd, self if ddf is not set

apply_index(index_to_apply, ddf=None)

Apply an index to all fields in this dataframe, returns filtered dataframe (itself) or a new target (destination) dataframe

Example:

df = ... # df contains a field ('foo') with data: ["a", "b", "c", "d", "e"]

# apply index inplace
index = np.array([4, 3, 2, 1, 0])
df.apply_index(index)
print(df['foo'].data[:])     # prints ["e", "d", "c", "b", "a"]

# apply index and store new result to designated dataframe
df.apply_index(index, ddf=df2)
print(df2['foo'].data[0:10]) # prints ["e", "d", "c", "b", "a"]
Parameters
  • index_to_apply – the index to be applied to the fields, an ndarray of integers

  • ddf – optional- the destination data frame

Returns

a dataframe contains all the fields re-indexed, self if ddf is not set

property columns

The columns property interface. Columns is a dictionary to store the fields by (field_name, field_object). The field_name is field.name without prefix ‘/’ and HDF5 group name.

contains_field(field)

check if dataframe contains a field by the field object

Parameters

field – the filed object to check, return a tuple(bool,str). The str is the name stored in dataframe.

Returns

bool value indicating whether this DataFrame contains a Field

create_categorical(name: str, nformat: int, key: dict, timestamp: Optional[str] = None, chunksize: Optional[int] = None)

Create a categorical type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#categoricalfield for a detailed description of indexed string fields

Parameters
  • name – name of field to be created

  • nformat – A numerical type in the set (int8, uint8, int16, uint18, int32, uint32, int64, uint64, float32, float64). It is recommended to use ‘int8’.

  • timestamp – optional - If set, the timestamp that should be given to the new field.

  • chunksize – optional - If set, the chunksize that should be used to create the new field.

Returns

a newly created categorical type field

create_fixed_string(name: str, length: int, timestamp: Optional[str] = None, chunksize: Optional[int] = None)

Create a fixed string type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#fixedstringfield for a detailed description of fixed string fields

Parameters
  • name – name of field to be created

  • timestamp – optional - If set, the timestamp that should be given to the new field.

  • chunksize – optional - If set, the chunksize that should be used to create the new field.

Returns

a newly created fixed string type field

create_group(name: str)

Create a group object in HDF5 file for field to use. Please note, this function is for backwards compatibility with older scripts and should not be used in the general case.

Parameters

name – the name of the group and field

Returns

a hdf5 group object

create_indexed_string(name: str, timestamp: Optional[str] = None, chunksize: Optional[int] = None)

Create a indexed string type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#indexedstringfield for a detailed description of indexed string fields

Parameters
  • name – name of field to be created

  • timestamp – optional - If set, the timestamp that should be given to the new field.

  • chunksize – optional - If set, the chunksize that should be used to create the new field.

Returns

a newly created indexed string type field

create_numeric(name: str, nformat: int, timestamp: Optional[str] = None, chunksize: Optional[int] = None)

Create a numeric type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#numericfield for a detailed description of numeric fields

Parameters
  • name – name of field to be created

  • nformat – A numerical type in the set (int8, uint8, int16, uint18, int32, uint32, int64, uint64, float32, float64). It is recommended to avoid uint64 as certain operations in numpy cause conversions to floating point values.

  • timestamp – optional - If set, the timestamp that should be given to the new field.

  • chunksize – optional - If set, the chunksize that should be used to create the new field.

Returns

a newly created numeric type field

create_timestamp(name: str, timestamp: Optional[str] = None, chunksize: Optional[int] = None)

Create a timestamp type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#timestampfield for a detailed description of timestamp fields

Parameters
  • name – name of field to be created

  • timestamp – optional - If set, the timestamp that should be given to the new field.

  • chunksize – optional - If set, the chunksize that should be used to create the new field.

Returns

a newly created timestamp type field

property dataset

The dataset property interface.

delete_field(field)

Remove field from dataframe by field.

Parameters

field – The field to delete from this dataframe.

Returns

None.

describe(include=None, exclude=None, output='terminal')

Show the basic statistics of the data in each field.

Example:

df = ... # df contains three fields:
         # field "foo" with data [1, 0, 0, 1, 1]
         # field "bar" with data ["b", "b", "a", "a", "b"]
         # field "baz" with data [3.5, 6.0, 4.2, 7.2, 5.5]

# Display statistics results in stdout by default,
# and return dataframe that contains staticstic results.
result = df.describe()
# Statistics results displayed
#
# fields    foo          baz
# ---------------------------------
# count      5           5
# mean       0.60        5.28
# std        0.49        1.31
# min        0.00        3.50
# 25%        0.00        3.51
# 50%        0.00        3.51
# 75%        0.00        3.52
# max        1.00        7.20


# Not display staticstic results
result = df.describe(output='None')


# Include multiple fields
result = df.describe(include=['foo', 'bar', 'baz'])
# Statistics results displayed
#
# fields              foo             bar             baz
# --------------------------------------------------------
# count                 5               5               5
# unique              NaN               2             NaN
# top                 NaN            b'b'             NaN
# freq                NaN               3             NaN
# mean               0.60             NaN            5.28
# std                0.49             NaN            1.31
# min                0.00             NaN            3.50
# 25%                0.00             NaN            3.51
# 50%                0.00             NaN            3.51
# 75%                0.00             NaN            3.52
# max                1.00             NaN            7.20


# Include multiple data types
result = df.describe(include = [np.bytes_, np.float32])
# Statistics results displayed
#
# fields              bar             baz
# -----------------------------------------
# count                 5               5
# unique                2             NaN
# top                b'b'             NaN
# freq                  3             NaN
# mean                NaN            5.28
# std                 NaN            1.31
# min                 NaN            3.50
# 25%                 NaN            3.51
# 50%                 NaN            3.51
# 75%                 NaN            3.52
# max                 NaN            7.20
Parameters
  • include – The field name or data type or simply ‘all’ to indicate the fields included in the calculation.

  • exclude – The field name or data type to exclude in the calculation.

  • output – Display the result in stdout if set to terminal, otherwise silent.

Returns

A dataframe contains the statistic results.

drop(name: str)

Drop a field from this dataframe as well as the HDF5 Group

Parameters

name – name of field to be dropped

drop_duplicates(by: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame = None, hint_keys_is_sorted=False)

Removes duplicated values in a field or list of fields, returns a dataframe with distinct values.

Example:

df = ... # df contains two fields:
         # field "foo" with data [1, 0, 0, 1]
         # field "bar" with data ["b", "b", "a", "a"]

# return distinct values of a single field
df.drop_duplicates(by = 'foo', ddf = df2)
print(df2["foo"].data[:])  # prints [0, 1]

# return distinct values of multiple fields
df.drop_duplicates(by = ['foo', 'bar'], ddf = df3)
# print dataframe (df3) data:
#
# "foo", "bar"
# -------------
#   0     "a"
#   0     "b"
#   1     "a"
#   1     "b"
Parameters
  • by – Name (str) or list of names (str) to distinct.

  • ddf – optional - the destination dataframe

Returns

DataFrame with distinct values.

get_field(name)

Get a field stored by the field name.

Parameters

name – the name of field to get.

Returns

field to get.

groupby(by: Union[str, List[str]], hint_keys_is_sorted=False)

Group DataFrame using a field or a list of field, return a groupby object.

Example:

df = ... # df contains two fields:
         # field "foo" with data [1, 0, 0, 1, 1]
         # field "bar" with data ["b", "b", "a", "a", "b"]

# group by on single field, then compute max
df.groupby(by = 'bar').max(ddf = ddf)
# print dataframe (ddf) data:
#
# "bar", "foo_max"
# ----------------
#  "a"      1
#  "b"      1

# group by on multiple field, then compute count
df.groupby(by = ['foo', 'bar']).count(ddf = ddf)
# print dataframe (ddf) data:
#
# "foo", "bar", "count"
# ----------------------
#   0     "a"      1
#   0     "b"      1
#   1     "a"      1
#   1     "b"      2
Parameters
  • by – Name (str) or list of names (str) to group by.

  • hint_keys_is_sorted – an optional flag that users could set to skip the sorted check. Note that it runs faster and uses less memory when the dataframe is sorted, that is, hint_key_is_sorted=True.

Returns

Returns a groupby object that contains information about the groups.

property h5group

The h5group property interface, used to handle underlying storage.

items()

Return all the field names and their corresponding field values

keys()

Return all the field names

rename(field: Union[str, Mapping[str, str]], field_to: Optional[str] = None) → None

Rename provides you with the means to rename fields within a dataframe. You can specify either a single field to be renamed or you can provide a dictionary with a set of fields to be renamed.

Example:

# rename a single field
df.rename('old_field_name', 'new_field_name')

# rename multiple fields
df.rename({'old_field_name_a': 'new_field_name_a', 'old_field_name_a': 'new_field_name_b'})

Field renaming can fail if the resulting set of renamed fields would have name clashes. If this is the case, none of the rename operations go ahead and the dataframe remains unmodified.

Parameters
  • field – Either a string or a dictionary of name pairs, each of which is the existing field name and the destination field name

  • field_to – Optional parameter containing a string, if field is a string. If ‘field’ is a dictionary, parameter should not be set. Field references remain valid after this operation and reflect their renaming.

Returns

None

sort_values(by: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame = None, axis=0, ascending=True, kind='stable')

Sort one or multiple fields in dataframe (itself) or a new target (destination) dataframe

Example:

df = ... # df contains a field ('idx') with data: ["a", "c", "e", "g", "f", "b", "d"]

# sort inplace
df.sort_values(by = 'idx')
print(df['idx'].data[:])      # prints ["a", "b", "c", "d", "e", "f", "g"]

# sort and store sorted value in designated dataframe
df.sort_values(by = 'idx', ddf = df2)
print(df2['idx'].data[:10])  # prints ["a", "b", "c", "d", "e", "f", "g"]
Parameters
  • by – Name (str) or list of names (str) to sort by.

  • ddf – optional - the destination data frame

  • axis – Axis to be sorted. Currently only supports 0

  • ascending – Sort ascending vs. descending. Currently only supports ascending=True.

  • kind – Choice of sorting algorithm. Currently only supports “stable”

Returns

DataFrame with sorted values or None if ddf=None.

to_csv(filepath: str, row_filter: Union[numpy.ndarray, exetera.core.abstract_types.Field] = None, column_filter: Union[str, List[str]] = None, chunk_row_size: int = 32768)

Write object to a comma-separated values (csv) file.

Example:

# write to csv file
df.to_csv(csv_file_name)

# write to csv file with row_filter. Only select rows when filter value is True.
df.to_csv(csv_file_name, row_filter=df['foo'])

# write to csv file with selected columns defined in column_filter.
df.to_csv(csv_file_name, column_filter=['foo', 'bar'])
Parameters
  • filepath – File path.

  • row_filter – A boolean array / field. Only select rows when filter value is True

  • column_filter – A sequence of string names for the fields.

Chunk_row_size

Write rows for every chunk which has maximum chunk_row_size rows. The default is 1<<15.

to_pandas(row_filter: List[bool] = None, col_filter: Union[str, List[str]] = None)

Convert an ExeTera dataframe to Pandas DataFrame. :param row_filter: A boolean array indicates which rows to export. :param col_filter: String or list of strings indicates which columns to export. :returns: A pandas dataframe.

Example:

pandas_df = df.to_pandas()
values()

Return all the field values

class exetera.core.dataframe.HDF5DataFrameGroupBy(columns, by, sorted_index, spans)

Bases: exetera.core.abstract_types.DataFrameGroupBy

count(ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame

Compute count of group values.

Example:

df = ... # df contains a fields ("foo") with data: [1, 0, 0, 1, 1]

# group by on single field, then compute count
df.groupby(by = 'foo').count(ddf = ddf)
# print dataframe (ddf) data:
#
# "foo", "count"
# -------------
#   0     2
#   1     3
Parameters
  • ddf – the destination data frame

  • write_keys – optional - write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with count of group values

distinct(ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame

Compute distinct values of a field or a list of field

Example:

df = ... # df contains two fields:
         # field "foo" with data [1, 0, 0, 1, 1]
         # field "bar" with data ["b", "b", "a", "a", "b"]

# group by on multiple fields, then compute distinct
df.groupby(by = ['foo', 'bar']).distinct(ddf = ddf)
# print dataframe (ddf) data:
#
# "foo", "bar"
# -------------
#   0     "a"
#   0     "b"
#   1     "a"
#   1     "b"
Parameters
  • ddf – the destination data frame

  • write_keys – optional - write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with distinct values of a field or a list of field

first(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame

Get first of group values.

Example:

df = ... # df contains three fields:
         # field "foo" with data [1, 0, 0, 1, 1]
         # field "bar" with data ["b", "b", "a", "a", "b"]
         # field "baz" with data [3.5, 6.0, 4.2, 7.2, 5.5]

# group by on multiple fields, then compute first on a single target field
df.groupby(by = ['foo', 'bar']).first(target = 'baz', ddf = ddf)
# print dataframe (ddf) data:
#
# "foo", "bar", "baz_first"
# -------------------------
#   0     "a"       4.2
#   0     "b"       6.0
#   1     "a"       7.2
#   1     "b"       3.5
Parameters
  • target – Name (str) or list of names (str) to get first value.

  • ddf – the destination data frame

  • write_keys – optional - write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with first of group values

last(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame

Get last of group values.

Example:

df = ... # df contains three fields:
         # field "foo" with data [1, 0, 0, 1, 1]
         # field "bar" with data ["b", "b", "a", "a", "b"]
         # field "baz" with data [3.5, 6.0, 4.2, 7.2, 5.5]

# group by on multiple fields, then compute first on a single target field
df.groupby(by = ['foo', 'bar']).first(target = 'baz', ddf = ddf)
# print dataframe (ddf) data:
#
# "foo", "bar", "baz_first"
# -------------------------
#   0     "a"       4.2
#   0     "b"       6.0
#   1     "a"       7.2
#   1     "b"       5.5
Parameters
  • target – Name (str) or list of names (str) to get last value.

  • ddf – the destination data frame

  • write_keys – optional - write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with last of group values

max(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame

Compute max of group values.

Example:

df = ... # df contains three fields:
         # field "foo" with data [1, 0, 0, 1, 1]
         # field "bar" with data ["b", "b", "a", "a", "b"]
         # field "baz" with data [3.5, 6.0, 4.2, 7.2, 5.5]

# group by on a single field, then compute max on multiple target fields
df.groupby(by = 'bar').max(target = ['foo','baz'], ddf = ddf)
# print dataframe (ddf) data:
#
# "bar", "foo_max", "baz_max"
# ---------------------------
#  "a"      1         7.2
#  "b"      1         6.0
Parameters
  • target – Name (str) or list of names (str) to compute max.

  • ddf – the destination data frame

  • write_keys – optional - write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with max of group values

min(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame

Compute min of group values.

Example:

df = ... # df contains two fields:
         # field "foo" with data [1, 0, 0, 1, 1]
         # field "bar" with data ["b", "b", "a", "a", "b"]

# group by on a single field, then compute min on a single target field
df.groupby(by = 'bar').min(target = 'foo', ddf = ddf)
# print dataframe (ddf) data:
#
# "bar", "foo_min"
# -------------
#  "a"      0
#  "b"      0
Parameters
  • target – Name (str) or list of names (str) to compute min.

  • ddf – the destination data frame

  • write_keys – optional - write groupby keys to ddf only if write_key=True. Default is True.

Returns

dataframe with min of group values

exetera.core.dataframe.copy(field: exetera.core.abstract_types.Field, ddf: exetera.core.abstract_types.DataFrame, name: str)

Copy a field to another dataframe as well as underlying dataset.

Example:

# Copy a field ('foobar') of dataframe (df1) to another dataframe (df2) with new field name ('foo')
dataframe.copy(df1['foobar'], df2, 'foo')
Parameters
  • field – The source field to copy.

  • ddf – The destination dataframe to copy to.

  • name – The name of field under destination dataframe.

exetera.core.dataframe.merge(left: exetera.core.abstract_types.DataFrame, right: exetera.core.abstract_types.DataFrame, dest: exetera.core.abstract_types.DataFrame, left_on: Union[Tuple[Union[str, exetera.core.abstract_types.Field]], str, exetera.core.abstract_types.Field], right_on: Union[Tuple[Union[str, exetera.core.abstract_types.Field]], str, exetera.core.abstract_types.Field], left_fields: Optional[Sequence[str]] = None, right_fields: Optional[Sequence[str]] = None, left_suffix: str = '_l', right_suffix: str = '_r', how='left', hint_left_keys_ordered: Optional[bool] = None, hint_left_keys_unique: Optional[bool] = None, hint_right_keys_ordered: Optional[bool] = None, hint_right_keys_unique: Optional[bool] = None, chunk_size=1048576)

Merge ‘left’ and ‘right’ DataFrames into a destination dataset. The merge is a database-style join operation, in any of the following modes (“left”, “right”, “inner”, “outer”). This method closely follows the Pandas ‘merge’ functionality.

The join is performed using the fields specified by ‘left_on’ and ‘right_on’; these can either be strings or fields; if they strings then they refer to fields that must exist in the corresponding dataframe.

You can optionally set ‘left_fields’ and / or ‘right_fields’ if you want to have only a subset of fields joined from the left and right dataframes. If you don’t want any fields to be joined from a given dataframe, you can pass an empty list.

Fields are written to the destination dataframe. If the field names clash, they will get appended with the strings specified in ‘left_suffix’ and ‘right_suffix’ respectively.

Parameters
  • left – The left dataframe

  • right – The right dataframe

  • dest – The destination dataframe

  • left_on – The field corresponding to the left key used to perform the join. This is either the the name of the field, or a field object. If it is a field object, it can be from another dataframe but it must be the same length as the fields being joined. This can also be a tuple of such values when performing joins on compound keys

  • right_on – The field corresponding to the right key used to perform the join. This is either the name of the field, or a field object. If it is a field object, it can be from another dataframe but it must be the same length as the fields being joined. This can also be a tuple of such values when performing joins on compound keys

  • left_fields – Optional parameter listing which fields are to be joined from the left table. If this is not set, all fields from the left table are joined

  • right_fields – Optional parameter listing which fields are to be joined from the right table. If this is not set, all fields from the right table are joined

  • left_suffix – A string to be appended to fields from the left table if they clash with fields from the right table.

  • right_suffix – A string to be appended to fields from the right table if they clash with fields from the left table.

  • how – Optional parameter specifying the merge mode. It must be one of (‘left’, ‘right’, ‘inner’, ‘outer’ or ‘cross). If not set, the ‘left’ join is performed.

exetera.core.dataframe.move(field: exetera.core.abstract_types.Field, ddf: exetera.core.abstract_types.DataFrame, name: str)

Move a field to another dataframe as well as underlying dataset.

Example:

# Move a field ('foobar') of dataframe (df1) to another dataframe (df2) with new field name ('foo')
dataframe.move(df1['foobar'], df2, 'foo')
Parameters
  • src_df – The source dataframe where the field is located.

  • field – The field to move.

  • ddf – The destination dataframe to move to.

  • name – The name of field under destination dataframe.

exetera.core.exporter module

exetera.core.fields module

class exetera.core.fields.CategoricalField(session, group, dataframe, write_enabled=False)

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Example:

field = ... # field contains data [1, 2, 3, 4, 0, 5, 6, 7, 8]
filter_to_apply = np.array([0, 2, 0, 1, 0, 1, 0, 1, 0])

field.apply_filter(filter_to_apply, in_place=True)
field.data[:]  # prints [2, 4, 5, 7]
Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data [1, 2, 3, 4, 0, 5, 6, 7, 8]
index_to_apply = np.array([8, 0, 7, 1, 6, 2, 5, 3, 4], dtype=np.int32)

field.apply_index(index_to_apply, target_field)
target_field.data[:]  # [8, 1, 7, 2, 6, 3, 5, 4, 0]
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)

Apply spans (first). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_last(spans_to_apply, target=None, in_place=False)

Apply spans (last). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_max(spans_to_apply, target=None, in_place=False)

Apply spans (max). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_min(spans_to_apply, target=None, in_place=False)

Apply spans (min). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

create_like(group=None, name=None, timestamp=None)

Creates a empty field of same type as this.

Parameters
  • group – h5group

  • name – Name of new the field

  • timestamp – optional - If set, the timestamp that should be given to the new field.

Returns

Indexed string field

property data

Get data.

get_spans()

Get spans of field.

is_sorted()

Returns if data in field is sorted :return: bool

isin(test_elements: Union[list, set, numpy.ndarray])

Returns a boolean array of the same length as field that is True where an element of field is in test_elements and False otherwise.

Parameters

test_elements – The values against which to test each value of field.

Returns

a boolean array of the same length as field

property keys

Get keys.

property nformat

Get numeric format.

remap(key_map, new_key)

Remap the key names and key values.

Parameters
  • key_map – The mapping rule of convert the old key into the new key.

  • new_key – The new key.

Returns

A CategoricalMemField with the new key.

Example:

cat_field = df.create_categorical('cat', 'int32', {"a": 1, "b": 2})
cat_field.data.write([1,2,1,2])
newfield = cat_field.remap([(1, 4), (2, 5)], {"a": 4, "b": 5})
print(newfield.data[:])
unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of a CategoricalField. Returns the sorted unique elements of a CategoricalField. There are three optional outputs in addition to the unique elements: (1) the indices of the input array that give the unique values (2) the indices of the unique array that reconstruct the input array (3) the number of times each unique value comes up in the input array

Parameters
  • return_index – boolean, if true returns index of unique elements

  • return_inverse – boolean, if true returns result in reverse

  • return_counts – boolean, if true returns counts of unique elements

Returns

numpy array

writeable()

Indicates whether this field permits write operations. By default, dataframe fields are read-only in order to protect accidental writes to datasets

class exetera.core.fields.CategoricalMemField(session, nformat, keys)

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Example:

field = ... # field contains data [1, 2, 3, 4, 0, 5, 6, 7, 8]
filter_to_apply = np.array([0, 2, 0, 1, 0, 1, 0, 1, 0])

field.apply_filter(filter_to_apply, in_place=True)
field.data[:]  # prints [2, 4, 5, 7]
Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data [1, 2, 3, 4, 0, 5, 6, 7, 8]
index_to_apply = np.array([8, 0, 7, 1, 6, 2, 5, 3, 4], dtype=np.int32)

field.apply_index(index_to_apply, target_field)
target_field.data[:]  # [8, 1, 7, 2, 6, 3, 5, 4, 0]
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)

Apply spans (first). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_last(spans_to_apply, target=None, in_place=False)

Apply spans (last). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_max(spans_to_apply, target=None, in_place=False)

Apply spans (max). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_min(spans_to_apply, target=None, in_place=False)

Apply spans (min). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

create_like(group=None, name=None, timestamp=None)

Creates a empty field of same type as this.

Parameters
  • group – h5group

  • name – Name of new the field

  • timestamp – optional - If set, the timestamp that should be given to the new field.

Returns

Indexed string field

property data

Returns memory field array with values from this field :return: MemoryFieldArray

get_spans()

Get spans of field.

is_sorted()

Returns if data in field is sorted :return: bool

isin(test_elements: Union[list, set, numpy.ndarray])

Returns a boolean array of the same length as field that is True where an element of field is in test_elements and False otherwise.

Parameters

test_elements – The values against which to test each value of field.

Returns

a boolean array of the same length as field

property keys

Get keys.

remap(key_map, new_key)

Remap the key names and key values.

Parameters
  • key_map – The mapping rule of convert the old key into the new key.

  • new_key – The new key.

Returns

A CategoricalMemField with the new key.

Example:

cat_field = df.create_categorical('cat', 'int32', {"a": 1, "b": 2})
cat_field.data.write([1,2,1,2])
newfield = cat_field.remap([(1, 4), (2, 5)], {"a": 4, "b": 5})
print(newfield.data[:])  # [4,5,4,5]
unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of a CategoricalMemField. Returns the sorted unique elements of a CategoricalMemField. There are three optional outputs in addition to the unique elements: (1) the indices of the input array that give the unique values (2) the indices of the unique array that reconstruct the input array (3) the number of times each unique value comes up in the input array

Parameters
  • return_index – boolean, if true returns index of unique elements

  • return_inverse – boolean, if true returns result in reverse

  • return_counts – boolean, if true returns counts of unique elements

Returns

numpy array

writeable()

Indicates whether this field permits write operations. By default, dataframe fields are read-only in order to protect accidental writes to datasets

class exetera.core.fields.FieldDataOps

Bases: object

static apply_filter_to_field(source, filter_to_apply, target=None, in_place=False)

Apply filter to field, either in place (itself) or a target (new) field

Parameters
  • source – Field

  • filter_to_apply – a Field or numpy array that contains the indices to filter

  • target – Optional, Field, if set create a field like as target

  • in_place – optional, bool, if set changes data in field

Returns

Field with filter applied

static apply_filter_to_indexed_field(source, filter_to_apply, target=None, in_place=False)
static apply_index_to_field(source, index_to_apply, target=None, in_place=False)

Apply index to field, either in place (itself) or a target (new) field

Parameters
  • source – Field

  • index_to_apply – a Field or numpy array that contains the indices

  • target – Optional, Field, if set create a field like as target

  • in_place – bool, if set changes data in field

Returns

Field with index

static apply_index_to_indexed_field(source, index_to_apply, target=None, in_place=False)
static apply_isin(source: exetera.core.abstract_types.Field, test_elements: Union[list, set, numpy.ndarray])

Apply isin operation for elements on Field

Parameters
  • source – Field

  • test_elements – list, set or ndarray

Returns

bool

static apply_spans_first(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field

Apply spans first, either in place (itself) or a target (new) field

Parameters
  • source – Field

  • spans – Field or ndarray

  • target – Optional, Field, if set create a field like as target

  • in_place – bool, if set changes data in field

Returns

Field

static apply_spans_last(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field

Apply spans last, either in place (itself) or a target (new) field

Parameters
  • source – Field

  • spans – Field or ndarray

  • target – Optional, Field, if set create a field like as target

  • in_place – bool, if set changes data in field

Returns

Field

static apply_spans_max(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field

Apply spans max, either in place (itself) or a target (new) field

Parameters
  • source – Field

  • spans – Field or ndarray

  • target – Optional, Field, if set create a field like as target

  • in_place – bool, if set changes data in field

Returns

Field

static apply_spans_min(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field

Apply spans min, either in place (itself) or a target (new) field

Parameters
  • source – Field

  • spans – Field or ndarray

  • target – Optional, Field, if set create a field like as target

  • in_place – bool, if set changes data in field

Returns

Field

static apply_unique(src: exetera.core.abstract_types.Field, return_index=False, return_inverse=False, return_counts=False) → numpy.ndarray

Find unique elements in field. Returns the sorted unique elements of a field. There are three optional outputs in addition to the unique elements: (1) the indices of the input array that give the unique values (2) the indices of the unique array that reconstruct the input array (3) the number of times each unique value comes up in the input array

Parameters
  • src – Field

  • return_index – boolean, if true returns index of unique elements

  • return_inverse – boolean, if true returns result in reverse

  • return_counts – boolean, if true returns counts of unique elements

Returns

numpy array

static categorical_field_create_like(source, group, name, timestamp)
Parameters
  • group – h5py group

  • name – str

  • timestamp – timestamp

Returns

CategoricalField or CategoricalMemField

classmethod equal(session, first, second)
static fixed_string_field_create_like(source, group, name, timestamp)
Parameters
Returns

FixedStringField or FixedStringMemField

classmethod greater_than(session, first, second)
classmethod greater_than_equal(session, first, second)
static indexed_string_create_like(source, group, name, timestamp)
Parameters
Returns

Indexed string field

classmethod invert(session, first)
classmethod less_than(session, first, second)
classmethod less_than_equal(session, first, second)
classmethod logical_not(session, first)
classmethod not_equal(session, first, second)
classmethod numeric_add(session, first, second)
classmethod numeric_and(session, first, second)
classmethod numeric_divmod(session, first, second)
static numeric_field_create_like(source, group, name, timestamp)
Parameters
Returns

NumericField or NumericMemField

classmethod numeric_floordiv(session, first, second)
classmethod numeric_mod(session, first, second)
classmethod numeric_mul(session, first, second)
classmethod numeric_or(session, first, second)
classmethod numeric_sub(session, first, second)
classmethod numeric_truediv(session, first, second)
classmethod numeric_xor(session, first, second)
static timestamp_field_create_like(source, group, name, timestamp)
Parameters
Returns

TimestampField, see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#timestampfield

class exetera.core.fields.FixedStringField(session, group, dataframe, write_enabled=False)

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Example:

field = ... # field contains data ['a', 'b', 'c', 'd', '', 'e', 'f', 'g', 'h']
filter_to_apply = np.array([0, 2, 0, 1, 0, 1, 0, 1, 0])

field.apply_filter(filter_to_apply, target_field)
target_field.data[:]  # prints ['b', 'd', 'e', 'g']
Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data ['a', 'b', 'c', 'd', '', 'e', 'f', 'g', 'h']
index_to_apply = np.array([8, 0, 7, 1, 6, 2, 5, 3, 4], dtype=np.int32)

field.apply_index(index_to_apply, target_field)
target_field.data[:]  # ['h', 'a', 'g', 'b', 'f', 'c', 'e', 'd', '']
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)

Apply spans (first). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_last(spans_to_apply, target=None, in_place=False)

Apply spans (last). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_max(spans_to_apply, target=None, in_place=False)

Apply spans (max). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_min(spans_to_apply, target=None, in_place=False)

Apply spans (min). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

create_like(group=None, name=None, timestamp=None)

Creates a empty field of same type as this.

Parameters
  • group – h5group

  • name – Name of new the field

  • timestamp – optional - If set, the timestamp that should be given to the new field.

Returns

Indexed string field

property data

Get data.

get_spans()

Get spans of field.

is_sorted()

Returns if data in field is sorted :return: bool

isin(test_elements: Union[list, set, numpy.ndarray])

Returns a boolean array of the same length as field that is True where an element of field is in test_elements and False otherwise.

Parameters

test_elements – The values against which to test each value of field.

Returns

a boolean array of the same length as field

unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of a FixedStringField. Returns the sorted unique elements of a FixedStringField. There are three optional outputs in addition to the unique elements: (1) the indices of the input array that give the unique values (2) the indices of the unique array that reconstruct the input array (3) the number of times each unique value comes up in the input array

Parameters
  • return_index – boolean, if true returns index of unique elements

  • return_inverse – boolean, if true returns result in reverse

  • return_counts – boolean, if true returns counts of unique elements

Returns

numpy array

writeable()

Indicates whether this field permits write operations. By default, dataframe fields are read-only in order to protect accidental writes to datasets

class exetera.core.fields.FixedStringMemField(session, length)

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Example:

field = ... # field contains data ['a', 'b', 'c', 'd', '', 'e', 'f', 'g', 'h']
filter_to_apply = np.array([0, 2, 0, 1, 0, 1, 0, 1, 0])

field.apply_filter(filter_to_apply, target_field)
target_field.data[:]  # prints ['b', 'd', 'e', 'g']
Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data ['a', 'b', 'c', 'd', '', 'e', 'f', 'g', 'h']
index_to_apply = np.array([8, 0, 7, 1, 6, 2, 5, 3, 4], dtype=np.int32)

field.apply_index(index_to_apply, target_field)
target_field.data[:]  # ['h', 'a', 'g', 'b', 'f', 'c', 'e', 'd', '']
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)

Apply spans (first). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None.

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_last(spans_to_apply, target=None, in_place=False)

Apply spans (last). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_max(spans_to_apply, target=None, in_place=False)

Apply spans (max). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_min(spans_to_apply, target=None, in_place=False)

Apply spans (min). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

create_like(group=None, name=None, timestamp=None)

Creates a empty field of same type as this.

Parameters
  • group – h5group

  • name – Name of new the field

  • timestamp – optional - If set, the timestamp that should be given to the new field.

Returns

Indexed string field

property data

Returns moemory field array with values from this field :return: MemoryFieldArray

get_spans()

Get spans of field :return: Spans of field

is_sorted()

Returns if data in field is sorted :return: bool

isin(test_elements: Union[list, set, numpy.ndarray])

Returns a boolean array of the same length as field that is True where an element of field is in test_elements and False otherwise.

Parameters

test_elements – The values against which to test each value of field.

Returns

a boolean array of the same length as field

unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of a FixedStringMemField. Returns the sorted unique elements of a FixedStringMemField. There are three optional outputs in addition to the unique elements: (1) the indices of the input array that give the unique values (2) the indices of the unique array that reconstruct the input array (3) the number of times each unique value comes up in the input array

Parameters
  • return_index – boolean, if true returns index of unique elements

  • return_inverse – boolean, if true returns result in reverse

  • return_counts – boolean, if true returns counts of unique elements

Returns

numpy array

writeable()

Indicates whether this field permits write operations. By default, dataframe fields are read-only in order to protect accidental writes to datasets

class exetera.core.fields.HDF5Field(session, group, dataframe, write_enabled=False)

Bases: exetera.core.abstract_types.Field

apply_filter(filter_to_apply, dstfld=None)

Apply filter on the field.

apply_index(index_to_apply, dstfld=None)

Apply index on the field.

property chunksize

The chunksize for the field. This is not generally required for users, and may be ignored depending on the storage medium.

property dataframe

The owning dataframe of this field, or None if the field is not owned by a dataframe :return str or None

get_spans()

Get spans of the field.

property indexed

Whether the field is an indexed field or not. Indexed fields store their data internally as index and value arrays for efficiency, as well as making it accessible through the data property.

property name

The name of the field within a dataframe, if the field belongs to a dataframe :return: str

property timestamp

The timestamp representing the field creation time. This is the time at which the data for this field was added to the dataset, rather than the point at which the field wrapper was created. :return: timestamp

property valid

Returns whether the field is a valid field object. Fields can become invalid as a result of certain operations, such as a field being moved from one dataframe to another. A field that is invalid with throw exceptions if any other operation is performed on them. :return: bool

class exetera.core.fields.IndexedStringField(session, group, dataframe, write_enabled=False)

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Example:

field = ... # field contains data ['a', 'bb', 'ccc', 'dddd', '', 'eeee', 'fff', 'gg', 'h']
filter_to_apply = np.array([0, 2, 0, 1, 0, 1, 0, 1, 0])

field.apply_filter(filter_to_apply, target_field)
target_field.data[:]  # prints ['bb', 'dddd', 'eeee', 'gg']
Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data ['a', 'bb', 'ccc', 'dddd', '', 'eeee', 'fff', 'gg', 'h']
index_to_apply = np.array([8, 0, 7, 1, 6, 2, 5, 3, 4], dtype=np.int32)

field.apply_index(index_to_apply, target_field)
target_field.data[:]  # ['h', 'a', 'gg', 'bb', 'fff', 'ccc', 'eeee', 'dddd', '']
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)

Apply spans (first). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data ['a', 'bb', 'ccc', 'dddd', '', 'eeee', 'fff', 'gg', 'h']
spans_to_apply = np.array([0, 2, 3, 6, 8], dtype=np.int32)

field.apply_spans_first(spans_to_apply, target_field)
target_field.data[:]  # ['a', 'ccc', 'dddd', 'gg']
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_last(spans_to_apply, target=None, in_place=False)

Apply spans (last). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data ['a', 'bb', 'ccc', 'dddd', '', 'eeee', 'fff', 'gg', 'h']
spans_to_apply = np.array([0, 2, 3, 6, 8], dtype=np.int32)

field.apply_spans_last(spans_to_apply, target_field)
target_field.data[:]  #  ['bb', 'ccc', 'fff', 'h']
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_max(spans_to_apply, target=None, in_place=False)

Apply spans (max). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data ['a', 'bb', 'ccc', 'dddd', '', 'eeee', 'fff', 'gg', 'h']
spans_to_apply = np.array([0, 2, 3, 6, 8], dtype=np.int32)

field.apply_spans_max(spans_to_apply, in_place=True)
field.data[:]  # ['bb', 'ccc', 'fff', 'h']
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_min(spans_to_apply, target=None, in_place=False)

Apply spans (min). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data ['a', 'bb', 'ccc', 'dddd', '', 'eeee', 'fff', 'gg', 'h']
spans_to_apply = np.array([0, 2, 3, 6, 8], dtype=np.int32)

field.apply_spans_min(spans_to_apply, in_place=True)
field.data[:]  # ['a', 'ccc', 'dddd', 'gg']
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

create_like(group=None, name=None, timestamp=None)

Creates a empty field of same type as this.

Parameters
  • group – h5group

  • name – Name of new the field

  • timestamp – optional - If set, the timestamp that should be given to the new field.

Returns

Indexed string field

property data

Returns indexes writable field array with values of field :return: WriteableIndexedFieldArray

get_spans()

Get spans of field

property indexed

Whether the field is an indexed field or not. Indexed fields store their data internally as index and value arrays for efficiency, as well as making it accessible through the data property.

property indices

Get indices.

is_sorted()

Returns if data in field is sorted :return: bool

isin(test_elements: Union[list, set, numpy.ndarray])

Returns a boolean array of the same length as field that is True where an element of field is in test_elements and False otherwise.

Parameters

test_elements – The values against which to test each value of field.

Returns

a boolean array of the same length as field

unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of an IndexedStringField. Returns the sorted unique elements of an IndexedStringField. There are three optional outputs in addition to the unique elements: (1) the indices of the input array that give the unique values (2) the indices of the unique array that reconstruct the input array (3) the number of times each unique value comes up in the input array

Parameters
  • return_index – boolean, if true returns index of unique elements

  • return_inverse – boolean, if true returns result in reverse

  • return_counts – boolean, if true returns counts of unique elements

Returns

numpy array

property values

Get values.

writeable()

Indicates whether this field permits write operations. By default, dataframe fields are read-only in order to protect accidental writes to datasets

class exetera.core.fields.IndexedStringMemField(session, chunksize=None)

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Example:

field = ... # field contains data ['a', 'bb', 'ccc', 'dddd', '', 'eeee', 'fff', 'gg', 'h']
filter_to_apply = np.array([0, 2, 0, 1, 0, 1, 0, 1, 0])

field.apply_filter(filter_to_apply, target_field)
target_field.data[:]  # prints ['bb', 'dddd', 'eeee', 'gg']
Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data ['a', 'bb', 'ccc', 'dddd', '', 'eeee', 'fff', 'gg', 'h']
index_to_apply = np.array([8, 0, 7, 1, 6, 2, 5, 3, 4], dtype=np.int32)

field.apply_index(index_to_apply, target_field)
target_field.data[:]  # ['h', 'a', 'gg', 'bb', 'fff', 'ccc', 'eeee', 'dddd', '']
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)

Apply spans (first). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data ['a', 'bb', 'ccc', 'dddd', '', 'eeee', 'fff', 'gg', 'h']
spans_to_apply = np.array([0, 2, 3, 6, 8], dtype=np.int32)

field.apply_spans_first(spans_to_apply, target_field)
target_field.data[:]  # ['a', 'ccc', 'dddd', 'gg']
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_last(spans_to_apply, target=None, in_place=False)

Apply spans (last). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data ['a', 'bb', 'ccc', 'dddd', '', 'eeee', 'fff', 'gg', 'h']
spans_to_apply = np.array([0, 2, 3, 6, 8], dtype=np.int32)

field.apply_spans_last(spans_to_apply, target_field)
target_field.data[:]  #  ['bb', 'ccc', 'fff', 'h']
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_max(spans_to_apply, target=None, in_place=False)

Apply spans (max). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data ['a', 'bb', 'ccc', 'dddd', '', 'eeee', 'fff', 'gg', 'h']
spans_to_apply = np.array([0, 2, 3, 6, 8], dtype=np.int32)

field.apply_spans_max(spans_to_apply, in_place=True)
field.data[:]  # ['bb', 'ccc', 'fff', 'h']
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_min(spans_to_apply, target=None, in_place=False)

Apply spans (min). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data ['a', 'bb', 'ccc', 'dddd', '', 'eeee', 'fff', 'gg', 'h']
spans_to_apply = np.array([0, 2, 3, 6, 8], dtype=np.int32)

field.apply_spans_min(spans_to_apply, in_place=True)
field.data[:]  # ['a', 'ccc', 'dddd', 'gg']
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

create_like(group=None, name=None, timestamp=None)

Creates a empty field of same type as this.

Parameters
  • group – h5group

  • name – Name of new the field

  • timestamp – optional - If set, the timestamp that should be given to the new field.

Returns

Indexed string field

property data

Returns indexed writable field array with values from this field :return: WriteableIndexedFieldArray

get_spans()
Returns

Span of indices as List

property indexed

Whether the field is an indexed field or not. Indexed fields store their data internally as index and value arrays for efficiency, as well as making it accessible through the data property.

property indices

Get indices for field :return: MemoryFieldArray(‘int64’)

is_sorted()

Returns if data in field is sorted :return: bool

isin(test_elements: Union[list, set, numpy.ndarray])

Returns a boolean array of the same length as field that is True where an element of field is in test_elements and False otherwise.

Parameters

test_elements – The values against which to test each value of field.

Returns

a boolean array of the same length as field

unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of an IndexedStringMemField. Returns the sorted unique elements of an IndexedStringMemField. There are three optional outputs in addition to the unique elements: (1) the indices of the input array that give the unique values (2) the indices of the unique array that reconstruct the input array (3) the number of times each unique value comes up in the input array

Parameters
  • return_index – boolean, if true returns index of unique elements

  • return_inverse – boolean, if true returns result in reverse

  • return_counts – boolean, if true returns counts of unique elements

Returns

numpy array

property values

Get values for field :return: MemoryFieldArray(‘8’)

writeable()

Indicates whether this field permits write operations. By default, dataframe fields are read-only in order to protect accidental writes to datasets

class exetera.core.fields.MemoryField(session)

Bases: exetera.core.abstract_types.Field

apply_filter(filter_to_apply, dstfld=None)

Apply filter on the field.

apply_index(index_to_apply, dstfld=None)

Apply index on the field.

property chunksize

The chunksize for the field. This is not generally required for users, and may be ignored depending on the storage medium.

property dataframe

The owning dataframe of this field, or None if the field is now owned by a dataframe

property indexed

Whether the field is an indexed field or not. Indexed fields store their data internally as index and value arrays for efficiency, as well as making it accessible through the data property.

property name

The name of the field within a dataframe, if the field belongs to a dataframe :return: str or None

property timestamp

The timestamp representing the field creation time. This is the time at which the data for this field was added to the dataset, rather than the point at which the field wrapper was created.

property valid

Returns whether the field is a valid field object. Fields can become invalid as a result of certain operations, such as a field being moved from one dataframe to another. A field that is invalid with throw exceptions if any other operation is performed on them. :return: bool

class exetera.core.fields.MemoryFieldArray(dtype)

Bases: object

clear()

Set dataset to None :return: None

complete()

Mark writing completed, usually used after calling write_part.

property dtype
Returns

dtype of field

write(part)

Writes data to field and marks it as complete.

Example::

part = np.array([97, 97, 100]) field.write(part)

Parameters

part – numpy array to write to field

Returns

None

write_part(part, move_mem=False)

Writes data part to field, followed by calling complete().

Example::

part = np.array([97, 97, 100]) field.write_part(part) field.complete()

Parameters
  • part – numpy array to written to field

  • move_mem – boolean, use part provided directly or make copy before writing.

Returns

None

class exetera.core.fields.NumericField(session, group, dataframe, write_enabled=False)

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Example:

field = ... # field contains data [1, 22, 333, 444, 0, 5555, 666, 77, 8]
filter_to_apply = np.array([0, 2, 0, 1, 0, 1, 0, 1, 0])

field.apply_filter(filter_to_apply, in_place=True)
field.data[:]  # prints [22, 444, 5555, 77]
Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data [1, 22, 333, 444, 0, 5555, 666, 77, 8]
index_to_apply = np.array([8, 0, 7, 1, 6, 2, 5, 3, 4], dtype=np.int32)

field.apply_index(index_to_apply, target_field)
target_field.data[:]  # [8, 1, 77, 22, 666, 333, 5555, 444, 0]
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)

Apply spans (first). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_last(spans_to_apply, target=None, in_place=False)

Apply spans (last). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_max(spans_to_apply, target=None, in_place=False)

Apply spans (max). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_min(spans_to_apply, target=None, in_place=False)

Apply spans (min). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

astype(dtype: str, casting='unsafe')

Convert the field data type to dtype parameter given.

Parameters
  • dtype – The new datatype, given as a str object. The dtype must be a subtype of np.number, e.g. int, float, etc.

  • casting – Similar to the casting parameter in numpy ndarray.astype, can be ‘no’, ‘equiv’, ‘safe’, ‘same_kind’, or ‘unsafe’.

Returns

The field with new datatype.

create_like(group=None, name=None, timestamp=None)

Creates a empty field of same type as this.

Parameters
  • group – h5group

  • name – Name of new the field

  • timestamp – optional - If set, the timestamp that should be given to the new field.

Returns

Indexed string field

property data

Get data.

get_spans()

Get spans of field.

is_sorted()

Returns if data in field is sorted :return: bool

isin(test_elements: Union[list, set, numpy.ndarray])

Returns a boolean array of the same length as field that is True where an element of field is in test_elements and False otherwise.

Parameters

test_elements – The values against which to test each value of field.

Returns

a boolean array of the same length as field

logical_not()
unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of a NumericField. Returns the sorted unique elements of a NumericField. There are three optional outputs in addition to the unique elements: (1) the indices of the input array that give the unique values (2) the indices of the unique array that reconstruct the input array (3) the number of times each unique value comes up in the input array

Parameters
  • return_index – boolean, if true returns index of unique elements

  • return_inverse – boolean, if true returns result in reverse

  • return_counts – boolean, if true returns counts of unique elements

Returns

numpy array

writeable()

Indicates whether this field permits write operations. By default, dataframe fields are read-only in order to protect accidental writes to datasets

class exetera.core.fields.NumericMemField(session, nformat)

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Example:

field = ... # field contains data [1, 22, 333, 444, 0, 5555, 666, 77, 8]
filter_to_apply = np.array([0, 2, 0, 1, 0, 1, 0, 1, 0])

field.apply_filter(filter_to_apply, in_place=True)
field.data[:]  # prints [22, 444, 5555, 77]
Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data [1, 22, 333, 444, 0, 5555, 666, 77, 8]
index_to_apply = np.array([8, 0, 7, 1, 6, 2, 5, 3, 4], dtype=np.int32)

field.apply_index(index_to_apply, target_field)
target_field.data[:]  # [8, 1, 77, 22, 666, 333, 5555, 444, 0]
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)

Apply spans to (first). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_last(spans_to_apply, target=None, in_place=False)

Apply spans (last). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_max(spans_to_apply, target=None, in_place=False)

Apply spans (max). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_min(spans_to_apply, target=None, in_place=False)

Apply spans (minimum). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

create_like(group=None, name=None, timestamp=None)

Creates a empty field of same type as this.

Parameters
  • group – h5group

  • name – Name of new the field

  • timestamp – optional - If set, the timestamp that should be given to the new field.

Returns

Indexed string field

property data

Returns memory field array with values from this field :return: MemoryFieldArray

get_spans()

Get spans of field :return: Spans of field

is_sorted()

Returns if data in field is sorted :return: bool

isin(test_elements: Union[list, set, numpy.ndarray])

Returns a boolean array of the same length as field that is True where an element of field is in test_elements and False otherwise.

Parameters

test_elements – The values against which to test each value of field.

Returns

a boolean array of the same length as field

logical_not()
unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of a NumericMemField. Returns the sorted unique elements of a NumericMemField. There are three optional outputs in addition to the unique elements: (1) the indices of the input array that give the unique values (2) the indices of the unique array that reconstruct the input array (3) the number of times each unique value comes up in the input array

Parameters
  • return_index – boolean, if true returns index of unique elements

  • return_inverse – boolean, if true returns result in reverse

  • return_counts – boolean, if true returns counts of unique elements

Returns

numpy array

writeable()

Indicates whether this field permits write operations. By default, dataframe fields are read-only in order to protect accidental writes to datasets

class exetera.core.fields.ReadOnlyFieldArray(field, dataset_name)

Bases: object

clear()

Clear Field Array.

complete()

Mark writing completed, usually used after calling write_part.

property dtype

Return datatype of field.

write(part)

Write data to field.

write_part(part)

Write data part to field.

class exetera.core.fields.ReadOnlyIndexedFieldArray(field, indices, values)

Bases: object

clear()

Clears field array.

complete()

Mark writing completed, usually used after calling write_part.

property dtype

Get datatype of field. Please note constructing a numpy array from IndexedString data can be very memory expensive.

write(part)

Writes data to field.

write_part(part)

Writes data part to field.

class exetera.core.fields.TimestampField(session, group, dataframe, write_enabled=False)

Bases: exetera.core.fields.HDF5Field

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Example:

field = ... # field contains data [1, 22, 333, 444, 0, 5555, 666, 77, 8]
filter_to_apply = np.array([0, 2, 0, 1, 0, 1, 0, 1, 0])

field.apply_filter(filter_to_apply, in_place=True)
field.data[:]  # prints [22, 444, 5555, 77]
Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data [1, 22, 333, 444, 0, 5555, 666, 77, 8]
index_to_apply = np.array([8, 0, 7, 1, 6, 2, 5, 3, 4], dtype=np.int32)

field.apply_index(index_to_apply, target_field)
target_field.data[:]  # [8, 1, 77, 22, 666, 333, 5555, 444, 0]
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)

Apply spans (first). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_last(spans_to_apply, target=None, in_place=False)

Apply spans (last). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_max(spans_to_apply, target=None, in_place=False)

Apply spans (max). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_min(spans_to_apply, target=None, in_place=False)

Apply spans (min). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

create_like(group=None, name=None, timestamp=None)

Creates a empty field of same type as this.

Parameters
  • group – h5group

  • name – Name of new the field

  • timestamp – optional - If set, the timestamp that should be given to the new field.

Returns

Indexed string field

property data

Get data.

get_spans()

Get spans of field.

is_sorted()

Returns if data in field is sorted :return: bool

isin(test_elements: Union[list, set, numpy.ndarray])

Returns a boolean array of the same length as field that is True where an element of field is in test_elements and False otherwise.

Parameters

test_elements – The values against which to test each value of field.

Returns

a boolean array of the same length as field

unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of a TimestampField. Returns the sorted unique elements of a TimestampField. There are three optional outputs in addition to the unique elements: (1) the indices of the input array that give the unique values (2) the indices of the unique array that reconstruct the input array (3) the number of times each unique value comes up in the input array

Parameters
  • return_index – boolean, if true returns index of unique elements

  • return_inverse – boolean, if true returns result in reverse

  • return_counts – boolean, if true returns counts of unique elements

Returns

numpy array

writeable()

Indicates whether this field permits write operations. By default, dataframe fields are read-only in order to protect accidental writes to datasets

class exetera.core.fields.TimestampMemField(session)

Bases: exetera.core.fields.MemoryField

apply_filter(filter_to_apply, target=None, in_place=False)

Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.

Example:

field = ... # field contains data [1, 22, 333, 444, 0, 5555, 666, 77, 8]
filter_to_apply = np.array([0, 2, 0, 1, 0, 1, 0, 1, 0])

field.apply_filter(filter_to_apply, in_place=True)
field.data[:]  # prints [22, 444, 5555, 77]
Parameters
  • filter_to_apply – a Field or numpy array that contains the boolean filter data

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_index(index_to_apply, target=None, in_place=False)

Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Example:

field = ... # field contains data [1, 22, 333, 444, 0, 5555, 666, 77, 8]
index_to_apply = np.array([8, 0, 7, 1, 6, 2, 5, 3, 4], dtype=np.int32)

field.apply_index(index_to_apply, target_field)
target_field.data[:]  # [8, 1, 77, 22, 666, 333, 5555, 444, 0]
Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_first(spans_to_apply, target=None, in_place=False)

Apply spans (first). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_last(spans_to_apply, target=None, in_place=False)

Apply spans (last). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_max(spans_to_apply, target=None, in_place=False)

Apply spans (max). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

apply_spans_min(spans_to_apply, target=None, in_place=False)

Apply spans (min). This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.

Parameters
  • index_to_apply – a Field or numpy array that contains the indices

  • target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.

  • in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None

Returns

The respanned field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.

create_like(group=None, name=None, timestamp=None)

Creates a empty field of same type as this.

Parameters
  • group – h5group

  • name – Name of new the field

  • timestamp – optional - If set, the timestamp that should be given to the new field.

Returns

Indexed string field

property data

Returns memory field array with values from this field :return: MemoryFieldArray

get_spans()

Get spans of field.

is_sorted()

Returns if data in field is sorted :return: bool

isin(test_elements: Union[list, set, numpy.ndarray])

Returns a boolean array of the same length as field that is True where an element of field is in test_elements and False otherwise.

Parameters

test_elements – The values against which to test each value of field.

Returns

a boolean array of the same length as field

unique(return_index=False, return_inverse=False, return_counts=False)

Find the unique elements of a TimestampMemField. Returns the sorted unique elements of a TimestampMemField. There are three optional outputs in addition to the unique elements: (1) the indices of the input array that give the unique values (2) the indices of the unique array that reconstruct the input array (3) the number of times each unique value comes up in the input array

Parameters
  • return_index – boolean, if true returns index of unique elements

  • return_inverse – boolean, if true returns result in reverse

  • return_counts – boolean, if true returns counts of unique elements

Returns

numpy array

writeable()

Indicates whether this field permits write operations. By default, dataframe fields are read-only in order to protect accidental writes to datasets

class exetera.core.fields.WriteableFieldArray(field, dataset_name)

Bases: object

clear()

Replaces current dataset with empty dataset. :return: None

complete()

Mark writing completed, usually used after calling write_part.

Example::

field.write_part(part) field.complete()

Returns

None

property dtype

Returns the datatype for the dataset :return: dtype

write(part)

Writes data to field and marks it as complete.

Example::

part = np.array([97, 97, 100]) field.write(part)

Parameters

part – numpy array to write to field

Returns

None

write_part(part)

Writes data part to field, followed by calling complete().

Example::

part = np.array([97, 97, 100]) field.write_part(part) field.complete()

Parameters

part – numpy array to write to field

Returns

None

class exetera.core.fields.WriteableIndexedFieldArray(chunksize, indices, values)

Bases: object

clear()

Resets field, clears all indices and values.

Returns

None

complete()

Mark writing completed, usually used after calling write_part.

Example::

field.write_part(part) field.complete()

Returns

None

property dtype

Returns datatype of field. Please note constructing a numpy array from IndexedString data can be very memory expensive. :return: dtype

write(part)

Writes data to field and marks it as complete.

Example::

part = np.array([97, 97, 100]) field.write(part)

Parameters

part – List of strings to write to field

Returns

None

write_part(part)

Writes data part to field, followed by calling complete().

Example::

part = np.array([97, 97, 100]) field.write_part(part) field.complete()

Parameters

part – List of strings to be written

Returns

None

exetera.core.fields.argsort(field: exetera.core.abstract_types.Field, dtype: str = None)
exetera.core.fields.as_field(data, key=None)
exetera.core.fields.base_field_contructor(session, group, name, timestamp=None, chunksize=None)

Constructor are for 1)create the field (hdf5 group), 2)add basic attributes like chunksize, timestamp, field type, and 3)add the dataset to the field (hdf5 group) under the name ‘values’

exetera.core.fields.categorical_field_constructor(session, group, name, nformat, key, timestamp=None, chunksize=None)
exetera.core.fields.dtype_to_str(dtype)

Returns string name for given data type :param dtype: given data type :return: str

exetera.core.fields.fixed_string_field_constructor(session, group, name, length, timestamp=None, chunksize=None)
exetera.core.fields.indexed_string_field_constructor(session, group, name, timestamp=None, chunksize=None)
exetera.core.fields.isin(field: exetera.core.abstract_types.Field, test_elements: Union[list, set, numpy.ndarray])

Returns a boolean array of the same length as field that is True where an element of field is in test_elements and False otherwise.

Parameters
  • field – The field to check.

  • test_elements – The values against which to test each value of field.

Returns

a boolean array of the same length as field

exetera.core.fields.numeric_field_constructor(session, group, name, nformat, timestamp=None, chunksize=None)
exetera.core.fields.timestamp_field_constructor(session, group, name, timestamp=None, chunksize=None)

exetera.core.indexed_array module

class exetera.core.indexed_array.IndexedArray

Bases: object

exetera.core.journal module

exetera.core.journal.journal_table(session, schema, old_src, new_src, src_pk, result)
exetera.core.journal.journal_test_harness(session, schema, old_file, new_file, dest_file)

exetera.core.operations module

exetera.core.operations.apply_filter_to_index_values(index_filter, indices, values)
exetera.core.operations.apply_indices_to_index_values(indices_to_apply, indices, values)
exetera.core.operations.apply_spans_count(spans, dest_array=None)
exetera.core.operations.apply_spans_first(spans, src_array, dest_array=None)
exetera.core.operations.apply_spans_index_of_first(spans, dest_array=None)
exetera.core.operations.apply_spans_index_of_first_filter(spans, dest_array, filter_array)
exetera.core.operations.apply_spans_index_of_last(spans, dest_array=None)
exetera.core.operations.apply_spans_index_of_last_filter(spans, dest_array, filter_array)
exetera.core.operations.apply_spans_index_of_max(spans, src_array, dest_array=None)
exetera.core.operations.apply_spans_index_of_max_filter(spans, src_array, dest_array, filter_array)
exetera.core.operations.apply_spans_index_of_max_indexed(spans, src_indices, src_values, dest_array=None)
exetera.core.operations.apply_spans_index_of_min(spans, src_array, dest_array=None)
exetera.core.operations.apply_spans_index_of_min_filter(spans, src_array, dest_array, filter_array)
exetera.core.operations.apply_spans_index_of_min_indexed(spans, src_indices, src_values, dest_array=None)
exetera.core.operations.apply_spans_last(spans, src_array, dest_array=None)
exetera.core.operations.apply_spans_max(spans, src_array, dest_array=None)
exetera.core.operations.apply_spans_min(spans, src_array, dest_array=None)
exetera.core.operations.calculate_chunk_decomposition(s_start, s_end, indices, value_chunk_size, sub_chunks)
exetera.core.operations.categorical_transform(chunk, i_c, column_inds, column_vals, column_offsets, cat_keys, cat_index, cat_values)

Transform method for categorical importer in readerwriter.py

exetera.core.operations.check_if_sorted_for_multi_fields(fields_data)

Check if input fields data is sorted. Note that fields_data should be treat as a group key

pre_row[j] < cur_row[j], means these two rows are sorted, move to next row => i + 1 pre_row[j] = cur_row[j], means we need to check if next element is sorted => j + 1 pre_row[j] > cur_row[j], means input data is not sorted

exetera.core.operations.chunked_copy(src_field, dest_field, chunksize=1048576)
exetera.core.operations.chunks(length, chunksize=1048576)
exetera.core.operations.compare_arrays(source[s1: s2], target[t1: t2])
exetera.core.operations.compare_indexed_rows_for_journalling(old_map, new_map, old_indices, old_values, new_indices, new_values, to_keep)
exetera.core.operations.compare_rows_for_journalling(old_map, new_map, old_field, new_field, to_keep)
exetera.core.operations.count_back(array)

This is a helper function that provides functionality specific to streaming ordered merges. It takes an array in sorted order and calculates a trimmed length that excludes the final sequence of equal values: Example:

[10, 20, 30, 40, 50] -> 4 ([10, 20, 30, 40])
[10, 20, 30, 40, 40] -> 3 ([10, 20, 30])
[10, 20, 30, 30, 30] -> 2 ([10, 20])
[10, 20, 20, 20, 20] -> 1 ([10])
exetera.core.operations.data_iterator(data_field, chunksize=1048576)
exetera.core.operations.element_chunked_copy(src_elem, dest_elem, chunksize)
exetera.core.operations.filter_duplicate_fields(field)

DEPRECATED

exetera.core.operations.first_trimmed_chunk(field, chunk_size)
exetera.core.operations.first_untrimmed_chunk(field, chunk_size)
exetera.core.operations.fixed_string_transform(column_inds, column_vals, column_offsets, col_idx, written_row_count, strlen, memory)

Transform method for fixed string importer in field_importer.py

exetera.core.operations.foreign_key_is_in_primary_key(primary_key, foreign_key)

DEPRECATED

exetera.core.operations.generate_ordered_map_to_inner_both_unique_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r)
exetera.core.operations.generate_ordered_map_to_inner_both_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)
exetera.core.operations.generate_ordered_map_to_inner_left_unique_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r)
exetera.core.operations.generate_ordered_map_to_inner_left_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)
exetera.core.operations.generate_ordered_map_to_inner_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r, ii, jj, ii_max, jj_max, inner)

This function performs generates a mapping from a subset of a left key to a subset of a a right key, writing the resulting mapping to a buffer, where both keys can contain repeated entries.

Example:

left = [10, 20, 30, 40, 40, 50, 50]
right = [20, 30, 30, 40, 40, 40, 60, 70]

i  j op r lres rres
0  0 <  0  0   INV
1  0 =  1  1   0
2  1 =  2  2   1
2  2    3  2   2
3  3    4  3   3
3  4    5  3   4
3  5    6  3   5
4  3    7  4   3
4  4    8  4   4
4  5    9  4   5
5  6   10  5   INV
6  6   11  6   INV


left_map = [0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6]
right_map = [INV, 1, 2, 2, 3, 3, 3, 4, 4, 4, INV, INV]

Everything about this function is optimised for performance under njit. It is effectively a finite state machine that iterates through left, right, and result arrays. The various…

i and i_max are used to track the index of the left source j and j_max are used to track the index of the right source

exetera.core.operations.generate_ordered_map_to_inner_right_unique_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r)
exetera.core.operations.generate_ordered_map_to_inner_right_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)
exetera.core.operations.generate_ordered_map_to_inner_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)

This function performs the most generic type of left to right mapping calculation in which both key fields can have repeated key values. At its heart, the function generates a mapping from left to right that can then be used to map data in the right space to data in the left space. Note that this can also be used to generate the inverse mapping my simply flipping left and right collections.

As the Fields left and right can contain arbitrarily long sequences of data, the data is streamed through the algorithm in a series of chunks. Similarly, the resulting map is written to a buffer that is written to the result field in chunks.

This streamed function makes a sequence of calls to a corresponding _partial function that does the heavy lifting. Inside the _partial function, a finite state machine (FSM) iterates over the data, performing the mapping. The _partial function call exits whenever any of the chunks (left_, right_ or result_ that it is passed become exhausted.

Please take a look at the documentation for the partial function to understand the finite state machine parameters to understand that role that the various parameters play.

We have to make some adjustments to the finite state machine between calls to _partial:
  • if the call used all the left_ data, add the size of that data chunk to i_off

  • if the call used all of the right_ data, add the size of that data chunk to j_off

  • write the accumulated result_ data to the result` field, and reset r to 0

exetera.core.operations.generate_ordered_map_to_left_both_unique(first, second, result, invalid)
exetera.core.operations.generate_ordered_map_to_left_both_unique_partial(left, right, r_result, invalid, j_off, i, j, r)
exetera.core.operations.generate_ordered_map_to_left_both_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)
exetera.core.operations.generate_ordered_map_to_left_left_unique_partial(left, right, j_max, l_result, r_result, invalid, i_off, j_off, i, j, r)
exetera.core.operations.generate_ordered_map_to_left_left_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)
exetera.core.operations.generate_ordered_map_to_left_partial(left, i_max, right, j_max, l_result, r_result, invalid, i_off, j_off, i, j, r, ii, jj, ii_max, jj_max, inner)

This function performs generates a mapping from a subset of a left key to a subset of a a right key, writing the resulting mapping to a buffer, where both keys can contain repeated entries.

Example:

left = [10, 20, 30, 40, 40, 50, 50]
right = [20, 30, 30, 40, 40, 40, 60, 70]

i  j op r lres rres
0  0 <  0  0   INV
1  0 =  1  1   0
2  1 =  2  2   1
2  2    3  2   2
3  3    4  3   3
3  4    5  3   4
3  5    6  3   5
4  3    7  4   3
4  4    8  4   4
4  5    9  4   5
5  6   10  5   INV
6  6   11  6   INV


left_map = [0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6]
right_map = [INV, 1, 2, 2, 3, 3, 3, 4, 4, 4, INV, INV]

Everything about this function is optimised for performance under njit. It is effectively a finite state machine that iterates through left, right, and result arrays. The various…

i and i_max are used to track the index of the left source j and j_max are used to track the index of the right source

exetera.core.operations.generate_ordered_map_to_left_remaining(i_max, l_result, r_result, i_off, i, r, invalid)
exetera.core.operations.generate_ordered_map_to_left_right_unique(first, second, result, invalid)
exetera.core.operations.generate_ordered_map_to_left_right_unique_partial(left, i_max, right, r_result, invalid, j_off, i, j, r)
exetera.core.operations.generate_ordered_map_to_left_right_unique_partial_old(d_j, left, right, left_to_right, invalid)

Returns: [0]: how many positions forward i moved [1]: how many positions forward j moved [2]: how many elements were written

exetera.core.operations.generate_ordered_map_to_left_right_unique_remaining(i_max, r_result, i, r, invalid)
exetera.core.operations.generate_ordered_map_to_left_right_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)
exetera.core.operations.generate_ordered_map_to_left_right_unique_streamed_old(left, right, left_to_right, invalid=- 1, chunksize=1048576)
exetera.core.operations.generate_ordered_map_to_left_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)

This function performs the most generic type of left to right mapping calculation in which both key fields can have repeated key values. At its heart, the function generates a mapping from left to right that can then be used to map data in the right space to data in the left space. Note that this can also be used to generate the inverse mapping my simply flipping left and right collections.

As the Fields left and right can contain arbitrarily long sequences of data, the data is streamed through the algorithm in a series of chunks. Similarly, the resulting map is written to a buffer that is written to the result field in chunks.

This streamed function makes a sequence of calls to a corresponding _partial function that does the heavy lifting. Inside the _partial function, a finite state machine (FSM) iterates over the data, performing the mapping. The _partial function call exits whenever any of the chunks (left_, right_ or result_ that it is passed become exhausted.

Please take a look at the documentation for the partial function to understand the finite state machine parameters to understand that role that the various parameters play.

We have to make some adjustments to the finite state machine between calls to _partial:
  • if the call used all the left_ data, add the size of that data chunk to i_off

  • if the call used all of the right_ data, add the size of that data chunk to j_off

  • write the accumulated result_ data to the result` field, and reset r to 0

exetera.core.operations.get_byte_map(string_map)

Getting byte indices and byte values from categorical key-value pair

exetera.core.operations.get_indexed_string_unique(indices, values, unique_result, unique_index, unique_inverse, unique_counts)

Find the unique elements for indexed string field using njit function.

exetera.core.operations.get_map_datatype_based_on_lengths(left_len, right_len)
exetera.core.operations.get_map_subchunks_based_on_index_lengths(map_, invalid, chunksize)
exetera.core.operations.get_next_chunk(start: int, chunk_size: int, field: exetera.core.abstract_types.Field)

This is a helper function that provides functionality specific to streaming ordered merges. It assumes that field is in sorted order.

This function is used to fetch chunks of memory from a field to be consumed by streaming merges. It first fetches the chunk of a given chunk size, or the size of the remaining memory, whichever is smaller. It then ‘trims’ that memory by removing the last sequence of equal values from the valid range.

Parameters
  • start – The start of the chunk to be returned

  • chunksize – The size of the chunk to be considered. The returned chunk will always

be shorter than this unless it is the final chunk of the field data :param field: The field from which data should be fetched. This field must be in sorted order :return: A tuple representing the range (inclusive, exclusive) and an numpy ndarray containing the data. Note, this is is typically longer than the range returned, as we do not trim the data for performance reasons.

exetera.core.operations.get_spans_for_field(ndarray)
exetera.core.operations.get_valid_value_extents(chunk, start, end, invalid=- 1)
exetera.core.operations.is_ordered(field)
exetera.core.operations.isin_for_indexed_string_field(test_elements, indices, values)
exetera.core.operations.isin_indexed_string_speedup(test_elements, indices, values)
exetera.core.operations.leaky_categorical_transform(chunk, freetext_indices, freetext_values, i_c, column_inds, column_vals, column_offsets, cat_keys, cat_index, cat_values)

Transform method for categorical importer in readerwriter.py

exetera.core.operations.map_valid(data_field, map_field, result=None, invalid=- 1)
exetera.core.operations.merge_entries_segment(i_start, cur_old_start, old_map, new_map, to_keep, old_src, new_src, dest)
Parameters
  • i_start – the initial value to apply to ‘i’

  • cur_old_start – the initial value to apply to ‘cur_old

  • old_map – the map (in i-space) for the existing records

  • new_map – the map (in i-space) for the new records

  • to_keep – the flags (in i-space) indicating whether the new record should be kept

  • old_src – the source for the existing records

  • new_src – the source for the new records

  • dest – the sink for the merged sources

Returns

exetera.core.operations.merge_indexed_journalled_entries(old_map, new_map, to_keep, old_src_inds, old_src_vals, new_src_inds, new_src_vals, dest_inds, dest_vals)
exetera.core.operations.merge_indexed_journalled_entries_count(old_map, new_map, to_keep, old_src_inds, new_src_inds)
exetera.core.operations.merge_journalled_entries(old_map, new_map, to_keep, old_src, new_src, dest)
exetera.core.operations.next_chunk(current: int, length: int, desired: int)

This is a helper function that can be used whenever you want to access a large sequence of data in chunks. It simply carries out the calculation that returns the extents of the next chunk taking into account the length of the sequence. The sequence itself is not required here, only the length. :param current: the starting point of the chunk :param length: the length of the sequence being chunked :param desired: the requested length of the chunk :return: A tuple of the chunk extents. The first value is inclusive; the second is exclusive

exetera.core.operations.next_map_subchunk(map_, sm, invalid, chunksize)
exetera.core.operations.next_trimmed_chunk(field, chunk, chunk_size)
exetera.core.operations.next_untrimmed_chunk(field, chunk, chunk_size)
exetera.core.operations.numeric_bool_transform(elements, validity, column_inds, column_vals, column_offsets, col_idx, written_row_count, invalid_value, validation_mode, field_name)

Transform method for numeric importer (bool) in readerwriter.py

exetera.core.operations.ordered_generate_journalling_indices(old, new)
exetera.core.operations.ordered_get_last_as_filter(field)
exetera.core.operations.ordered_inner_map(left, right, left_to_inner, right_to_inner)
exetera.core.operations.ordered_inner_map_both_unique(left, right, left_to_inner, right_to_inner)
exetera.core.operations.ordered_inner_map_left_unique(left, right, left_to_inner, right_to_inner)
exetera.core.operations.ordered_inner_map_left_unique_partial(d_i, d_j, left, right, left_to_inner, right_to_inner)

Returns: [0]: how many positions forward i moved [1]: how many positions forward j moved [2]: how many elements were written

exetera.core.operations.ordered_inner_map_left_unique_streamed(left, right, left_to_inner, right_to_inner, chunksize=1048576)
exetera.core.operations.ordered_inner_map_result_size(left, right)
exetera.core.operations.ordered_left_map_result_size(left, right)
exetera.core.operations.ordered_map_valid_indexed_partial(sm_values, sm_start, sm_end, indices, i_start, i_max, values, mv_start, result_indices, result_values, invalid, sm, ri, rv, ri_accum)
exetera.core.operations.ordered_map_valid_indexed_stream(data_field, map_field, result_field, invalid=- 1, chunksize=1048576, value_factor=8)
exetera.core.operations.ordered_map_valid_partial(values, map_values, sm_start, sm_end, d_start, result_data, invalid, invalid_value)
exetera.core.operations.ordered_map_valid_partial_old(d, data_field, map_field, result, invalid)
exetera.core.operations.ordered_map_valid_stream(data_field, map_field, result_field, invalid=- 1, chunksize=1048576)
. for each map chunk
. calculate sub chunks based on indices
. for each sub chunk

. map indices for sub chunk

exetera.core.operations.ordered_map_valid_stream_old(data_field, map_field, result_field, invalid=- 1, chunksize=1048576)
exetera.core.operations.ordered_outer_map_result_size_both_unique(left, right)
exetera.core.operations.raiseNumericException(exception_message, exception_args)
exetera.core.operations.safe_map_indexed_values(data_indices, data_values, map_field, map_filter, empty_value=None)
exetera.core.operations.safe_map_values(data_field, map_field, map_filter, empty_value=None)
exetera.core.operations.str_to_dtype(str_dtype)
exetera.core.operations.streaming_sort_partial(in_chunk_indices, in_chunk_lengths, src_value_chunks, src_index_chunks, dest_value_chunk, dest_index_chunk)
exetera.core.operations.transform_float(column_inds, column_vals, column_offsets, col_idx, written_row_count, invalid_value, validation_mode, data_type, field_name)

Transform float method for numeric importer in field_importer.py

exetera.core.operations.transform_int(column_inds, column_vals, column_offsets, col_idx, written_row_count, invalid_value, validation_mode, data_type, field_name)

Transform int method for numeric importer in field_importer.py

exetera.core.operations.transform_to_values(column_inds, column_vals, column_offsets, col_idx, written_row_count)

Trasnform method for byte data from np.int to np.bytes_

exetera.core.operations.unique_for_indexed_string(indices, values, return_index, return_inverse, return_counts)

Find the unique elements for indexed string field.

exetera.core.regression module

exetera.core.session module

class exetera.core.session.Session(chunksize: int = 1048576, timestamp: str = '2023-01-18 11:14:27.097526+00:00')

Bases: exetera.core.abstract_types.AbstractSession

Session is the top-level object that is used to create and open ExeTera Datasets. It also provides operations that can be performed on Fields. For a more detailed explanation of Session and examples of its usage, please refer to https://github.com/KCL-BMEIS/ExeTera/wiki/Session-API

Parameters
  • chunksize – Change the default chunksize that fields created with this dataset use. Note this is a hint parameter and future versions of Session may choose to ignore it if it is no longer required. In general, it should only be changed for testing.

  • timestamp – Set the official timestamp for the Session’s creation rather than taking the current date/time.

aggregate_count(index, dest=None)

Finds the number of entries within each sub-group of index.

Example:

Index:  a a a b b x a c c d d d
Result: 3     2   1 1 2   3
Parameters
  • index – A numpy array or Field containing the index that defines the ranges over which count is applied.

  • dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

aggregate_custom(predicate, index, target=None, dest=None)
aggregate_first(index, target=None, dest=None)

Finds the first entries within each sub-group of index.

Example:

Index: a a a b b x a c c d d d Target: 1 2 3 4 5 6 7 8 9 0 1 2 Result: 1 4 6 7 8 0

Parameters
  • index – A numpy array or Field containing the index that defines the ranges over which count is applied.

  • target – A numpy array to which the index and predicate are applied

  • dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

aggregate_last(index, target=None, dest=None)

Finds the first entries within each sub-group of index.

Example:

Index:  a a a b b x a c c d d d
Target: 1 2 3 4 5 6 7 8 9 0 1 2
Result: 3     5   6 7 9   2
Parameters
  • index – A numpy array or Field containing the index that defines the ranges over which count is applied.

  • target – A numpy array to which the index and predicate are applied

  • dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

aggregate_max(index, target=None, dest=None)

Finds the maximum value within each sub-group of index.

Example:

Index: a a a b b x a c c d d d Target: 1 2 3 5 4 6 7 8 9 2 1 0 Result: 3 5 6 7 9 2

Parameters
  • index – A numpy array or Field containing the index that defines the ranges over which max is applied.

  • target – A numpy array to which the index and predicate are applied

  • dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

aggregate_min(index, target=None, dest=None)

Finds the minimum value within each sub-group of index.

Example:

Index:  a a a b b x a c c d d d
Target: 1 2 3 5 4 6 7 8 9 2 1 0
Result: 1     4   6 7 8   0
Parameters
  • index – A numpy array or Field containing the index that defines the ranges over which min is applied.

  • target – A numpy array to which the index and predicate are applied

  • dest – If set, a Field to which the resulting counts are written

Returns

A numpy array containing the resulting values

apply_filter(filter_to_apply, src, dest=None)

Apply a filter to an a src field. The filtered field is written to dest if it set, and returned from the function call. If the field is an IndexedStringField, the indices and values are returned separately.

Parameters
  • filter_to_apply – the filter to be applied to the source field, an array of boolean

  • src – the field to be filtered

  • dest – optional - a field to write the filtered data to

Returns

the filtered values

apply_index(index_to_apply, src, dest=None)

Apply a index to an a src field. The indexed field is written to dest if it set, and returned from the function call. If the field is an IndexedStringField, the indices and values are returned separately.

Parameters
  • index_to_apply – the index to be applied to the source field, must be one of Group, Field, or ndarray

  • src – the field to be index

  • dest – optional - a field to write the indexed data to

Returns

the indexed values

apply_spans_concat(spans, target, dest, src_chunksize=None, dest_chunksize=None, chunksize_mult=None)
apply_spans_count(spans: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the number of entries within each span.

Parameters
  • spans – the numpy array of spans to be applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_first(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the first entry within each span on a target field.

Parameters
  • spans – the numpy array of spans to be applied

  • target – the field to which the spans are applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_index_of_first(spans: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the index of the first entry within each span.

Parameters
  • spans – the numpy array of spans to be applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_index_of_last(spans: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the index of the last entry within each span.

Parameters
  • spans – the numpy array of spans to be applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_index_of_max(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the index of the maximum value within each span on a target field.

Parameters
  • spans – the numpy array of spans to be applied

  • target – the field to which the spans are applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_index_of_min(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the index of the minimum value within each span on a target field.

Parameters
  • spans – the numpy array of spans to be applied

  • target – the field to which the spans are applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_last(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the last entry within each span on a target field.

Parameters
  • spans – the numpy array of spans to be applied

  • target – the field to which the spans are applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_max(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the maximum value within each span on a target field.

Parameters
  • spans – the numpy array of spans to be applied

  • target – the field to which the spans are applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

apply_spans_min(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)

Finds the minimum value within span on a target field.

Parameters
  • spans – the numpy array of spans to be applied

  • target – the field to which the spans are applied

  • dest – if set, the field to which the results are written

Returns

A numpy array containing the resulting values

chunks(length: int, chunksize: Optional[int] = None)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

‘chunks’ is a convenience method that, given an overall length and a chunksize, will yield a set of ranges for the chunks in question. ie. chunks(1048576, 500000) -> (0, 500000), (500000, 1000000), (1000000, 1048576)

Parameters
  • length – The range to be split into chunks

  • chunksize – Optional parameter detailing the size of each chunk. If not set, the chunksize that the Session was initialized with is used.

close()

Close all open datasets.

Returns

None

close_dataset(name: str)

Close the dataset with the given name. If there is no dataset with that name, do nothing.

Parameters

name – The name of the dataset to be closed

Returns

None

create_categorical(group, name, nformat, key, timestamp=None, chunksize=None)

Create a categorical field in the given DataFrame with the given name. This function also takes a numerical format for the numeric representation of the categories, and a key that maps numeric values to their string string descriptions.

Parameters
  • group – The group in which the new field should be created

  • name – The name of the new field

  • nformat – A numerical type in the set (int8, uint8, int16, uint18, int32, uint32, int64). It is recommended to use ‘int8’.

  • key – A dictionary that maps numerical values to their string representations

  • timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.

  • chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.

create_fixed_string(group, name, length, timestamp=None, chunksize=None)

Create a fixed string field in the given DataFrame, given name, and given max string length per entry.

Parameters
  • group – The group in which the new field should be created

  • name – The name of the new field

  • length – The maximum length in bytes that each entry can have.

  • timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.

  • chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.

create_indexed_string(group, name, timestamp=None, chunksize=None)

Create an indexed string field in the given DataFrame with the given name.

Parameters
  • group – The group in which the new field should be created

  • name – The name of the new field

  • timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.

  • chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.

create_like(field, dest_group, dest_name, timestamp=None, chunksize=None)

Create a field of the same type as an existing field, in the location and with the name provided.

Example:

with Session as s:
  ...
  a = s.get(table_1['a'])
  b = s.create_like(a, table_2, 'a_times_2')
  b.data.write(a.data[:] * 2)
Parameters
  • field – The Field whose type is to be copied

  • dest_group – The group in which the new field should be created

  • dest_name – The name of the new field

create_numeric(group, name, nformat, timestamp=None, chunksize=None)

Create a numeric field in the given DataFrame with the given name.

Parameters
  • group – The group in which the new field should be created

  • name – The name of the new field

  • nformat – A numerical type in the set (int8, uint8, int16, uint18, int32, uint32, int64, uint64, float32, float64). It is recommended to avoid uint64 as certain operations in numpy cause conversions to floating point values.

  • timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.

  • chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.

create_timestamp(group, name, timestamp=None, chunksize=None)

Create a timestamp field in the given group with the given name.

dataset_sort_index(sort_indices, index=None)

Generate a sorted index based on a set of fields upon which to sort and an optional index to apply to the sort_indices.

Parameters
  • sort_indices – a tuple or list of indices that determine the sorted order

  • index – optional - the index by which the initial field should be permuted

Returns

the resulting index that can be used to permute unsorted fields

distinct(field=None, fields=None, filter=None)

todo: confirm deprecated.

get(field: Union[exetera.core.abstract_types.Field, h5py._hl.group.Group])

Get a Field from a h5py Group.

Example:

# this code for context
with Session() as s:

  # open a dataset about wildlife
  src = s.open_dataset("/my/wildlife/dataset.hdf5", "r", "src")

  # fetch the group containing bird data
  birds = src['birds']

  # get the bird decibel field
  bird_decibels = s.get(birds['decibels'])
Parameters

field – The Field or Group object to retrieve.

get_dataset(name: str)

Get the dataset with the given name. If there is no dataset with that name, raise a KeyError indicating that the dataset with that name is not present.

Parameters

name – Name of the dataset to be fetched. This is the name that was given to it when it was opened through open_dataset().

Returns

Dataset with that name.

get_index(target, foreign_key, destination=None)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please make use of Dataframe.merge functionality instead. This method can be emulated by adding an index (via np.arange) to a dataframe, performing a merge and then fetching the mapped index field.

‘get_index’ maps a primary key (‘target’) into the space of a foreign key (‘foreign_key’).

get_or_create_group(group: Union[h5py._hl.group.Group, h5py._hl.files.File], name: str)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

get_shared_index(keys: Tuple[numpy.ndarray])

Create a shared index based on a tuple of numpy arrays containing keys. This function generates the sorted union of a tuple of key fields and then maps the individual arrays to their corresponding indices in the sorted union.

Parameters

keys – a tuple of groups, fields or ndarrays whose contents represent keys

Example:

key_1 = ['a', 'b', 'e', 'g', 'i']
key_2 = ['b', 'b', 'c', 'c', 'e', 'g', 'j']
key_3 = ['a', 'c' 'd', 'e', 'g', 'h', 'h', 'i']

sorted_union = ['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j']

key_1_index = [0, 1, 4, 5, 7]
key_2_index = [1, 1, 2, 2, 4, 5, 8]
key_3_index = [0, 3, 4, 5, 6, 6, 7]
get_spans(field: Union[exetera.core.abstract_types.Field, numpy.ndarray] = None, dest: exetera.core.abstract_types.Field = None, **kwargs)

Calculate a set of spans that indicate contiguous equal values. The entries in the result array correspond to the inclusive start and exclusive end of the span (the ith span is represented by element i and element i+1 of the result array). The last entry of the result array is the length of the source field.

Only one of ‘field’ or ‘fields’ may be set. If ‘fields’ is used and more than one field specified, the fields are effectively zipped and the check for spans is carried out on each corresponding tuple in the zipped field.

Example:

field: [1, 2, 2, 1, 1, 1, 3, 4, 4, 4, 2, 2, 2, 2, 2]
result: [0, 1, 3, 6, 7, 10, 15]
Parameters
  • field – A Field or numpy array to be evaluated for spans

  • dest – A destination Field to store the result

  • **kwargs – See below. For parameters set in both argument and kwargs, use kwargs

Keyword Arguments
  • field – Similar to field parameter, in case user specify field as keyword

  • fields – A tuple of Fields or tuple of numpy arrays to be evaluated for spans

  • dest – Similar to dest parameter, in case user specify as keyword

Returns

The resulting set of spans as a numpy array

join(destination_pkey, fkey_indices, values_to_join, writer=None, fkey_index_spans=None)

This method is due for removal and should not be used. Please use the merge or ordered_merge functions instead.

list_datasets()

List the open datasets for this Session object. This is returned as a tuple of strings rather than the datasets themselves. The individual datasets can be fetched using get_dataset().

Example:

names = s.list_datasets()
datasets = [s.get_dataset(n) for n in names]
Returns

A tuple containing the names of the currently open datasets for this Session object

merge_inner(left_on, right_on, left_fields=None, left_writers=None, right_fields=None, right_writers=None)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please use DataFrame.merge instead.

Perform a database-style inner join on left_fields, outputting the result to left_writers, if set.

Parameters
  • left_on – The key to perform the join on on the left hand side

  • right_on – The key to perform the join on on the right hand side

  • left_fields – The fields to be mapped from left to inner

  • left_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.

  • right_fields – The fields to be mapped from right to inner

  • right_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.

merge_left(left_on, right_on, right_fields=(), right_writers=None)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please use DataFrame.merge instead.

Perform a database-style left join on right_fields, outputting the result to right_writers, if set.

Parameters
  • left_on – The key to perform the join on on the left hand side

  • right_on – The key to perform the join on on the right hand side

  • right_fields – The fields to be mapped from right to left

  • right_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.

merge_right(left_on, right_on, left_fields=(), left_writers=None)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please use DataFrame.merge instead.

Perform a database-style right join on left_fields, outputting the result to left_writers, if set.

Parameters
  • left_on – The key to perform the join on on the left hand side

  • right_on – The key to perform the join on on the right hand side

  • left_fields – The fields to be mapped from right to left

  • left_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.

open_dataset(dataset_path: Union[str, IO[bytes]], mode: str, name: str)

Open a dataset with the given access mode.

Parameters
  • dataset_path – the path to the dataset

  • mode – the mode in which the dataset should be opened. This is one of “r”, “r+” or “w”.

  • name – the name that is associated with this dataset. This can be used to retrieve the dataset when calling get_dataset().

Returns

The top-level dataset object

ordered_merge_inner(left_on, right_on, left_field_sources=(), left_field_sinks=None, right_field_sources=(), right_field_sinks=None, left_unique=False, right_unique=False)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please use DataFrame.merge instead.

Generate the results of an inner join and apply it to the fields described in the tuple ‘right_field_sources’. If ‘right_field_sinks’ is set, the mapped values are written to the fields / arrays set there.

Note: in order to achieve best scalability, you should use groups / fields rather than numpy arrays and provide a tuple of groups/fields to right_field_sinks, so that the session and compute the merge and apply the mapping in a streaming fashion.

Parameters
  • left_on – the group/field/numba array that contains the left key values

  • right_on – the group/field/numba array that contains the right key values

  • right_to_left_map – a group/field/numba array that the map is written to. If it is a numba array, it must be the size of the resulting merge

  • right_field_sources – a tuple of group/fields/numba arrays that contain the fields to be joined

  • right_field_sinks – optional - a tuple of group/fields/numba arrays that the mapped fields should be written to

  • left_unique – a hint to indicate whether the ‘left_on’ field contains unique values

  • right_unique – a hint to indicate whether the ‘right_on’ field contains unique values

Returns

If right_field_sinks is not set, a tuple of the output fields is returned

ordered_merge_left(left_on, right_on, right_field_sources=(), left_field_sinks=None, left_to_right_map=None, left_unique=False, right_unique=False)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please use DataFrame.merge instead.

Generate the results of a left join and apply it to the fields described in the tuple ‘left_field_sources’. If ‘left_field_sinks’ is set, the mapped values are written to the fields / arrays set there. Note: in order to achieve best scalability, you should use groups / fields rather than numpy arrays and provide a tuple of groups/fields to left_field_sinks, so that the session and compute the merge and apply the mapping in a streaming fashion.

Parameters
  • left_on – the group/field/numba array that contains the left key values

  • right_on – the group/field/numba array that contains the right key values

  • left_to_right_map – a group/field/numba array that the map is written to. If it is a numba array, it must be the size of the resulting merge

  • left_field_sources – a tuple of group/fields/numba arrays that contain the fields to be joined

  • left_field_sinks – optional - a tuple of group/fields/numba arrays that the mapped fields should be written to

  • left_unique – a hint to indicate whether the ‘left_on’ field contains unique values

  • right_unique – a hint to indicate whether the ‘right_on’ field contains unique values

Returns

If left_field_sinks is not set, a tuple of the output fields is returned

ordered_merge_right(left_on, right_on, left_field_sources=(), right_field_sinks=None, right_to_left_map=None, left_unique=False, right_unique=False)

Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.

Please use DataFrame.merge instead.

Generate the results of a right join and apply it to the fields described in the tuple ‘right_field_sources’. If ‘right_field_sinks’ is set, the mapped values are written to the fields / arrays set there.

Note: in order to achieve best scalability, you should use groups / fields rather than numpy arrays and provide a tuple of groups/fields to right_field_sinks, so that the session and compute the merge and apply the mapping in a streaming fashion.

Parameters
  • left_on – the group/field/numba array that contains the left key values

  • right_on – the group/field/numba array that contains the right key values

  • right_to_left_map – a group/field/numba array that the map is written to. If it is a numba array, it must be the size of the resulting merge

  • right_field_sources – a tuple of group/fields/numba arrays that contain the fields to be joined

  • right_field_sinks – optional - a tuple of group/fields/numba arrays that the mapped fields should be written to

  • left_unique – a hint to indicate whether the ‘left_on’ field contains unique values

  • right_unique – a hint to indicate whether the ‘right_on’ field contains unique values

Returns

If right_field_sinks is not set, a tuple of the output fields is returned

set_timestamp(timestamp: str = '2023-01-18 11:14:27.097586+00:00')

Set the default timestamp to be used when creating fields without specifying an explicit timestamp.

Parameters

timestamp – a string representing a valid Datetime

Returns

None

sort_on(src_group: h5py._hl.group.Group, dest_group: h5py._hl.group.Group, keys: Union[tuple, list], timestamp=datetime.datetime(2023, 1, 18, 11, 14, 27, 97592, tzinfo=datetime.timezone.utc), write_mode='write', verbose=True)

Sort a group (src_group) of fields by the specified set of keys, and write the sorted fields to dest_group.

Parameters
  • src_group – the group of fields that are to be sorted

  • dest_group – the group into which sorted fields are written

  • keys – fields to sort on

  • timestamp – optional - timestamp to write on the sorted fields

  • write_mode – optional - write mode to use if the destination fields already exist

Returns

None

temp_filename()

exetera.core.utils module

class exetera.core.utils.Timer(start_msg, new_line=False, end_msg='completed in')

Bases: object

exetera.core.utils.build_histogram(dataset, filtered_records=None, tx=None)
exetera.core.utils.check_input_lengths(names, fields)
exetera.core.utils.count_flag_empty(flags)
exetera.core.utils.count_flag_not_set(flags, flag_to_test)
exetera.core.utils.count_flag_set(flags, flag_to_test)
exetera.core.utils.datetime_to_seconds(dt)
exetera.core.utils.filter_field(fields, filter_list, f_missing, f_bad, is_type_fn, type_fn, valid_fn)
exetera.core.utils.find_longest_sequence_of(string, char)
exetera.core.utils.get_min_max(value_type)
exetera.core.utils.guess_encoding(filename)

Attempt to determine the encodig of the given text file by reading the byte order mark, defaulting to utf-8 if none is found.

Parameters

filename – path to a text file containing possible UTF-8, UTF-16, or UTF-32 text

Returns

encoding name, one of utf-8, utf-8-sig, utf-16, utf-32

exetera.core.utils.map_between_categories(first_map, second_map)
exetera.core.utils.one_dim_data_to_indexed_for_test(data, field_size)
exetera.core.utils.string_to_datetime(field)
exetera.core.utils.timestamp_to_day(field)
exetera.core.utils.validate_file_exists(file_name)

exetera.core.validation module

exetera.core.validation.all_same_basic_type(name, fields)
exetera.core.validation.array_from_field_or_lower(name, field)
exetera.core.validation.array_from_parameter(session, name, field)
exetera.core.validation.ensure_valid_field(name, field)
exetera.core.validation.ensure_valid_field_like(name, field)
exetera.core.validation.field_from_parameter(session, name, field)
exetera.core.validation.is_field_parameter(field)
exetera.core.validation.raw_array_from_parameter(datastore, name, field)
exetera.core.validation.validate_all_field_length_in_df(df: exetera.core.abstract_types.DataFrame)
exetera.core.validation.validate_and_get_key_fields(side, df, key)
exetera.core.validation.validate_and_normalize_categorical_key(param_name, key)
exetera.core.validation.validate_boolean_row_filter(name, field)
exetera.core.validation.validate_chunk_size(chunk_size_name, chunk_size)
exetera.core.validation.validate_field_lengths(side, lens, df, names=None)
exetera.core.validation.validate_filter(filter_to_apply)
exetera.core.validation.validate_groupby_target(target, by, all)
exetera.core.validation.validate_key_field_consistency(lname, rname, lkey, rkey)
exetera.core.validation.validate_key_lengths(side, df, key)
exetera.core.validation.validate_require_key(context, key, dictionary)
exetera.core.validation.validate_selected_keys(by, all)

Module contents