exetera.core package¶
Submodules¶
exetera.core.data_writer module¶
-
class
exetera.core.data_writer.DataWriter¶ Bases:
object-
static
clear_dataset(parent_group, name)¶
-
static
create_group(parent_group, name, attrs)¶
-
static
flush(group)¶
-
static
write(group, name, field, count, dtype=None)¶
-
static
write_additional(group, name, field, count)¶
-
static
write_first(group, name, field, count, dtype=None)¶
-
static
exetera.core.dataset module¶
-
class
exetera.core.dataset.HDF5Dataset(session, dataset_path, mode, name)¶ Bases:
exetera.core.abstract_types.DatasetDataset is the means which which you interact with an ExeTera datastore. These are created and loaded through Session.open_dataset, rather than being constructed directly.
Datasets are composed of one or more DataFrame objects and the means by which DataFrames are interacted with.
For a detailed explanation of Dataset along with examples of its use, please refer to the wiki documentation at https://github.com/KCL-BMEIS/ExeTera/wiki/Dataset-API
- Parameters
session – The session instance to include this dataset to.
dataset_path – The path of HDF5 file.
mode – the mode in which the dataset should be opened. This is one of “r”, “r+” or “w”.
name – the name that is associated with this dataset. This can be used to retrieve the dataset when calling
get_dataset().
- Returns
A HDF5Dataset instance.
-
close()¶ Close the HDF5 file operations.
-
contains_dataframe(dataframe: exetera.core.abstract_types.DataFrame)¶ Check if a dataframe is contained in this dataset by the dataframe object itself.
- Parameters
dataframe – the dataframe object to check
- Returns
True or False if the dataframe is contained
-
copy(dataframe, name)¶ Add an existing dataframe (from other dataset) to this dataset, write the existing group attributes and HDF5 datasets to this dataset.
- Parameters
dataframe – the dataframe to copy to this dataset.
name – optional- change the dataframe name.
- Returns
None if the operation is successful; otherwise throw Error.
-
create_dataframe(name: str, dataframe: Optional[exetera.core.abstract_types.DataFrame] = None)¶ Create a new DataFrame object as a part of this Dataset.
- Parameters
name – name of the dataframe
dataframe – if set, this is a dataframe object whose contents are duplicated
- Returns
a dataframe object
-
create_group(name: str)¶ This method is a wrapper around
create_dataframe()instead.
-
delete_dataframe(dataframe: exetera.core.abstract_types.DataFrame)¶ Remove dataframe from this dataset by the dataframe object.
- Parameters
dataframe – The dataframe instance to delete.
- Returns
Boolean if the dataframe is deleted.
-
drop(name: str)¶
-
get_dataframe(name: str)¶ Get the dataframe by dataset.get_dataframe(dataframe_name).
- Parameters
name – The name of the dataframe.
- Returns
The dataframe or throw Error if the name is not existed in this dataset.
-
items()¶ Return the (name, dataframe) tuple in this dataset.
-
keys()¶ Return all dataframe names in this dataset.
-
require_dataframe(name)¶ Get a dataframe, creating it if it doesn’t exist.
- Parameters
name – name of the dataframe
-
property
session¶ The session property interface.
- Returns
The _session instance.
-
values()¶ Return all dataframe instance in this dataset.
-
exetera.core.dataset.copy(dataframe: exetera.core.abstract_types.DataFrame, dataset: exetera.core.abstract_types.Dataset, name: str)¶ Copy dataframe to another dataset via HDF5DataFrame.copy(ds1[‘df1’], ds2, ‘df1’])
- Parameters
dataframe – The dataframe to copy.
dataset – The destination dataset.
name – The name of dataframe in destination dataset.
-
exetera.core.dataset.move(dataframe: exetera.core.abstract_types.DataFrame, dataset: exetera.core.abstract_types.Dataset, name: str)¶ Move a dataframe to another dataset via HDF5DataFrame.move(ds1[‘df1’], ds2, ‘df1’]). If move within the same dataset, e.g. HDF5DataFrame.move(ds1[‘df1’], ds1, ‘df2’]), function as a rename for both dataframe and HDF5Group. However, to
- Parameters
dataframe – The dataframe to copy.
dataset – The destination dataset.
name – The name of dataframe in destination dataset.
exetera.core.dataframe module¶
-
class
exetera.core.dataframe.HDF5DataFrame(dataset: exetera.core.abstract_types.Dataset, name: str, h5group: h5py._hl.group.Group)¶ Bases:
exetera.core.abstract_types.DataFrameDataFrame is the means which which you interact with an ExeTera datastore. These are created and loaded through Dataset.create_dataframe, and other methods, rather than being constructed directly.
DataFrames closely resemble Pandas DataFrames, but with a number of key differences: 1. Instead of Series, DataFrames are composed of Field objects 2. DataFrames can store fields of differing lengths, although all fields must be of the same length when performing certain operations such as merges. 3. ExeTera DataFrames do not (yet) have the ability to create filtered views onto an underlying DataFrame, although this functionality will be added in upcoming releases
For a detailed explanation of DataFrame along with examples of its use, please refer to the wiki documentation at https://github.com/KCL-BMEIS/ExeTera/wiki/DataFrame-API
- Parameters
name – name of the dataframe.
dataset – a dataset object, where this dataframe belongs to.
h5group – the h5group object to store the fields. If the h5group is not empty, acquire data from h5group object directly. The h5group structure is h5group<-h5group-dataset structure, the later group has a ‘fieldtype’ attribute and only one dataset named ‘values’. So that the structure is mapped to Dataframe<-Field-Field.data automatically.
dataframe – optional - replicate data from another dictionary of (name:str, field: Field).
-
add(field: exetera.core.abstract_types.Field)¶ Add a field to this dataframe as well as the HDF5 Group.
- Parameters
field – field to add to this dataframe, copy the underlying dataset
-
apply_filter(filter_to_apply, ddf=None)¶ Apply the filter to all the fields in this dataframe, return a dataframe with filtered fields.
- Parameters
filter_to_apply – the filter to be applied to the source field, an array of boolean
ddf – optional- the destination data frame
- Returns
a dataframe contains all the fields filterd, self if ddf is not set
-
apply_index(index_to_apply, ddf=None)¶ Apply the index to all the fields in this dataframe, return a dataframe with indexed fields.
- Parameters
index_to_apply – the index to be applied to the fields, an ndarray of integers
ddf – optional- the destination data frame
- Returns
a dataframe contains all the fields re-indexed, self if ddf is not set
-
property
columns¶ The columns property interface. Columns is a dictionary to store the fields by (field_name, field_object). The field_name is field.name without prefix ‘/’ and HDF5 group name.
-
contains_field(field)¶ check if dataframe contains a field by the field object
- Parameters
field – the filed object to check, return a tuple(bool,str). The str is the name stored in dataframe.
-
create_categorical(name: str, nformat: int, key: dict, timestamp: Optional[str] = None, chunksize: Optional[int] = None)¶ Create a categorical type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#categoricalfield for a detailed description of indexed string fields
-
create_fixed_string(name: str, length: int, timestamp: Optional[str] = None, chunksize: Optional[int] = None)¶ Create a fixed string type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#fixedstringfield for a detailed description of fixed string fields
-
create_group(name: str)¶ Create a group object in HDF5 file for field to use. Please note, this function is for backwards compatibility with older scripts and should not be used in the general case.
- Parameters
name – the name of the group and field
- Returns
a hdf5 group object
-
create_indexed_string(name: str, timestamp: Optional[str] = None, chunksize: Optional[int] = None)¶ Create a indexed string type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#indexedstringfield for a detailed description of indexed string fields
-
create_numeric(name: str, nformat: int, timestamp: Optional[str] = None, chunksize: Optional[int] = None)¶ Create a numeric type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#numericfield for a detailed description of numeric fields
-
create_timestamp(name: str, timestamp: Optional[str] = None, chunksize: Optional[int] = None)¶ Create a timestamp type field. Please see https://github.com/KCL-BMEIS/ExeTera/wiki/Datatypes#timestampfield for a detailed description of timestamp fields
-
property
dataset¶ The dataset property interface.
-
delete_field(field)¶ Remove field from dataframe by field.
- Parameters
field – The field to delete from this dataframe.
-
describe(include=None, exclude=None, output='terminal')¶ Show the basic statistics of the data in each field.
- Parameters
include – The field name or data type or simply ‘all’ to indicate the fields included in the calculation.
exclude – The filed name or data type to exclude in the calculation.
output – Display the result in stdout if set to terminal, otherwise silent.
- Returns
A dataframe contains the statistic results.
-
drop(name: str)¶ Drop a field from this dataframe as well as the HDF5 Group
-
drop_duplicates(by: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame = None, hint_keys_is_sorted=False)¶ Distinct values of a field or a list of field, return a dataframe with distinct values.
- Parameters
by – Name (str) or list of names (str) to distinct.
ddf – optional - the destination dataframe
- Returns
DataFrame with distinct values.
-
get_field(name)¶ Get a field stored by the field name.
- Parameters
name – The name of field to get.
-
groupby(by: Union[str, List[str]], hint_keys_is_sorted=False)¶ Group DataFrame using a field or a list of field, return a groupby object.
- Parameters
by – Name (str) or list of names (str) to group by.
hint_keys_is_sorted – an optional flag that users could set to skip the sorted check. Note that it runs faster and uses less memory when the dataframe is sorted, that is, hint_key_is_sorted=True.
- Returns
Returns a groupby object that contains information about the groups.
-
property
h5group¶ The h5group property interface, used to handle underlying storage.
-
items()¶ Return all the field names and their corresponding field values
-
keys()¶ Return all the field names
-
rename(field: Union[str, Mapping[str, str]], field_to: Optional[str] = None) → None¶ Rename provides you with the means to rename fields within a dataframe. You can specify either a single field to be renamed or you can provide a dictionary with a set of fields to be renamed.
Example:
# rename a single field df.rename('a', 'b') # rename multiple fields df.rename({'a': 'b', 'b': 'c', 'c': 'a'})
Field renaming can fail if the resulting set of renamed fields would have name clashes. If this is the case, none of the rename operations go ahead and the dataframe remains unmodified.
- Parameters
field – Either a string or a dictionary of name pairs, each of which is the existing field name and the destination field name
field_to – Optional parameter containing a string, if field is a string. If ‘field’ is a dictionary, parameter should not be set. Field references remain valid after this operation and reflect their renaming.
- Returns
None
-
sort_values(by: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame = None, axis=0, ascending=True, kind='stable')¶ Sort by the values of a field or a list of fields
- Parameters
by – Name (str) or list of names (str) to sort by.
ddf – optional - the destination data frame
axis – Axis to be sorted. Currently only supports 0
ascending – Sort ascending vs. descending. Currently only supports ascending=True.
kind – Choice of sorting algorithm. Currently only supports “stable”
- Returns
DataFrame with sorted values or None if ddf=None.
-
to_csv(filepath: str, row_filter: Union[numpy.ndarray, exetera.core.abstract_types.Field] = None, column_filter: Union[str, List[str]] = None, chunk_row_size: int = 32768)¶ Write object to a comma-separated values (csv) file. :param filepath: File path. :param row_filter: A boolean array / field. Only select rows when filter value is True :param column_filter: A sequence of string names for the fields. :chunk_row_size: Write rows for every chunk which has maximum chunk_row_size rows. The default is 1<<15.
-
to_pandas(row_filter: List[bool] = None, col_filter: Union[str, List[str]] = None)¶ Convert an ExeTera dataframe to Pandas DataFrame. :param row_filter: A boolean array indicates which rows to export. :param col_filter: String or list of strings indicates which columns to export. :returns: A pandas dataframe.
Example:
pandas_df = df.to_pandas()
-
values()¶ Return all the field values
-
class
exetera.core.dataframe.HDF5DataFrameGroupBy(columns, by, sorted_index, spans)¶ Bases:
exetera.core.abstract_types.DataFrameGroupBy-
count(ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame¶ Compute max of group values.
- Parameters
target – Name (str) or list of names (str) to compute count.
ddf – the destination data frame
write_keys – write groupby keys to ddf only if write_key=True. Default is True.
- Returns
dataframe with count of group values
-
distinct(ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame¶
-
first(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame¶ Get first of group values.
- Parameters
target – Name (str) or list of names (str) to get first value.
ddf – the destination data frame
write_keys – write groupby keys to ddf only if write_key=True. Default is True.
- Returns
dataframe with first of group values
-
last(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame¶ Get last of group values.
- Parameters
target – Name (str) or list of names (str) to get last value.
ddf – the destination data frame
write_keys – write groupby keys to ddf only if write_key=True. Default is True.
- Returns
dataframe with last of group values
-
max(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame¶ Compute max of group values.
- Parameters
target – Name (str) or list of names (str) to compute max.
ddf – the destination data frame
write_keys – write groupby keys to ddf only if write_key=True. Default is True.
- Returns
dataframe with max of group values
-
min(target: Union[str, List[str]], ddf: exetera.core.abstract_types.DataFrame, write_keys=True) → exetera.core.abstract_types.DataFrame¶ Compute min of group values.
- Parameters
target – Name (str) or list of names (str) to compute min.
ddf – the destination data frame
write_keys – write groupby keys to ddf only if write_key=True. Default is True.
- Returns
dataframe with min of group values
-
-
exetera.core.dataframe.copy(field: exetera.core.abstract_types.Field, dataframe: exetera.core.abstract_types.DataFrame, name: str)¶ Copy a field to another dataframe as well as underlying dataset.
- Parameters
field – The source field to copy.
dataframe – The destination dataframe to copy to.
name – The name of field under destination dataframe.
-
exetera.core.dataframe.merge(left: exetera.core.abstract_types.DataFrame, right: exetera.core.abstract_types.DataFrame, dest: exetera.core.abstract_types.DataFrame, left_on: Union[Tuple[Union[str, exetera.core.abstract_types.Field]], str, exetera.core.abstract_types.Field], right_on: Union[Tuple[Union[str, exetera.core.abstract_types.Field]], str, exetera.core.abstract_types.Field], left_fields: Optional[Sequence[str]] = None, right_fields: Optional[Sequence[str]] = None, left_suffix: str = '_l', right_suffix: str = '_r', how='left', hint_left_keys_ordered: Optional[bool] = None, hint_left_keys_unique: Optional[bool] = None, hint_right_keys_ordered: Optional[bool] = None, hint_right_keys_unique: Optional[bool] = None, chunk_size=1048576)¶ Merge ‘left’ and ‘right’ DataFrames into a destination dataset. The merge is a database-style join operation, in any of the following modes (“left”, “right”, “inner”, “outer”). This method closely follows the Pandas ‘merge’ functionality.
The join is performed using the fields specified by ‘left_on’ and ‘right_on’; these can either be strings or fields; if they strings then they refer to fields that must exist in the corresponding dataframe.
You can optionally set ‘left_fields’ and / or ‘right_fields’ if you want to have only a subset of fields joined from the left and right dataframes. If you don’t want any fields to be joined from a given dataframe, you can pass an empty list.
Fields are written to the destination dataframe. If the field names clash, they will get appended with the strings specified in ‘left_suffix’ and ‘right_suffix’ respectively.
- Parameters
left – The left dataframe
right – The right dataframe
left_on – The field corresponding to the left key used to perform the join. This is either the the name of the field, or a field object. If it is a field object, it can be from another dataframe but it must be the same length as the fields being joined. This can also be a tuple of such values when performing joins on compound keys
right_on – The field corresponding to the right key used to perform the join. This is either the name of the field, or a field object. If it is a field object, it can be from another dataframe but it must be the same length as the fields being joined. This can also be a tuple of such values when performing joins on compound keys
left_fields – Optional parameter listing which fields are to be joined from the left table. If this is not set, all fields from the left table are joined
right_fields – Optional parameter listing which fields are to be joined from the right table. If this is not set, all fields from the right table are joined
left_suffix – A string to be appended to fields from the left table if they clash with fields from the right table.
right_suffix – A string to be appended to fields from the right table if they clash with fields from the left table.
how – Optional parameter specifying the merge mode. It must be one of (‘left’, ‘right’, ‘inner’, ‘outer’ or ‘cross). If not set, the ‘left’ join is performed.
-
exetera.core.dataframe.move(field: exetera.core.abstract_types.Field, dest_df: exetera.core.abstract_types.DataFrame, name: str)¶ Move a field to another dataframe as well as underlying dataset.
- Parameters
src_df – The source dataframe where the field is located.
field – The field to move.
dest_df – The destination dataframe to move to.
name – The name of field under destination dataframe.
exetera.core.exporter module¶
-
exetera.core.exporter.export_schema(destination, readers)¶
-
exetera.core.exporter.export_to_csv(destination, datastore, fields)¶ Export selected fields of selected dataframe to csv file.
-
exetera.core.exporter.schema_from_reader_type(reader)¶
-
exetera.core.exporter.transform_from_reader_type(reader)¶
exetera.core.fields module¶
-
class
exetera.core.fields.CategoricalField(session, group, dataframe, write_enabled=False)¶ Bases:
exetera.core.fields.HDF5Field-
apply_filter(filter_to_apply, target=None, in_place=False)¶ Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.
- Parameters
filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_index(index_to_apply, target=None, in_place=False)¶ Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.
- Parameters
index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_spans_first(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_last(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_max(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_min(spans_to_apply, target=None, in_place=False)¶
-
create_like(group=None, name=None, timestamp=None)¶
-
property
data¶
-
get_spans()¶
-
is_sorted()¶
-
isin(test_elements: Union[list, set, numpy.ndarray])¶
-
property
keys¶
-
property
nformat¶
-
remap(key_map, new_key)¶ Remap the key names and key values.
- Parameters
key_map – The mapping rule of convert the old key into the new key.
new_key – The new key.
- Returns
A CategoricalMemField with the new key.
-
unique(return_index=False, return_inverse=False, return_counts=False)¶ Find the unique elements of CategoricalField
-
writeable()¶
-
-
class
exetera.core.fields.CategoricalMemField(session, nformat, keys)¶ Bases:
exetera.core.fields.MemoryField-
apply_filter(filter_to_apply, target=None, in_place=False)¶ Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.
- Parameters
filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_index(index_to_apply, target=None, in_place=False)¶ Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.
- Parameters
index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_spans_first(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_last(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_max(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_min(spans_to_apply, target=None, in_place=False)¶
-
create_like(group=None, name=None, timestamp=None)¶
-
property
data¶
-
get_spans()¶
-
is_sorted()¶
-
isin(test_elements: Union[list, set, numpy.ndarray])¶
-
property
keys¶
-
remap(key_map, new_key)¶ Remap the key names and key values.
- Parameters
key_map – The mapping rule of convert the old key into the new key.
new_key – The new key.
- Returns
A CategoricalMemField with the new key.
-
unique(return_index=False, return_inverse=False, return_counts=False)¶ Find the unique elements of CategoricalMemField
-
writeable()¶
-
-
class
exetera.core.fields.FieldDataOps¶ Bases:
object-
static
apply_filter_to_field(source, filter_to_apply, target=None, in_place=False)¶
-
static
apply_filter_to_indexed_field(source, filter_to_apply, target=None, in_place=False)¶
-
static
apply_index_to_field(source, index_to_apply, target=None, in_place=False)¶
-
static
apply_index_to_indexed_field(source, index_to_apply, target=None, in_place=False)¶
-
static
apply_isin(source: exetera.core.abstract_types.Field, test_elements: Union[list, set, numpy.ndarray])¶
-
static
apply_spans_first(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field¶
-
static
apply_spans_last(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field¶
-
static
apply_spans_max(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field¶
-
static
apply_spans_min(source: exetera.core.abstract_types.Field, spans: Union[exetera.core.abstract_types.Field, numpy.ndarray], target: Optional[exetera.core.abstract_types.Field] = None, in_place: bool = None) → exetera.core.abstract_types.Field¶
-
static
apply_unique(src: exetera.core.abstract_types.Field, return_index=False, return_inverse=False, return_counts=False) → numpy.ndarray¶
-
static
categorical_field_create_like(source, group, name, timestamp)¶
-
classmethod
equal(session, first, second)¶
-
static
fixed_string_field_create_like(source, group, name, timestamp)¶
-
classmethod
greater_than(session, first, second)¶
-
classmethod
greater_than_equal(session, first, second)¶
-
static
indexed_string_create_like(source, group, name, timestamp)¶
-
classmethod
invert(session, first)¶
-
classmethod
less_than(session, first, second)¶
-
classmethod
less_than_equal(session, first, second)¶
-
classmethod
logical_not(session, first)¶
-
classmethod
not_equal(session, first, second)¶
-
classmethod
numeric_add(session, first, second)¶
-
classmethod
numeric_and(session, first, second)¶
-
classmethod
numeric_divmod(session, first, second)¶
-
static
numeric_field_create_like(source, group, name, timestamp)¶
-
classmethod
numeric_floordiv(session, first, second)¶
-
classmethod
numeric_mod(session, first, second)¶
-
classmethod
numeric_mul(session, first, second)¶
-
classmethod
numeric_or(session, first, second)¶
-
classmethod
numeric_sub(session, first, second)¶
-
classmethod
numeric_truediv(session, first, second)¶
-
classmethod
numeric_xor(session, first, second)¶
-
static
timestamp_field_create_like(source, group, name, timestamp)¶
-
static
-
class
exetera.core.fields.FixedStringField(session, group, dataframe, write_enabled=False)¶ Bases:
exetera.core.fields.HDF5Field-
apply_filter(filter_to_apply, target=None, in_place=False)¶ Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.
- Parameters
filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_index(index_to_apply, target=None, in_place=False)¶ Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.
- Parameters
index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_spans_first(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_last(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_max(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_min(spans_to_apply, target=None, in_place=False)¶
-
create_like(group=None, name=None, timestamp=None)¶
-
property
data¶
-
get_spans()¶
-
is_sorted()¶
-
isin(test_elements: Union[list, set, numpy.ndarray])¶
-
unique(return_index=False, return_inverse=False, return_counts=False)¶ Find the unique elements of FixedStringField
-
writeable()¶
-
-
class
exetera.core.fields.FixedStringMemField(session, length)¶ Bases:
exetera.core.fields.MemoryField-
apply_filter(filter_to_apply, target=None, in_place=False)¶ Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.
- Parameters
filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_index(index_to_apply, target=None, in_place=False)¶ Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.
- Parameters
index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_spans_first(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_last(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_max(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_min(spans_to_apply, target=None, in_place=False)¶
-
create_like(group=None, name=None, timestamp=None)¶
-
property
data¶
-
get_spans()¶
-
is_sorted()¶
-
isin(test_elements: Union[list, set, numpy.ndarray])¶
-
unique(return_index=False, return_inverse=False, return_counts=False)¶ Find the unique elements of FixedStringMemField
-
writeable()¶
-
-
class
exetera.core.fields.HDF5Field(session, group, dataframe, write_enabled=False)¶ Bases:
exetera.core.abstract_types.Field-
apply_filter(filter_to_apply, dstfld=None)¶
-
apply_index(index_to_apply, dstfld=None)¶
-
property
chunksize¶ The chunksize for the field. This is not generally required for users, and may be ignored depending on the storage medium.
-
property
dataframe¶ The owning dataframe of this field, or None if the field is now owned by a dataframe
-
get_spans()¶
-
property
indexed¶ Whether the field is an indexed field or not. Indexed fields store their data internally as index and value arrays for efficiency, as well as making it accessible through the data property.
-
property
name¶ The name of the field within a dataframe, if the field belongs to a dataframe
-
property
timestamp¶ The timestamp representing the field creation time. This is the time at which the data for this field was added to the dataset, rather than the point at which the field wrapper was created.
-
property
valid¶ Returns whether the field is a valid field object. Fields can become invalid as a result of certain operations, such as a field being moved from one dataframe to another. A field that is invalid with throw exceptions if any other operation is performed on them.
-
-
class
exetera.core.fields.IndexedStringField(session, group, dataframe, write_enabled=False)¶ Bases:
exetera.core.fields.HDF5Field-
apply_filter(filter_to_apply, target=None, in_place=False)¶ Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.
- Parameters
filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_index(index_to_apply, target=None, in_place=False)¶ Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.
- Parameters
index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_spans_first(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_last(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_max(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_min(spans_to_apply, target=None, in_place=False)¶
-
create_like(group=None, name=None, timestamp=None)¶ Create an empty field of the same type as this field.
-
property
data¶
-
get_spans()¶
-
property
indexed¶ Whether the field is an indexed field or not. Indexed fields store their data internally as index and value arrays for efficiency, as well as making it accessible through the data property.
-
property
indices¶
-
is_sorted()¶
-
isin(test_elements: Union[list, set, numpy.ndarray])¶
-
unique(return_index=False, return_inverse=False, return_counts=False)¶ Find the unique elements of IndexedStringField
-
property
values¶
-
writeable()¶ Indicates whether this field permits write operations. By default, dataframe fields are read-only in order to protect accidental writes to datasets
-
-
class
exetera.core.fields.IndexedStringMemField(session, chunksize=None)¶ Bases:
exetera.core.fields.MemoryField-
apply_filter(filter_to_apply, target=None, in_place=False)¶ Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.
- Parameters
filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_index(index_to_apply, target=None, in_place=False)¶ Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.
- Parameters
index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_spans_first(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_last(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_max(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_min(spans_to_apply, target=None, in_place=False)¶
-
create_like(group=None, name=None, timestamp=None)¶
-
property
data¶
-
get_spans()¶
-
property
indexed¶
-
property
indices¶
-
is_sorted()¶
-
isin(test_elements: Union[list, set, numpy.ndarray])¶
-
unique(return_index=False, return_inverse=False, return_counts=False)¶ Find the unique elements of IndexedStringMemField
-
property
values¶
-
writeable()¶
-
-
class
exetera.core.fields.MemoryField(session)¶ Bases:
exetera.core.abstract_types.Field-
apply_filter(filter_to_apply, dstfld=None)¶
-
apply_index(index_to_apply, dstfld=None)¶
-
property
chunksize¶
-
property
dataframe¶
-
property
indexed¶
-
property
name¶
-
property
timestamp¶
-
property
valid¶
-
-
class
exetera.core.fields.MemoryFieldArray(dtype)¶ Bases:
object-
clear()¶
-
complete()¶
-
property
dtype¶
-
write(part)¶
-
write_part(part, move_mem=False)¶
-
-
class
exetera.core.fields.NumericField(session, group, dataframe, write_enabled=False)¶ Bases:
exetera.core.fields.HDF5Field-
apply_filter(filter_to_apply, target=None, in_place=False)¶ Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.
- Parameters
filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_index(index_to_apply, target=None, in_place=False)¶ Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.
- Parameters
index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_spans_first(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_last(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_max(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_min(spans_to_apply, target=None, in_place=False)¶
-
astype(dtype: str, casting='unsafe')¶ Convert the field data type to dtype parameter given.
- Parameters
dtype – The new datatype, given as a str object. The dtype must be a subtype of np.number, e.g. int, float, etc.
casting – Similar to the casting parameter in numpy ndarray.astype, can be ‘no’, ‘equiv’, ‘safe’, ‘same_kind’, or ‘unsafe’.
- Returns
The field with new datatype.
-
create_like(group=None, name=None, timestamp=None)¶
-
property
data¶
-
get_spans()¶
-
is_sorted()¶
-
isin(test_elements: Union[list, set, numpy.ndarray])¶
-
logical_not()¶
-
unique(return_index=False, return_inverse=False, return_counts=False)¶ Find the unique elements of NumericField
-
writeable()¶
-
-
class
exetera.core.fields.NumericMemField(session, nformat)¶ Bases:
exetera.core.fields.MemoryField-
apply_filter(filter_to_apply, target=None, in_place=False)¶ Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.
- Parameters
filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_index(index_to_apply, target=None, in_place=False)¶ Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.
- Parameters
index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_spans_first(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_last(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_max(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_min(spans_to_apply, target=None, in_place=False)¶
-
create_like(group=None, name=None, timestamp=None)¶
-
property
data¶
-
get_spans()¶
-
is_sorted()¶
-
isin(test_elements: Union[list, set, numpy.ndarray])¶
-
logical_not()¶
-
unique(return_index=False, return_inverse=False, return_counts=False)¶ Find the unique elements of NumericMemField
-
writeable()¶
-
-
class
exetera.core.fields.ReadOnlyFieldArray(field, dataset_name)¶ Bases:
object-
clear()¶
-
complete()¶
-
property
dtype¶
-
write(part)¶
-
write_part(part)¶
-
-
class
exetera.core.fields.ReadOnlyIndexedFieldArray(field, indices, values)¶ Bases:
object-
clear()¶
-
complete()¶
-
property
dtype¶
-
write(part)¶
-
write_part(part)¶
-
-
class
exetera.core.fields.TimestampField(session, group, dataframe, write_enabled=False)¶ Bases:
exetera.core.fields.HDF5Field-
apply_filter(filter_to_apply, target=None, in_place=False)¶ Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.
- Parameters
filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_index(index_to_apply, target=None, in_place=False)¶ Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.
- Parameters
index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_spans_first(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_last(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_max(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_min(spans_to_apply, target=None, in_place=False)¶
-
create_like(group=None, name=None, timestamp=None)¶
-
property
data¶
-
get_spans()¶
-
is_sorted()¶
-
isin(test_elements: Union[list, set, numpy.ndarray])¶
-
unique(return_index=False, return_inverse=False, return_counts=False)¶ Find the unique elements of TimestampField
-
writeable()¶
-
-
class
exetera.core.fields.TimestampMemField(session)¶ Bases:
exetera.core.fields.MemoryField-
apply_filter(filter_to_apply, target=None, in_place=False)¶ Apply a boolean filter to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the filtered data is written to.
- Parameters
filter_to_apply – a Field or numpy array that contains the boolean filter data
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The filtered field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_index(index_to_apply, target=None, in_place=False)¶ Apply an index to this field. This operation doesn’t modify the field on which it is called unless ‘in_place is set to true’. The user can specify a ‘target’ field that the reindexed data is written to.
- Parameters
index_to_apply – a Field or numpy array that contains the indices
target – if set, this is the field that is written to. This field must be writable. If ‘target’ is set, ‘in_place’ must be False.
in_place – if True, perform the operation destructively on this field. This field must be writable. If ‘in_place’ is True, ‘target’ must be None
- Returns
The reindexed field. This is a new field instance unless ‘target’ is set, in which case it is the target field, or unless ‘in_place’ is True, in which case it is this field.
-
apply_spans_first(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_last(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_max(spans_to_apply, target=None, in_place=False)¶
-
apply_spans_min(spans_to_apply, target=None, in_place=False)¶
-
create_like(group=None, name=None, timestamp=None)¶
-
property
data¶
-
get_spans()¶
-
is_sorted()¶
-
isin(test_elements: Union[list, set, numpy.ndarray])¶
-
unique(return_index=False, return_inverse=False, return_counts=False)¶ Find the unique elements of TimestampMemField
-
writeable()¶
-
-
class
exetera.core.fields.WriteableFieldArray(field, dataset_name)¶ Bases:
object-
clear()¶
-
complete()¶
-
property
dtype¶
-
write(part)¶
-
write_part(part)¶
-
-
class
exetera.core.fields.WriteableIndexedFieldArray(chunksize, indices, values)¶ Bases:
object-
clear()¶
-
complete()¶
-
property
dtype¶
-
write(part)¶
-
write_part(part)¶
-
-
exetera.core.fields.argsort(field: exetera.core.abstract_types.Field, dtype: str = None)¶
-
exetera.core.fields.as_field(data, key=None)¶
-
exetera.core.fields.base_field_contructor(session, group, name, timestamp=None, chunksize=None)¶ Constructor are for 1)create the field (hdf5 group), 2)add basic attributes like chunksize, timestamp, field type, and 3)add the dataset to the field (hdf5 group) under the name ‘values’
-
exetera.core.fields.categorical_field_constructor(session, group, name, nformat, key, timestamp=None, chunksize=None)¶
-
exetera.core.fields.dtype_to_str(dtype)¶
-
exetera.core.fields.fixed_string_field_constructor(session, group, name, length, timestamp=None, chunksize=None)¶
-
exetera.core.fields.indexed_string_field_constructor(session, group, name, timestamp=None, chunksize=None)¶
-
exetera.core.fields.isin(field, test_elements)¶
-
exetera.core.fields.numeric_field_constructor(session, group, name, nformat, timestamp=None, chunksize=None)¶
-
exetera.core.fields.timestamp_field_constructor(session, group, name, timestamp=None, chunksize=None)¶
exetera.core.filtered_field module¶
-
class
exetera.core.filtered_field.FilteredField(field, filter)¶ Bases:
object
-
exetera.core.filtered_field.filtered_field(field, filter)¶
exetera.core.journal module¶
-
exetera.core.journal.journal_table(session, schema, old_src, new_src, src_pk, result)¶
-
exetera.core.journal.journal_test_harness(session, schema, old_file, new_file, dest_file)¶
exetera.core.operations module¶
-
exetera.core.operations.apply_filter_to_index_values(index_filter, indices, values)¶
-
exetera.core.operations.apply_indices_to_index_values(indices_to_apply, indices, values)¶
-
exetera.core.operations.apply_spans_concat(spans, src_index, src_values, dest_index, dest_values, max_index_i, max_value_i, s_start)¶
-
exetera.core.operations.apply_spans_count(spans, dest_array=None)¶
-
exetera.core.operations.apply_spans_first(spans, src_array, dest_array=None)¶
-
exetera.core.operations.apply_spans_index_of_first(spans, dest_array=None)¶
-
exetera.core.operations.apply_spans_index_of_first_filter(spans, dest_array, filter_array)¶
-
exetera.core.operations.apply_spans_index_of_last(spans, dest_array=None)¶
-
exetera.core.operations.apply_spans_index_of_last_filter(spans, dest_array, filter_array)¶
-
exetera.core.operations.apply_spans_index_of_max(spans, src_array, dest_array=None)¶
-
exetera.core.operations.apply_spans_index_of_max_filter(spans, src_array, dest_array, filter_array)¶
-
exetera.core.operations.apply_spans_index_of_max_indexed(spans, src_indices, src_values, dest_array=None)¶
-
exetera.core.operations.apply_spans_index_of_min(spans, src_array, dest_array=None)¶
-
exetera.core.operations.apply_spans_index_of_min_filter(spans, src_array, dest_array, filter_array)¶
-
exetera.core.operations.apply_spans_index_of_min_indexed(spans, src_indices, src_values, dest_array=None)¶
-
exetera.core.operations.apply_spans_last(spans, src_array, dest_array=None)¶
-
exetera.core.operations.apply_spans_max(spans, src_array, dest_array=None)¶
-
exetera.core.operations.apply_spans_min(spans, src_array, dest_array=None)¶
-
exetera.core.operations.calculate_chunk_decomposition(s_start, s_end, indices, value_chunk_size, sub_chunks)¶
-
exetera.core.operations.categorical_transform(chunk, i_c, column_inds, column_vals, column_offsets, cat_keys, cat_index, cat_values)¶ Transform method for categorical importer in readerwriter.py
-
exetera.core.operations.check_if_sorted_for_multi_fields(fields_data)¶ Check if input fields data is sorted. Note that fields_data should be treat as a group key
pre_row[j] < cur_row[j], means these two rows are sorted, move to next row => i + 1 pre_row[j] = cur_row[j], means we need to check if next element is sorted => j + 1 pre_row[j] > cur_row[j], means input data is not sorted
-
exetera.core.operations.chunked_copy(src_field, dest_field, chunksize=1048576)¶
-
exetera.core.operations.chunks(length, chunksize=1048576)¶
-
exetera.core.operations.compare_arrays(source[s1: s2], target[t1: t2])¶
-
exetera.core.operations.compare_indexed_rows_for_journalling(old_map, new_map, old_indices, old_values, new_indices, new_values, to_keep)¶
-
exetera.core.operations.compare_rows_for_journalling(old_map, new_map, old_field, new_field, to_keep)¶
-
exetera.core.operations.count_back(array)¶ This is a helper function that provides functionality specific to streaming ordered merges. It takes an array in sorted order and calculates a trimmed length that excludes the final sequence of equal values: Example:
[10, 20, 30, 40, 50] -> 4 ([10, 20, 30, 40]) [10, 20, 30, 40, 40] -> 3 ([10, 20, 30]) [10, 20, 30, 30, 30] -> 2 ([10, 20]) [10, 20, 20, 20, 20] -> 1 ([10])
-
exetera.core.operations.data_iterator(data_field, chunksize=1048576)¶
-
exetera.core.operations.dtype_to_str(dtype)¶
-
exetera.core.operations.element_chunked_copy(src_elem, dest_elem, chunksize)¶
-
exetera.core.operations.first_trimmed_chunk(field, chunk_size)¶
-
exetera.core.operations.first_untrimmed_chunk(field, chunk_size)¶
-
exetera.core.operations.fixed_string_transform(column_inds, column_vals, column_offsets, col_idx, written_row_count, strlen, memory)¶ Transform method for fixed string importer in field_importer.py
-
exetera.core.operations.generate_ordered_map_to_inner_both_unique_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r)¶
-
exetera.core.operations.generate_ordered_map_to_inner_both_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶
-
exetera.core.operations.generate_ordered_map_to_inner_left_unique_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r)¶
-
exetera.core.operations.generate_ordered_map_to_inner_left_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶
-
exetera.core.operations.generate_ordered_map_to_inner_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r, ii, jj, ii_max, jj_max, inner)¶ This function performs generates a mapping from a subset of a left key to a subset of a a right key, writing the resulting mapping to a buffer, where both keys can contain repeated entries.
Example:
left = [10, 20, 30, 40, 40, 50, 50] right = [20, 30, 30, 40, 40, 40, 60, 70] i j op r lres rres 0 0 < 0 0 INV 1 0 = 1 1 0 2 1 = 2 2 1 2 2 3 2 2 3 3 4 3 3 3 4 5 3 4 3 5 6 3 5 4 3 7 4 3 4 4 8 4 4 4 5 9 4 5 5 6 10 5 INV 6 6 11 6 INV left_map = [0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6] right_map = [INV, 1, 2, 2, 3, 3, 3, 4, 4, 4, INV, INV]
Everything about this function is optimised for performance under njit. It is effectively a finite state machine that iterates through left, right, and result arrays. The various…
i and i_max are used to track the index of the left source j and j_max are used to track the index of the right source
-
exetera.core.operations.generate_ordered_map_to_inner_right_unique_partial(left, i_max, right, j_max, l_result, r_result, i_off, j_off, i, j, r)¶
-
exetera.core.operations.generate_ordered_map_to_inner_right_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶
-
exetera.core.operations.generate_ordered_map_to_inner_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶ This function performs the most generic type of left to right mapping calculation in which both key fields can have repeated key values. At its heart, the function generates a mapping from left to right that can then be used to map data in the right space to data in the left space. Note that this can also be used to generate the inverse mapping my simply flipping left and right collections.
As the Fields
leftandrightcan contain arbitrarily long sequences of data, the data is streamed through the algorithm in a series of chunks. Similarly, the resulting map is written to a buffer that is written to theresultfield in chunks.This streamed function makes a sequence of calls to a corresponding _partial function that does the heavy lifting. Inside the _partial function, a finite state machine (FSM) iterates over the data, performing the mapping. The _partial function call exits whenever any of the chunks (
left_,right_orresult_that it is passed become exhausted.Please take a look at the documentation for the partial function to understand the finite state machine parameters to understand that role that the various parameters play.
- We have to make some adjustments to the finite state machine between calls to _partial:
if the call used all the
left_data, add the size of that data chunk toi_offif the call used all of the
right_data, add the size of that data chunk toj_offwrite the accumulated
result_data to the result` field, and resetrto 0
-
exetera.core.operations.generate_ordered_map_to_left_both_unique(first, second, result, invalid)¶
-
exetera.core.operations.generate_ordered_map_to_left_both_unique_partial(left, right, r_result, invalid, j_off, i, j, r)¶
-
exetera.core.operations.generate_ordered_map_to_left_both_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶
-
exetera.core.operations.generate_ordered_map_to_left_left_unique_partial(left, right, j_max, l_result, r_result, invalid, i_off, j_off, i, j, r)¶
-
exetera.core.operations.generate_ordered_map_to_left_left_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶
-
exetera.core.operations.generate_ordered_map_to_left_partial(left, i_max, right, j_max, l_result, r_result, invalid, i_off, j_off, i, j, r, ii, jj, ii_max, jj_max, inner)¶ This function performs generates a mapping from a subset of a left key to a subset of a a right key, writing the resulting mapping to a buffer, where both keys can contain repeated entries.
Example:
left = [10, 20, 30, 40, 40, 50, 50] right = [20, 30, 30, 40, 40, 40, 60, 70] i j op r lres rres 0 0 < 0 0 INV 1 0 = 1 1 0 2 1 = 2 2 1 2 2 3 2 2 3 3 4 3 3 3 4 5 3 4 3 5 6 3 5 4 3 7 4 3 4 4 8 4 4 4 5 9 4 5 5 6 10 5 INV 6 6 11 6 INV left_map = [0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6] right_map = [INV, 1, 2, 2, 3, 3, 3, 4, 4, 4, INV, INV]
Everything about this function is optimised for performance under njit. It is effectively a finite state machine that iterates through left, right, and result arrays. The various…
i and i_max are used to track the index of the left source j and j_max are used to track the index of the right source
-
exetera.core.operations.generate_ordered_map_to_left_remaining(i_max, l_result, r_result, i_off, i, r, invalid)¶
-
exetera.core.operations.generate_ordered_map_to_left_right_unique(first, second, result, invalid)¶
-
exetera.core.operations.generate_ordered_map_to_left_right_unique_partial(left, i_max, right, r_result, invalid, j_off, i, j, r)¶
-
exetera.core.operations.generate_ordered_map_to_left_right_unique_partial_old(d_j, left, right, left_to_right, invalid)¶ Returns: [0]: how many positions forward i moved [1]: how many positions forward j moved [2]: how many elements were written
-
exetera.core.operations.generate_ordered_map_to_left_right_unique_remaining(i_max, r_result, i, r, invalid)¶
-
exetera.core.operations.generate_ordered_map_to_left_right_unique_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶
-
exetera.core.operations.generate_ordered_map_to_left_right_unique_streamed_old(left, right, left_to_right, invalid=- 1, chunksize=1048576)¶
-
exetera.core.operations.generate_ordered_map_to_left_streamed(left: exetera.core.abstract_types.Field, right: exetera.core.abstract_types.Field, l_result: exetera.core.abstract_types.Field, r_result: exetera.core.abstract_types.Field, invalid: Union[numpy.int32, numpy.int64], chunksize: Optional[int] = 1048576, rdtype=<class 'numpy.int32'>)¶ This function performs the most generic type of left to right mapping calculation in which both key fields can have repeated key values. At its heart, the function generates a mapping from left to right that can then be used to map data in the right space to data in the left space. Note that this can also be used to generate the inverse mapping my simply flipping left and right collections.
As the Fields
leftandrightcan contain arbitrarily long sequences of data, the data is streamed through the algorithm in a series of chunks. Similarly, the resulting map is written to a buffer that is written to theresultfield in chunks.This streamed function makes a sequence of calls to a corresponding _partial function that does the heavy lifting. Inside the _partial function, a finite state machine (FSM) iterates over the data, performing the mapping. The _partial function call exits whenever any of the chunks (
left_,right_orresult_that it is passed become exhausted.Please take a look at the documentation for the partial function to understand the finite state machine parameters to understand that role that the various parameters play.
- We have to make some adjustments to the finite state machine between calls to _partial:
if the call used all the
left_data, add the size of that data chunk toi_offif the call used all of the
right_data, add the size of that data chunk toj_offwrite the accumulated
result_data to the result` field, and resetrto 0
-
exetera.core.operations.get_byte_map(string_map)¶ Getting byte indices and byte values from categorical key-value pair
-
exetera.core.operations.get_map_datatype_based_on_lengths(left_len, right_len)¶
-
exetera.core.operations.get_map_datatype_str_based_on_lengths(left_len, right_len)¶
-
exetera.core.operations.get_map_subchunks_based_on_index_lengths(map_, invalid, chunksize)¶
-
exetera.core.operations.get_next_chunk(start: int, chunk_size: int, field: exetera.core.abstract_types.Field)¶ This is a helper function that provides functionality specific to streaming ordered merges. It assumes that
fieldis in sorted order.This function is used to fetch chunks of memory from a field to be consumed by streaming merges. It first fetches the chunk of a given chunk size, or the size of the remaining memory, whichever is smaller. It then ‘trims’ that memory by removing the last sequence of equal values from the valid range.
- Parameters
start – The start of the chunk to be returned
chunksize – The size of the chunk to be considered. The returned chunk will always
be shorter than this unless it is the final chunk of the
fielddata :param field: The field from which data should be fetched. This field must be in sorted order :return: A tuple representing the range (inclusive, exclusive) and an numpy ndarray containing the data. Note, this is is typically longer than the range returned, as we do not trim the data for performance reasons.
-
exetera.core.operations.get_spans_for_field(ndarray)¶
-
exetera.core.operations.get_valid_value_extents(chunk, start, end, invalid=- 1)¶
-
exetera.core.operations.indexed_string_unique(indices, values, unique_result, unique_index, unique_inverse, unique_counts)¶ Find the unique elements for indexed string field using njit function.
-
exetera.core.operations.is_ordered(field)¶
-
exetera.core.operations.isin_for_indexed_string_field(test_elements, indices, values)¶
-
exetera.core.operations.isin_indexed_string_speedup(test_elements, indices, values)¶
-
exetera.core.operations.leaky_categorical_transform(chunk, freetext_indices, freetext_values, i_c, column_inds, column_vals, column_offsets, cat_keys, cat_index, cat_values)¶ Transform method for categorical importer in readerwriter.py
-
exetera.core.operations.map_valid(data_field, map_field, result=None, invalid=- 1)¶
-
exetera.core.operations.merge_entries_segment(i_start, cur_old_start, old_map, new_map, to_keep, old_src, new_src, dest)¶ - Parameters
i_start – the initial value to apply to ‘i’
cur_old_start – the initial value to apply to ‘cur_old
old_map – the map (in i-space) for the existing records
new_map – the map (in i-space) for the new records
to_keep – the flags (in i-space) indicating whether the new record should be kept
old_src – the source for the existing records
new_src – the source for the new records
dest – the sink for the merged sources
- Returns
-
exetera.core.operations.merge_indexed_journalled_entries(old_map, new_map, to_keep, old_src_inds, old_src_vals, new_src_inds, new_src_vals, dest_inds, dest_vals)¶
-
exetera.core.operations.merge_indexed_journalled_entries_count(old_map, new_map, to_keep, old_src_inds, new_src_inds)¶
-
exetera.core.operations.merge_journalled_entries(old_map, new_map, to_keep, old_src, new_src, dest)¶
-
exetera.core.operations.next_chunk(current: int, length: int, desired: int)¶ This is a helper function that can be used whenever you want to access a large sequence of data in chunks. It simply carries out the calculation that returns the extents of the next chunk taking into account the
lengthof the sequence. The sequence itself is not required here, only the length. :param current: the starting point of the chunk :param length: the length of the sequence being chunked :param desired: the requested length of the chunk :return: A tuple of the chunk extents. The first value is inclusive; the second is exclusive
-
exetera.core.operations.next_map_subchunk(map_, sm, invalid, chunksize)¶
-
exetera.core.operations.next_trimmed_chunk(field, chunk, chunk_size)¶
-
exetera.core.operations.next_untrimmed_chunk(field, chunk, chunk_size)¶
-
exetera.core.operations.numeric_bool_transform(elements, validity, column_inds, column_vals, column_offsets, col_idx, written_row_count, invalid_value, validation_mode, field_name)¶ Transform method for numeric importer (bool) in readerwriter.py
-
exetera.core.operations.ordered_generate_journalling_indices(old, new)¶
-
exetera.core.operations.ordered_get_last_as_filter(field)¶
-
exetera.core.operations.ordered_inner_map(left, right, left_to_inner, right_to_inner)¶
-
exetera.core.operations.ordered_inner_map_both_unique(left, right, left_to_inner, right_to_inner)¶
-
exetera.core.operations.ordered_inner_map_left_unique(left, right, left_to_inner, right_to_inner)¶
-
exetera.core.operations.ordered_inner_map_left_unique_partial(d_i, d_j, left, right, left_to_inner, right_to_inner)¶ Returns: [0]: how many positions forward i moved [1]: how many positions forward j moved [2]: how many elements were written
-
exetera.core.operations.ordered_inner_map_left_unique_streamed(left, right, left_to_inner, right_to_inner, chunksize=1048576)¶
-
exetera.core.operations.ordered_inner_map_result_size(left, right)¶
-
exetera.core.operations.ordered_left_map_result_size(left, right)¶
-
exetera.core.operations.ordered_map_valid_indexed_partial(sm_values, sm_start, sm_end, indices, i_start, i_max, values, mv_start, result_indices, result_values, invalid, sm, ri, rv, ri_accum)¶
-
exetera.core.operations.ordered_map_valid_indexed_stream(data_field, map_field, result_field, invalid=- 1, chunksize=1048576, value_factor=8)¶
-
exetera.core.operations.ordered_map_valid_partial(values, map_values, sm_start, sm_end, d_start, result_data, invalid, invalid_value)¶
-
exetera.core.operations.ordered_map_valid_partial_old(d, data_field, map_field, result, invalid)¶
-
exetera.core.operations.ordered_map_valid_stream(data_field, map_field, result_field, invalid=- 1, chunksize=1048576)¶ - . for each map chunk
- . calculate sub chunks based on indices
- . for each sub chunk
. map indices for sub chunk
-
exetera.core.operations.ordered_map_valid_stream_old(data_field, map_field, result_field, invalid=- 1, chunksize=1048576)¶
-
exetera.core.operations.ordered_outer_map_result_size_both_unique(left, right)¶
-
exetera.core.operations.raiseNumericException(exception_message, exception_args)¶
-
exetera.core.operations.safe_map_indexed_values(data_indices, data_values, map_field, map_filter, empty_value=None)¶
-
exetera.core.operations.safe_map_values(data_field, map_field, map_filter, empty_value=None)¶
-
exetera.core.operations.str_to_dtype(str_dtype)¶
-
exetera.core.operations.streaming_sort_merge(src_index_f, src_value_f, tgt_index_f, tgt_value_f, segment_length, chunk_length)¶
-
exetera.core.operations.streaming_sort_partial(in_chunk_indices, in_chunk_lengths, src_value_chunks, src_index_chunks, dest_value_chunk, dest_index_chunk)¶
-
exetera.core.operations.transform_float(column_inds, column_vals, column_offsets, col_idx, written_row_count, invalid_value, validation_mode, data_type, field_name)¶ Transform float method for numeric importer in field_importer.py
-
exetera.core.operations.transform_int(column_inds, column_vals, column_offsets, col_idx, written_row_count, invalid_value, validation_mode, data_type, field_name)¶ Transform int method for numeric importer in field_importer.py
-
exetera.core.operations.transform_to_values(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶ Trasnform method for byte data from np.int to np.bytes_
-
exetera.core.operations.unique_for_indexed_string(indices, values, return_index, return_inverse, return_counts)¶ Find the unique elements for indexed string field.
exetera.core.persistence module¶
-
class
exetera.core.persistence.DataStore(chunksize=1048576, timestamp='2022-04-05 17:12:36.942412+00:00')¶ Bases:
object-
aggregate_count(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)¶
-
aggregate_custom(predicate, fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)¶
-
aggregate_first(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)¶
-
aggregate_last(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)¶
-
aggregate_max(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)¶
-
aggregate_min(fkey_indices=None, fkey_index_spans=None, reader=None, writer=None)¶
-
apply_filter(filter_to_apply, reader, writer=None)¶
-
apply_indices(indices_to_apply, reader, writer=None)¶
-
apply_sort(index, reader, writer=None)¶
-
apply_spans_concat(spans, reader, writer)¶
-
apply_spans_count(spans, _, writer=None)¶
-
apply_spans_first(spans, reader, writer)¶
-
apply_spans_index_of_first(spans, writer=None)¶
-
apply_spans_index_of_last(spans, writer=None)¶
-
apply_spans_index_of_max(spans, reader, writer=None)¶
-
apply_spans_index_of_min(spans, reader, writer=None)¶
-
apply_spans_last(spans, reader, writer)¶
-
apply_spans_max(spans, reader, writer)¶
-
apply_spans_min(spans, reader, writer)¶
-
chunks(length, chunksize=None)¶
-
dataset_sort(readers, index=None)¶
-
distinct(field=None, fields=None, filter=None)¶
-
get_categorical_writer(group, name, categories, timestamp=None, writemode='write')¶
-
get_compatible_writer(field, dest_group, dest_name, timestamp=None, writemode='write')¶
-
get_existing_writer(field, timestamp=None)¶
-
get_fixed_string_writer(group, name, width, timestamp=None, writemode='write')¶
-
get_index(target, foreign_key, destination=None)¶
-
get_indexed_string_writer(group, name, timestamp=None, writemode='write')¶
-
get_numeric_writer(group, name, dtype, timestamp=None, writemode='write')¶
-
get_or_create_group(group, name)¶
-
get_reader(field)¶
-
get_spans(field=None, fields=None)¶
-
get_timestamp_writer(group, name, timestamp=None, writemode='write')¶
-
get_trash_group(group)¶
-
index_spans(spans)¶
-
join(destination_pkey, fkey_indices, values_to_join, writer=None, fkey_index_spans=None)¶
-
predicate_and_join(predicate, destination_pkey, fkey_indices, reader=None, writer=None, fkey_index_spans=None)¶
-
process(inputs, outputs, predicate)¶
-
set_timestamp(timestamp='2022-04-05 17:12:36.942424+00:00')¶
-
sort_on(src_group, dest_group, keys, fields=None, timestamp=None, write_mode='write')¶
-
temp_filename()¶
-
-
exetera.core.persistence.dataset_merge_sort(group, index, fields)¶
-
exetera.core.persistence.filter_duplicate_fields(field)¶
-
exetera.core.persistence.filtered_iterator(values, filter, default=nan)¶
-
exetera.core.persistence.foreign_key_is_in_primary_key(primary_key, foreign_key)¶
-
exetera.core.persistence.temp_dataset()¶
-
exetera.core.persistence.timestamp_to_date(values)¶
-
exetera.core.persistence.try_str_to_bool(value, invalid=0)¶
-
exetera.core.persistence.try_str_to_float(value, invalid=0)¶
-
exetera.core.persistence.try_str_to_float_to_int(value, invalid=0)¶
-
exetera.core.persistence.try_str_to_int(value, invalid=0)¶
exetera.core.readerwriter module¶
-
class
exetera.core.readerwriter.CategoricalImporter(datastore, group, name, categories, timestamp=None, write_mode='write')¶ Bases:
object-
chunk_factory(length)¶
-
flush()¶
-
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶
-
write(values)¶
-
write_part(values)¶
-
write_strings(values)¶
-
-
class
exetera.core.readerwriter.CategoricalReader(datastore, field)¶ Bases:
exetera.core.readerwriter.Reader-
dtype()¶
-
get_writer(dest_group, dest_name, timestamp=None, write_mode='write')¶
-
-
class
exetera.core.readerwriter.CategoricalWriter(datastore, group, name, categories, timestamp=None, write_mode='write')¶ Bases:
exetera.core.readerwriter.Writer-
chunk_factory(length)¶
-
flush()¶
-
write(values)¶
-
write_part(values)¶
-
-
class
exetera.core.readerwriter.DateTimeImporter(datastore, group, name, create_day_field=False, optional=True, timestamp=None, write_mode='write')¶ Bases:
object-
chunk_factory(length)¶
-
flush()¶
-
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶
-
write(values)¶
-
write_part(values)¶
-
-
class
exetera.core.readerwriter.DateTimeWriter(datastore, group, name, timestamp=None, write_mode='write')¶ Bases:
exetera.core.readerwriter.Writer-
chunk_factory(length)¶
-
flush()¶
-
write(values)¶
-
write_part(values)¶
-
-
class
exetera.core.readerwriter.DateWriter(datastore, group, name, timestamp=None, write_mode='write')¶ Bases:
exetera.core.readerwriter.Writer-
chunk_factory(length)¶
-
flush()¶
-
write(values)¶
-
write_part(values)¶
-
-
class
exetera.core.readerwriter.FixedStringReader(datastore, field)¶ Bases:
exetera.core.readerwriter.Reader-
dtype()¶
-
get_writer(dest_group, dest_name, timestamp=None, write_mode='write')¶
-
-
class
exetera.core.readerwriter.FixedStringWriter(datastore, group, name, strlen, timestamp=None, write_mode='write')¶ Bases:
exetera.core.readerwriter.Writer-
chunk_factory(length)¶
-
flush()¶
-
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶
-
write(values)¶
-
write_part(values)¶
-
-
class
exetera.core.readerwriter.IndexedStringReader(datastore, field)¶ Bases:
exetera.core.readerwriter.Reader-
dtype()¶
-
get_writer(dest_group, dest_name, timestamp=None, write_mode='write')¶
-
sort(index, writer)¶
-
-
class
exetera.core.readerwriter.IndexedStringWriter(datastore, group, name, timestamp=None, write_mode='write')¶ Bases:
exetera.core.readerwriter.Writer-
chunk_factory(length)¶
-
flush()¶
-
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶
-
write(values)¶
-
write_part(values)¶ Writes a list of strings in indexed string form to a field.
- Parameters
values – a list of utf8 strings
-
write_part_raw(index, values)¶
-
write_raw(index, values)¶
-
-
class
exetera.core.readerwriter.LeakyCategoricalImporter(datastore, group, name, categories, out_of_range, timestamp=None, write_mode='write')¶ Bases:
object-
chunk_factory(length)¶
-
flush()¶
-
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶
-
write(values)¶
-
write_part(values)¶
-
-
class
exetera.core.readerwriter.NumericImporter(datastore, group, name, nformat, parser, invalid_value=0, validation_mode='allow_empty', create_flag_field=True, flag_field_suffix='_valid', timestamp=None, write_mode='write')¶ Bases:
object-
chunk_factory(length)¶
-
flush()¶
-
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶
-
write(values)¶
-
write_part(values)¶ Given a list of strings, parse the strings and write the parsed values. Values that cannot be parsed are written out as zero for the values, and zero for the flags to indicate that that entry is not valid.
- Parameters
values – a list of strings to be parsed
-
-
class
exetera.core.readerwriter.NumericReader(datastore, field)¶ Bases:
exetera.core.readerwriter.Reader-
dtype()¶
-
get_writer(dest_group, dest_name, timestamp=None, write_mode='write')¶
-
-
class
exetera.core.readerwriter.NumericWriter(datastore, group, name, nformat, timestamp=None, write_mode='write')¶ Bases:
exetera.core.readerwriter.Writer-
chunk_factory(length)¶
-
flush()¶
-
write(values)¶
-
write_part(values)¶
-
-
class
exetera.core.readerwriter.OptionalDateImporter(datastore, group, name, create_day_field=False, optional=True, timestamp=None, write_mode='write')¶ Bases:
object-
chunk_factory(length)¶
-
flush()¶
-
import_part(column_inds, column_vals, column_offsets, col_idx, written_row_count)¶
-
write(values)¶
-
write_part(values)¶
-
-
class
exetera.core.readerwriter.Reader(field)¶ Bases:
object
-
class
exetera.core.readerwriter.TimestampReader(datastore, field)¶ Bases:
exetera.core.readerwriter.Reader-
dtype()¶
-
get_writer(dest_group, dest_name, timestamp=None, write_mode='write')¶
-
exetera.core.regression module¶
-
exetera.core.regression.check_row(exp_ds, exp_index, act_ds, act_index, keys, custom_checks)¶
-
exetera.core.regression.datetime_compare_to_secs(value1, value2)¶
-
exetera.core.regression.na_compare(value1, value2)¶
-
exetera.core.regression.na_or_value(value)¶
exetera.core.session module¶
-
class
exetera.core.session.Session(chunksize: int = 1048576, timestamp: str = '2022-04-05 17:12:36.959786+00:00')¶ Bases:
exetera.core.abstract_types.AbstractSessionSession is the top-level object that is used to create and open ExeTera Datasets. It also provides operations that can be performed on Fields. For a more detailed explanation of Session and examples of its usage, please refer to https://github.com/KCL-BMEIS/ExeTera/wiki/Session-API
- Parameters
chunksize – Change the default chunksize that fields created with this dataset use. Note this is a hint parameter and future versions of Session may choose to ignore it if it is no longer required. In general, it should only be changed for testing.
timestamp – Set the official timestamp for the Session’s creation rather than taking the current date/time.
-
aggregate_count(index, dest=None)¶ Finds the number of entries within each sub-group of index.
Example:
Index: a a a b b x a c c d d d Result: 3 2 1 1 2 3
- Parameters
index – A numpy array or Field containing the index that defines the ranges over which count is applied.
dest – If set, a Field to which the resulting counts are written
- Returns
A numpy array containing the resulting values
-
aggregate_custom(predicate, index, target=None, dest=None)¶
-
aggregate_first(index, target=None, dest=None)¶ Finds the first entries within each sub-group of index.
Example:
Index: a a a b b x a c c d d d Target: 1 2 3 4 5 6 7 8 9 0 1 2 Result: 1 4 6 7 8 0
- Parameters
index – A numpy array or Field containing the index that defines the ranges over which count is applied.
target – A numpy array to which the index and predicate are applied
dest – If set, a Field to which the resulting counts are written
- Returns
A numpy array containing the resulting values
-
aggregate_last(index, target=None, dest=None)¶ Finds the first entries within each sub-group of index.
Example:
Index: a a a b b x a c c d d d Target: 1 2 3 4 5 6 7 8 9 0 1 2 Result: 3 5 6 7 9 2
- Parameters
index – A numpy array or Field containing the index that defines the ranges over which count is applied.
target – A numpy array to which the index and predicate are applied
dest – If set, a Field to which the resulting counts are written
- Returns
A numpy array containing the resulting values
-
aggregate_max(index, target=None, dest=None)¶ Finds the maximum value within each sub-group of index.
Example:
Index: a a a b b x a c c d d d Target: 1 2 3 5 4 6 7 8 9 2 1 0 Result: 3 5 6 7 9 2
- Parameters
index – A numpy array or Field containing the index that defines the ranges over which max is applied.
target – A numpy array to which the index and predicate are applied
dest – If set, a Field to which the resulting counts are written
- Returns
A numpy array containing the resulting values
-
aggregate_min(index, target=None, dest=None)¶ Finds the minimum value within each sub-group of index.
Example:
Index: a a a b b x a c c d d d Target: 1 2 3 5 4 6 7 8 9 2 1 0 Result: 1 4 6 7 8 0
- Parameters
index – A numpy array or Field containing the index that defines the ranges over which min is applied.
target – A numpy array to which the index and predicate are applied
dest – If set, a Field to which the resulting counts are written
- Returns
A numpy array containing the resulting values
-
apply_filter(filter_to_apply, src, dest=None)¶ Apply a filter to an a src field. The filtered field is written to dest if it set, and returned from the function call. If the field is an IndexedStringField, the indices and values are returned separately.
- Parameters
filter_to_apply – the filter to be applied to the source field, an array of boolean
src – the field to be filtered
dest – optional - a field to write the filtered data to
- Returns
the filtered values
-
apply_index(index_to_apply, src, dest=None)¶ Apply a index to an a src field. The indexed field is written to dest if it set, and returned from the function call. If the field is an IndexedStringField, the indices and values are returned separately.
- Parameters
index_to_apply – the index to be applied to the source field, must be one of Group, Field, or ndarray
src – the field to be index
dest – optional - a field to write the indexed data to
- Returns
the indexed values
-
apply_spans_concat(spans, target, dest, src_chunksize=None, dest_chunksize=None, chunksize_mult=None)¶
-
apply_spans_count(spans: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶ Finds the number of entries within each span.
- Parameters
spans – the numpy array of spans to be applied
dest – if set, the field to which the results are written
- Returns
A numpy array containing the resulting values
-
apply_spans_first(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶ Finds the first entry within each span on a target field.
- Parameters
spans – the numpy array of spans to be applied
target – the field to which the spans are applied
dest – if set, the field to which the results are written
- Returns
A numpy array containing the resulting values
-
apply_spans_index_of_first(spans: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶ Finds the index of the first entry within each span.
- Parameters
spans – the numpy array of spans to be applied
dest – if set, the field to which the results are written
- Returns
A numpy array containing the resulting values
-
apply_spans_index_of_last(spans: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶ Finds the index of the last entry within each span.
- Parameters
spans – the numpy array of spans to be applied
dest – if set, the field to which the results are written
- Returns
A numpy array containing the resulting values
-
apply_spans_index_of_max(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶ Finds the index of the maximum value within each span on a target field.
- Parameters
spans – the numpy array of spans to be applied
target – the field to which the spans are applied
dest – if set, the field to which the results are written
- Returns
A numpy array containing the resulting values
-
apply_spans_index_of_min(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶ Finds the index of the minimum value within each span on a target field.
- Parameters
spans – the numpy array of spans to be applied
target – the field to which the spans are applied
dest – if set, the field to which the results are written
- Returns
A numpy array containing the resulting values
-
apply_spans_last(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶ Finds the last entry within each span on a target field.
- Parameters
spans – the numpy array of spans to be applied
target – the field to which the spans are applied
dest – if set, the field to which the results are written
- Returns
A numpy array containing the resulting values
-
apply_spans_max(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶ Finds the maximum value within each span on a target field.
- Parameters
spans – the numpy array of spans to be applied
target – the field to which the spans are applied
dest – if set, the field to which the results are written
- Returns
A numpy array containing the resulting values
-
apply_spans_min(spans: numpy.ndarray, target: numpy.ndarray, dest: exetera.core.abstract_types.Field = None)¶ Finds the minimum value within span on a target field.
- Parameters
spans – the numpy array of spans to be applied
target – the field to which the spans are applied
dest – if set, the field to which the results are written
- Returns
A numpy array containing the resulting values
-
chunks(length: int, chunksize: Optional[int] = None)¶ Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.
‘chunks’ is a convenience method that, given an overall length and a chunksize, will yield a set of ranges for the chunks in question. ie. chunks(1048576, 500000) -> (0, 500000), (500000, 1000000), (1000000, 1048576)
- Parameters
length – The range to be split into chunks
chunksize – Optional parameter detailing the size of each chunk. If not set, the chunksize that the Session was initialized with is used.
-
close()¶ Close all open datasets.
- Returns
None
-
close_dataset(name: str)¶ Close the dataset with the given name. If there is no dataset with that name, do nothing.
- Parameters
name – The name of the dataset to be closed
- Returns
None
-
create_categorical(group, name, nformat, key, timestamp=None, chunksize=None)¶ Create a categorical field in the given DataFrame with the given name. This function also takes a numerical format for the numeric representation of the categories, and a key that maps numeric values to their string string descriptions.
- Parameters
group – The group in which the new field should be created
name – The name of the new field
nformat – A numerical type in the set (int8, uint8, int16, uint18, int32, uint32, int64). It is recommended to use ‘int8’.
key – A dictionary that maps numerical values to their string representations
timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.
chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.
-
create_fixed_string(group, name, length, timestamp=None, chunksize=None)¶ Create a fixed string field in the given DataFrame, given name, and given max string length per entry.
- Parameters
group – The group in which the new field should be created
name – The name of the new field
length – The maximum length in bytes that each entry can have.
timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.
chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.
-
create_indexed_string(group, name, timestamp=None, chunksize=None)¶ Create an indexed string field in the given DataFrame with the given name.
- Parameters
group – The group in which the new field should be created
name – The name of the new field
timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.
chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.
-
create_like(field, dest_group, dest_name, timestamp=None, chunksize=None)¶ Create a field of the same type as an existing field, in the location and with the name provided.
Example:
with Session as s: ... a = s.get(table_1['a']) b = s.create_like(a, table_2, 'a_times_2') b.data.write(a.data[:] * 2)
- Parameters
field – The Field whose type is to be copied
dest_group – The group in which the new field should be created
dest_name – The name of the new field
-
create_numeric(group, name, nformat, timestamp=None, chunksize=None)¶ Create a numeric field in the given DataFrame with the given name.
- Parameters
group – The group in which the new field should be created
name – The name of the new field
nformat – A numerical type in the set (int8, uint8, int16, uint18, int32, uint32, int64, uint64, float32, float64). It is recommended to avoid uint64 as certain operations in numpy cause conversions to floating point values.
timestamp – If set, the timestamp that should be given to the new field. If not set datetime.now() is used.
chunksize – If set, the chunksize that should be used to create the new field. In general, this should not be set unless you are writing unit tests.
-
create_timestamp(group, name, timestamp=None, chunksize=None)¶ Create a timestamp field in the given group with the given name.
-
dataset_sort_index(sort_indices, index=None)¶ Generate a sorted index based on a set of fields upon which to sort and an optional index to apply to the sort_indices.
- Parameters
sort_indices – a tuple or list of indices that determine the sorted order
index – optional - the index by which the initial field should be permuted
- Returns
the resulting index that can be used to permute unsorted fields
-
distinct(field=None, fields=None, filter=None)¶
-
get(field: Union[exetera.core.abstract_types.Field, h5py._hl.group.Group])¶ Get a Field from a h5py Group.
Example:
# this code for context with Session() as s: # open a dataset about wildlife src = s.open_dataset("/my/wildlife/dataset.hdf5", "r", "src") # fetch the group containing bird data birds = src['birds'] # get the bird decibel field bird_decibels = s.get(birds['decibels'])
- Parameters
field – The Field or Group object to retrieve.
-
get_dataset(name: str)¶ Get the dataset with the given name. If there is no dataset with that name, raise a KeyError indicating that the dataset with that name is not present.
- Parameters
name – Name of the dataset to be fetched. This is the name that was given to it when it was opened through
open_dataset().- Returns
Dataset with that name.
-
get_index(target, foreign_key, destination=None)¶ Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.
Please make use of Dataframe.merge functionality instead. This method can be emulated by adding an index (via np.arange) to a dataframe, performing a merge and then fetching the mapped index field.
‘get_index’ maps a primary key (‘target’) into the space of a foreign key (‘foreign_key’).
-
get_or_create_group(group: Union[h5py._hl.group.Group, h5py._hl.files.File], name: str)¶ Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.
Create a shared index based on a tuple of numpy arrays containing keys. This function generates the sorted union of a tuple of key fields and then maps the individual arrays to their corresponding indices in the sorted union.
- Parameters
keys – a tuple of groups, fields or ndarrays whose contents represent keys
Example:
key_1 = ['a', 'b', 'e', 'g', 'i'] key_2 = ['b', 'b', 'c', 'c, 'e', 'g', 'j'] key_3 = ['a', 'c' 'd', 'e', 'g', 'h', 'h', 'i'] sorted_union = ['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j'] key_1_index = [0, 1, 4, 5, 7] key_2_index = [1, 1, 2, 2, 4, 5, 8] key_3_index = [0, 2, 3, 4, 5, 6, 6, 7]
-
get_spans(field: Union[exetera.core.abstract_types.Field, numpy.ndarray] = None, dest: exetera.core.abstract_types.Field = None, **kwargs)¶ Calculate a set of spans that indicate contiguous equal values. The entries in the result array correspond to the inclusive start and exclusive end of the span (the ith span is represented by element i and element i+1 of the result array). The last entry of the result array is the length of the source field.
Only one of ‘field’ or ‘fields’ may be set. If ‘fields’ is used and more than one field specified, the fields are effectively zipped and the check for spans is carried out on each corresponding tuple in the zipped field.
Example:
field: [1, 2, 2, 1, 1, 1, 3, 4, 4, 4, 2, 2, 2, 2, 2] result: [0, 1, 3, 6, 7, 10, 15]
- Parameters
field – A Field or numpy array to be evaluated for spans
dest – A destination Field to store the result
**kwargs – See below. For parameters set in both argument and kwargs, use kwargs
- Keyword Arguments
field – Similar to field parameter, in case user specify field as keyword
fields – A tuple of Fields or tuple of numpy arrays to be evaluated for spans
dest – Similar to dest parameter, in case user specify as keyword
- Returns
The resulting set of spans as a numpy array
-
join(destination_pkey, fkey_indices, values_to_join, writer=None, fkey_index_spans=None)¶ This method is due for removal and should not be used. Please use the merge or ordered_merge functions instead.
-
list_datasets()¶ List the open datasets for this Session object. This is returned as a tuple of strings rather than the datasets themselves. The individual datasets can be fetched using
get_dataset().Example:
names = s.list_datasets() datasets = [s.get_dataset(n) for n in names]
- Returns
A tuple containing the names of the currently open datasets for this Session object
-
merge_inner(left_on, right_on, left_fields=None, left_writers=None, right_fields=None, right_writers=None)¶ Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.
Please use DataFrame.merge instead.
Perform a database-style inner join on left_fields, outputting the result to left_writers, if set.
- Parameters
left_on – The key to perform the join on on the left hand side
right_on – The key to perform the join on on the right hand side
left_fields – The fields to be mapped from left to inner
left_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.
right_fields – The fields to be mapped from right to inner
right_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.
-
merge_left(left_on, right_on, right_fields=(), right_writers=None)¶ Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.
Please use DataFrame.merge instead.
Perform a database-style left join on right_fields, outputting the result to right_writers, if set.
- Parameters
left_on – The key to perform the join on on the left hand side
right_on – The key to perform the join on on the right hand side
right_fields – The fields to be mapped from right to left
right_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.
-
merge_right(left_on, right_on, left_fields=(), left_writers=None)¶ Note: this function is deprecated, and provided only for compatibility with existing scripts. It will be removed in a future version.
Please use DataFrame.merge instead.
Perform a database-style right join on left_fields, outputting the result to left_writers, if set.
- Parameters
left_on – The key to perform the join on on the left hand side
right_on – The key to perform the join on on the right hand side
left_fields – The fields to be mapped from right to left
left_writers – Optional parameter providing the fields to which the mapped data should be written. If this is not set, the mapped data is returned as numpy arrays and lists instead.
-
open_dataset(dataset_path: Union[str, IO[bytes]], mode: str, name: str)¶ Open a dataset with the given access mode.
- Parameters
dataset_path – the path to the dataset
mode – the mode in which the dataset should be opened. This is one of “r”, “r+” or “w”.
name – the name that is associated with this dataset. This can be used to retrieve the dataset when calling
get_dataset().
- Returns
The top-level dataset object
-
ordered_merge_inner(left_on, right_on, left_field_sources=(), left_field_sinks=None, right_field_sources=(), right_field_sinks=None, left_unique=False, right_unique=False)¶ Generate the results of an inner join and apply it to the fields described in the tuple ‘right_field_sources’. If ‘right_field_sinks’ is set, the mapped values are written to the fields / arrays set there.
Note: in order to achieve best scalability, you should use groups / fields rather than numpy arrays and provide a tuple of groups/fields to right_field_sinks, so that the session and compute the merge and apply the mapping in a streaming fashion.
- Parameters
left_on – the group/field/numba array that contains the left key values
right_on – the group/field/numba array that contains the right key values
right_to_left_map – a group/field/numba array that the map is written to. If it is a numba array, it must be the size of the resulting merge
right_field_sources – a tuple of group/fields/numba arrays that contain the fields to be joined
right_field_sinks – optional - a tuple of group/fields/numba arrays that the mapped fields should be written to
left_unique – a hint to indicate whether the ‘left_on’ field contains unique values
right_unique – a hint to indicate whether the ‘right_on’ field contains unique values
- Returns
If right_field_sinks is not set, a tuple of the output fields is returned
-
ordered_merge_left(left_on, right_on, right_field_sources=(), left_field_sinks=None, left_to_right_map=None, left_unique=False, right_unique=False)¶ Generate the results of a left join and apply it to the fields described in the tuple ‘left_field_sources’. If ‘left_field_sinks’ is set, the mapped values are written to the fields / arrays set there. Note: in order to achieve best scalability, you should use groups / fields rather than numpy arrays and provide a tuple of groups/fields to left_field_sinks, so that the session and compute the merge and apply the mapping in a streaming fashion.
- Parameters
left_on – the group/field/numba array that contains the left key values
right_on – the group/field/numba array that contains the right key values
left_to_right_map – a group/field/numba array that the map is written to. If it is a numba array, it must be the size of the resulting merge
left_field_sources – a tuple of group/fields/numba arrays that contain the fields to be joined
left_field_sinks – optional - a tuple of group/fields/numba arrays that the mapped fields should be written to
left_unique – a hint to indicate whether the ‘left_on’ field contains unique values
right_unique – a hint to indicate whether the ‘right_on’ field contains unique values
- Returns
If left_field_sinks is not set, a tuple of the output fields is returned
-
ordered_merge_right(left_on, right_on, left_field_sources=(), right_field_sinks=None, right_to_left_map=None, left_unique=False, right_unique=False)¶ Generate the results of a right join and apply it to the fields described in the tuple ‘right_field_sources’. If ‘right_field_sinks’ is set, the mapped values are written to the fields / arrays set there.
Note: in order to achieve best scalability, you should use groups / fields rather than numpy arrays and provide a tuple of groups/fields to right_field_sinks, so that the session and compute the merge and apply the mapping in a streaming fashion.
- Parameters
left_on – the group/field/numba array that contains the left key values
right_on – the group/field/numba array that contains the right key values
right_to_left_map – a group/field/numba array that the map is written to. If it is a numba array, it must be the size of the resulting merge
right_field_sources – a tuple of group/fields/numba arrays that contain the fields to be joined
right_field_sinks – optional - a tuple of group/fields/numba arrays that the mapped fields should be written to
left_unique – a hint to indicate whether the ‘left_on’ field contains unique values
right_unique – a hint to indicate whether the ‘right_on’ field contains unique values
- Returns
If right_field_sinks is not set, a tuple of the output fields is returned
-
predicate_and_join(predicate, destination_pkey, fkey_indices, reader=None, writer=None, fkey_index_spans=None)¶ This method is due for removal and should not be used. Please use the merge or ordered_merge functions instead.
-
set_timestamp(timestamp: str = '2022-04-05 17:12:36.959841+00:00')¶ Set the default timestamp to be used when creating fields without specifying an explicit timestamp.
- Parameters
timestamp – a string representing a valid Datetime
- Returns
None
-
sort_on(src_group: h5py._hl.group.Group, dest_group: h5py._hl.group.Group, keys: Union[tuple, list], timestamp=datetime.datetime(2022, 4, 5, 17, 12, 36, 959847, tzinfo=datetime.timezone.utc), write_mode='write', verbose=True)¶ Sort a group (src_group) of fields by the specified set of keys, and write the sorted fields to dest_group.
- Parameters
src_group – the group of fields that are to be sorted
dest_group – the group into which sorted fields are written
keys – fields to sort on
timestamp – optional - timestamp to write on the sorted fields
write_mode – optional - write mode to use if the destination fields already exist
- Returns
None
-
temp_filename()¶
exetera.core.split module¶
-
exetera.core.split.assessment_splitter(input_filename, output_filename, assessment_buckets, bucket)¶
-
exetera.core.split.patient_splitter(input_filename, output_filenames, sorted_indices, bucket_size)¶
-
exetera.core.split.split_data(patient_data, assessment_data, bucket_size=500000, territories=None)¶
exetera.core.utils module¶
-
class
exetera.core.utils.Timer(start_msg, new_line=False, end_msg='completed in')¶ Bases:
object
-
exetera.core.utils.build_histogram(dataset, filtered_records=None, tx=None)¶
-
exetera.core.utils.bytearray_to_escaped(srcbytearray, destbytearray, src_start=0, src_end=None, dest_start=0, separator=b',', delimiter=b'"')¶
-
exetera.core.utils.check_input_lengths(names, fields)¶
-
exetera.core.utils.chunks(length, chunksize)¶
-
exetera.core.utils.clear_set_flag(values, to_clear)¶
-
exetera.core.utils.concatenate_maybe_strs(sequence, value, separator=',', delimiter='"')¶
-
exetera.core.utils.count_flag_empty(flags)¶
-
exetera.core.utils.count_flag_not_set(flags, flag_to_test)¶
-
exetera.core.utils.count_flag_set(flags, flag_to_test)¶
-
exetera.core.utils.datetime_to_seconds(dt)¶
-
exetera.core.utils.filter_field(fields, filter_list, f_missing, f_bad, is_type_fn, type_fn, valid_fn)¶
-
exetera.core.utils.find_longest_sequence_of(string, char)¶
-
exetera.core.utils.from_escaped(string)¶
-
exetera.core.utils.get_min_max(value_type)¶
-
exetera.core.utils.guess_encoding(filename)¶ Attempt to determine the encodig of the given text file by reading the byte order mark, defaulting to utf-8 if none is found.
- Parameters
filename – path to a text file containing possible UTF-8, UTF-16, or UTF-32 text
- Returns
encoding name, one of utf-8, utf-8-sig, utf-16, utf-32
-
exetera.core.utils.is_float(value)¶
-
exetera.core.utils.is_int(value)¶
-
exetera.core.utils.list_to_escaped(strings)¶
-
exetera.core.utils.map_between_categories(first_map, second_map)¶
-
exetera.core.utils.one_dim_data_to_indexed_for_test(data, field_size)¶
-
exetera.core.utils.print_diagnostic_row(preamble, ds, ir, keys, fns=None)¶
-
exetera.core.utils.replace_if_invalid(replacement)¶
-
exetera.core.utils.sort_mixed_list(values, check_fn, sort_fn)¶
-
exetera.core.utils.string_to_datetime(field)¶
-
exetera.core.utils.timestamp_to_day(field)¶
-
exetera.core.utils.to_categorical(field, transform)¶
-
exetera.core.utils.to_escaped(string, separator=',', delimiter='"')¶
-
exetera.core.utils.to_float(value)¶
-
exetera.core.utils.to_int(value)¶
-
exetera.core.utils.valid_range_fac(f_min, f_max, default_value='')¶
-
exetera.core.utils.valid_range_fac_inc(f_min, f_max, default_value='')¶
-
exetera.core.utils.validate_file_exists(file_name)¶
exetera.core.validation module¶
-
exetera.core.validation.all_same_basic_type(name, fields)¶
-
exetera.core.validation.array_from_field_or_lower(name, field)¶
-
exetera.core.validation.array_from_parameter(session, name, field)¶
-
exetera.core.validation.ensure_valid_field(name, field)¶
-
exetera.core.validation.ensure_valid_field_like(name, field)¶
-
exetera.core.validation.field_from_parameter(session, name, field)¶
-
exetera.core.validation.is_field_parameter(field)¶
-
exetera.core.validation.raw_array_from_parameter(datastore, name, field)¶
-
exetera.core.validation.validate_all_field_length_in_df(df: exetera.core.abstract_types.DataFrame)¶
-
exetera.core.validation.validate_and_get_key_fields(side, df, key)¶
-
exetera.core.validation.validate_and_normalize_categorical_key(param_name, key)¶
-
exetera.core.validation.validate_boolean_row_filter(name, field)¶
-
exetera.core.validation.validate_chunk_size(chunk_size_name, chunk_size)¶
-
exetera.core.validation.validate_field_lengths(side, lens, df, names=None)¶
-
exetera.core.validation.validate_filter(filter_to_apply)¶
-
exetera.core.validation.validate_groupby_target(target, by, all)¶
-
exetera.core.validation.validate_key_field_consistency(lname, rname, lkey, rkey)¶
-
exetera.core.validation.validate_key_lengths(side, df, key)¶
-
exetera.core.validation.validate_require_key(context, key, dictionary)¶
-
exetera.core.validation.validate_selected_keys(by, all)¶