umami.data_tools package#

Submodules#

umami.data_tools.cuts module#

Module to define sample cuts.

umami.data_tools.cuts.get_category_cuts(label_var: str, label_value: float, cut_op: str = '==') → list#

This function returns the cut object for the categories used in the preprocessing.

Parameters:

label_var (str) – Name of the variable.
label_value (float, int, list) – Value for the cut of the variable.
cut_op (str, optional) – operator for the cut. Possible values: “==”, “!=”, “>”, “>=”, “<”, “<=” or None. default is “==”.

Returns:

List with the cut objects inside.

Return type:

list

Raises:

ValueError – If label_value is not a float, int or a list.

umami.data_tools.cuts.get_cut_list(class_labels: list)#

Returns a dict of cuts used to define the classes.

Parameters:: class_labels (list) – List with the class labels.
Returns:: cut_dict – dict with cuts per class label
Return type:: dict

umami.data_tools.cuts.get_sample_cuts(jets: ndarray, cuts: list) → ndarray#

Given an array of jets and a list of cuts, the function provides a list of indices which are removed by applying the cuts. Users can define the cuts either via the variable name and logical operators: ==, !=, >, >=, <, <= or using the dedicated modulo operator.

The latter is defined as follows: mod_[N]_[operator] with - [N] denoting “modulo N ” - [operator] denoting operator used for comparison to condition

Parameters:

jets (np.ndarray) – Array of jets which need to pass certain cuts.
cuts (list) – List from config file which contains dict objects for individual cuts.

Returns:

indices_to_remove – Numpy array of indices to be removed by the cuts

Return type:

np.ndarray

Raises:

KeyError – If the cut object in the list is not a dict with one entry.
RuntimeError – If the modulo is incorrectly used.
RuntimeError – If the modulo is incorrectly used. Operation is not supported.
KeyError – If unsupported operator is provided.

umami.data_tools.cuts.retrieve_cut_string(class_labels: list) → tuple#

Retrieve the cut string for a list of class labels

Parameters:: class_labels (list) – List with the classes to retrieve. Like [“bjets”, “cjets”, “ujets”]
Returns:: Dict with the cut string for each flavour
Return type:: dict

umami.data_tools.loaders module#

Provides functions for loading datasets from files.

umami.data_tools.loaders.load_jets_from_file(filepath: str, class_labels: list, n_jets: int | None = None, variables: list | None = None, cut_vars_dict: dict | None = None, print_logger: bool = True, chunk_size: int = 1000000.0, indices_to_load: tuple | None = None)#

Load jets from file. Only jets from classes in class_labels are returned.

Parameters:

filepath (str) – Path to the .h5 file with the jets.
class_labels (list) – List of class labels which are used.
n_jets (int) – Number of jets to load.
variables (list) – Variables which are loaded.
cut_vars_dict (dict) – Variable cuts that are applied when loading the jets.
print_logger (bool) – Decide if the number of jets loaded from the file is printed.
chunk_size (int) – Chunk size how much jets are loaded in on go.
indices_to_load (int, optional) – Load the given indices, by default None

Returns:

all_jets (pandas.DataFrame) – The jets as numpy ndarray
all_labels (numpy.ndarray) – The internal class label for each jet. Corresponds with the index of the class label in class_labels.

Raises:

ValueError – If neither n_jets nor indices_to_load is given
ValueError – If both n_jets and indices_to_load is given
KeyError – If filepath is not a list or a string
RuntimeError – If no files could be found in filepath

umami.data_tools.loaders.load_trks_from_file(filepath: str, class_labels: list, n_jets: int | None = None, tracks_name: str = 'tracks', cut_vars_dict: dict | None = None, print_logger: bool = True, chunk_size: int = 1000000.0, indices_to_load: tuple | None = None)#

Load tracks from file. Only jets from classes in class_labels are returned.

Parameters:

filepath (str) – Path to the .h5 file with the jets.
class_labels (list) – List of class labels which are used.
n_jets (int) – Number of jets to load.
tracks_name (str) – Name of the tracks collection to load
cut_vars_dict (dict) – Variable cuts that are applied when loading the jets.
print_logger (bool) – Decide if the number of jets loaded from the file is printed.
chunk_size (int) – Chunk size how much jets are loaded in on go.
indices_to_load (int, optional) – Load the given indices, by default None

Returns:

all_trks (numpy.ndarray) – The tracks of the jets as numpy ndarray
all_labels (numpy.ndarray) – The internal class label for each jet. Corresponds with the index of the class label in class_labels.

Raises:

ValueError – If neither n_jets nor indices_to_load is given
ValueError – If both n_jets and indices_to_load is given
KeyError – If filepath is not a list or a string
RuntimeError – If no files could be found in filepath

umami.data_tools.tools module#

Helper tools for helper_tools.

umami.data_tools.tools.compare_h5_files_variables(*h5_files, key)#

Compare variable contant of several hdf5 files.

Parameters:

*h5_files (str) – name of hdf5 files to be compared
key (str) – hdf5 dataset key

Returns:

list – list of variables common to all provided files
list – list of variables not common in provided files

Raises:

ValueError – if no input files provided.

umami.data_tools package#

Submodules#

umami.data_tools.cuts module#

umami.data_tools.loaders module#

umami.data_tools.tools module#

Module contents#