umami.data_tools package#
Submodules#
umami.data_tools.cuts module#
Module to define sample cuts.
- umami.data_tools.cuts.get_category_cuts(label_var: str, label_value: float, cut_op: str = '==') list #
This function returns the cut object for the categories used in the preprocessing.
- Parameters:
label_var (str) – Name of the variable.
label_value (float, int, list) – Value for the cut of the variable.
cut_op (str, optional) – operator for the cut. Possible values: “==”, “!=”, “>”, “>=”, “<”, “<=” or None. default is “==”.
- Returns:
List with the cut objects inside.
- Return type:
list
- Raises:
ValueError – If label_value is not a float, int or a list.
- umami.data_tools.cuts.get_cut_list(class_labels: list)#
Returns a dict of cuts used to define the classes.
- Parameters:
class_labels (list) – List with the class labels.
- Returns:
cut_dict – dict with cuts per class label
- Return type:
dict
- umami.data_tools.cuts.get_sample_cuts(jets: ndarray, cuts: list) ndarray #
Given an array of jets and a list of cuts, the function provides a list of indices which are removed by applying the cuts. Users can define the cuts either via the variable name and logical operators: ==, !=, >, >=, <, <= or using the dedicated modulo operator.
The latter is defined as follows: mod_[N]_[operator] with - [N] denoting “modulo N ” - [operator] denoting operator used for comparison to condition
- Parameters:
jets (np.ndarray) – Array of jets which need to pass certain cuts.
cuts (list) – List from config file which contains dict objects for individual cuts.
- Returns:
indices_to_remove – Numpy array of indices to be removed by the cuts
- Return type:
np.ndarray
- Raises:
KeyError – If the cut object in the list is not a dict with one entry.
RuntimeError – If the modulo is incorrectly used.
RuntimeError – If the modulo is incorrectly used. Operation is not supported.
KeyError – If unsupported operator is provided.
- umami.data_tools.cuts.retrieve_cut_string(class_labels: list) tuple #
Retrieve the cut string for a list of class labels
- Parameters:
class_labels (list) – List with the classes to retrieve. Like [“bjets”, “cjets”, “ujets”]
- Returns:
Dict with the cut string for each flavour
- Return type:
dict
umami.data_tools.loaders module#
Provides functions for loading datasets from files.
- umami.data_tools.loaders.load_jets_from_file(filepath: str, class_labels: list, n_jets: int | None = None, variables: list | None = None, cut_vars_dict: dict | None = None, print_logger: bool = True, chunk_size: int = 1000000.0, indices_to_load: tuple | None = None)#
Load jets from file. Only jets from classes in class_labels are returned.
- Parameters:
filepath (str) – Path to the .h5 file with the jets.
class_labels (list) – List of class labels which are used.
n_jets (int) – Number of jets to load.
variables (list) – Variables which are loaded.
cut_vars_dict (dict) – Variable cuts that are applied when loading the jets.
print_logger (bool) – Decide if the number of jets loaded from the file is printed.
chunk_size (int) – Chunk size how much jets are loaded in on go.
indices_to_load (int, optional) – Load the given indices, by default None
- Returns:
all_jets (pandas.DataFrame) – The jets as numpy ndarray
all_labels (numpy.ndarray) – The internal class label for each jet. Corresponds with the index of the class label in class_labels.
- Raises:
ValueError – If neither n_jets nor indices_to_load is given
ValueError – If both n_jets and indices_to_load is given
KeyError – If filepath is not a list or a string
RuntimeError – If no files could be found in filepath
- umami.data_tools.loaders.load_trks_from_file(filepath: str, class_labels: list, n_jets: int | None = None, tracks_name: str = 'tracks', cut_vars_dict: dict | None = None, print_logger: bool = True, chunk_size: int = 1000000.0, indices_to_load: tuple | None = None)#
Load tracks from file. Only jets from classes in class_labels are returned.
- Parameters:
filepath (str) – Path to the .h5 file with the jets.
class_labels (list) – List of class labels which are used.
n_jets (int) – Number of jets to load.
tracks_name (str) – Name of the tracks collection to load
cut_vars_dict (dict) – Variable cuts that are applied when loading the jets.
print_logger (bool) – Decide if the number of jets loaded from the file is printed.
chunk_size (int) – Chunk size how much jets are loaded in on go.
indices_to_load (int, optional) – Load the given indices, by default None
- Returns:
all_trks (numpy.ndarray) – The tracks of the jets as numpy ndarray
all_labels (numpy.ndarray) – The internal class label for each jet. Corresponds with the index of the class label in class_labels.
- Raises:
ValueError – If neither n_jets nor indices_to_load is given
ValueError – If both n_jets and indices_to_load is given
KeyError – If filepath is not a list or a string
RuntimeError – If no files could be found in filepath
umami.data_tools.tools module#
Helper tools for helper_tools.
- umami.data_tools.tools.compare_h5_files_variables(*h5_files, key)#
Compare variable contant of several hdf5 files.
- Parameters:
*h5_files (str) – name of hdf5 files to be compared
key (str) – hdf5 dataset key
- Returns:
list – list of variables common to all provided files
list – list of variables not common in provided files
- Raises:
ValueError – if no input files provided.