umami.data_tools package#

Submodules#

umami.data_tools.cuts module#

Module to define sample cuts.

umami.data_tools.cuts.get_category_cuts(label_var: str, label_value: float, cut_op: str = '==') list#

This function returns the cut object for the categories used in the preprocessing.

Parameters:
  • label_var (str) – Name of the variable.

  • label_value (float, int, list) – Value for the cut of the variable.

  • cut_op (str, optional) – operator for the cut. Possible values: “==”, “!=”, “>”, “>=”, “<”, “<=” or None. default is “==”.

Returns:

List with the cut objects inside.

Return type:

list

Raises:

ValueError – If label_value is not a float, int or a list.

umami.data_tools.cuts.get_cut_list(class_labels: list)#

Returns a dict of cuts used to define the classes.

Parameters:

class_labels (list) – List with the class labels.

Returns:

cut_dict – dict with cuts per class label

Return type:

dict

umami.data_tools.cuts.get_sample_cuts(jets: ndarray, cuts: list) ndarray#

Given an array of jets and a list of cuts, the function provides a list of indices which are removed by applying the cuts. Users can define the cuts either via the variable name and logical operators: ==, !=, >, >=, <, <= or using the dedicated modulo operator.

The latter is defined as follows: mod_[N]_[operator] with - [N] denoting “modulo N ” - [operator] denoting operator used for comparison to condition

Parameters:
  • jets (np.ndarray) – Array of jets which need to pass certain cuts.

  • cuts (list) – List from config file which contains dict objects for individual cuts.

Returns:

indices_to_remove – Numpy array of indices to be removed by the cuts

Return type:

np.ndarray

Raises:
  • KeyError – If the cut object in the list is not a dict with one entry.

  • RuntimeError – If the modulo is incorrectly used.

  • RuntimeError – If the modulo is incorrectly used. Operation is not supported.

  • KeyError – If unsupported operator is provided.

umami.data_tools.cuts.retrieve_cut_string(class_labels: list) tuple#

Retrieve the cut string for a list of class labels

Parameters:

class_labels (list) – List with the classes to retrieve. Like [“bjets”, “cjets”, “ujets”]

Returns:

Dict with the cut string for each flavour

Return type:

dict

umami.data_tools.loaders module#

Provides functions for loading datasets from files.

umami.data_tools.loaders.load_jets_from_file(filepath: str, class_labels: list, n_jets: int | None = None, variables: list | None = None, cut_vars_dict: dict | None = None, print_logger: bool = True, chunk_size: int = 1000000.0, indices_to_load: tuple | None = None)#

Load jets from file. Only jets from classes in class_labels are returned.

Parameters:
  • filepath (str) – Path to the .h5 file with the jets.

  • class_labels (list) – List of class labels which are used.

  • n_jets (int) – Number of jets to load.

  • variables (list) – Variables which are loaded.

  • cut_vars_dict (dict) – Variable cuts that are applied when loading the jets.

  • print_logger (bool) – Decide if the number of jets loaded from the file is printed.

  • chunk_size (int) – Chunk size how much jets are loaded in on go.

  • indices_to_load (int, optional) – Load the given indices, by default None

Returns:

  • all_jets (pandas.DataFrame) – The jets as numpy ndarray

  • all_labels (numpy.ndarray) – The internal class label for each jet. Corresponds with the index of the class label in class_labels.

Raises:
  • ValueError – If neither n_jets nor indices_to_load is given

  • ValueError – If both n_jets and indices_to_load is given

  • KeyError – If filepath is not a list or a string

  • RuntimeError – If no files could be found in filepath

umami.data_tools.loaders.load_trks_from_file(filepath: str, class_labels: list, n_jets: int | None = None, tracks_name: str = 'tracks', cut_vars_dict: dict | None = None, print_logger: bool = True, chunk_size: int = 1000000.0, indices_to_load: tuple | None = None)#

Load tracks from file. Only jets from classes in class_labels are returned.

Parameters:
  • filepath (str) – Path to the .h5 file with the jets.

  • class_labels (list) – List of class labels which are used.

  • n_jets (int) – Number of jets to load.

  • tracks_name (str) – Name of the tracks collection to load

  • cut_vars_dict (dict) – Variable cuts that are applied when loading the jets.

  • print_logger (bool) – Decide if the number of jets loaded from the file is printed.

  • chunk_size (int) – Chunk size how much jets are loaded in on go.

  • indices_to_load (int, optional) – Load the given indices, by default None

Returns:

  • all_trks (numpy.ndarray) – The tracks of the jets as numpy ndarray

  • all_labels (numpy.ndarray) – The internal class label for each jet. Corresponds with the index of the class label in class_labels.

Raises:
  • ValueError – If neither n_jets nor indices_to_load is given

  • ValueError – If both n_jets and indices_to_load is given

  • KeyError – If filepath is not a list or a string

  • RuntimeError – If no files could be found in filepath

umami.data_tools.tools module#

Helper tools for helper_tools.

umami.data_tools.tools.compare_h5_files_variables(*h5_files, key)#

Compare variable contant of several hdf5 files.

Parameters:
  • *h5_files (str) – name of hdf5 files to be compared

  • key (str) – hdf5 dataset key

Returns:

  • list – list of variables common to all provided files

  • list – list of variables not common in provided files

Raises:

ValueError – if no input files provided.

Module contents#