umami.preprocessing_tools.resampling package#

Submodules#

umami.preprocessing_tools.resampling.count_sampling module#

Count sampling module handling data preprocessing.

class umami.preprocessing_tools.resampling.count_sampling.SimpleSamplingBase(config)#

Bases: ResamplingTools, ABC

A base class for simple sampling methods like UnderSamplingNoReplace and UnderSampling.

Run()#

Run function executing full chain.

abstract get_indices()#

Applies the sampling to the given arrays. Returns the indices for the jets to be used separately for each category and sample.

class umami.preprocessing_tools.resampling.count_sampling.UnderSampling(config)#

Bases: SimpleSamplingBase

Undersampling class.

Run()#

Run function executing full chain.

get_indices()#

Applies the UnderSampling to the given arrays.

Returns:

  • Returns the indices for the jets to be used separately for each

  • category and sample.

umami.preprocessing_tools.resampling.importance_sampling_no_replace module#

Importance sampling without replacement module handling data preprocessing.

class umami.preprocessing_tools.resampling.importance_sampling_no_replace.UnderSamplingNoReplace(config)#

Bases: SimpleSamplingBase

The UnderSamplingNoReplace is used to prepare the training dataset. It makes sure that all flavour fractions are equal and the flavour distributions have the same shape as the target distribution. This is an alternative to the UnderSampling class, with the difference that it ensures that the predefined target distribution is always the final target distribution, regardless of pre-sampling flavour fractions and low statistics. This method also ensures that the final fractions are equal. Does not work well with taus as of now.

get_indices() dict#

Applies the undersampling to the given arrays.

Returns:

Indices for the jets to be used separately for each category and sample.

Return type:

np.ndarray

Raises:

ValueError – If no target is given.

get_sampling_probabilities(target_distribution: str = 'bjets', stats: dict | None = None) dict#

Computes probability ratios against the target distribution for each flavour. The probabiliy sampling ensures the resulting flavour fractions are the same and distributions shapes are the same.

Parameters:
  • target_distribution (str, optional) – Target distribution, i.e. bjets, to compute probability ratios against, by default “bjets”

  • stats (dict, optional) – Dictionary of stats such as bin count for different jet flavours, by default None

Returns:

A dictionary of the sampling probabilities for each flavour.

Return type:

dict

Raises:

ValueError – If target distribution class does not exist in your sample classes.

get_sampling_probability(target_stat: ndarray, original_stat: ndarray) dict#

Computes probability ratios against the target distribution.

Parameters:
  • target_stat (np.ndarray) – Target distribution or histogram, i.e. bjets histo, to compute probability ratios against.

  • original_stat (np.ndarray) – Original distribution or histogram, i.e. cjets histo, to scale using target_stat.

Returns:

A dictionary of the sampling probabilities for each flavour.

Return type:

dict

umami.preprocessing_tools.resampling.pdf_sampling module#

PDF sampling module handling data preprocessing.

class umami.preprocessing_tools.resampling.pdf_sampling.PDFSampling(config: object, flavour: int | None = None)#

Bases: ResamplingTools

An importance sampling approach using ratios between distributions to sample and a target as importance weights.

Run()#

Run function for PDF sampling class.

calculate_pdf(store_key: str, x_y_original: tuple | None = None, x_y_target: tuple | None = None, target_hist: ndarray | None = None, original_hist: ndarray | None = None, target_bins: tuple | None = None, bins: list | None = None, limits: list | None = None) None#

Calculates the histograms of the input data and uses them to calculate the PDF Ratio. Works either on dataframe or pre-made histograms CalculatePDFRatio is invoked here. Provides the PDF interpolation function which is used for sampling (entry in a dict). It is a property of the class.

Parameters:
  • store_key (str) – Key of the interpolation function to be added to self.inter_func_dict (and self._ratio_dict)

  • x_y_original (tuple, optional) – A 2D tuple of the to resample datapoints of x and y, by default None.

  • x_y_target (tuple, optional) – A 2D tuple of the target datapoints of x and y, by default None.

  • target_hist (np.ndarray, optional) – Histogram for the target, by default None

  • original_hist (np.ndarray, optional) – Histogram for the original flavour, by default None

  • target_bins (tuple, optional) – If using target_hist, need to define target_bins, a tuple with (binx, biny), by default None.

  • bins (list, optional) – This can be all possible binning inputs as for numpy histogram2d. Not used if hist are passed instead of arrays, by default None.

  • limits (list, optional) – Limits for the binning. Not used if hist are passed instead of arrays, by default None.

Raises:
  • ValueError – If feeding a histogram but not the bins in PDF calculation.

  • ValueError – If improper target input for PDF calculation of the store_key.

  • ValueError – If improper original flavour input for PDF calculation of store_key

calculate_pdf_ratio(store_key: str, h_target: ndarray, h_original: ndarray, x_bin_edges: ndarray, y_bin_edges: ndarray) None#

Receives the histograms of the target and original data, the bins and a max ratio value. Latter is optional. Provides the PDF interpolation function which is used for sampling. This can be returned with inter_func. It is a property of the class.

Parameters:
  • store_key (str) – Key of the interpolation function to be added to self.inter_func_dict (and self._ratio_dict)

  • h_target (np.ndarray) – Output of numpy histogram2D for the target datapoints.

  • h_original (np.ndarray) – Output of numpy histogram2D for the original datapoints.

  • x_bin_edges (np.ndarray) – Array with the x axis bin edges.

  • y_bin_edges (np.ndarray) – Array with the y axis bin edges.

check_sample_consistency(samples: dict) None#

Helper function to check if each sample category has the same amount of samples with same category (e.g. Z’ and ttbar both have b, c & light)

Parameters:

samples (dict) – Dict with the categories (ttbar, zpext) and their corresponding sample names.

Raises:
  • KeyError – If the sample which is requested is not in the preparation stage.

  • RuntimeError – Your specified samples in the sampling/samples block need to have the same category in each sample category.

combine_flavours(chunk_size: int = 1000000.0)#

This method loads the stored resampled flavour samples and combines them iteratively into a single file.

Parameters:

chunk_size (int, optional) – Number of jets that are loaded in one chunk, by default 1e6

file_to_histogram(sample_category: str, category_ind: int, sample_id: int, iterator: bool = True, chunk_size: int = 10000.0, bins: list | None = None, hist_range: list | None = None) dict#

Convert the provided sample into a 2d histogram which is used to calculate the PDF functions.

Parameters:
  • sample_category (str) – Sample category that is loaded.

  • category_ind (int) – Index of the category which is used.

  • sample_id (int) – Index of the sample that is used.

  • iterator (bool, optional) – Decide, if the iterative approach is used (True) or the in memory approach (False), by default True.

  • chunk_size (int, optional) – Chunk size for loading the jets in the iterative approach, by default 1e4.

  • bins (list, optional) – List with the bins to use for the 2d histogram, by default None.

  • hist_range (list, optional) – List with histogram ranges for the 2d histogram function, by default None.

Returns:

Results_dict – Dict with the 2d histogram info.

Return type:

dict

generate_flavour_pdf(sample_category: str, category_id: int, sample_id: int, iterator: bool = True) dict#
This method:
  • create the flavour distribution (also seperated),

  • produce the PDF between the flavour and the target.

Parameters:
  • sample_category (str) – The name of the category study.

  • category_id (int) – The location of the category in the list.

  • sample_id (int) – The location of the sample flavour in the category dict.

  • iterator (bool, optional) – Whether to use the iterator approach or load the whole sample in memory, by default True.

Returns:

reading_dict – Add a dictionary object to the class pointing to the interpolation functions (also saves them). Returns None or dataframe of flavour (if iterator or not) and the histogram of the flavour.

Return type:

dict

generate_number_sample(sample_id: int) None#

For a given sample, sets the target numbers, respecting flavour ratio and upsampling max ratio (if given).

Parameters:

sample_id (int) – Position of the flavour in the sample list.

generate_target_pdf(iterator: bool = True) None#

This method creates the target distribution (seperated) and store the associated histogram in memory (use for sampling) as well as the target numbers. Save to memory the target histogram, binning info, and target numbers.

Parameters:

iterator (bool, optional) – Whether to use the iterator approach or load the whole sample in memory, by default True.

in_memory_resample(x_values: ndarray, y_values: ndarray, size: int, store_key: str, replacement: bool = True) ndarray#

Resample all of the datapoints at once. Requirement for that is that all datapoints fit in the RAM.

Parameters:
  • x_values (np.ndarray) – x values of the datapoints which are to be resampled from (i.e pT)

  • y_values (np.ndarray) – y values of the datapoints which are to be resampled from (i.e eta)

  • size (int) – Number of jets which are resampled.

  • store_key (str) – Key of the interpolation function to be added to self.inter_func_dict (and self._ratio_dict)

  • replacement (bool, optional) – Decide, if replacement is used in the resampling, by default True.

Returns:

sampled_indices – The indicies of the sampled jets.

Return type:

np.ndarray

Raises:

ValueError – If x_values and y_values have different shapes.

initialise_flavour_samples() None#

Initialising input files: this one just creates the map. (based on UnderSampling one). At this point the arrays of the 2 variables are loaded which are used for the sampling and saved into class variables.

property inter_func#

Return the dict with the interpolation functions inside.

Returns:

Dict with the interpolation functions inside.

Return type:

inter_func_dict

load(file_name: str) RectBivariateSpline#

Load the interpolation function from file.

Parameters:

file_name (str) – Path where the pickle file is saved.

Returns:

The loaded interpolation function.

Return type:

RectBivariateSpline

load_index_generator(in_file: str, chunk_size: int)#

Generator that yields the indicies of the jets that are to be loaded.

Parameters:
  • in_file (str) – Filepath of the input file.

  • chunk_size (int) – Chunk size of the jets that are loaded and yielded.

Yields:
  • indices (np.ndarray) – Indicies of the jets which are to be loaded.

  • index_tuple (int) – End index of the chunk.

load_samples(sample_category: str, sample_id: int)#

Load the input file of the specified category and id.

Parameters:
  • sample_category (str) – Sample category that is loaded.

  • sample_id (int) – Index of the sample.

Returns:

  • sample (str) – Name of the sample which is loaded.

  • samples (dict) – Dict with the info retrieved for resampling.

load_samples_generator(sample_category: str, sample_id: int, chunk_size: int)#

Generator for the loading of the samples.

Parameters:
  • sample_category (str) – Sample category that is loaded.

  • sample_id (int) – Index of the sample.

  • chunk_size (int) – Chunk size of the jets that are loaded and yielded.

Yields:
  • sample (str) – Name of the sample. “training_ttbar_bjets” for example.

  • samples (dict) – Dict with the loaded jet info needed for resampling.

  • Next_chunk (bool) – True if more chunks can be loaded, False if this was the last chunk.

  • n_jets_initial (int) – Number of jets available.

  • start_index (int) – Start index of the chunk.

property ratio#

Return the dict with the ratios inside.

Returns:

Dict with the ratios inside.

Return type:

Ratio_dict

resample_chunk(r_resamp: ndarray, size: int, replacement: bool = True) ndarray#

Get sampled indicies from the PDF weights.

Parameters:
  • r_resamp (np.ndarray) – PDF weights.

  • size (int) – Number of jets to sample

  • replacement (bool, optional) – Decide, if replacement is used, by default True.

Returns:

sampled_indices – Indicies of the resampled jets which are to use.

Return type:

np.ndarray

resample_iterator(sample_category: str, sample_id: int, save_name: str, sample_name: str, chunk_size: int = 1000000.0) None#

Resample with the data not completely stored in memory. Will load the jets in chunks, computing first the sum of PDF weights and then sampling with replacement based on the normalised weights.

Parameters:
  • sample_category (str) – Sample category to resample.

  • sample_id (int) – Index of the sample which to be resampled.

  • save_name (str) – Filepath + Filename and ending of the file where to save the resampled jets to.

  • sample_name (str) – Name of the sample to use.

  • chunk_size (int, optional) – Chunk size which is loaded per step, by default 1e6.

return_unnormalised_pdf_weights(x_values: ndarray, y_values: ndarray, store_key: str) ndarray#

Calculate the unnormalised PDF weights and return them.

Parameters:
  • x_values (np.ndarray) – x values of the datapoints which are to be resampled from (i.e pT)

  • y_values (np.ndarray) – y values of the datapoints which are to be resampled from (i.e eta)

  • store_key (str) – Key of the interpolation function to be added to self.inter_func_dict (and self._ratio_dict)

Returns:

r_resamp – Array with the PDF weights.

Return type:

np.ndarray

sample_flavour(sample_category: str, sample_id: int, iterator: bool = True, flavour_distribution: ndarray | None = None) ndarray#
This method:
  • samples the required amount based on PDF and fractions

  • storing the indices selected to memory.

Parameters:
  • sample_category (str) – The name of the category study.

  • sample_id (int) – The location of the sample flavour in the category dict.

  • iterator (bool, optional) – Whether to use the iterator approach or load the whole sample in memory, by default True.

  • flavour_distribution (np.ndarray, optional) – None or numpy array, the loaded data (for the flavour). If it is None, an iterator method is used, by default None.

Returns:

selected_indices – Returns (and stores to memory, if iterator is false) the selected indices for the flavour studied. If iterator is True, a None will be returned.

Return type:

np.ndarray

save(inter_func: RectBivariateSpline, file_name: str, overwrite: bool = True) None#

Save the interpolation function to file.

Parameters:
  • inter_func (RectBivariateSpline) – Interpolation function.

  • file_name (str) – Path where the pickle file is saved.

  • overwrite (bool, optional) – Decide if the file is overwritten if it exists already, by default True

Raises:

ValueError – If no interpolation function is given.

save_complete_iterator(sample_category: str, sample_id: int, chunk_size: int = 100000.0) None#

Save the selected data to an output file with an iterative approach (generator) for both writing and reading, in chunk of size chunk_size.

Parameters:
  • sample_category (str) – Sample category to save

  • sample_id (int) – Sample index which is to be saved

  • chunk_size (int, optional) – Chunk size which is loaded per step, by default 1e5.

save_flavour(sample_category: str, sample_id: int, selected_indices: dict | None = None, chunk_size: int = 100000.0, iterator: bool = True)#

This method stores the selected date to memory (based on given indices).

Parameters:
  • sample_category (str) – The name of the category study.

  • sample_id (int) – The location of the category in the list.

  • selected_indices (dict, optional) – The location of the sample flavour in the category dict, by default None

  • chunk_size (int, optional) – The size of the chunks (the last chunk may be at most 2 * chunk_size), by default 1e5

  • iterator (bool, optional) – Whether to use the iterator approach or load the whole sample in memory, by default True

save_partial_iterator(sample_category: str, sample_id: int, selected_indices: ndarray, chunk_size: int = 1000000.0) None#

Save the selected data to an output file with an iterative approach (generator) for writing only, writing in chunk of size chunk_size. The file is read in one go.

Parameters:
  • sample_category (str) – Sample category to save

  • sample_id (int) – Sample index which is to be saved

  • selected_indices (np.ndarray) – Array with the selected indicies

  • chunk_size (int, optional) – Chunk size which is loaded per step, by default 1e6.

umami.preprocessing_tools.resampling.resampling_base module#

Resampling base module handling data preprocessing.

class umami.preprocessing_tools.resampling.resampling_base.JsonNumpyEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)#

Bases: JSONEncoder

This functions converts the numpy type to a json compatible format.

Parameters:

JSONEncoder (class) – base class from json package

default(o)#

overwriting default function of JSONEncoder class

Parameters:

o (numpy integer, float or ndarray) – objects from json loader

Returns:

modified JSONEncoder class

Return type:

class

class umami.preprocessing_tools.resampling.resampling_base.Resampling(config: object)#

Bases: object

Base class for all resampling methods in umami.

get_bins(sample_vector: ndarray)#

Calculates the bin statistics for a DD histogram. This post might be helpful to understand the flattened bin numbering: https://stackoverflow.com/questions/63275441/can-i-get-binned-statistic-2d-to-return-bin-numbers-for-only-bins-in-range

Parameters:

sample_vector (np.ndarray) – The sample_vector from which we derive the bins

Returns:

  • binnumber (np.ndarray) – Array with bin number of each jet with same length as x and y

  • bins_indices_flat (np.ndarray) – Array with flat bin numbers mapped from DD with length nBins

  • statistic (np.ndarray) – Array with counts per bin, length nBins

retrieve_common_variables(sample_paths: list)#

Check if all samples have the same variables. If not, warnings are printed. It also returns the common variables which are available in all samples.

Parameters:

sample_paths (list) – Paths of all inputs files which are to be prepared.

Returns:

Dict with the common vars for the different datasets (jets, tracks). Key is the dataset name and the value is a list with all common vars.

Return type:

dict

static sampling_generator(file: str, indices: ndarray, label: int, label_classes: list, variables: dict | None = None, save_tracks: bool = False, tracks_names: list | None = None, chunk_size: int = 10000, seed: int = 42, duplicate: bool = False)#

Generator to iterate over datasets based on given indices.

This method also implements fancy indexing for H5 files by separating a list of indices that may contain duplicates into lists of uniques indices. The splitting consists in:

  • 1st list: all indices repeated at least 1 time (all).

  • 2nd list: all indices repeated at least 2 times.

  • 3rd list: ……………………….. 3 ……

Parameters:
  • file (str) – the path to the h5 file to read

  • indices (list or numpy.array) – the indices of entries to read

  • label (int) – the label of the jets being read

  • label_classes (list) – the combined labelling scheme

  • variables (dict) – variables per dataset which should be used if None all variables are used, by default None

  • save_tracks (bool) – whether to store tracks, by default False

  • tracks_names (list) – list of tracks collection names to use, by default None

  • chunk_size (int) – the size of each chunk (last chunk might at most be twice this size), by default 10000

  • seed (int) – random seed to use, by default 42

  • duplicate (bool) – whether the reading should assume duplicates are present. DO NOT USE IF NO DUPLICATES ARE EXPECTED!, by default False

Yields:
  • numpy.ndarray – jets

  • numpy.ndarray – tracks, if save_tracks is True

  • numpy.ndarray – labels

write_file(indices: dict, chunk_size: int = 10000)#

Takes the indices as input calculated in the GetIndices function and reads them in and writes them to disk. Writes the selected jets from the samples to disk

Parameters:
  • indices (dict) – Dict of indices as returned by the GetIndices function.

  • chunk_size (int, optional) – Size of loaded chunks, by default 10_000

Raises:
  • TypeError – If concatenated samples have different shape.

  • TypeError – If the used samples don’t have the same content of variables.

class umami.preprocessing_tools.resampling.resampling_base.ResamplingTools(config: object)#

Bases: Resampling

Helper class for resampling.

concatenate_samples()#

Takes initialized object from InitialiseSamples() and concatenates samples with the same category into dict which contains the samplevector: array(sample_size x 5) with pt, eta, jet_count, sample_id (ttbar:0, zprime:1), sample_class

Returns:

  • self.concat_samples = { – “bjets”: {“jets”: array(sample_size x 5)}, “cjets”: {“jets”: array(sample_size x 5)}, …

  • }

get_pt_eta_bin_statistics()#

Retrieve pt and eta bin statistics.

get_valid_class_categories(samples: dict)#

Helper function to check sample categories requested in resampling were also defined in the sample preparation step. Returns sample classes.

Parameters:

samples (dict) – Dict wih the samples

Returns:

check_consistency – Dict with the consistency check results.

Return type:

dict

Raises:

RuntimeError – If your specified samples in the sampling block don’t have same samples in each category.

initialise_samples(n_jets: int | None = None) None#

At this point the arrays of the 2 variables are loaded which are used for the sampling and saved into class variables.

Parameters:

n_jets (int, optional) – If the custom_n_jets_initial are not set, use this value to decide how much jets are loaded from each sample. By default None

Raises:

KeyError – If the samples are not correctly specified.

umami.preprocessing_tools.resampling.resampling_base.calculate_binning(bins: list) ndarray#

Calculate and return the bin egdes for the provided bins.

Parameters:

bins (list) – Is either a list containing the np.linspace arguments, or a list of them

Returns:

bin_edges – Array with the bin edges

Return type:

np.ndarray

umami.preprocessing_tools.resampling.resampling_base.correct_fractions(n_jets: list, target_fractions: list, class_names: list | None = None, verbose: bool = True) ndarray#

Corrects the fractions of N classes

Parameters:
  • n_jets (list) – List actual number of available jets per class

  • target_fractions (list) – List of the target fraction per class

  • class_names (list, optional) – List with the class names, by default None

  • verbose (bool, optional) – Decide, if more detailed output is logged, by default True

Returns:

n_jets_to_keep – Array of N_jets to keep per class

Return type:

np.ndarray

Raises:
  • ValueError – If not all N_jets entries are bigger than 0.

  • ValueError – If the ‘target_fractions’ don’t add up to one.

umami.preprocessing_tools.resampling.resampling_base.quick_check_duplicates(arr: list) bool#

This performs a quick duplicate check in list arr. If a duplicate is found, returns directly True.

Parameters:

arr (list) – List with entries.

Returns:

duplicate_is_there – Return True if one element is double in the list, False if not.

Return type:

bool

umami.preprocessing_tools.resampling.resampling_base.read_dataframe_repetition(file_df: ndarray, loading_indices: list, duplicate: bool = False, save_tracks: bool = False, tracks_names: list | None = None)#

Implements a fancier reading of H5 dataframe (allowing repeated indices). Designed to read a h5 file with jets (and tracks if use_track is true).

Parameters:
  • file_df (file) – file containing datasets

  • loading_indices (list) – indices to load

  • duplicate (bool) – whether the reading should assume duplicates are present. DO NOT USE IF NO DUPLICATES ARE EXPECTED!, by default False

  • save_tracks (bool) – whether to store tracks, by default False

  • tracks_names (list) – list of tracks collection names to use, by default None

Returns:

  • numpy.ndarray – jets

  • numpy.ndarray – tracks if save_tracks is True

umami.preprocessing_tools.resampling.weighting module#

Weighting module handling data preprocessing.

class umami.preprocessing_tools.resampling.weighting.Weighting(config: object)#

Bases: ResamplingTools

Weighting class.

Run()#

Run function for Weighting class.

get_flavour_weights()#

Calculate ratios (weights) from bins in 2d (pt,eta) histogram between different flavours.

get_indices()#

Applies the UnderSampling to the given arrays.

Returns:

  • Returns the indices for the jets to be used separately for each

  • category and sample.

Module contents#