umami.preprocessing_tools.resampling package#
Submodules#
umami.preprocessing_tools.resampling.count_sampling module#
Count sampling module handling data preprocessing.
- class umami.preprocessing_tools.resampling.count_sampling.SimpleSamplingBase(config)#
Bases:
ResamplingTools
,ABC
A base class for simple sampling methods like UnderSamplingNoReplace and UnderSampling.
- Run()#
Run function executing full chain.
- abstract get_indices()#
Applies the sampling to the given arrays. Returns the indices for the jets to be used separately for each category and sample.
- class umami.preprocessing_tools.resampling.count_sampling.UnderSampling(config)#
Bases:
SimpleSamplingBase
Undersampling class.
- Run()#
Run function executing full chain.
- get_indices()#
Applies the UnderSampling to the given arrays.
- Returns:
Returns the indices for the jets to be used separately for each
category and sample.
umami.preprocessing_tools.resampling.importance_sampling_no_replace module#
Importance sampling without replacement module handling data preprocessing.
- class umami.preprocessing_tools.resampling.importance_sampling_no_replace.UnderSamplingNoReplace(config)#
Bases:
SimpleSamplingBase
The UnderSamplingNoReplace is used to prepare the training dataset. It makes sure that all flavour fractions are equal and the flavour distributions have the same shape as the target distribution. This is an alternative to the UnderSampling class, with the difference that it ensures that the predefined target distribution is always the final target distribution, regardless of pre-sampling flavour fractions and low statistics. This method also ensures that the final fractions are equal. Does not work well with taus as of now.
- get_indices() dict #
Applies the undersampling to the given arrays.
- Returns:
Indices for the jets to be used separately for each category and sample.
- Return type:
np.ndarray
- Raises:
ValueError – If no target is given.
- get_sampling_probabilities(target_distribution: str = 'bjets', stats: dict | None = None) dict #
Computes probability ratios against the target distribution for each flavour. The probabiliy sampling ensures the resulting flavour fractions are the same and distributions shapes are the same.
- Parameters:
target_distribution (str, optional) – Target distribution, i.e. bjets, to compute probability ratios against, by default “bjets”
stats (dict, optional) – Dictionary of stats such as bin count for different jet flavours, by default None
- Returns:
A dictionary of the sampling probabilities for each flavour.
- Return type:
dict
- Raises:
ValueError – If target distribution class does not exist in your sample classes.
- get_sampling_probability(target_stat: ndarray, original_stat: ndarray) dict #
Computes probability ratios against the target distribution.
- Parameters:
target_stat (np.ndarray) – Target distribution or histogram, i.e. bjets histo, to compute probability ratios against.
original_stat (np.ndarray) – Original distribution or histogram, i.e. cjets histo, to scale using target_stat.
- Returns:
A dictionary of the sampling probabilities for each flavour.
- Return type:
dict
umami.preprocessing_tools.resampling.pdf_sampling module#
PDF sampling module handling data preprocessing.
- class umami.preprocessing_tools.resampling.pdf_sampling.PDFSampling(config: object, flavour: int | None = None)#
Bases:
ResamplingTools
An importance sampling approach using ratios between distributions to sample and a target as importance weights.
- Run()#
Run function for PDF sampling class.
- calculate_pdf(store_key: str, x_y_original: tuple | None = None, x_y_target: tuple | None = None, target_hist: ndarray | None = None, original_hist: ndarray | None = None, target_bins: tuple | None = None, bins: list | None = None, limits: list | None = None) None #
Calculates the histograms of the input data and uses them to calculate the PDF Ratio. Works either on dataframe or pre-made histograms CalculatePDFRatio is invoked here. Provides the PDF interpolation function which is used for sampling (entry in a dict). It is a property of the class.
- Parameters:
store_key (str) – Key of the interpolation function to be added to self.inter_func_dict (and self._ratio_dict)
x_y_original (tuple, optional) – A 2D tuple of the to resample datapoints of x and y, by default None.
x_y_target (tuple, optional) – A 2D tuple of the target datapoints of x and y, by default None.
target_hist (np.ndarray, optional) – Histogram for the target, by default None
original_hist (np.ndarray, optional) – Histogram for the original flavour, by default None
target_bins (tuple, optional) – If using target_hist, need to define target_bins, a tuple with (binx, biny), by default None.
bins (list, optional) – This can be all possible binning inputs as for numpy histogram2d. Not used if hist are passed instead of arrays, by default None.
limits (list, optional) – Limits for the binning. Not used if hist are passed instead of arrays, by default None.
- Raises:
ValueError – If feeding a histogram but not the bins in PDF calculation.
ValueError – If improper target input for PDF calculation of the store_key.
ValueError – If improper original flavour input for PDF calculation of store_key
- calculate_pdf_ratio(store_key: str, h_target: ndarray, h_original: ndarray, x_bin_edges: ndarray, y_bin_edges: ndarray) None #
Receives the histograms of the target and original data, the bins and a max ratio value. Latter is optional. Provides the PDF interpolation function which is used for sampling. This can be returned with inter_func. It is a property of the class.
- Parameters:
store_key (str) – Key of the interpolation function to be added to self.inter_func_dict (and self._ratio_dict)
h_target (np.ndarray) – Output of numpy histogram2D for the target datapoints.
h_original (np.ndarray) – Output of numpy histogram2D for the original datapoints.
x_bin_edges (np.ndarray) – Array with the x axis bin edges.
y_bin_edges (np.ndarray) – Array with the y axis bin edges.
- check_sample_consistency(samples: dict) None #
Helper function to check if each sample category has the same amount of samples with same category (e.g. Z’ and ttbar both have b, c & light)
- Parameters:
samples (dict) – Dict with the categories (ttbar, zpext) and their corresponding sample names.
- Raises:
KeyError – If the sample which is requested is not in the preparation stage.
RuntimeError – Your specified samples in the sampling/samples block need to have the same category in each sample category.
- combine_flavours(chunk_size: int = 1000000.0)#
This method loads the stored resampled flavour samples and combines them iteratively into a single file.
- Parameters:
chunk_size (int, optional) – Number of jets that are loaded in one chunk, by default 1e6
- file_to_histogram(sample_category: str, category_ind: int, sample_id: int, iterator: bool = True, chunk_size: int = 10000.0, bins: list | None = None, hist_range: list | None = None) dict #
Convert the provided sample into a 2d histogram which is used to calculate the PDF functions.
- Parameters:
sample_category (str) – Sample category that is loaded.
category_ind (int) – Index of the category which is used.
sample_id (int) – Index of the sample that is used.
iterator (bool, optional) – Decide, if the iterative approach is used (True) or the in memory approach (False), by default True.
chunk_size (int, optional) – Chunk size for loading the jets in the iterative approach, by default 1e4.
bins (list, optional) – List with the bins to use for the 2d histogram, by default None.
hist_range (list, optional) – List with histogram ranges for the 2d histogram function, by default None.
- Returns:
Results_dict – Dict with the 2d histogram info.
- Return type:
dict
- generate_flavour_pdf(sample_category: str, category_id: int, sample_id: int, iterator: bool = True) dict #
- This method:
create the flavour distribution (also seperated),
produce the PDF between the flavour and the target.
- Parameters:
sample_category (str) – The name of the category study.
category_id (int) – The location of the category in the list.
sample_id (int) – The location of the sample flavour in the category dict.
iterator (bool, optional) – Whether to use the iterator approach or load the whole sample in memory, by default True.
- Returns:
reading_dict – Add a dictionary object to the class pointing to the interpolation functions (also saves them). Returns None or dataframe of flavour (if iterator or not) and the histogram of the flavour.
- Return type:
dict
- generate_number_sample(sample_id: int) None #
For a given sample, sets the target numbers, respecting flavour ratio and upsampling max ratio (if given).
- Parameters:
sample_id (int) – Position of the flavour in the sample list.
- generate_target_pdf(iterator: bool = True) None #
This method creates the target distribution (seperated) and store the associated histogram in memory (use for sampling) as well as the target numbers. Save to memory the target histogram, binning info, and target numbers.
- Parameters:
iterator (bool, optional) – Whether to use the iterator approach or load the whole sample in memory, by default True.
- in_memory_resample(x_values: ndarray, y_values: ndarray, size: int, store_key: str, replacement: bool = True) ndarray #
Resample all of the datapoints at once. Requirement for that is that all datapoints fit in the RAM.
- Parameters:
x_values (np.ndarray) – x values of the datapoints which are to be resampled from (i.e pT)
y_values (np.ndarray) – y values of the datapoints which are to be resampled from (i.e eta)
size (int) – Number of jets which are resampled.
store_key (str) – Key of the interpolation function to be added to self.inter_func_dict (and self._ratio_dict)
replacement (bool, optional) – Decide, if replacement is used in the resampling, by default True.
- Returns:
sampled_indices – The indicies of the sampled jets.
- Return type:
np.ndarray
- Raises:
ValueError – If x_values and y_values have different shapes.
- initialise_flavour_samples() None #
Initialising input files: this one just creates the map. (based on UnderSampling one). At this point the arrays of the 2 variables are loaded which are used for the sampling and saved into class variables.
- property inter_func#
Return the dict with the interpolation functions inside.
- Returns:
Dict with the interpolation functions inside.
- Return type:
inter_func_dict
- load(file_name: str) RectBivariateSpline #
Load the interpolation function from file.
- Parameters:
file_name (str) – Path where the pickle file is saved.
- Returns:
The loaded interpolation function.
- Return type:
RectBivariateSpline
- load_index_generator(in_file: str, chunk_size: int)#
Generator that yields the indicies of the jets that are to be loaded.
- Parameters:
in_file (str) – Filepath of the input file.
chunk_size (int) – Chunk size of the jets that are loaded and yielded.
- Yields:
indices (np.ndarray) – Indicies of the jets which are to be loaded.
index_tuple (int) – End index of the chunk.
- load_samples(sample_category: str, sample_id: int)#
Load the input file of the specified category and id.
- Parameters:
sample_category (str) – Sample category that is loaded.
sample_id (int) – Index of the sample.
- Returns:
sample (str) – Name of the sample which is loaded.
samples (dict) – Dict with the info retrieved for resampling.
- load_samples_generator(sample_category: str, sample_id: int, chunk_size: int)#
Generator for the loading of the samples.
- Parameters:
sample_category (str) – Sample category that is loaded.
sample_id (int) – Index of the sample.
chunk_size (int) – Chunk size of the jets that are loaded and yielded.
- Yields:
sample (str) – Name of the sample. “training_ttbar_bjets” for example.
samples (dict) – Dict with the loaded jet info needed for resampling.
Next_chunk (bool) – True if more chunks can be loaded, False if this was the last chunk.
n_jets_initial (int) – Number of jets available.
start_index (int) – Start index of the chunk.
- property ratio#
Return the dict with the ratios inside.
- Returns:
Dict with the ratios inside.
- Return type:
Ratio_dict
- resample_chunk(r_resamp: ndarray, size: int, replacement: bool = True) ndarray #
Get sampled indicies from the PDF weights.
- Parameters:
r_resamp (np.ndarray) – PDF weights.
size (int) – Number of jets to sample
replacement (bool, optional) – Decide, if replacement is used, by default True.
- Returns:
sampled_indices – Indicies of the resampled jets which are to use.
- Return type:
np.ndarray
- resample_iterator(sample_category: str, sample_id: int, save_name: str, sample_name: str, chunk_size: int = 1000000.0) None #
Resample with the data not completely stored in memory. Will load the jets in chunks, computing first the sum of PDF weights and then sampling with replacement based on the normalised weights.
- Parameters:
sample_category (str) – Sample category to resample.
sample_id (int) – Index of the sample which to be resampled.
save_name (str) – Filepath + Filename and ending of the file where to save the resampled jets to.
sample_name (str) – Name of the sample to use.
chunk_size (int, optional) – Chunk size which is loaded per step, by default 1e6.
- return_unnormalised_pdf_weights(x_values: ndarray, y_values: ndarray, store_key: str) ndarray #
Calculate the unnormalised PDF weights and return them.
- Parameters:
x_values (np.ndarray) – x values of the datapoints which are to be resampled from (i.e pT)
y_values (np.ndarray) – y values of the datapoints which are to be resampled from (i.e eta)
store_key (str) – Key of the interpolation function to be added to self.inter_func_dict (and self._ratio_dict)
- Returns:
r_resamp – Array with the PDF weights.
- Return type:
np.ndarray
- sample_flavour(sample_category: str, sample_id: int, iterator: bool = True, flavour_distribution: ndarray | None = None) ndarray #
- This method:
samples the required amount based on PDF and fractions
storing the indices selected to memory.
- Parameters:
sample_category (str) – The name of the category study.
sample_id (int) – The location of the sample flavour in the category dict.
iterator (bool, optional) – Whether to use the iterator approach or load the whole sample in memory, by default True.
flavour_distribution (np.ndarray, optional) – None or numpy array, the loaded data (for the flavour). If it is None, an iterator method is used, by default None.
- Returns:
selected_indices – Returns (and stores to memory, if iterator is false) the selected indices for the flavour studied. If iterator is True, a None will be returned.
- Return type:
np.ndarray
- save(inter_func: RectBivariateSpline, file_name: str, overwrite: bool = True) None #
Save the interpolation function to file.
- Parameters:
inter_func (RectBivariateSpline) – Interpolation function.
file_name (str) – Path where the pickle file is saved.
overwrite (bool, optional) – Decide if the file is overwritten if it exists already, by default True
- Raises:
ValueError – If no interpolation function is given.
- save_complete_iterator(sample_category: str, sample_id: int, chunk_size: int = 100000.0) None #
Save the selected data to an output file with an iterative approach (generator) for both writing and reading, in chunk of size chunk_size.
- Parameters:
sample_category (str) – Sample category to save
sample_id (int) – Sample index which is to be saved
chunk_size (int, optional) – Chunk size which is loaded per step, by default 1e5.
- save_flavour(sample_category: str, sample_id: int, selected_indices: dict | None = None, chunk_size: int = 100000.0, iterator: bool = True)#
This method stores the selected date to memory (based on given indices).
- Parameters:
sample_category (str) – The name of the category study.
sample_id (int) – The location of the category in the list.
selected_indices (dict, optional) – The location of the sample flavour in the category dict, by default None
chunk_size (int, optional) – The size of the chunks (the last chunk may be at most 2 * chunk_size), by default 1e5
iterator (bool, optional) – Whether to use the iterator approach or load the whole sample in memory, by default True
- save_partial_iterator(sample_category: str, sample_id: int, selected_indices: ndarray, chunk_size: int = 1000000.0) None #
Save the selected data to an output file with an iterative approach (generator) for writing only, writing in chunk of size chunk_size. The file is read in one go.
- Parameters:
sample_category (str) – Sample category to save
sample_id (int) – Sample index which is to be saved
selected_indices (np.ndarray) – Array with the selected indicies
chunk_size (int, optional) – Chunk size which is loaded per step, by default 1e6.
umami.preprocessing_tools.resampling.resampling_base module#
Resampling base module handling data preprocessing.
- class umami.preprocessing_tools.resampling.resampling_base.JsonNumpyEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)#
Bases:
JSONEncoder
This functions converts the numpy type to a json compatible format.
- Parameters:
JSONEncoder (class) – base class from json package
- default(o)#
overwriting default function of JSONEncoder class
- Parameters:
o (numpy integer, float or ndarray) – objects from json loader
- Returns:
modified JSONEncoder class
- Return type:
class
- class umami.preprocessing_tools.resampling.resampling_base.Resampling(config: object)#
Bases:
object
Base class for all resampling methods in umami.
- get_bins(sample_vector: ndarray)#
Calculates the bin statistics for a DD histogram. This post might be helpful to understand the flattened bin numbering: https://stackoverflow.com/questions/63275441/can-i-get-binned-statistic-2d-to-return-bin-numbers-for-only-bins-in-range
- Parameters:
sample_vector (np.ndarray) – The sample_vector from which we derive the bins
- Returns:
binnumber (np.ndarray) – Array with bin number of each jet with same length as x and y
bins_indices_flat (np.ndarray) – Array with flat bin numbers mapped from DD with length nBins
statistic (np.ndarray) – Array with counts per bin, length nBins
- retrieve_common_variables(sample_paths: list)#
Check if all samples have the same variables. If not, warnings are printed. It also returns the common variables which are available in all samples.
- Parameters:
sample_paths (list) – Paths of all inputs files which are to be prepared.
- Returns:
Dict with the common vars for the different datasets (jets, tracks). Key is the dataset name and the value is a list with all common vars.
- Return type:
dict
- static sampling_generator(file: str, indices: ndarray, label: int, label_classes: list, variables: dict | None = None, save_tracks: bool = False, tracks_names: list | None = None, chunk_size: int = 10000, seed: int = 42, duplicate: bool = False)#
Generator to iterate over datasets based on given indices.
This method also implements fancy indexing for H5 files by separating a list of indices that may contain duplicates into lists of uniques indices. The splitting consists in:
1st list: all indices repeated at least 1 time (all).
2nd list: all indices repeated at least 2 times.
3rd list: ……………………….. 3 ……
…
- Parameters:
file (str) – the path to the h5 file to read
indices (list or numpy.array) – the indices of entries to read
label (int) – the label of the jets being read
label_classes (list) – the combined labelling scheme
variables (dict) – variables per dataset which should be used if None all variables are used, by default None
save_tracks (bool) – whether to store tracks, by default False
tracks_names (list) – list of tracks collection names to use, by default None
chunk_size (int) – the size of each chunk (last chunk might at most be twice this size), by default 10000
seed (int) – random seed to use, by default 42
duplicate (bool) – whether the reading should assume duplicates are present. DO NOT USE IF NO DUPLICATES ARE EXPECTED!, by default False
- Yields:
numpy.ndarray – jets
numpy.ndarray – tracks, if save_tracks is True
numpy.ndarray – labels
- write_file(indices: dict, chunk_size: int = 10000)#
Takes the indices as input calculated in the GetIndices function and reads them in and writes them to disk. Writes the selected jets from the samples to disk
- Parameters:
indices (dict) – Dict of indices as returned by the GetIndices function.
chunk_size (int, optional) – Size of loaded chunks, by default 10_000
- Raises:
TypeError – If concatenated samples have different shape.
TypeError – If the used samples don’t have the same content of variables.
- class umami.preprocessing_tools.resampling.resampling_base.ResamplingTools(config: object)#
Bases:
Resampling
Helper class for resampling.
- concatenate_samples()#
Takes initialized object from InitialiseSamples() and concatenates samples with the same category into dict which contains the samplevector: array(sample_size x 5) with pt, eta, jet_count, sample_id (ttbar:0, zprime:1), sample_class
- Returns:
self.concat_samples = { – “bjets”: {“jets”: array(sample_size x 5)}, “cjets”: {“jets”: array(sample_size x 5)}, …
}
- get_pt_eta_bin_statistics()#
Retrieve pt and eta bin statistics.
- get_valid_class_categories(samples: dict)#
Helper function to check sample categories requested in resampling were also defined in the sample preparation step. Returns sample classes.
- Parameters:
samples (dict) – Dict wih the samples
- Returns:
check_consistency – Dict with the consistency check results.
- Return type:
dict
- Raises:
RuntimeError – If your specified samples in the sampling block don’t have same samples in each category.
- initialise_samples(n_jets: int | None = None) None #
At this point the arrays of the 2 variables are loaded which are used for the sampling and saved into class variables.
- Parameters:
n_jets (int, optional) – If the custom_n_jets_initial are not set, use this value to decide how much jets are loaded from each sample. By default None
- Raises:
KeyError – If the samples are not correctly specified.
- umami.preprocessing_tools.resampling.resampling_base.calculate_binning(bins: list) ndarray #
Calculate and return the bin egdes for the provided bins.
- Parameters:
bins (list) – Is either a list containing the np.linspace arguments, or a list of them
- Returns:
bin_edges – Array with the bin edges
- Return type:
np.ndarray
- umami.preprocessing_tools.resampling.resampling_base.correct_fractions(n_jets: list, target_fractions: list, class_names: list | None = None, verbose: bool = True) ndarray #
Corrects the fractions of N classes
- Parameters:
n_jets (list) – List actual number of available jets per class
target_fractions (list) – List of the target fraction per class
class_names (list, optional) – List with the class names, by default None
verbose (bool, optional) – Decide, if more detailed output is logged, by default True
- Returns:
n_jets_to_keep – Array of N_jets to keep per class
- Return type:
np.ndarray
- Raises:
ValueError – If not all N_jets entries are bigger than 0.
ValueError – If the ‘target_fractions’ don’t add up to one.
- umami.preprocessing_tools.resampling.resampling_base.quick_check_duplicates(arr: list) bool #
This performs a quick duplicate check in list arr. If a duplicate is found, returns directly True.
- Parameters:
arr (list) – List with entries.
- Returns:
duplicate_is_there – Return True if one element is double in the list, False if not.
- Return type:
bool
- umami.preprocessing_tools.resampling.resampling_base.read_dataframe_repetition(file_df: ndarray, loading_indices: list, duplicate: bool = False, save_tracks: bool = False, tracks_names: list | None = None)#
Implements a fancier reading of H5 dataframe (allowing repeated indices). Designed to read a h5 file with jets (and tracks if use_track is true).
- Parameters:
file_df (file) – file containing datasets
loading_indices (list) – indices to load
duplicate (bool) – whether the reading should assume duplicates are present. DO NOT USE IF NO DUPLICATES ARE EXPECTED!, by default False
save_tracks (bool) – whether to store tracks, by default False
tracks_names (list) – list of tracks collection names to use, by default None
- Returns:
numpy.ndarray – jets
numpy.ndarray – tracks if save_tracks is True
umami.preprocessing_tools.resampling.weighting module#
Weighting module handling data preprocessing.
- class umami.preprocessing_tools.resampling.weighting.Weighting(config: object)#
Bases:
ResamplingTools
Weighting class.
- Run()#
Run function for Weighting class.
- get_flavour_weights()#
Calculate ratios (weights) from bins in 2d (pt,eta) histogram between different flavours.
- get_indices()#
Applies the UnderSampling to the given arrays.
- Returns:
Returns the indices for the jets to be used separately for each
category and sample.