# Write train sample

## Writing Train Sample#

In the final step of the preprocessing, the resampled, training jets are scaled/shifted and then written to disk in a format, that can be used for training. Each type of object is stored within its own group in the output file. Each group can contain multiple datasets, for the inputs, weights, and labels for example. You can recursively list the contents of all groups using h5ls -r. For this, the collections of the training sample will get different names and data types. The collections are replaced with datasets with unstructured numpy.ndarrays. The names/shapes of these new datasets in the final training file can be found in the table below:

Before Writing After Writing Shape Comment
jets jets/inputs (n_jets, n_jet_variables)
labels jets/labels_one_hot (n_jets, n_jet_classes) Old format: one-hot encoded truth labels. The n_jet_classes are the class_labels defined in the preprocessing config. The value 0 here corresponds to the jet origin which is on index 0 in the class_labels list.
labels jets/labels (n_jets,) Sparse encoded jet labels
<tracks_name> <tracks_name>/inputs (n_jets, n_tracks, n_track_variables) <tracks_name> is the name of the track collection in the .h5 files coming from the training dataset dumper.
<tracks_name>_labels <tracks_name>/labels (n_jets, n_tracks, n_track_truth_variables) This is the sparse representation of the track_truth_variables.

In the final training file, the column information (and therefore which column corresponds to which variable) is not longer available. You can run h5ls -v on the file to get some information about the variables for each of the datasets. The variables for the specific jet and track(s) datasets will be shown as a attribute of the dataset. The order of this variables shown is also the order of the variable columns in the dataset.

### Config File#

For the final writing step, only a few options are needed. Those are shown/explained below

compression: lzf

# save final output files with specified precision
precision: float16

# concatenate jet inputs with each track's inputs in the final output file
concat_jet_tracks: False

# Options for the conversion to tfrecords
convert_to_tfrecord:
chunk_size: 5000

Setting Explanation
compression Decide, which compression is used for the final training sample. Due to slow loading times, this should be null. Possible options are for example gzip.
precision The precision of the final output file. The values are saved with the given precision to save space.
concat_jet_tracks If True, all jet features are concatenated to the features for each track. You can also provide a list of named jet variables to concatenate if you don't want to use all of them.
convert_to_tfrecord Options for the conversion to tfrecords. Possible options to define are the chunk_size which gives the number of samples saved per file and the number of additional variables to be saved in tf records N_Add_Vars.

When you want to train with conditional information, i.e. jet $p_T$ and $\eta$, the corresponding model (CADS) will load the jet information directly from the train file when using .h5. When you want to use TFRecords, you need to define the amount of variables that are added to extra to the files with N_Add_Vars. Until now, when using 2, this will use the first two available jet variables, which are by default jet $p_T$ and $\eta$.

There is support to store additional per-jet labels, such as those used as regression targets. To include, simply add:

additional_labels:
- jet_label_1
- jet_label_2
- jet_label_3


To the variable config. NaN values will be replaced by 0, and no scaling is applied.

### Running the Writing Step#

The writing of the final training sample can be started via the command

preprocessing.py --config <path to config file> --write


The writing step will take some time, depending on the amount of jets you want to use for training and also if you are using track collections.

#### TFRecords writing#

If you are saving the tracks it might be useful to save your samples as a directory with tf Records. This can be done by using --to_records instead of --write. Important: you need to have ran --write beforehand.

preprocessing.py --config <path to config file> --to_records

TF records

TF records are the Tensorflow's own file format to store datasets. Especially when working with large datasets this format can be useful. In TF records the data is saved as a sequence of binary strings. This has the advantage that reading the data is significantly faster than from a .h5 file. In addition the data can be saved in multiple files instead of one big file containing all data. This way the reading procedure can be parallelized which speeds up the whole training. Besides of this, since TF records are the Tensorflow's own file format, it is optimised for the usage with Tensorflow. For example, the dataset is not stored completely in memory but automatically loaded in batches as soon as needed.

#### Writing validation samples#

In some cases, use of a hybrid-resampled validation sample that has been scaled, shifted, and written to file in the same structure as the training files, is required. To do this, simply use the --hybrid_validation flag when running the --write step:

preprocessing.py --config <path to config file> --write --hybrid_validation