The Processing Config¶
The processing config file tells fast-carpenter what to do with your data and is written in YAML.
An example config file looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | stages:
- jet_cleaning: fast_carpenter.Define
- event_selection: fast_carpenter.CutFlow
- histogram: fast_carpenter.BinnedDataframe
jet_cleaning:
variables:
- BtaggedJets: Jet_bScore > 0.9
- nBJets: {reduce: count, formula: BtaggedJets}
event_selection:
selection:
All:
- nElectron == 0
- nJet > 1
- {reduce: 0, formula: Jet_pt > 100}
- Any:
- HT >= 200
- MHT >= 200
histogram:
binning:
- {in: nJet}
- {in: nBJets}
- {in: MET, out: met, bins: {edges: [0, 200, 400, 700, 1000]}
weights: weight_nominal
|
Other, more complete examples are listed in Example repositories.
Tip
Since this is a YAML file, things like anchors and block syntax are totally valid, which can be helpful to define “aliases” or reuse certain parts of a config. For more guidance on YAML, this is a good overview of the concepts and syntax: https://kapeli.com/cheat_sheets/YAML.docset/Contents/Resources/Documents/index.
Anatomy of the config¶
- The
stages
section - This is the most important section of the config because it defines what steps to take with the data.
It uses a list of single-length dictionaries, whose key is the name for the stage (e.g.
histogram
) and whose values is the python-importable class that implements it (e.g.fast_carpenter.BinnedDataframes
). The following sections discuss what are valid stage classes. Lines 1 to 4 of the config above show an example of this section and others can be found in the linked Example repositories. - Stage configuration sections
- Each stage must be given a complete description by adding a top-level section in the yaml file with the same name provided in the
stages
section. This should then contain a dictionary which will be passed as keyword-arguments to the underlying class’ init method. Lines 22 to 26 of the above example config file show how the stage calledhistogram
is configured. See below for more help on configuring specific stages. - Importing other config files
Sometimes it can be helpful to re-use one config in another, for example, defining a list of common variables and event selections, but then changing the BinnedDataframes that are produced. The processing config supports this by using the reserved word
IMPORT
as the key for a stage, followed by the path to the config file to import. If the path starts with{this_dir}
then the imported file will be located relative to the directory of the importing config file.For example:
- IMPORT: "{this_dir}/another_processing_config.yml"
See also
The interpretation of the processing config is handled by the fast-flow package so its documentation can also be helpful to understanding the basic anatomy and handling.
Built-in Stages¶
The list of stages known to fast_carpenter already can be found using the built-in --help-stages
option.
$ fast_carpenter --help-stages
fast_carpenter.Define
config: variables
purpose: Creates new variables using a string-based expression.
fast_carpenter.SystematicWeights
config: weights, out_format=weight_{}, extra_variations=[]
purpose: Combines multiple weights and variations to produce a single event weight
fast_carpenter.CutFlow
config: selection_file=None, keep_unique_id=False, selection=None, counter=True, weights=None
purpose: Prevents subsequent stages seeing certain events.
fast_carpenter.SelectPhaseSpace
config: region_name, **kwargs
purpose: Creates an event-mask and adds it to the data-space.
fast_carpenter.BinnedDataframe
config: binning, weights=None, dataset_col=True, pad_missing=False, file_format=None
purpose: Produces a binned dataframe (a multi-dimensional histogram).
fast_carpenter.BuildAghast
config: binning, weights=None, dataset_col=True
purpose: Builds an aghast histogram.
fast_carpenter.EventByEventDataframe
config: collections, mask=None, flatten=True
purpose: Write out a pandas dataframe with event-level values
Further guidance on the built-in stages can be found using --help-stages-full
and giving the name of the stage.
All the built-in stages of fast_carpenter
are available directly from the fast_carpenter
module, e.g. fast_carpenter.Define
.
See also
In-depth discussion of the built-in stages and their configuration can be found on the fast_carpenter
module page: fast_carpenter
, or directly at:
Todo
Build that list programmatically, so its always up to date and uses the built-in docstrings for a description.
Used-defined Stages¶
fast-carpenter is still evolving, and so it is natural that many analysis tasks cannot be implemented using the existing stages.
In this case, it is possible to implement your own stage and making sure it can be imported by python (e.g. by setting the PYTHONPATH
variable to point to the directory containing its code).
The class implementing a custom stage should provide the following methods:
-
__init__
(name, out_dir, ...)¶ This is the method that will receive configuration from the config file, creating the stage itself.
Parameters: - name (str) – will contain the name of the stage as given in the config file.
- out_dir (path) – receives the path to the output directory that should be used if the stage produces output.
Additional arguments can be added, which will be configurable from the processing config file.
-
event
(chunk)¶ Called once for each chunk of data.
Parameters: chunk – provides access to the dataset configuration (
chunk.config
) and the current data-space (chunk.tree
). Typically one wants an array, or set of arrays representing the data for each event, in which case these can be obtained using:jet_pt = chunk.tree.array("jet_pt") jet_pt, jet_eta = chunk.tree.arrays(["jet_pt", "jet_eta", outputtype=tuple)
If your stage produces a new variable, which you want other stages to be able to see, then use the new_variable method:
chunk.tree.new_variable("number_good_jets", number_good_jets)
For more details on working with
chunk.tree
, seefast_carpenter.masked_tree.MaskedUprootTree
.Returns: True
orFalse
for whether to continue processing the chunk through subsequent stages.Return type: bool
See also
An example of such a user stage can be seen in the cms_public_tutorial demo repository: https://gitlab.cern.ch/fast-hep/public/fast_cms_public_tutorial/blob/master/cms_hep_tutorial/__init__.py
Warning
Make sure that your stage can be imported by python, most likely by setting the PYTHONPATH
variable to point to the containing directory.
Then to check a stage called AddMyFancyVar
and defined in a module called my_custom_module
can be imported, make sure no errors are raised by doing:
python -c "import my_custom_module.AddMyFancyVar"
Todo
Describe the collector and merge methods to allow a user stage to save results to disk.