Configuration files

In the following we describe the main sections of the configuration file and the configurable parameters therein

Dataset creator settings

datasetCreator: There are three possible ways to create a dataset, hence three possible parameters:

regions: use the parameters configured in regions and measuredSignalsStations below to create a list of signals and download them

customJSON: use a list of signals configured in customJSONSignals below and download them

CSVreader: load a list of CSV files defined in loadCsvFolder below

saveDataset: if true, save the downloaded or loaded dataset in the outputCsvFolder
loadSignalsFolder: path to the folder where the features JSON files are stored
customJSONSignals: list of dictionaries containing:
- a JSON file containing the features
- a list of one or more columns of the dataset, containing the O3 values of the previous day. If more than one coulmn is provided, the maximum of the daily values will constitute the response vector Y
```
{
  "filename": "CHI_MOR.json"
  "targetColumn": ["CHI__YO3__d1", "BIO__YO3__d1"]
}
```
loadCsvFolder: path to the folder containind the CSV dataset files
csvFiles: a list of dictionaries containing:
- a CSV file containing the whole dataset
- a list of one or more columns of the dataset, containing the O3 values of the previous day. If more than one coulmn is provided, the maximum of the daily values will constitute the response vector Y
```
{
  "filename": "BIO_MOR.csv"
  "targetColumn": ["CHI__YO3__d1", "BIO__YO3__d1"]
}
```
outputCsvFolder: path where the downloaded datasets and selected features are saved
outputSignalFolder: path where the JSON file of regional signals are saved
startDay: “mm-dd” format date, the start of the period to download each year. Ex: “05-15”
endDay: “mm-dd” format date, the end of the period to download each year. Ex: “09-30”
years: list of years for which the data are downloaded, between startDay and endDay. Ex: [2015, 2016, 2017, 2018, 2019, 2020, 2021]
sleepTimeBetweenQueries: Sleeping time in seconds between each query to InfluxDB

Regions settings

regions: a dictionary of regions. A region is a dictionary itself, composed by measure stations, forecast stations and a list of one or more target columns. For instance:
```
"Bioggio":
{
  "MeasureStations": ["BIO", "LUG", "MS-LUG"],
  "targetColumn": ["BIO__YO3__d1"],
  "ForecastStations": ["P_BIO"]
}
```

measuredSignalsStations: list of signals that are measured at each measurement station. For instance:

"measuredSignalsStations":
{
  "CHI": ["CN", "Gl", "NO", "NO2", "NOx", "O3", "P", "Prec", "RH", "T", "WD", "WS"],
  "MEN": ["CN", "Gl", "NO", "NO2", "NOx", "O3", "P", "Prec", "RH", "T", "WDvect", "WSvect"],
  "LUG": ["NO", "NO2", "NOx", "O3"],
}

forecastedSignalsStations: list of signals that are forecasted at each forecast station. For instance:

"forecastedSignalsStations":
{
  "TICIA": ["GLOB", "PS", "TOT_PREC", "RELHUM_2M", "T_2M", "TD_2M", "DD_10M", "FF_10M", "CLCT"],
  "P_BIO": ["GLOB", "PS", "TOT_PREC", "RELHUM_2M", "T_2M", "TD_2M", "DD_10M", "FF_10M", "CLCT"],
}

allMeasuredSignals: list of all measured signals. Do not modify
allForecastedSignals: list of all forecasted signals. Do not modify

VOC settings

Configuration parameters for the calculation of wood VOC signal

useCorrection: if true, apply a linear correction to the values calculated with the forecasted data with the following parameters:
- correction: use the following two parameters to apply a correction through a regression line fit
  - slope: slope of the linear regression fitting the forecasted data
  - intercept: intercept of the linear regression fitting the forecasted data

emissionType: if "forecasted", try to use forecasted data wherever possible, otherwise use measured data. If "measured", use measured data everywhere to calculate woods VOC

Feature analyzer settings

Configuration parameters to perform the Feature Selection using SHAP and NGBoost

numberEstimatorsNGB: number of boosting iterations in NGB (see the online documentation)
learningRate: the learning rate eta
numberSelectedFeatures: number of most important features to be saved
w1: weight of O3 observations above threshold1 \(\mu g/m^3\)
w2: weight of O3 observations between threshold2 and threshold1 \(\mu g/m^3\)
w3: weight of O3 observations between threshold3 and threshold2 \(\mu g/m^3\)
threshold1: highest threshold limit, currently 240 \(\mu g/m^3\)
threshold2: intermediate threshold limit, currently 180 \(\mu g/m^3\)
threshold3: lowest threshold limit, currently 135 \(\mu g/m^3\)

Grid search settings

w1_start: starting value for the weight \(w_1\) in the grid search
w1_end: last value for the weight \(w_1\) in the grid search
w1_step: step value for the weight \(w_1\) in the grid search
w2_start: starting value for the weight \(w_2\) in the grid search
w2_end: last value for the weight \(w_2\) in the grid search
w2_step: step value for the weight \(w_2\) in the grid search
w3_start: starting value for the weight \(w_3\) in the grid search
w3_end: last value for the weight \(w_3\) in the grid search
w3_step: step value for the weight \(w_3\) in the grid search
typeGridSearch: there are three possible ways to perform a grid search, as explained in Grid search:
- multiple: perform a feature selection inside each fold
- single: perform one feature selection on the whole dataset and use the selecetd feazures on each fold
- test: mostly used only for testing, perform a feature selection on the whole dataset and then divide the dataset in 80% training and 20% testing