Configuration files
In the following we describe the main sections of the configuration file and the configurable parameters therein
Dataset creator settings
datasetCreator: There are three possible ways to create a dataset, hence three possible parameters:
regions: use the parameters configured inregionsandmeasuredSignalsStationsbelow to create a list of signals and download them
customJSON: use a list of signals configured incustomJSONSignalsbelow and download them
CSVreader: load a list of CSV files defined inloadCsvFolderbelow
saveDataset: iftrue, save the downloaded or loaded dataset in theoutputCsvFolderloadSignalsFolder: path to the folder where the features JSON files are storedcustomJSONSignals: list of dictionaries containing:a JSON file containing the features
a list of one or more columns of the dataset, containing the O3 values of the previous day. If more than one coulmn is provided, the maximum of the daily values will constitute the response vector Y
{ "filename": "CHI_MOR.json" "targetColumn": ["CHI__YO3__d1", "BIO__YO3__d1"] }
loadCsvFolder: path to the folder containind the CSV dataset filescsvFiles: a list of dictionaries containing:a CSV file containing the whole dataset
a list of one or more columns of the dataset, containing the O3 values of the previous day. If more than one coulmn is provided, the maximum of the daily values will constitute the response vector Y
{ "filename": "BIO_MOR.csv" "targetColumn": ["CHI__YO3__d1", "BIO__YO3__d1"] }
outputCsvFolder: path where the downloaded datasets and selected features are savedoutputSignalFolder: path where the JSON file of regional signals are savedstartDay: “mm-dd” format date, the start of the period to download each year. Ex: “05-15”endDay: “mm-dd” format date, the end of the period to download each year. Ex: “09-30”years: list of years for which the data are downloaded, betweenstartDayandendDay. Ex: [2015, 2016, 2017, 2018, 2019, 2020, 2021]sleepTimeBetweenQueries: Sleeping time in seconds between each query to InfluxDB
Regions settings
regions: a dictionary of regions. A region is a dictionary itself, composed by measure stations, forecast stations and a list of one or more target columns. For instance:"Bioggio": { "MeasureStations": ["BIO", "LUG", "MS-LUG"], "targetColumn": ["BIO__YO3__d1"], "ForecastStations": ["P_BIO"] }
measuredSignalsStations: list of signals that are measured at each measurement station. For instance:"measuredSignalsStations": { "CHI": ["CN", "Gl", "NO", "NO2", "NOx", "O3", "P", "Prec", "RH", "T", "WD", "WS"], "MEN": ["CN", "Gl", "NO", "NO2", "NOx", "O3", "P", "Prec", "RH", "T", "WDvect", "WSvect"], "LUG": ["NO", "NO2", "NOx", "O3"], }
forecastedSignalsStations: list of signals that are forecasted at each forecast station. For instance:"forecastedSignalsStations": { "TICIA": ["GLOB", "PS", "TOT_PREC", "RELHUM_2M", "T_2M", "TD_2M", "DD_10M", "FF_10M", "CLCT"], "P_BIO": ["GLOB", "PS", "TOT_PREC", "RELHUM_2M", "T_2M", "TD_2M", "DD_10M", "FF_10M", "CLCT"], }
allMeasuredSignals: list of all measured signals. Do not modifyallForecastedSignals: list of all forecasted signals. Do not modify
VOC settings
Configuration parameters for the calculation of wood VOC signal
useCorrection: iftrue, apply a linear correction to the values calculated with the forecasted data with the following parameters:correction: use the following two parameters to apply a correction through a regression line fitslope: slope of the linear regression fitting the forecasted dataintercept: intercept of the linear regression fitting the forecasted data
emissionType: if"forecasted", try to use forecasted data wherever possible, otherwise use measured data. If"measured", use measured data everywhere to calculate woods VOC
Feature analyzer settings
Configuration parameters to perform the Feature Selection using SHAP and NGBoost
numberEstimatorsNGB: number of boosting iterations in NGB (see the online documentation)learningRate: the learning rate etanumberSelectedFeatures: number of most important features to be savedw1: weight of O3 observations abovethreshold1\(\mu g/m^3\)w2: weight of O3 observations betweenthreshold2andthreshold1\(\mu g/m^3\)w3: weight of O3 observations betweenthreshold3andthreshold2\(\mu g/m^3\)threshold1: highest threshold limit, currently 240 \(\mu g/m^3\)threshold2: intermediate threshold limit, currently 180 \(\mu g/m^3\)threshold3: lowest threshold limit, currently 135 \(\mu g/m^3\)
Grid search settings
w1_start: starting value for the weight \(w_1\) in the grid searchw1_end: last value for the weight \(w_1\) in the grid searchw1_step: step value for the weight \(w_1\) in the grid searchw2_start: starting value for the weight \(w_2\) in the grid searchw2_end: last value for the weight \(w_2\) in the grid searchw2_step: step value for the weight \(w_2\) in the grid searchw3_start: starting value for the weight \(w_3\) in the grid searchw3_end: last value for the weight \(w_3\) in the grid searchw3_step: step value for the weight \(w_3\) in the grid searchtypeGridSearch: there are three possible ways to perform a grid search, as explained in Grid search:multiple: perform a feature selection inside each foldsingle: perform one feature selection on the whole dataset and use the selecetd feazures on each foldtest: mostly used only for testing, perform a feature selection on the whole dataset and then divide the dataset in 80% training and 20% testing