The anomaly detection engine is the core component of ADBox. In fact, for every available anomaly detection method it orchestrates the interaction between the bulk functions of every algorithm, the data ingestion, data storage, user output, etc. In other words, the Engine determines the sequence of action to be performed to successfully go through the detection pipeline.
For every available anomaly detection method, it includes:
Detector Manager: This is essentially a parser and transformer, which transforms use-case input to data consumable by various detection pipeline. It manages the list od available detectors.
Engine: The ADEngine class method maintains the status of the current detector, which is the detector to be used if nothing else specifies and the list of available detectors. Moreover, it exposes the actions that are available for the end user, triggering the corresponding pipelines. Moreover, it defines the default behavior of ADBox if no info are provided.
The list of available actions includes:
classDiagram
class ADEngine {
current_detector_id : NoneType, str
detectors
get_detectors()
predict(run_mode: str, index_date: str, detector_id: str, start_time: str, end_time: str, batch_size: int, predict_input_config, use_case_no)
select(detector_id: str)
set_current_detector_id() str
set_detectors() List[str]
set_detectors_and_id()
train(index_date: str, detector_name: str, default_config: bool, custom_config_file, use_case_no)
}
Currently, ADBox engine supports on the MTAD-GAT method for anomaly detection.
Ideally a detector detector is the object that it is used to perform detection. In the specific case of anomaly detection via MTAD-GAT algorithm, it must include for example a trained ML, along with all the configuration used, and POT object.
In ADBox implementation detector is realize as a collection of files stored under an unique id, which is also the name of subfolder of siem_mtad_gat/assets/detector_models
containing such files. All the outcomes generated by a pipeline associate with a certain detector are stored in the corresponding folder. For details see Detector data structure.
├── a77c773c-9e6f-4700-92f2-53c0e682f290
│ ├── input
│ │ ├── detector_input_parameters.json
│ │ └── training_config.json
│ ├── prediction
│ │ ├── uc-16_predicted_anomalies_data-1_2024-08-12_13-48-42.json
│ │ └── uc-16_predicted_data-1_2024-08-12_13-48-42.json
│ └── training
│ ├── losses_train_data.json
│ ├── model.pt
│ ├── scaler.pkl
│ ├── spot
│ │ ├── spot_feature-0.pkl
│ │ ├── spot_feature-1.pkl
│ │ ├── spot_feature-2.pkl
│ │ ├── spot_feature-3.pkl
│ │ ├── spot_feature-4.pkl
│ │ └── spot_feature-global.pkl
│ ├── test_output.pkl
│ ├── train_losses.png
│ ├── train_output.pkl
│ └── validation_losses.png
Below we describe the train and predict pipelines.
The main goal of the train pipeline is to create a new detector.
See parameters in Use Case Guide,
The train pipeline:
Running the training pipeline should produce input
and training
folders under the id of the model.
flowchart LR
Parse --> Init
Init--> Ingest
Ingest --> Transform
Transform --> Train
Train --> Respond
sequenceDiagram %%% some intermediate checks are not included
participant p0 as Main
participant p1 as :ADEngine
participant p2 as :Detector
participant p3 as :DataStorageManager
participant p10 as :DataRetrievalManager
participant p8 as :SPOTManager
participant p4 as :WazuhDataIngestor
participant p5 as :data_transformer.DataTypeTransformer
participant p6 as :data_transformer.DataPreprocessor
participant p7 as train_mtad_gat
p0->>+ p1: train(trainindex_date, detector_name)
p1 ->>+ p2: get_detector_training_input(index_date, detector_name,default_config, custom_config_file)
p2 -->>- p1: return training_request
p1 ->> +p3: init DataStorageManager
p1 ->> +p3: save_detector_input_parameters(training_request)
deactivate p3
p1 ->> +p10: init DataRetrievalManager with data_storage_manager.uuidd
p1 ->>+ p3: save_yaml_train(training_request)
p3 -->>- p1: : return use_case_no
p1 ->> p1: check if required keys exist in training_request else raise exception
%% ingestion
p1 ->> +p4: get_training_data(training_request)
p4 -->>- p1: : return input_data, datasource_name
%% transformation
%%%% Convert data types into proper format
p1 ->>+ p5: transform_data_types(input_data,...)
p5 -->>- p1: : return value
%%%% preprocessing
p1 ->>+ p6: preprocess(input_data, training_request,...)
p6 -->>- p1: : return train_data, test_data, train_stamps, test_stamps, column_names
%% SPOT
p1 ->> +p8: init SPOTManager
%%%$ train model
p1 ->>+ p7: train_MTAD_GAT(train_data, test_data,<br>train_stamp,test_stamps,training_request.get("train_config"))
Note left of p7: Internal calls to Managers and MTAD_GAT package
p7 -->>- p1: : return train_response
p1 ->> p1: define updated_detector_input_parameters using train_response
p1 ->>+ p3: update_detector_input_parameters_after_training(updated_detector_input_parameters)
deactivate p3
p1 ->> p1: Init response dict
p1 ->>+ p8: save_all_train()
deactivate p8
p1 ->> p8: destroy_instance()
deactivate p8
p1 ->> p10: destroy_instance()
deactivate p10
p1 ->> p3: destroy_instance()
deactivate p3
p1 -->> -p0: response
train: The train method trains a detector to be used for detection using the default or given arguments. It takes the following inputs:
/assets/default_configs/default_detector_input_config.json
file. By default it is True, if specified False, then a custom input config needs to be provided for training.The main goal of the train pipeline is to use a detector to find anomalies in a selected time-frame.
Notice that, at prediction the parameters of the detector like window size, granularity, features aggregation methods, etc. cannot be selected, as they are inherent property of the detector.
See parameters in the use case guide.
The prediction pipeline:
Running the training pipeline should files in prediction
folder under the id of the model.
flowchart LR
Parse --> Init
Init--> Ingest
Ingest --> Transform
Transform --> Predict
Predict --> Respond
Respond --> Ingest
predict: The predict method performs anomaly detection using the trained detectors with default or specified parameters. The predict function of ADEngine
takes the following inputs: