idps-escape

Anomaly detection engine

The anomaly detection engine is the core component of ADBox. In fact, for every available anomaly detection method it orchestrates the interaction between the bulk functions of every algorithm, the data ingestion, data storage, user output, etc. In other words, the Engine determines the sequence of action to be performed to successfully go through the detection pipeline.

For every available anomaly detection method, it includes:

The list of available actions includes:

classDiagram
  class ADEngine {
    current_detector_id : NoneType, str
    detectors
    get_detectors()
    predict(run_mode: str, index_date: str,  detector_id: str, start_time: str, end_time: str, batch_size: int, predict_input_config, use_case_no)
    select(detector_id: str)
    set_current_detector_id() str
    set_detectors() List[str]
    set_detectors_and_id()
    train(index_date: str, detector_name: str, default_config: bool, custom_config_file, use_case_no)
  }

Currently, ADBox engine supports on the MTAD-GAT method for anomaly detection.

MTAD-GAT pipelines

Ideally a detector detector is the object that it is used to perform detection. In the specific case of anomaly detection via MTAD-GAT algorithm, it must include for example a trained ML, along with all the configuration used, and POT object. In ADBox implementation detector is realize as a collection of files stored under an unique id, which is also the name of subfolder of siem_mtad_gat/assets/detector_models containing such files. All the outcomes generated by a pipeline associate with a certain detector are stored in the corresponding folder. For details see Detector data structure.

├── a77c773c-9e6f-4700-92f2-53c0e682f290
│   ├── input
│   │   ├── detector_input_parameters.json
│   │   └── training_config.json
│   ├── prediction
│   │   ├── uc-16_predicted_anomalies_data-1_2024-08-12_13-48-42.json
│   │   └── uc-16_predicted_data-1_2024-08-12_13-48-42.json
│   └── training
│       ├── losses_train_data.json
│       ├── model.pt
│       ├── scaler.pkl
│       ├── spot
│       │   ├── spot_feature-0.pkl
│       │   ├── spot_feature-1.pkl
│       │   ├── spot_feature-2.pkl
│       │   ├── spot_feature-3.pkl
│       │   ├── spot_feature-4.pkl
│       │   └── spot_feature-global.pkl
│       ├── test_output.pkl
│       ├── train_losses.png
│       ├── train_output.pkl
│       └── validation_losses.png

Below we describe the train and predict pipelines.

Training Pipeline

The main goal of the train pipeline is to create a new detector.

See parameters in Use Case Guide,

The train pipeline:

  1. Parse the used case and use-case input to get training request.
  2. initialize the managers, persistent object during the pipeline to which specific tasks are delegated:
    • data storage,
    • data retrievable,
    • management of POT object for dynamic threshold control.
  3. Wazuh data ingestion according to uc input (if the connection to Wazuh is not available, stored data are used).
  4. Transformation of ingested data, including preprocessing.
  5. Training of MTAD-GAT model. This include a testing on subset of data (30%).
  6. Produce the train response.

Running the training pipeline should produce input and training folders under the id of the model.

flowchart LR
    Parse --> Init 
    Init--> Ingest
    Ingest --> Transform 
    Transform --> Train
    Train --> Respond
sequenceDiagram %%% some intermediate checks are not included

    participant p0 as Main
    participant p1 as :ADEngine
    participant p2 as :Detector

    participant p3 as :DataStorageManager
    participant p10 as :DataRetrievalManager
    participant p8 as :SPOTManager

    participant p4 as :WazuhDataIngestor

    participant p5 as :data_transformer.DataTypeTransformer
    participant p6 as :data_transformer.DataPreprocessor
    
    participant p7 as train_mtad_gat


    
    p0->>+ p1: train(trainindex_date,  detector_name)

    p1 ->>+ p2:  get_detector_training_input(index_date, detector_name,default_config, custom_config_file)
    p2 -->>- p1: return training_request


    p1 ->> +p3: init DataStorageManager
    p1 ->> +p3:  save_detector_input_parameters(training_request)
    deactivate p3
    p1 ->> +p10: init DataRetrievalManager with data_storage_manager.uuidd


    p1 ->>+ p3:  save_yaml_train(training_request)
	  p3 -->>- p1: : return use_case_no

    p1 ->> p1: check if required keys exist in training_request else raise exception

    %% ingestion
    p1 ->> +p4:  get_training_data(training_request)
    p4 -->>- p1: : return input_data, datasource_name

    %% transformation
    %%%% Convert data types into proper format 
    p1 ->>+ p5:  transform_data_types(input_data,...)
    p5 -->>- p1: : return value
    %%%% preprocessing 
    p1 ->>+ p6:  preprocess(input_data, training_request,...)
    p6 -->>- p1: : return train_data, test_data, train_stamps, test_stamps, column_names

    %% SPOT
    p1 ->> +p8: init SPOTManager

    %%%$ train model
    p1 ->>+ p7:  train_MTAD_GAT(train_data, test_data,<br>train_stamp,test_stamps,training_request.get("train_config")) 
    Note left of p7: Internal calls to Managers and MTAD_GAT package
    p7 -->>- p1: : return train_response

    p1 ->> p1: define updated_detector_input_parameters using train_response

    p1 ->>+ p3:  update_detector_input_parameters_after_training(updated_detector_input_parameters)
    deactivate p3

    p1 ->> p1: Init response dict

    p1 ->>+ p8:  save_all_train()
    deactivate p8

    p1 ->> p8:  destroy_instance()
    deactivate p8
    
	  p1 ->> p10:  destroy_instance()
    deactivate p10

  	p1 ->> p3:  destroy_instance()
    deactivate p3

    p1 -->> -p0: response
 

Train function input explanation

train: The train method trains a detector to be used for detection using the default or given arguments. It takes the following inputs:

  1. index_date: This is the date string that the detector will use to fetch from the consequent Wazuh index. The input format is the same as the date formats for Wazuh indices i.e., ‘YYYY-MM-DD’ and could also contain and asterisk (*) at any position for example to fetch the data for the month of July 2024, the input can be “2024-07-*”. If no date is provided, or “default” is given, then it will use the default index_date which would be the index date for the current month. So by default, the detector will train on all the data from the current month.
  2. detector_name: This is the display name for the detector. If no name is provided or “default” is given, then it will name the detector using the default naming format i.e., ‘detector_'.
  3. default_config: This is a boolean input value, to specify if the detector should use the default training inputs from the /assets/default_configs/default_detector_input_config.json file. By default it is True, if specified False, then a custom input config needs to be provided for training.
  4. custom_config: If the default_config is set to False, then the custom input config needs to be provided from a yaml file. ADBox will read the configs from this file and train a detector using those values. Keep in mind that the custom input config should have the key names same as the default input config. Otherwise it will still take the default values.

Prediction Pipeline

The main goal of the train pipeline is to use a detector to find anomalies in a selected time-frame.

Notice that, at prediction the parameters of the detector like window size, granularity, features aggregation methods, etc. cannot be selected, as they are inherent property of the detector.

See parameters in the use case guide.

The prediction pipeline:

  1. Parse the used case and use-case input to get prediction request, including detector id and runmode.
  2. initialize the managers, persistent object during the pipeline to which specific tasks are delegated:
    • data storage,
    • data retrieval,
    • management of POT object for dynamic threshold control.
  3. Depending on the runmode and the uc prediction parameters, run one or multiple times the following actions:
    1. Wazuh data ingestion.
    2. Transformation of ingested data, including preprocessing.
    3. Apply MTAD-GAT model for prediction.
    4. Produce the predict response.

Running the training pipeline should files in prediction folder under the id of the model.

flowchart LR
    Parse --> Init 
    Init--> Ingest
    Ingest --> Transform 
    Transform --> Predict
    Predict --> Respond
    Respond --> Ingest

Predict function input explanation

predict: The predict method performs anomaly detection using the trained detectors with default or specified parameters. The predict function of ADEngine takes the following inputs:

  1. run_mode: This value specifies the detection run mode. And it could take three values. - HISTORICAL: performs detection on historical data. - BATCH: performs detection on data in batches (in a real-time loop). - REALTIME: performs detection on (almost )real-time data. If no value is provided or “default” is specified, it will run using the default run mode which is set to HISTORICAL.
  2. index_date: This is the date string that the detector will use to fetch from the consequent Wazuh index. The input format is the same as the date formats for Wazuh indices i.e., ‘YYYY-MM-DD’ and could also contain and asterisk (*) at any position for example to fetch the data for the month of July 2024, the input can be “2024-07-*”. If no date is provided, or “default” is given, then it will use the default index_date which would be the index date for the current day. So by default, the detector will preform detection on all the data from the current day.
  3. detector_id: This is the detector id for the detector selected for detection. If no id is given, or “default” is specified, it will detect using the most recently trained detector.
  4. start_time: This is the start time for detection. It should be a timestamp string in ‘YYYY-MM-DDTHH:MM:SSZ’ format. If not provided, or “default” is specified, it will be set to starting timestamp of the current date.
  5. end_time: This is the end time for detection. It should be a timestamp string in ‘YYYY-MM-DDTHH:MM:SSZ’ format. If not provided, or “default” is specified, it will be set to current timestamp of the current date.
    Note that for BATCH and REALTIME mode, the start and end time are not required.
  6. batch_size: This specifies the batch size for the BATCH run mode. It should be given as an integer. If not provided, it will use a default batch size, which is set to 10. Note that the batch size is not required for the other two batch modes (HISTORICAL and REALTIME).