# ADBox Result Visualizer 
This Jupyter notebook provides a detailed visualization of the results of the ADBox, a custom anomaly-based intrusion detection component used in the IDPS-ESCAPE system. The ADBox integrates with Wazuh to detect anomalies in time-series data using the MTAD-GAT algorithm. 
This notebook is designed to present the results of training the anomaly detection model and its subsequent results during prediction.


## Prerequisites and Guidelines 
This notebook is designed for visualizing the results of anomaly detection using ADBox, not for executing the ADBox itself. To successfully visualize the results:

1. Ensure ADBox Training and Prediction Outputs: Complete the training and prediction processes with ADBox separately.

2. Provide File Paths: Supply the notebook with the file paths where the ADBox output files are stored. These file paths are necessary for the notebook to access and display the results accurately.

Without the correct file paths, the notebook will be unable to print and plot the desired results.

## ADBox 
ADBox is the anomaly detection (AD) subsystem of IDPS-ESCAPE. It handles the ingestion, preprocessing, and evaluation of data collected by the SIEM. ADBox takes data from the events indexer, applies an anomaly detection model, and returns the results. Using MTAD-GAT (Multivariate Time-series Anomaly Detection via Graph Attention Network), a self-supervised learning framework, ADBox detects anomalies in multivariate time-series data through machine learning. 

It runs through the following stages which are the two fundamental stages of machine learning: 

##### Training Stage
1. **Data Ingestion:**
During training, ADBox gathers time-series data from Wazuh. This data is used to teach the model how to identify normal patterns and distinguish them from anomalous behavior.
2. **Preprocessing:**
The collected data is preprocessed to ensure it is in a format suitable for training. This might include normalization, handling missing values, or reshaping the data.
3. **Model Training:**
The MTAD-GAT algorithm is applied to the preprocessed time-series data. 
4. **Visualization:**
The notebook includes visualizations such as loss curves to illustrate the training performance and validate the model's effectiveness.

##### Prediction/Detection Stage
1. **Data Ingestion:**
For prediction, the ADBox ingests new time-series data from Wazuh. 
2. **Preprocessing:**
The prediction input data is also preprocessed. 
3. **Anomaly Detection:**
The trained MTAD-GAT model is applied to the data to detect anomalies. It uses the learned patterns from the training phase to identify deviations from expected behavior. The model provides predictions indicating whether each time window in the data is normal or anomalous.
4. **Visualization:**
Visualizations in this notebook include time-series plots with detected anomalies highlighted, anomaly scores over time. 

### Imports 

In [None]:
# Imports 
import pandas as pd
from datetime import datetime, timezone
import plotly.express as px
import plotly.graph_objects as go
import re
from plotly.subplots import make_subplots
import json 
import os 
import yaml
from IPython.display import Image, display


### Variables and paths to be modified 
The following variables should should be updated for the use of the notebook. 
1. The base path should specify the path where the assets directory is located. 
2. Use case is the number of the use case which was used for training or prediction. 
3. Detector id is the id of the detector which was trained or was used for prediction. The notebook will try to find the output files in the folder named after this detector id so it should be correct. 

In [None]:
# File paths to be modified 
base_path = "/home/alab/siem-mtad-gat/siem_mtad_gat"
use_case = "9"
detector_id = "2d36a80a-c47a-4eb4-bb3e-5b2bfb90dc95"

### Variabled and paths not to be modified 


In [None]:
# File paths not to be modified 
yaml_file_path = os.path.join(base_path, f'assets/drivers/uc_{use_case}.yaml') 
detector_root_directory = os.path.join(base_path, f'assets/detector_models/{detector_id}') 
detector_input_parameters_path = os.path.join(detector_root_directory, 'input/detector_input_parameters.json') 
training_config_path = os.path.join(detector_root_directory, 'input/training_config.json')
train_losses_image_path = os.path.join(detector_root_directory, 'training/train_losses.png') 
validation_losses_image_path = os.path.join(detector_root_directory, 'training/validation_losses.png') 
train_output_path = os.path.join(detector_root_directory, 'training/train_output.pkl') 
prediction_directory = os.path.join(detector_root_directory, 'prediction')
files = os.listdir(prediction_directory) 
predicted_anomalies_data_pattern = re.compile(rf'uc-{use_case}_predicted_anomalies_data-(\d+)_.*\.json')
predicted_data_pattern = re.compile(rf'uc-{use_case}_predicted_data-(\d+)_.*\.json')
max_number = -1
predicted_data_pattern_max_file = None
for file in files:
    match = predicted_data_pattern.match(file)
    if match:
        number = int(match.group(1))
        if number > max_number:
            max_number = number
            predicted_data_pattern_max_file = file
    
if predicted_data_pattern_max_file is not None: 
    prediction_output_path = os.path.join(prediction_directory, predicted_data_pattern_max_file) 
    print(f"The prediction file used for visualization will be {predicted_data_pattern_max_file}. ")
else: 
    print("No prediction file found. ")

# Running the ADBox with a use case 
Use cases in the context of the ADBOx refer to the consolidation of training or prediction inputs in a file that are given to the ADBox. This provides a simple way of interacting with the ADBox since all the required values are provided using a single file. 
These files are user defined and should contain the right keys for their values to be processed by the ADBox instead of using the default values. 

If the training or prediction are performed using a use_case file, then the contents of the use case file contain the inputs that are provided to the training and prediction. 
A training and detection use case can be run by providing the uc flag along win a number to the script that runs the ADBox docker container. 
The yaml file should be present in the `/siem_mtad_gat/assets/drivers/` folder. 
```sh
./adbox.sh -uc {use_case} 
``` 
With this input, the ADBox will take the inputs specified in the `uc_{use_case}.yaml` file. 
Once the script is run, it runs a training and prediction cycle if both keys are present in the yaml file. 
The contents of the yaml file can be read to see the inputs. 

In [None]:
# Load the YAML file 
with open(yaml_file_path, 'r') as file:
    yaml_content = yaml.safe_load(file)
    
yaml_content

### Contents of the yaml file 
The contents of the yaml file include the input parameters for training and prediction/anomaly detection. 
#### Training input parameters 
The training input parameters include: 
1. **index_date**: It represents the data source index where the training data should be fetched from. If it is specified as   `default`, it will use the default index_date, which would be the index date for the current month. 
2. **categorical_features**: Specifies that the given input features include categorical features or not. 
3. **columns**: This specifies a list of columns used as features to train the detector. 
4. **aggregation**: Specifies if the column values should be aggregated or not. 
5. **aggregation_config**: If the aggregation is set to  `True`, then an aggregation config is required to specify the configurations that the ADBox uses to perform the aggregation. This further contains. 
    1. **fill_na_method**: It is fill method to handle null values. 
    2. **padding_value**: Only required when the fill_na_method is `Fixed`. 
    3. **granularity**: The granularity to aggregate the input data. 
    4. **features**: It is a key value pair of the features and the method that they should use for aggregation. 
6. **train_config**: The train_config specifies the training configurations, It contains the  `window_size` and  `epochs`. 
7. **display_name**: Specifies a name for the detector. 

#### Prediction/detection input parameters 
 The prediction input parameters include: 
1. **run_mode**: This value specifies the detection run mode. 
2. **index_date**: This is the date string that the detector will use to fetch data from the consequent Wazuh index. If it is specified as   `default`, it will use the default index_date, which would be the index date for the current day. 
3. **detector_id**: This is the detector ID for the detector selected for detection. If it is specified as  `default`, it will detect using the most recently trained detector.  
4. **start_time**: This is the start time for detection. If it is `default`, it will be set to the starting timestamp of the current date.
5. **end_time**: This is the end time for detection. If it is `default`, it will be set to the current timestamp of the current date. 

## Training 
In the training process, the ADBox trains a model using the provided specifications. Each trained detector is identified by a unique id and all of its generated artifacts are stored in a folder named as the detector id. 
Reading the contents of the folder shows the following subfolders and files. 

In [None]:
for root, dirs, files in os.walk(detector_root_directory):
    # Print the directory name relative to root_dir
    print(f"Directory: {os.path.relpath(root, os.path.dirname(detector_root_directory))}")
    for file in files:
        # Print the file name relative to root_dir
        print(f"  File: {os.path.relpath(os.path.join(root, file), detector_root_directory)}")
    for dir in dirs:
        # Print the subdirectory name relative to root_dir
        print(f"  Subdirectory: {os.path.relpath(os.path.join(root, dir), detector_root_directory)}")

The contents of the folder generated for the trained detector after running the ADBox shows three subfolders input, training and prediction. 
#### 1. input
The input folder contains the the following two file. 
##### a. detector_input_parameters.json: 
This file is generated as a result of the training input parameters provided in the yaml file.  While reading the file it could be seen that it contains the same fields as defined above for the yaml file and some other fields that were added after the training of the detector. 


In [None]:
# Read and print the contents of the JSON file
with open(detector_input_parameters_path, 'r') as file:
    data = json.load(file)
    print(json.dumps(data, indent=4))  # Pretty-print the JSON data

### Contents of the detector_input_parameters.json file 
The individual contents of this file have already been defined in the yaml file section. The additional fields such as, `created_time` represents the time at which the detector was created. And `model_info` provides some details about the training of the model. 

##### b. training_config.json: 
This file contains more details about the machine learning level training parameters. 
This file is critical for configuring how the MTAD-GAT machine learning model is trained. This specific configuration file includes parameters that define the model architecture, training process, and other hyperparameters. 

This file should be changed by the user if the user has sufficient knowledge about machine learning algorithms, parameters, their working and functionalities. 

In [None]:
# Read and print the contents of the JSON file
with open(training_config_path, 'r') as file:
    data = json.load(file)
    print(json.dumps(data, indent=4))  # Pretty-print the JSON data

### Contents of the training_config.json file 

1. **window_size**: This parameter likely determines the size of the time window used for time-series data input.
2. **spec_res**: Indicates whether to use spectral residuals, often used for anomaly detection in time-series data.
3. **kernel_size**: The size of the kernel (filter) in convolutional layers, which impacts feature extraction.
4. **use_gatv2**: Specifies whether to use a version 2 Graph Attention Network (GATv2), a type of neural network architecture for graph-based data.
5. **feat_gat_embed_dim**: The embedding dimension for the feature-based Graph Attention Network, if applicable.
6. **time_gat_embed_dim**: The embedding dimension for the time-based Graph Attention Network, if applicable.
7. **gru_n_layers**: The number of layers in the Gated Recurrent Unit (GRU) network, a type of recurrent neural network.
8. **gru_hid_dim**: The hidden dimension size of the GRU layers.
9. **fc_n_layers**: The number of fully connected (dense) layers in the neural network.
10. **fc_hid_dim**: The hidden dimension size of the fully connected layers.
11. **recon_n_layers**: The number of layers in the reconstruction part of the network, possibly for autoencoders.
12. **recon_hid_dim**: The hidden dimension size of the reconstruction layers.
13. **alpha**: A coefficient, possibly for the learning rate or loss function scaling.
14. **epochs**: The number of epochs, or full passes through the training dataset.
15. **val_split**: The fraction of data to be used for validation.
16. **bs**: Batch size, or the number of samples per gradient update.
17. **init_lr**: Initial learning rate for the optimizer.
18. **shuffle_dataset**: Whether to shuffle the dataset before each epoch.
19. **dropout**: The dropout rate, a regularization technique to prevent overfitting by randomly setting a fraction of input units to zero during training.
20. **use_cuda**: Indicates whether to use GPU acceleration (CUDA).
21. **print_every**: Frequency of printing training progress.
22. **log_tensorboard**: Whether to log metrics to TensorBoard, a visualization tool.
23. **scale_scores**: Indicates if the scores should be scaled.
24. **use_mov_av**: Whether to use moving average smoothing.
25. **gamma**: A parameter for adjusting learning dynamics, possibly the learning rate decay.
26. **level**: A level parameter, the context of which would depend on the specific algorithm.
27. **q**: A parameter that could relate to quantization or another algorithm-specific function.
28. **dynamic_pot**: Whether to use dynamic potential, possibly related to the dynamic adjustment of model parameters. 

#### 2. training
The training folder contains the the following two file. 
##### a. train_output.pkl
This file contains the saved forecasts, reconstructions, actual, thresholds, etc. on the training dataset in pickle format. 
##### b. test_output.pkl:  
This file contains the saved forecasts, reconstructions, actual, thresholds, etc. on the testing dataset in pickle format. 
##### c. model.pt
This file contains the model parameters of trained model in a .pt file which is a PyTorch file used to save and load model parameters, entire models, or tensor data. 
##### d. losses_train_data.json 
This file contains contains the training losses for each epoch. 
##### e. train_losses.png  
A plot of train loss during training. 

In [None]:
display(Image(filename=train_losses_image_path))

##### f. validation_losses.png 
A plot of validation loss during training. 

In [None]:
display(Image(filename=validation_losses_image_path))

In [None]:
train_df = pd.read_pickle(train_output_path)  
train_df

In [None]:
#Add red shadows for anomalies
shaded=False

# Extracting timestamps and scores
timestamps = train_df.index
scores = train_df.get('A_Score_Global') 
# Convert is_anomaly to 0 or 1
is_anomaly = train_df.get("A_Pred_Global")

#threshold
threshold = train_df.get("Thresh_Global")

# Create a DataFrame
df = pd.DataFrame({'Timestamp': timestamps, 'Score': scores, 'Is_Anomaly': is_anomaly, 'Threshold': threshold})


# Create the figure
fig = go.Figure()

# Add the score line with markers
fig.add_trace(go.Scatter(
    x=df['Timestamp'],
    y=df['Score'],
    mode='lines+markers',
    text=df['Score'].apply(lambda x: f'Anomaly Score: {x:.2f}'),  # Hover text
    hoverinfo='text',  # Show only text on hover
    line=dict(color='#d53e1f', width=1),  # Line color
    marker=dict(
        size=4,  # Marker size
        color='rgba(0,0,0,0)',  # Marker fill color (transparent)
        line=dict(
            width=1,  # Border width
            color='#d53e1f'  # Border color (red)
        ),
    ), 
    name='Anomaly Score'
))


# Add the threshold line
fig.add_trace(go.Scatter(
    x=df['Timestamp'],
    y=df['Threshold'],
    mode='lines+markers',
    line=dict(color='black', width=1),  # Blue line
    name='Threshold', 
    marker=dict(
        size=4,  # Marker size
        color='rgba(0,0,0,0)',  # Marker fill color (transparent)
        line=dict(
            width=1,  # Border width
            color='black'  # Border color (red)
        ),
    ), 
    #showlegend=False  # Do not show legend for the anomaly line
))



if shaded:
    # Add shaded areas for anomalies
    for i in range(len(df)):
        if df['Is_Anomaly'][i] == 1:
            start = df['Timestamp'][i]
            end = df['Timestamp'][i + 1] if i + 1 < len(df) else df['Timestamp'][i]
            fig.add_shape(
                type="rect",
                x0=start,
                x1=end,
                y0=0,
                y1=1,
                xref='x',
                yref='paper',
                fillcolor='red',
                opacity=0.2,
                line_width=0,
            )

fig.update_layout(
    title='Anomaly Scores and Prediction Over Time',
    xaxis_title='Timestamp',
    yaxis_title='Anomaly Score',
    showlegend=True
)

fig.show() 

## Prediction  
After the training is finished, the use case runs a detection 
#### 3. prediction 

The prediction folder would contain  output JSON file for each time the prediction is run specifying the number of the use case and timestamp that it was run at. Each run would generate two files as output, one file having the predicted data for all the data points and the other having the predicted data only for the points which were predicted as anomalies. 

##### a. uc-{use_case}_predicted_data-{n}_{timestamp}.json 
This file contains predicted data for all the points that were used for prediction, and flags them as anomalies or not. 

##### b. uc-{use_case}_predicted_anomalies_data-{n}_{timestamp}.json 
This files contains the anomalies detected during the prediction run. 

The contents of both of these files would be in a similar format only differntiating in the anomaly flag being true or false. The `uc-{use_case}_predicted_anomalies_data-{n}_{timestamp}.json` file will contains have all the points all tha points which have the `is_anomaly` flag `true`, whereas `uc-{use_case}_predicted_data-{n}_{timestamp}.json` would have points with `is_anomaly` flag `true` and `false` both. 

For the plotting, `uc-{use_case}_predicted_data-{n}_{timestamp}.json` will be used and for the sake of explanation as well, we only explain the fields in the `uc-{use_case}_predicted_data-{n}_{timestamp}.json` file, as the other one would have the same format.  

In [None]:
# Read and print the contents of the JSON file
with open(prediction_output_path, 'r') as file:
    prediction_output_data = json.load(file)
    print(json.dumps(prediction_output_data, indent=4))  

### Explaining the result fields 
Before proceeding to plot and visualize these values, a description of each field in the result object and its significance is provided. We only discuss the result object, because otherwise the fields above the result object are the same which were given as input for the prediction. 
 
- **timestamp:** Represents the date and time when the data was recorded, it is the start time of the window in which the anomaly data point was recorded. This timestamp indicates the specific moment the observation was made. 
The output will contain each point based upon the input granularity for example, if the granularity was 1 minute, then the output will contain a datapoint after every minute since the data was aggregated using the input granularity. 

- **is_anomaly:** A boolean value indicating whether the data point is classified as an anomaly. false signifies that the data point is not considered an anomaly. 
- **score:** The global anomaly score for the data point, which quantifies how anomalous the data is.  
- **prediction_values:** A dictionary containing various metrics related to the forecast, reconstruction, and anomaly scores: 
    For each feature, the dictionary contains the following values: 
    - **Forecast:** The forecast refers to the predicted values for the next timestamp in a time-series for the feature. 
    - **Recon:** The reconstructed value of the feature. 
    - **True:** The actual value of feature from the data.
    - **A_Score:** The anomaly score for the feature. 
    - **Thresh:** The threshold value used to determine if the feature is considered an anomaly. 
    - **A_Pred:** Anomaly prediction for the feature, where 0 indicates no anomaly and 1 indicates an anomaly.  It is 1 if its corresponding anomaly score is larger than or equal to the threshold. 

    And the following global values: 
    - **A_Score_Global:** The global anomaly score that aggregates the anomaly scores across all features. 
    - **Thresh_Global:** The global threshold used to determine if the aggregated anomaly score is considered an anomaly. 
    - **A_Pred_Global:** Anomaly prediction based on the global anomaly score. Here, 0 indicates no anomaly, and 1 indicates an anomaly. It is 1 if its corresponding anomaly score is larger than or equal to the threshold. 


# Visualizing the detected anomalies 
An anomaly refers to a data point or pattern in a dataset that significantly deviates from the expected or normal behavior. 
Here, the anomalies would be data points which were observed to have anomalous values based upon the features which were used to train the detector. 

The following cell fetches the results section from the output file which would be used to plot the following plots. 

In [None]:
predict_output = []
for entry in prediction_output_data: 
    predict_output.extend(entry.get('results', [])) 
    
    
# Specify start and end time for filtering (example values)
start_time = datetime(2024, 8, 30, 5, 0, tzinfo=timezone.utc)  # YYYY, MM, DD, HH, MM
end_time = datetime(2024, 8, 30, 23, 0, tzinfo=timezone.utc)    # YYYY, MM, DD, HH, MM

# Filter the data before converting it into a DataFrame
predict_output = [
    result for result in predict_output 
    if start_time <= datetime.fromisoformat(result['timestamp'].replace('Z', '+00:00')) <= end_time
] 



### Anomaly scores and Prediction over time 
The following chart shows the global anomaly scores for each data point which are plotted against the timestamp they were predicted at. The red area signifies the points which were predicted as anomalies. And the `Is Anomaly` value shows if the individual point was flagged as an anomaly or not. 

In [None]:
# Iterate through each entry in the list
results = predict_output
    

# Extracting timestamps and scores
timestamps = [datetime.fromisoformat(result['timestamp'].replace('Z', '+00:00')) for result in results]
scores = [result['score'] for result in results] 
# Convert is_anomaly to 0 or 1
is_anomaly = [1 if result['is_anomaly'] else 0 for result in results]

#threshold
threshold = [result.get("prediction_values").get("Thresh_Global") for result in results]

# Create a DataFrame
df = pd.DataFrame({'Timestamp': timestamps, 'Score': scores, 'Is_Anomaly': is_anomaly, 'Threshold': threshold})


# Create the figure
fig = go.Figure()

# Add the score line with markers
fig.add_trace(go.Scatter(
    x=df['Timestamp'],
    y=df['Score'],
    mode='lines+markers',
    text=df['Score'].apply(lambda x: f'Anomaly Score: {x:.2f}'),  # Hover text
    hoverinfo='text',  # Show only text on hover
    line=dict(color='#d53e1f', width=1),  # Line color
    marker=dict(
        size=4,  # Marker size
        color='rgba(0,0,0,0)',  # Marker fill color (transparent)
        line=dict(
            width=1,  # Border width
            color='#d53e1f'  # Border color (red)
        ),
    ), 
    name='Anomaly Score'
))
"""
# Add the is_anomaly line
fig.add_trace(go.Scatter(
    x=df['Timestamp'],
    y=df['Is_Anomaly'],
    mode='lines+markers',
    line=dict(color='green', width=1),  # Black line
    name='Is Anomaly', 
    marker=dict(
        size=4,  # Marker size
        color='rgba(0,0,0,0)',  # Marker fill color (transparent)
        line=dict(
            width=1,  # Border width
            color='green'  # Border color (red)
        ),
    ), 
    #showlegend=False  # Do not show legend for the anomaly line
))
"""
# Add the threshold line
fig.add_trace(go.Scatter(
    x=df['Timestamp'],
    y=df['Threshold'],
    mode='lines+markers',
    line=dict(color='black', width=1),  # Blue line
    name='Threshold', 
    marker=dict(
        size=4,  # Marker size
        color='rgba(0,0,0,0)',  # Marker fill color (transparent)
        line=dict(
            width=1,  # Border width
            color='black'  # Border color (red)
        ),
    ), 
    #showlegend=False  # Do not show legend for the anomaly line
))


# Add shaded areas for anomalies
for i in range(len(df)):
    if df['Is_Anomaly'][i] == 1:
        start = df['Timestamp'][i]
        end = df['Timestamp'][i + 1] if i + 1 < len(df) else df['Timestamp'][i]
        fig.add_shape(
            type="rect",
            x0=start,
            x1=end,
            y0=0,
            y1=1,
            xref='x',
            yref='paper',
            fillcolor='red',
            opacity=0.2,
            line_width=0,
        )

fig.update_layout(
    title='Anomaly Scores and Prediction Over Time',
    xaxis_title='Timestamp',
    yaxis_title='Anomaly Score',
    showlegend=True
)

fig.show() 


#### Tabular View 
The same values can be seen in the following table. 

In [None]:
# Table
fig_table = go.Figure(data=[go.Table(
        header=dict(values=["Timestamp", "Score", "is Anomaly"]),
        cells=dict(values=[df['Timestamp'].astype(str), df['Score'].round(2), df['Is_Anomaly'].round(2)])
    )])

fig_table.update_layout(
        title='Anomaly Scores Table'
    ) 
    
fig_table.show()

### Anomaly scores and Prediction for each feature 
The following chart shows the individual anomaly scores for each feature which are plotted against the timestamp they were predicted at. The red area signifies the points which were predicted as anomalies overall. And the `Prediction` value shows if the individual point was flagged as an anomaly or not using that feature. 


In [None]:
results = predict_output

# Extracting timestamps and prediction values
timestamps = [datetime.fromisoformat(result['timestamp'].replace('Z', '+00:00')) for result in results]
prediction_values_list = [result['prediction_values'] for result in results]

# Find all unique feature names
feature_names = set()
for prediction_values in prediction_values_list:
    for key in prediction_values.keys():
        match = re.match(r'(Forecast|Recon|True)_(.+)', key)
        if match:
            feature_names.add(match.group(2))


# Add traces for each feature
for feature in feature_names:  
    fig = go.Figure()        
    pred_data = [] 
    score_data = []
        
    # Collect data for Threshold and Anomaly scores
    for prediction_values in prediction_values_list:
        pred_data.append(prediction_values.get(f'A_Pred_{feature}'))
        score_data.append(prediction_values.get(f'A_Score_{feature}'))
            
    # Add Anomaly scores line
    fig.add_trace(
        go.Scatter(
            x=timestamps,
            y=score_data, 
            mode='lines+markers',
            hoverinfo='text',  # Show only text on hover
            line=dict(color='#d53e1f', width=1),  # Line color
            marker=dict(
                size=4,  # Marker size
                color='rgba(0,0,0,0)',  # Marker fill color (transparent)
                line=dict(
                    width=1,  # Border width
                    color='#d53e1f'  # Border color (red)
                ),
            ), 
            name='Anomaly Score'
        ))
 
 
        
    # Add Threshold line
    fig.add_trace(
        go.Scatter(
            x=timestamps,
            y=pred_data,
            name="Prediction",
            mode='lines+markers',
            line=dict(color='green', width=1),  # Black line
            marker=dict(
                size=4,  # Marker size
                color='rgba(0,0,0,0)',  # Marker fill color (transparent)
                line=dict(
                    width=1,  # Border width
                    color='green'  # Border color (red)
                ),
            ), 
            #showlegend=False  # Do not show legend for the anomaly line
        )) 
    
  
    # Add shaded areas for anomalies
    for i in range(len(is_anomaly)):
        if is_anomaly[i] == 1:
            start = timestamps[i]
            end = timestamps[i + 1] if i + 1 < len(timestamps) else timestamps[i]
            fig.add_shape(
                type="rect",
                x0=start,
                x1=end,
                y0=0,
                y1=1,
                xref='x',
                yref='paper',
                fillcolor='red',
                opacity=0.2,
                line_width=0,
            )

    # Update layout
    fig.update_layout(
        title_text=f"Anomaly scores and Prediction for {feature}",
        height=400,  # Adjust height based on the number of features
        width=1100,
        showlegend=True
    )

    # Show plot
    fig.show() 
    

### Forecast and reconstruction vs true values for each feature 
The following chart shows the forecast and reconstruction vs true values which are plotted against the timestamp they were predicted at for each feature.  The red area signifies the points which were predicted as anomalies overall. 

In [None]:
results = predict_output

# Extracting timestamps and prediction values
timestamps = [datetime.fromisoformat(result['timestamp'].replace('Z', '+00:00')) for result in results]
prediction_values_list = [result['prediction_values'] for result in results]
is_anomaly = [1 if result['is_anomaly'] else 0 for result in results]


# Find all unique feature names
feature_names = set()
for prediction_values in prediction_values_list:
    for key in prediction_values.keys():
        match = re.match(r'(Forecast|Recon|True)_(.+)', key)
        if match:
            feature_names.add(match.group(2))


# Plot each feature
for feature in feature_names:
    fig = go.Figure()        
    forecast_data = []
    recon_data = [] 
    true_data = []
        
    
    for prediction_values in prediction_values_list:
        forecast_data.append(prediction_values.get(f'Forecast_{feature}'))
        recon_data.append(prediction_values.get(f'Recon_{feature}'))
        true_data.append(prediction_values.get(f'True_{feature}'))

    
    fig.add_trace(
        go.Scatter(
            x=timestamps,
            y=true_data,
            line_color="rgb(0, 204, 150, 0.5)",
            name="True",
            line=dict(width=2),
        ),
    )

    fig.add_trace(
        go.Scatter(
            x=timestamps,
            y=forecast_data,
            line_color="rgb(255, 127, 14, 1)",
            name="Forecast",
            line=dict(width=2),
        ),
    )

    fig.add_trace(
        go.Scatter(
            x=timestamps,
            y=recon_data,
            line_color="rgb(31, 119, 180, 1)",
            name="Recon",
            line=dict(width=2),
        ),
    )

    # Add shaded areas for anomalies
    for i in range(len(is_anomaly)):
        if is_anomaly[i] == 1:
            start = timestamps[i]
            end = timestamps[i + 1] if i + 1 < len(timestamps) else timestamps[i]
            fig.add_shape(
                type="rect",
                x0=start,
                x1=end,
                y0=0,
                y1=1,
                xref='x',
                yref='paper',
                fillcolor='red',
                opacity=0.2,
                line_width=0,
            )

    # Update layout
    fig.update_layout(
        title_text=f"Forecast & reconstruction vs true values for {feature}",
        height=400,  # Adjust height based on the number of features
        width=1100,
        showlegend=True
    )

    # Show plot
    fig.show() 
    
    

### Anomaly scores and Thresholds over time 
The following chart shows the global anomaly scores for each data point which are plotted against the timestamp they were predicted at. The red area signifies the points which were predicted as anomalies overall. And the `Threshold` value shows the global threshold value for that point which the anomaly score was compared against. 


In [None]:
# Iterate through each entry in the list
results = predict_output
    

# Extracting timestamps and scores
timestamps = [datetime.fromisoformat(result['timestamp'].replace('Z', '+00:00')) for result in results]
scores = [result['score'] for result in results] 
thresholds = [result['prediction_values']['Thresh_Global'] for result in results] 
# Convert is_anomaly to 0 or 1
is_anomaly = [1 if result['is_anomaly'] else 0 for result in results]


# Create a DataFrame
df = pd.DataFrame({'Timestamp': timestamps, 'Score': scores, 'Is_Anomaly': is_anomaly, "Threshold": thresholds})


# Create the figure
fig = go.Figure()

# Add the score line with markers
fig.add_trace(go.Scatter(
    x=df['Timestamp'],
    y=df['Score'],
    text=df['Score'].apply(lambda x: f'Anomaly Score: {x:.2f}'),  # Hover text
    mode='lines+markers',
            hoverinfo='text',  # Show only text on hover
            line=dict(color='#d53e1f', width=1),  # Line color
            marker=dict(
                size=4,  # Marker size
                color='rgba(0,0,0,0)',  # Marker fill color (transparent)
                line=dict(
                    width=1,  # Border width
                    color='#d53e1f'  # Border color (red)
                ),
            ), 
            name='Anomaly Score'
        ))



# Add the is_anomaly line
fig.add_trace(go.Scatter(
    x=df['Timestamp'],
    y=df['Threshold'],
    name="Threshold",
    line=dict(color="black", width=1, dash="dash"), 
    #showlegend=False  # Do not show legend for the anomaly line
))

# Add shaded areas for anomalies
for i in range(len(df)):
    if df['Is_Anomaly'][i] == 1:
        start = df['Timestamp'][i]
        end = df['Timestamp'][i + 1] if i + 1 < len(df) else df['Timestamp'][i]
        fig.add_shape(
            type="rect",
            x0=start,
            x1=end,
            y0=0,
            y1=1,
            xref='x',
            yref='paper',
            fillcolor='red',
            opacity=0.2,
            line_width=0,
        )

fig.update_layout(
    title='Anomaly Scores and Thresholds Over Time',
    xaxis_title='Timestamp',
    yaxis_title='Anomaly Score',
    showlegend=True
)

fig.show() 


### Anomaly scores and Thresholds for each features
The following chart shows the individual anomaly scores for each feature which are plotted against the timestamp they were predicted at. The red area signifies the points which were predicted as anomalies overall. And the `Threshold` value shows the threshold value for individual feature for that point which the anomaly score was compared against.  

In [None]:
results = predict_output

# Extracting timestamps and prediction values
timestamps = [datetime.fromisoformat(result['timestamp'].replace('Z', '+00:00')) for result in results]
prediction_values_list = [result['prediction_values'] for result in results]

# Find all unique feature names
feature_names = set()
for prediction_values in prediction_values_list:
    for key in prediction_values.keys():
        match = re.match(r'(Forecast|Recon|True)_(.+)', key)
        if match:
            feature_names.add(match.group(2))


# Add traces for each feature
for feature in feature_names:  
    fig = go.Figure()        
    thresh_data = [] 
    score_data = []
        
    # Collect data for Threshold and Anomaly scores
    for prediction_values in prediction_values_list:
        thresh_data.append(prediction_values.get(f'Thresh_{feature}'))
        score_data.append(prediction_values.get(f'A_Score_{feature}'))
            
    # Add Anomaly scores line
    fig.add_trace(
        go.Scatter(
            x=timestamps,
            y=score_data,
            mode='lines+markers',
            hoverinfo='text',  # Show only text on hover
            line=dict(color='#d53e1f', width=1),  # Line color
            marker=dict(
                size=4,  # Marker size
                color='rgba(0,0,0,0)',  # Marker fill color (transparent)
                line=dict(
                    width=1,  # Border width
                    color='#d53e1f'  # Border color (red)
                ),
            ), 
            name='Anomaly Score'
        ))
    

    # Add Threshold line
    fig.add_trace(
        go.Scatter(
            x=timestamps,
            y=thresh_data,
            name="Threshold",
            line=dict(color="black", width=1, dash="dash"), 
        )
    )

    # Add shaded areas for anomalies
    for i in range(len(is_anomaly)):
        if is_anomaly[i] == 1:
            start = timestamps[i]
            end = timestamps[i + 1] if i + 1 < len(timestamps) else timestamps[i]
            fig.add_shape(
                type="rect",
                x0=start,
                x1=end,
                y0=0,
                y1=1,
                xref='x',
                yref='paper',
                fillcolor='red',
                opacity=0.2,
                line_width=0,
            )

    # Update layout
    fig.update_layout(
        title_text=f"Anomaly scores and Thresholds for {feature}",
        height=400,  # Adjust height based on the number of features
        width=1100,
        showlegend=True
    )

    # Show plot
    fig.show() 
    

The notebook provides a comprehensive overview of the outputs and artifacts generated by ADBox when executed with a specific use case file. It includes detailed explanations and result plots that illustrate various methods for analyzing the algorithm's outputs. These plots help in understanding how the different factors and values used by ADBox to flag anomalies are interrelated on both a global and individual level.  This consolidated analysis helps users interpret the results and understand the effectiveness of the anomaly detection process.