Enhanced Attack Detection in In-Vehicle Networks: A Federated Learning Framework Approach

This section details the simulation platform, dataset characteristics, and evaluation metrics employed to assess the proposed scheme. Furthermore, it includes a thorough analysis of the findings derived from our proposed methodology.

Simulation Platform

The proposed algorithm was implemented and evaluated for performance using a high-performance desktop computer. The specifications of this machine include an 11th Gen Intel® Core™ i9-11900H @ 2.50GHz processor and 32 GB of RAM. To ensure efficient training of our Federated Learning (FL) scheme, an NVIDIA GeForce RTX 3080 Ti with 16 GB of graphics memory was utilized. The entire scheme was developed and simulated using a Python-based environment, leveraging the Keras and TensorFlow libraries for deep learning operations.

Dataset

Our framework’s effectiveness was rigorously tested using a recent dataset introduced by Kang et al. in the “Car Hacking: Attack & Defense Challenge” of 2020 [34]. This dataset expands upon the earlier “Car Hacking” dataset [35] and was designed to foster advancements in both attack and detection methodologies for Controller Area Networks (CAN). CAN is a widely adopted in-vehicle network standard. The competition centered around the Hyundai Avante CN7 vehicle. Consequently, the dataset comprises CAN network traffic data from the Avante CN7, encompassing both normal operational messages and malicious attack messages. The dataset is structured into: (1) an initial train/test dataset from the competition’s preliminary round, and (2) a final round dataset originating from the host’s attack session. It is substantial, containing 1,270,310 samples, of which 1,090,312 are normal traffic instances and 179,998 are identified as anomalies. The dataset categorizes these samples into five distinct classes: normal, flooding, spoofing, replay, and fuzzing attacks, providing a comprehensive range of scenarios for attack detection framework evaluation.

Hyperparameters

Hyperparameters are crucial settings that dictate the architecture of a neural network and govern its learning process. In our study, we maintained a consistent core structure based on Gated Recurrent Units (GRUs). To optimize the performance of our proposed scheme, we conducted extensive experiments on the “Car Hacking” dataset, exploring a broad spectrum of hyperparameter values. This iterative process, often referred to as the hit-and-trial method, allowed us to pinpoint effective ranges for these hyperparameters. This method is both robust and frequently used in recent machine learning research for developing solutions, especially where computational optimization is a concern [36,37,38]. Table 4 summarizes the specific hyperparameters used for each GRU model in our experiments. The following subsections provide a detailed explanation of these hyperparameters.

Table 4 Hyperparameters of the GRU Models

Full size table

Learning Rate

The learning rate is a critical hyperparameter that controls the pace at which the machine learning (ML) model learns. Selecting an appropriate learning rate is essential for effective training. A learning rate that is too low can lead to very slow training, and the model may become stuck in suboptimal solutions [39]. Conversely, a learning rate that is too high can accelerate training but may result in unstable learning and increased errors. In our experiments, we tested five different learning rate values: 0.001, 0.005, 0.01, 0.05, and 0.10 to find the optimal setting.

Optimizer

The optimizer is an algorithm designed to minimize the loss function or maximize the efficiency of a model. Optimizers are essentially mathematical functions that adjust model parameters, such as weights and biases, to reduce prediction errors. These algorithms iteratively refine the model based on the training data. We utilized “Adam,” a widely recognized and effective optimizer, in our experiments. Adam is an optimization algorithm that serves as an alternative to traditional stochastic gradient descent. It iteratively updates network weights based on the training dataset. Adam offers several key advantages [40]:

  • Simple to implement and use.
  • Computationally efficient, reducing training time.
  • Minimal memory requirements, making it suitable for various hardware.
  • Invariance to diagonal rescaling of gradients, enhancing stability.
  • Well-suited for problems involving large datasets and complex models.
  • Effective in scenarios with noisy or sparse gradients.
  • Hyperparameters are intuitive and require minimal fine-tuning, simplifying the optimization process.

Epochs

An epoch represents one complete pass of the entire training dataset through the machine learning algorithm. The number of epochs is a critical hyperparameter; it dictates how many times the model learns from the entire dataset. With each epoch completion, the model parameters are refined. In our experiments, we set the number of epochs to 100 for each GRU model, ensuring sufficient training iterations.

Batch Size

Batch size refers to the number of training samples used in each mini-batch during the training process. Choosing an appropriate batch size is important. A very small batch size can introduce high variance in training, potentially leading to instability. Conversely, an excessively large batch size may lead to overfitting, where the model performs well on training data but poorly on unseen data. We explored three batch sizes in our experiments: 128, 256, and 512, to determine the optimal balance.

Momentum

Momentum is a hyperparameter that helps accelerate gradient descent algorithms in the relevant direction and dampens oscillations. It influences the update of model parameters by considering the updates from previous steps. This can improve the speed and stability of learning, helping the model to overcome local minima. In our experiments, we used a momentum range from 0.5 to 0.9.

Dropout

Dropout is a regularization technique used to prevent overfitting in neural networks. It works by randomly “dropping out,” or ignoring, a fraction of neurons during the training phase. This forces the network to learn more robust features that are not dependent on specific neurons, improving generalization. In our experiments, we considered dropout values of 0.0, 0.01, and 0.05.

Fig. 4

Full size image

Fig. 5

Full size image

Performance Evaluation Parameters

To rigorously evaluate the effectiveness of our proposed model, we utilized several key performance metrics. Initially, the predicted outputs from our trained algorithms were compared against the actual, real values. This comparison allowed us to calculate True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN). TP and TN represent the counts of correct predictions made by the model for attacks and normal behaviors, respectively. Conversely, FP and FN indicate incorrect predictions for attacks and normal behaviors. These fundamental parameters are then used to compute accuracy, precision, recall, and the F1 score, providing a comprehensive assessment of the model’s detection capabilities.

Accuracy

Accuracy is a fundamental metric that measures the overall correctness of the model’s predictions. It indicates the percentage of predictions that are correct, whether they are attacks or normal events. Accuracy is calculated by dividing the total number of correct predictions (TP + TN) by the total number of all predictions (TP + TN + FP + FN), as shown in Equation 7:

$$begin{aligned} text{ Accuracy } =frac{T P+T N}{T P+T N+F P+F N}. end{aligned}$$

(7)

Precision

Precision focuses on the accuracy of positive predictions. It quantifies the proportion of correctly identified attack instances out of all instances that the model classified as attacks. High precision indicates that when the model predicts an attack, it is very likely to be a true attack. Precision is calculated as shown in Equation 8:

$$begin{aligned} text{ Precision } =frac{T P}{T P+F P}. end{aligned}$$

(8)

Recall

Recall, also known as sensitivity, measures the model’s ability to detect all actual attack instances. It calculates the proportion of correctly identified attacks out of all actual attacks present in the dataset. High recall is crucial in security applications to minimize the number of missed attacks. Recall is calculated as shown in Equation 9:

$$begin{aligned} text{ Recall } =frac{T P}{T P+F N}. end{aligned}$$

(9)

F1 Score

The F1 score provides a balanced measure of both precision and recall. It is the harmonic mean of precision and recall, offering a single metric that summarizes the overall performance, especially when dealing with imbalanced datasets. A high F1 score indicates a good balance between precision and recall. The F1 Score is calculated as shown in Equation 10:

$$begin{aligned} text{ F1 } text{ Score } =frac{2 times ( text{ Precision } times text{ Recall } )}{ text{ Precision } + text{ Recall } }. end{aligned}$$

(10)

Results and Discussion

To thoroughly assess the performance of our proposed algorithm, extensive experiments were conducted using the “Car Hacking: Attack & Defense Challenge 2020 Dataset.” We divided the dataset into training and testing sets with a 75%:25% split, respectively. As previously mentioned, our proposed scheme incorporates five GRU models, each configured with the hyperparameters detailed in Table 3. The performance of these GRU models was evaluated across different window sizes: 1, 5, 10, 20, and 30.

GRU-1 exhibited its best performance at a window size of 20 (W20). At this window size, it achieved a peak detection accuracy of 99.28%. Furthermore, all other performance metrics (precision, recall, and F1-score) also exceeded 99% for W20. For smaller window sizes (W1, W5, and W10), GRU-1’s performance remained high, consistently between 97% and 99%. Detailed performance results for GRU-1 across various window sizes are illustrated in Fig. 4.

Fig. 6

Full size image

Fig. 7

Full size image

The performance results for GRU-2 across different window sizes are depicted in Fig. 5. Similar to GRU-1, GRU-2 also achieved its highest performance at W20, reaching a detection accuracy of 99.12%. All other performance metrics were also above 99% at this window size. For window sizes W1, W5, W10, and W30, GRU-2’s performance ranged between 97% and 98%. Notably, GRU-3 demonstrated the most superior attack detection performance among all GRU models. It attained a peak detection accuracy of 99.52% at W20, with other performance scores also exceeding 99.50% for this window size. For window sizes W1, W5, W10, and W30, GRU-3’s performance results were between 98% and 99.25%. The performance results for GRU-3 across different window sizes are presented in Fig. 6.

In contrast, GRU-4 and GRU-5 exhibited slightly lower performance compared to the other GRU models. These models achieved peak detection accuracies of 98.23% and 98.02% at W20, respectively. For other window sizes, the performance of both GRU-4 and GRU-5 ranged between 96% and 97%. Performance results for GRU-4 and GRU-5 across different window sizes are shown in Figs. 7 and 8, respectively.

Fig. 8

Full size image

Fig. 9

Full size image

Fig. 10

Full size image

Table 5 Performance Comparison of the Proposed Scheme with Related Studies

Full size table

The simulation outcomes demonstrated the robust performance of the GRU models within our proposed Federated Learning framework. The average attack detection performance of the FL scheme is visualized in Fig. 9. The framework achieved an impressive average attack detection accuracy of 98.83%. The average precision, recall, and F1 scores were also high, reaching 98.93%, 98.91%, and 98.92%, respectively. All simulations were conducted for 100 epochs. As expected, the training time for our algorithms increased with larger window sizes, as shown in the training time comparison in Fig. 10.

To validate the efficacy of our proposed scheme, we benchmarked our results against recent, relevant studies that have investigated similar problems and utilized the same “Car Hacking” dataset. The comparative performance analysis, presented in Table 5, clearly indicates that our model achieves superior accuracy on the extended version of the dataset compared to existing approaches that rely on traditional Intrusion Detection System (IDS) methods or centralized Deep Learning (DL) models. This performance advantage can be attributed to the Federated Learning approach we adopted, where models are trained collaboratively yet independently across distributed participants. Specifically, local epochs are defined within the learning parameters, allowing each participant to train on their local data for a set number of epochs. After local training, updates are communicated to a central cloud server. This server aggregates updates from all participants, averages them, and generates a new global model. Participants then use this updated global model for the subsequent round of training. This iterative process continues until convergence is achieved or a predefined number of communication rounds are completed. This Federated Learning paradigm has proven its effectiveness and efficiency in numerous applications, and our case study on cyberattack detection in Vehicle Sensor Networks (VSNs) further highlights its benefits in reducing training time and enhancing data accuracy.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *