Unlocking Network Insights- Capturing, Analyzing, and Training with Packet Data

Introduction

Packet capture, often referred to as network sniffing or packet analysis, is the process of intercepting and logging data traffic passing through a computer network. This technique captures packets of data as they travel between devices on a network, allowing for detailed examination and analysis of network activity. Packet capture is commonly used for troubleshooting network issues, monitoring network performance, detecting security breaches, and analyzing network protocols. It provides valuable insights into the flow of data within a network, helping network administrators and security professionals to understand, diagnose, and resolve various network-related issues.

Steps to Perform the Lab

1. Install tcpdump: If you don’t have tcpdump installed on your Linux machine, you can install it using your package manager. For example, on Ubuntu or Debian-based systems, you can use:

sudo apt-get update 
sudo apt-get install tcpdump

2. Start capturing network data: Open a terminal and run the following command to start capturing network data:

sudo tcpdump -i any -w captured_data.pcap

This command will capture all incoming and outgoing network traffic on all interfaces (-i any) and write it to a file named captured_data.pcap.

3. Let it run for 10 minutes: Leave the tcpdump command running for 10 minutes to capture a sufficient amount of network data.

4. Stop capturing network data: After 10 minutes, stop the tcpdump command by pressing Ctrl+C.

View the pcap captured file in Wireshark.

5. Label the captured data: To label the captured data, you can use various methods depending on your specific use case. One common approach is to manually label the data based on known activities during the capture period. For example, you can label traffic related to web browsing, email communication, file downloads, etc.

In our case, we have labelled the data based on ingress / egress traffic.

6. Preprocess the data: Before using the data with a machine learning model, you may need to preprocess it. This can include tasks such as feature extraction, normalization, and data cleaning.

The data can be cleaned, in our case by removing the ‘Info’ column.

The column ‘TrafficType’ is added to label the .pcap data by importing the data in a MySQL Database.

USE [database-name]

ALTER TABLE [table-name]

add TrafficType integer;

UPDATE [table-name]

SET TrafficType =

    CASE 

        WHEN Source = 'ip' THEN 1

        WHEN Destination = 'ip' THEN 0

        ELSE NULL

    END

7. Train a machine learning model: Once the data is labeled and preprocessed, you can train a machine learning model using a framework like scikit-learn or TensorFlow. Choose an appropriate model for your classification task, such as a decision tree, random forest, or neural network.

8. Split the data into training and testing sets: Split the labeled data into training and testing sets to evaluate the performance of your model.

9. Train the model: Train the machine learning model using the training set.

10. Evaluate the model: Evaluate the performance of the trained model using the testing set. You can use metrics such as accuracy, precision, recall, and F1-score to assess the model’s performance.

11. Predict output: Once the model is trained and evaluated, you can use it to predict the output (e.g., classify network traffic) on new, unseen data.

12. Refine and iterate: Depending on the performance of your model, you may need to refine your feature selection, preprocessing steps, or choose a different model architecture. Iterate on these steps until you are satisfied with the model’s performance.

Python Script -

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

  
  

df = pd.read_csv('capture_data.csv')

  

df = df.dropna(subset=['TrafficType'])

  

df = pd.get_dummies(df, columns=['Source','Destination','Protocol'])

  

X = df.drop(columns=['TrafficType']) # features

y = df['TrafficType'] # target

  

#70% training and 30% testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

  

clf = RandomForestClassifier(random_state=42)

clf.fit(X_train, y_train)

  

y_pred = clf.predict(X_test)

  

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print('Accuracy:', accuracy)

  
  

print('Classification Report:')

print(classification_report(y_test, y_pred))

  

conf_matrix = confusion_matrix(y_test, y_pred)

print('Confusion Matrix:')

print(conf_matrix)

This output is typically generated after evaluating a machine learning model, particularly in a binary classification task where the goal is to predict between two classes (in this case, class 0 and class 1). Let’s break down each section:

Accuracy:

Accuracy measures the percentage of correctly classified instances out of all instances.
Here, the accuracy is 1.0, which means the model correctly classified all instances in the testing set.

Classification Report:

Precision: It measures the accuracy of positive predictions. Precision is the ratio of correctly predicted positive observations to the total predicted positives.
Recall: It measures the ability of the model to find all the positive instances. Recall is the ratio of correctly predicted positive observations to the all observations in actual class.
F1-score: It is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.
Support: It is the number of actual occurrences of the class in the specified dataset.

For both classes (0 and 1): - Precision, recall, and F1-score are all 1.0, indicating perfect performance. - Support indicates the number of instances of each class in the testing set.

Confusion Matrix:

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.
It is particularly useful for understanding the performance of a classification algorithm. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class (or vice versa).
Here, the confusion matrix indicates that there were 504 instances of class 0 correctly classified, and 486 instances of class 1 correctly classified. There were no false positives or false negatives.

In summary, the output indicates that the model performed perfectly, achieving 100% accuracy in classifying both classes, with no misclassifications.

Achieving 100% accuracy on a machine learning model can be quite rare and often indicates either a very simple problem or overfitting. Here are some reasons why a model might achieve 100% accuracy:

Perfectly separable data: In some cases, the data might be perfectly separable into distinct classes, meaning there’s a clear boundary between the classes that the model can easily learn.
Data leakage: If there’s data leakage in your training set, where information from the test set is inadvertently included in the training set, the model might learn to perfectly predict the labels based on this leaked information.
Overfitting: Overfitting occurs when the model learns the training data too well, including noise and irrelevant details, to the point that it doesn’t generalize well to unseen data. In such cases, the model might perform exceptionally well on the training set but poorly on new data.
Too simple problem: If the problem you’re solving is extremely simple, such as a basic rule-based problem, it’s possible to achieve 100% accuracy.
Insufficient data: If your dataset is very small and doesn’t represent the full complexity of the problem, a model might achieve perfect accuracy by simply memorizing the training examples.

Conclusion

While achieving 100% accuracy might be desirable, it’s essential to critically evaluate the reasons behind it to ensure that the model is truly capturing the underlying patterns in the data and not just memorizing the training examples or relying on simplistic features. Regular validation on unseen data and careful consideration of model complexity are crucial to building reliable machine learning models.