This repository provides a complete workflow for working with the CICIDS2017 and CICIDS2018 datasets — from raw data loading and cleaning, through exploratory data analysis (EDA), to building and evaluating dynamic ensemble learning models for Intrusion Detection Systems (IDS).
- Part 1: 📌 CICIDS2017 Preprocessing & EDA
- Part 2: 📌 CICIDS2018 Preprocessing & EDA
- Part 3: 🚀 Dynamic Ensemble Performance Evaluation
This project helps you:
- ✅ Load large network traffic CSVs efficiently
- ✅ Clean, optimize, and engineer features
- ✅ Visualize and understand attack distributions
- ✅ Train, load, and test multiple ML models
- ✅ Combine base models using static and adaptive ensembling
- ✅ Evaluate performance across datasets and techniques
Install once for all modules:
pip install pandas numpy scikit-learn matplotlib seaborn missingno joblib
Or inside a Colab cell:
!pip install pandas numpy scikit-learn matplotlib seaborn missingno joblib
-
Mount Drive
from google.colab import drive drive.mount('/content/drive')
-
Load and Clean
data = load_cicids_data('/content/drive/MyDrive/Capstone/CICIDS2017') data = optimize_dtypes(data) data.drop_duplicates(inplace=True)
-
Handle Missing Values
data.replace([np.inf, -np.inf], np.nan, inplace=True) data.fillna(data.median(), inplace=True)
-
Label Engineering
data['Attack Type'] = data['Label'].map(attack_map) le = LabelEncoder() data['Attack Number'] = le.fit_transform(data['Attack Type'])
-
EDA
import missingno as msno msno.bar(data) sns.heatmap(data.corr(numeric_only=True))
-
Mount Drive
from google.colab import drive drive.mount('/content/drive')
-
Combine Multiple CSVs
df1 = pd.read_csv('/path/to/file1.csv') df2 = pd.read_csv('/path/to/file2.csv') data = pd.concat([df1, df2], ignore_index=True)
-
Fix Data Types
data = fixDataType(data) data = optimize_dtypes(data)
-
Label Encoding
attack_map = {...} data['Attack Type'] = data['Label'].map(attack_map) le = LabelEncoder() data['Attack Number'] = le.fit_transform(data['Attack Type'])
-
Visual EDA
msno.bar(data) sns.boxplot(x='Attack Type', y='Flow Duration', data=data)
-
Load trained models for 2017 & 2018
-
Combine models: average, weighted, max-voting
-
Adaptive ensembling with:
- Confidence metrics
- Meta-learner (
RandomForestRegressor
)
-
Evaluate all pairwise model combinations
-
Generate comparison tables & plots
1️⃣ Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
2️⃣ Run the Pipeline
if __name__ == "__main__":
runner, standard_summary, adaptive_summary = main()
Class | Role |
---|---|
ModelLoader |
Loads models and test splits |
EnsemblePredictor |
Static ensembling |
AdaptiveEnsemblePredictor |
Confidence and meta-learning |
EvaluationMetrics |
Accuracy, F1, recall, precision |
VisualizationTools |
Confusion matrices, bar plots, heatmaps |
EnsembleExperimentRunner |
Runs all experiments and reporting |
adaptive = AdaptiveEnsemblePredictor()
preds, confs, weights = adaptive.predict_ensemble(
X_input, model1, model2,
method='meta_learner',
X_train=X_train_subset, y_train=y_train_subset
)
- Individual model metrics (accuracy, F1, precision, recall)
- Confusion matrix comparisons
- Top-k model combination heatmaps
- CSV-style DataFrame of results
- Summary reports comparing standard vs adaptive ensembles
Academic & research use only. Please cite CICIDS2017 and CICIDS2018.
This project was built for security researchers working on real-time Intrusion Detection using ensemble learning techniques.