This project demonstrates how to use a genetic algorithm to fine-tune the hyperparameters of an XGBoost model. This approach, a form of evolutionary machine learning, automates the tedious process of hyperparameter optimization, potentially leading to better model performance.
The core idea is to treat sets of hyperparameters as "individuals" in a population. The "fittest" individuals—those that produce the best-performing models—are more likely to "reproduce" and pass on their traits (hyperparameter values) to the next generation.
The process unfolds as follows:
-
Initialization: Define a range of possible values for each XGBoost hyperparameter you want to tune.
n_estimators
: The number of boosting rounds.max_depth
: The maximum depth of a tree.learning_rate
: The step size shrinkage.subsample
: The fraction of samples to be used for fitting the individual base learners.colsample_bytree
: The fraction of columns to be used for fitting the individual base learners.gamma
: The minimum loss reduction required to make a further partition on a leaf node of the tree.
-
Population Generation: The algorithm creates an initial population of random hyperparameter sets (dictionaries). Each set is a unique combination of values chosen from the predefined ranges.
-
Fitness Evaluation: For each set of hyperparameters in the population, a new XGBoost model is trained on the training data. The model's performance is then evaluated on a validation set using a fitness function (e.g., accuracy, F1-score, or mean squared error). This score represents the "fitness" of that individual.
-
Selection: The fittest individuals (i.e., the hyperparameter sets that resulted in the best-performing models) are selected to create the next generation.
-
Crossover and Mutation:
- Crossover: The selected individuals are "bred" to create new offspring. This involves combining the hyperparameter values of two parents to create a new set of hyperparameters.
- Mutation: To maintain genetic diversity and avoid getting stuck in local optima, some of the hyperparameters in the new generation are randomly mutated (slightly changed).
-
Repeat: The process of evaluation, selection, crossover, and mutation is repeated for a specified number of generations. The best set of hyperparameters found throughout the entire process is then used to train the final model.
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Define the hyperparameter space
param_space = {
'n_estimators': (10, 1000),
'max_depth': (3, 10),
'learning_rate': (0.01, 0.3),
'subsample': (0.5, 1.0),
'colsample_bytree': (0.5, 1.0),
'gamma': (0, 5)
}
# Generate a random population
def generate_population(size):
population = []
for _ in range(size):
individual = {
'n_estimators': np.random.randint(*param_space['n_estimators']),
'max_depth': np.random.randint(*param_space['max_depth']),
'learning_rate': np.random.uniform(*param_space['learning_rate']),
'subsample': np.random.uniform(*param_space['subsample']),
'colsample_bytree': np.random.uniform(*param_space['colsample_bytree']),
'gamma': np.random.uniform(*param_space['gamma'])
}
population.append(individual)
return population
# Fitness function
def fitness(individual, X_train, y_train, X_val, y_val):
model = XGBClassifier(**individual)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
return accuracy_score(y_val, y_pred)
# --- Genetic Algorithm Logic (Selection, Crossover, Mutation) ---
# This part of the code would be implemented here.
# ...
-
Create a virtual environment:
python3 -m venv venv
-
Activate the virtual environment:
- On macOS and Linux:
source venv/bin/activate
- On Windows:
.\venv\Scripts\activate
Your terminal prompt should change to indicate that the virtual environment is active.
- On macOS and Linux:
-
Install dependencies:
pip install -r requirements.txt
-
Create necessary directories:
mkdir -p data models
-
Generate dummy data (optional, for quick start):
python generate_data.py
-
Run the tests:
python test_initialization.py python test_fitness.py python test_selection.py python test_evolution.py
-
Prepare Your Data: Ensure your training and validation data are in CSV format (e.g.,
data/train.csv
,data/validation.csv
) with a target column. -
Run the Genetic Algorithm: Execute the
main.py
script with desired parameters. The script will output the best hyperparameters found and save the best model.python main.py --generations 20 --population_size 50 --num_parents 10 --mutation_rate 0.1
You can customize the data paths and target column:
python main.py --train_csv data/my_train.csv --val_csv data/my_val.csv --target_column my_target_col
The best model will be saved to
models/best_xgboost_model.joblib
by default, or you can specify a path:python main.py --model_output_path models/my_custom_model.joblib
- More Sophisticated Selection: Implement more advanced selection methods like tournament selection or rank-based selection.
- Adaptive Mutation: Allow the mutation rate to change over time.
- Parallelization: Train models in parallel to speed up the fitness evaluation process.
- Early Stopping: Stop the training of models that are not showing promise to save time.