Skip to content

raeslab/ColdSnap

Repository files navigation

Run Pytest Coverage Ruff License: CC BY-NC-SA 4.0

ColdSnap: Freeze ML models and their training/testing data

The ColdSnap framework allows for training/testing data as well as machine learning models to be "frozen" aka serialized to disk.

Machine learning projects often require careful tracking and storage of not only model architectures and parameters but also the datasets they were trained on. Having a robust mechanism for storing both models and their associated data snapshots is essential for reproducibility, version control, and long-term evaluation of model performance. ColdSnap was created to address these needs by providing a unified framework where machine learning models and their corresponding datasets can be seamlessly stored, serialized, and evaluated. By preserving both the model and data as a single unit, ColdSnap enables consistent evaluation across iterations, aids in model comparisons, and ensures that all aspects of a model’s creation—data transformations, training splits, and performance metrics—are easily retrievable, facilitating high-quality machine learning workflows.

Installation

How to use ColdSnap

The code below can be found in ./docs/example, in a nutshell you create a Data object, which contains your training and testing data, that data is added to a model along with the classifier to use and that can be serialized to disk. This model, along with the data, can be loaded again from another script/notebook. The create_overview function can summarize a list of models.

Creating Snapshot of Data and Models

The code below shows how to create and store Data and Models.

from coldsnap import Data, Model

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import os

iris = datasets.load_iris(as_frame=True)
iris_df = pd.merge(
    iris.data, iris.target, how="inner", left_index=True, right_index=True
)

if __name__ == "__main__":
    try:
        os.mkdir("./tmp/")
    except FileExistsError:
        pass

    cs_data = Data.from_df(
        iris_df, "target", random_state=1910, description="Iris Dataset"
    )

    cs_data.to_pickle("./tmp/iris_data.pkl.gz")

    # Create random forest classifier
    clf = RandomForestClassifier(random_state=1910)

    cs_model = Model(
        data=cs_data,
        clf=clf,
        description="RandomForestClassifier, default params on Iris dataset",
    )
    cs_model.fit()

    cs_model.to_pickle("./tmp/iris_model.pkl.gz")

Using Transformers with ColdSnap

ColdSnap also supports sklearn transformers like StandardScaler, PCA, etc. This is useful for saving fitted preprocessing pipelines along with your data. The example below shows how to fit a StandardScaler and save it as a snapshot.

from coldsnap import Data, Model

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import pandas as pd
import os

iris = datasets.load_iris(as_frame=True)
iris_df = pd.merge(
    iris.data, iris.target, how="inner", left_index=True, right_index=True
)

if __name__ == "__main__":
    try:
        os.mkdir("./tmp/")
    except FileExistsError:
        pass

    # Create data object
    cs_data = Data.from_df(
        iris_df, "target", random_state=1910, description="Iris Dataset"
    )

    # Create a StandardScaler transformer
    scaler = StandardScaler()

    # Create a Model with the transformer using the 'estimator' parameter
    cs_scaler_model = Model(
        data=cs_data,
        estimator=scaler,
        description="StandardScaler for Iris dataset",
    )

    # Fit the scaler
    cs_scaler_model.fit()

    # Transform the training data
    X_train_scaled = cs_scaler_model.transform(cs_data.X_train)

    print("Original data (first 3 samples):")
    print(cs_data.X_train.head(3))
    print("\nScaled data (first 3 samples):")
    print(X_train_scaled.head(3))  # DataFrame structure is preserved automatically

    # Save the fitted transformer
    cs_scaler_model.to_pickle("./tmp/iris_scaler.pkl.gz")

    # Later, you can load and use it on new data
    loaded_scaler = Model.from_pickle("./tmp/iris_scaler.pkl.gz")
    X_test_scaled = loaded_scaler.transform(cs_data.X_test)

    print("\nTransformer successfully saved and loaded!")

Using Regressors with ColdSnap

ColdSnap supports sklearn regression models like LinearRegression, Ridge, RandomForestRegressor, etc. Regressors work similarly to classifiers but return regression-specific evaluation metrics (RMSE, MAE, R2, MSE).

from coldsnap import Data, Model

from sklearn import datasets
from sklearn.linear_model import LinearRegression
import pandas as pd
import os

# Load the iris dataset for regression
iris = datasets.load_iris(as_frame=True)
iris_df = pd.merge(
    iris.data, iris.target, how="inner", left_index=True, right_index=True
)
# Drop the target column as we'll predict petal width from other features
iris_df = iris_df.drop(columns=["target"])

if __name__ == "__main__":
    try:
        os.mkdir("./tmp/")
    except FileExistsError:
        pass

    # Create data object for regression - predict petal width from other measurements
    cs_data = Data.from_df(
        iris_df, "petal width (cm)", random_state=1910, description="Iris Petal Width Regression"
    )

    # Create a LinearRegression model
    regressor = LinearRegression()

    cs_model = Model(
        data=cs_data,
        estimator=regressor,
        description="LinearRegression predicting petal width on Iris dataset",
    )

    # Fit the model
    cs_model.fit()

    # Evaluate with regression metrics
    metrics = cs_model.evaluate()
    print("Regression Metrics:")
    print(f"  RMSE: {metrics['rmse']:.4f}")
    print(f"  MAE: {metrics['mae']:.4f}")
    print(f"  R2 Score: {metrics['r2']:.4f}")
    print(f"  MSE: {metrics['mse']:.4f}")

    # Make predictions
    predictions = cs_model.predict(cs_data.X_test)

    # Save the model
    cs_model.to_pickle("./tmp/iris_regressor.pkl.gz")

    print("\nRegressor successfully trained, evaluated, and saved!")

Loading a Model

Once a model has been stored, it can easily be loaded using Model.from_pickle(path). Once loaded, details on the model and its performance can be retrieved using .summary.

from coldsnap import Model

if __name__ == "__main__":
    try:
        cs_model = Model.from_pickle("./tmp/iris_model.pkl.gz")
    except OSError:
        print("Model not found, run the script to create models first !")
        quit()

    print(cs_model.summary())

Creating an Overview of Your Models

To quickly compare a number of models the function create_overview can be used as shown below.

from coldsnap.utils import create_overview

if __name__ == "__main__":
    paths = [
        "./tmp/iris_model.pkl.gz",
        "./tmp/iris_model_svc.pkl.gz",
        "./tmp/iris_model_dt.pkl.gz",
    ]

    overview_df = create_overview(paths)

    print(overview_df.to_markdown())

The table below shows the output, you get for each model in the input list the summary and evaluation criteria.

path model_code model_description model_hash data_code data_description data_hash num_features features num_classes classes accuracy precision recall f1 roc_auc
0 ./tmp/iris_model.pkl.gz RF01 RandomForestClassifier, default params on Iris dataset b3f8665bce0ee979b51c9729019ae76d7ed3b83522024b9fb3375e1b96a3dc11 IrD Iris Dataset 975cdbb5f836a810ad019751a998b18683437093f372f4545fd00be5335d5e4b 4 sepal length (cm), sepal width (cm), petal length (cm), petal width (cm) 3 0, 1, 2 0.973684 0.975564 0.973684 0.973545 0.997973
1 ./tmp/iris_model_svc.pkl.gz SVC01 SVC (with probabilities) on Iris dataset 280f5c4ca76b77144bbe7e9768bfc663b45fdafe61be3bbdc793458597f75e07 IrD Iris Dataset 975cdbb5f836a810ad019751a998b18683437093f372f4545fd00be5335d5e4b 4 sepal length (cm), sepal width (cm), petal length (cm), petal width (cm) 3 0, 1, 2 0.973684 0.975564 0.973684 0.973545 0.997973
2 ./tmp/iris_model_dt.pkl.gz DT01 DecisionTreeClassifier (max_depth=2) on Iris dataset 3814de3d290288de03f1b2388897c964b967c7f8ffa44303c61c41575da5d856 IrD Iris Dataset 975cdbb5f836a810ad019751a998b18683437093f372f4545fd00be5335d5e4b 4 sepal length (cm), sepal width (cm), petal length (cm), petal width (cm) 3 0, 1, 2 0.947368 0.947368 0.947368 0.947368 0.975673

Evaluating Model performance

There are a few common metrics built into ColdSnap. See the example below (which assumes a model is loaded in cs_model).

print(cs_model.evaluate())

# Confusion matrix
print(cs_model.confusion_matrix())

fig, ax = plt.subplots()
disp = cs_model.display_confusion_matrix(ax=ax, cmap="Blues")
plt.show()

# ROC curve
fig, ax = plt.subplots()

roc_disp = cs_model.display_roc_curve(ax=ax)

plt.show()

# SHAP beeswarm
cs_model.display_shap_beeswarm()

Contributing

Any contributions you make are greatly appreciated.

  • Found a bug or have some suggestions? Open an issue.
  • Pull requests are welcome! Though open an issue first to discuss which features/changes you wish to implement.

Contact

ColdSnap was developed by Sebastian Proost at the RaesLab. ColdSnap is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.

For commercial access inquiries, please contact Jeroen Raes.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published