This project demonstrates how to use DeepSpeed and PyTorch for training and evaluating a deep learning model on the FashionMNIST dataset. DeepSpeed is an open-source deep learning training acceleration framework designed to optimize training efficiency, especially for memory management and distributed training.
-
ds_config.json
: DeepSpeed configuration file that defines hyperparameters for the training process (e.g., learning rate, batch size). This file is essential for configuring DeepSpeed's performance optimization during training. -
ds_train.py
: Training script responsible for loading the FashionMNIST dataset, initializing the model, and performing model training. DeepSpeed accelerates the training process and optimizes resource utilization during training. -
ds_eval.py
: Evaluation script that loads the trained model and performs inference to evaluate the model's performance on a test set. -
model.py
: Contains the model definition.FashionModel
is a simple fully connected neural network for classifying the FashionMNIST dataset.
- DeepSpeed is a deep learning training acceleration library that optimizes memory usage and computation efficiency during training, especially for large-scale models.
- In
ds_train.py
, the model is initialized with DeepSpeed usingdeepspeed.initialize()
, which optimizes the training process, reduces memory usage, and accelerates training.
- In
model.py
, a simple neural network model,FashionModel
, is built using PyTorch. - The model architecture includes fully connected layers (
Linear
), activation functions (ReLU
), etc., suitable for classification tasks like FashionMNIST.
- In
ds_train.py
andds_eval.py
, PyTorch and DeepSpeed are used for distributed training and evaluation. - Using
torch.distributed.get_rank()
, the rank of the training node is obtained, and only the main node (rank==0
) performs data loading and model inference, preventing redundant operations.
ds_eval.py
provides the code for model evaluation, including loading the trained model and performing inference. It demonstrates how to evaluate the model both in single-node and distributed mode.
- In
ds_eval.py
, the model parameters are saved and loaded usingtorch.save()
andtorch.load()
. - The model is stored in a
.pt
file format, which can later be used for inference or further training.
- Using DeepSpeed: Learned how to optimize the training process with DeepSpeed, especially in distributed training and memory management.
- PyTorch Neural Network Definition and Training: Understood how to define a simple neural network model using PyTorch and train it using a data loader.
- Distributed Training and Inference: Learned how to perform distributed training across multiple nodes and ensure tasks are not duplicated.
- Model Storage and Recovery: Learned how to save a trained model and load it for later use in inference or continued training.
This project demonstrates how to combine DeepSpeed and PyTorch to complete a classification task on the FashionMNIST dataset. By leveraging DeepSpeed, you can efficiently accelerate the training process and reduce memory consumption, making it especially suitable for large-scale model training. This workflow can be easily applied in real-world projects using PyTorch and DeepSpeed for deep learning training, evaluation, and model management.