Train TensorFlow models for image/video/features classification or other tasks. Currently the repository is set to train on image classification by default.
- TensorFlow Model Training
Install tensorflow
and related cudnn
libraries from the tensorflow-official-documentation if cudnn
libraries are not set up.
Create a .env
file with the following contents with the correct paths ensuring the correct CUDA
install path, with cp .env.example .env
:
XLA_FLAGS="--xla_gpu_cuda_data_dir=/usr/local/cuda"
TF_XLA_FLAGS="--tf_xla_enable_xla_devices --tf_xla_auto_jit=2 --tf_xla_cpu_global_jit"
TF_CPP_MIN_LOG_LEVEL='3'
TF_FORCE_GPU_ALLOW_GROWTH="true"
OMP_NUM_THREADS="15"
KMP_BLOCKTIME="0"
KMP_SETTINGS="1"
KMP_AFFINITY="granularity=fine,verbose,compact,1,0"
CUDA_DEVICE_ORDER="PCI_BUS_ID"
CUDA_VISIBLE_DEVICES="0"
Set up docker to run with NVIDIA-container-toolkit first.
Create checkpoints
dir in the current project directory.
bash scripts/build_docker.sh
bash scripts/run_docker.sh -p TF_BOARD_PORT
poetry install --all-groups
# export pyproject.toml requirements to requirements.txt
python scripts/poetry_to_pip_requirements.py
python -m venv venv; source venv/bin/activate
pip install -r requirements.txt
conda create --name tf_gpu tensorflow-gpu python=3.12 -y
conda activate tf_gpu
while read requirement; do conda install --yes $requirement; done < requirements.txt
Note: Conda
sets the cuda
, cudnn
and cudatoolkit
automatically, downloading non-python dependencies as well.
Assuming the data directory must be organized according to the following structure, with sub-directories having class names containing images. The CIFAR-10
dataset in JPG format can be acquired from https://github.com/YoongiKim/CIFAR-10-images for a sample train and test.
i.e.
data
|_ src_dataset
|_ class_1
|_ img1
|_ img2
|_ ....
|_ class_2
|_ img1
|_ img2
|_ ....
...
Note: ImageNet style ordering of data is also supported i.e. images ordered under subdirectories inside the class directories.
i.e.
data
|_ src_dataset
|_ class_1
|_ 00d
|_ img1
|_ img2
|_ 01
|_ img1
|_ img2
|_ ...
|_ ...
If all the classes do not have equal number of training samples, data duplication can be done.
python data_preparation/duplicate_data.py --sd data/src_dataset --td data/duplicated_dataset -n NUM_TO_DUPLICATE
# find corrupt images (i.e. that cannot be opened with tf.io.decode_image)
python data_preparation/find_corrupt_imgs.py --rd data/src_dataset
Set validation and test split in fractions (i.e. 0.1). Both splits are optional.
python data_preparation/create_train_val_test_split.py --sd data/duplicated_dataset --td data/split_dataset[ --vs VAL_SPLIT] [--ts TEST_SPLIT]
# to check the number of images in train, val and test dirs
bash scripts/count_files_per_subdir.sh data/split_dataset
Note: The test split should not be converted into tfrecords
and the original data->class_sub_directory
format should be used.
# convert train files into train tfrecord, select NUM_SHARDS so that each shard has a size of 100 MB+
python data_preparation/convert_imgs_to_tfrecord.py --sd data/split_dataset/train --td data/tfrecord_dataset/train [--cp CLASS_MAP_TXT_SAVEPATH] [--ns NUM_SAMPLES_PER_SHARDS]
# convert val files into val tfrecord, select NUM_SHARDS so that each shard has a size of 100 MB+
python data_preparation/convert_imgs_to_tfrecord.py --sd data/split_dataset/val --td data/tfrecord_dataset/val [--cp CLASS_MAP_TXT_SAVEPATH] [--ns NUM_SAMPLES_PER_SHARDS]
# to use multiprocessing use the --mt flag
Note: test dataset is not converted to tfrecord
as fast-loading is not a priority as we only run through the test data once.
To extract frames from videos into npy.npz
files install opencv
and pyav
, then run:
python data_preparation/extract_frames_from_video_dataset.py --sd SOURCE_DATA_DIR
# use -h for help
Configure all values in the YAML
files inside the config
dir. A sample config file is provided for training on the src_dataset
directory in config/train_image_clsf.yaml
.
The model information repository is located at tensorflow_training/model/models_info.py
. New models can be added or model parameters can be modified through this file.
Set number of GPUs to use, Tensorflow, and other system environment variables in .env
.
python train.py --cfg CONFIG_YAML_PATH [-r RESUME_CHECKPOINT_PATH]
Notes:
- Using the
-r
option while training will override theresume_checkpoint
param in config yaml if this param is not null. - To add tensorflow logs to train/test logs, set
"disable_existing_loggers"
parameter totrue
intensorflow_training/logging/logger_config.json
. - Out of Memory errors during training could be caused by large batch sizes, model size or dataset.cache() call in train preprocessing in
tensorflow_training/pipelines/data_pipeline.py
. - When using mixed_float16 precision, the data types of the final dense and activation layers must be set to
float32
. - An error like:
ValueError: Unexpected result of train_function (Empty logs)
could be caused by incorrect paths to train and validation directories in the config.yaml files
tensorboard --logdir=checkpoints/tf_logs/ --port=PORT_NUM
Make sure to set the correct test_data_dir
under data
and the class_map_txt_path
under tester
in the yaml config file.
The class_map_txt_path file is generated by the convert_imgs_to_tfrecord.py
script when converting images to tfrecord
format.
python test.py --cfg CONFIG_YAML_PATH -r TEST_CHECKPOINT_PATH
We can use a dockerized uvicorn
and fastapi
webserver with triton-server to serve the model through a HTTPS API endpoint. Instructions are at tensorflow_training/server/README.md.