Lazy Dataset Loading #1326

LinasKo · 2024-07-04T11:34:25Z

Description

Images are now passed to DetectionDataset as List[str], lazy-loading them as needed.

Updated DetectionDataset
Updated ClassificationDataset
Updated yolo, pascal voc, coco loaders
Fixed a couple of bugs in the loaders
Updated deprecations list
Benchmarked runtime

Having said that, I still have doubts regarding the contructor signature. We're passing List[str] for images and Dict[str, Detections] for annotations, which uses the same paths for list entries and dict keys.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

Benchmark code in comment below
Dataset loader tests: Colab

Any specific deployment considerations

Docs

Docs updated? What were the changes:

* Unfinished * Untested * Committing to show Piotr a bit earlier

supervision/dataset/core.py

LinasKo · 2024-07-09T09:16:00Z

Left to do:

~~Test with datasets.~~
~~I moved some functions out of utils for better type checking, but I think I found a way to move them back.~~
Think whether passing both images list and annotations is clean enough. Paths are sent twice.

Known imperfections:

~~Saving lazy images first loads them, then writes them to disk.~~

LinasKo · 2024-07-09T10:34:23Z

Tests show positive results. On my machine:

Loading 50,000 images into memory takes 41 seconds on average.
Iterating through 50,000 lazy-loaded images takes 40 seconds.
Throughput is around 1250 images per second.
No perceived spike in memory usage when lazy loading.

Test code:

Code: Create dataset

import os
import shutil
import argparse

def copy_image_n_times(image_path, output_folder, n):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    image_size = os.path.getsize(image_path)
    total_size = image_size * n

    print(f"Image size: {image_size / (1024 * 1024):.2f} MB")
    print(f"Total dataset size: {total_size / (1024 * 1024):.2f} MB")

    confirm = input(f"Are you sure you want to create a dataset of size {total_size / (1024 * 1024):.2f} MB? (yes/no): ")

    if confirm.lower() != 'yes':
        print("Operation cancelled.")
        return

    for i in range(n):
        new_image_path = os.path.join(output_folder, f"{i+1}_{os.path.basename(image_path)}")
        shutil.copy(image_path, new_image_path)
        print(f"Copied {image_path} to {new_image_path}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Copy an image into a new folder N times.")
    parser.add_argument("image_path", type=str, help="Path to the source image.")
    parser.add_argument("output_folder", type=str, help="Path to the output folder.")
    parser.add_argument("n", type=int, help="Number of times to copy the image.")

    args = parser.parse_args()
    
    copy_image_n_times(args.image_path, args.output_folder, args.n)

Code: Benchmark

from pathlib import Path
import time
import argparse
from typing import Dict, List
import cv2
import numpy as np
from inference import get_model

from supervision.dataset.core import DetectionDataset
from supervision.detection.core import Detections


def detect(image: np.ndarray) -> Detections:
    model = get_model("yolov8s-640")
    result = model.infer(image)[0]
    detections = Detections.from_inference(result)
    return detections

def make_annotations(image_paths: List[str]) -> Dict[str, Detections]:
    first_image = cv2.imread(str(image_paths[0]))
    annotations = detect(first_image)
    return {path: annotations for path in image_paths}

def load_into_memory(image_paths: List[str]) -> Dict[str, np.ndarray]:
    images = {}
    for i, path in enumerate(image_paths):
        if i % 10_000 == 0:
            print(f"Loaded {i} images...")
        image = cv2.imread(path)
        images[path] = image
    return images

def load_paths(folder_path: str, max_count=1_000_000) -> List[str]:
    return [str(path) for path in Path(folder_path).glob("*.jpg")][:max_count]

def measure_throughput(dataset: DetectionDataset, num_iterations: int = 3) -> float:
    times = []
    for i in range(num_iterations):
        t0 = time.time()
        for _, _, _ in dataset:
            pass
        t1 = time.time()
        print(f"Completed iteration {i + 1}/{num_iterations} in {t1 - t0:.2f} seconds")
        times.append(t1 - t0)
    throughput = len(dataset) * num_iterations / sum(times)
    return throughput

def main(folder_path, max_count):
    paths = load_paths(folder_path, max_count)
    annotations = make_annotations(paths)

    print("Testing in-memory loading...")
    t0 = time.time()
    images = load_into_memory(paths)
    t1 = time.time()
    print(f"Loaded {len(images)} images in {t1 - t0:.2f} seconds")

    t0 = time.time()
    in_memory_dataset = DetectionDataset(
        classes=["cat", "dog", "raccoon"],
        images=images,
        annotations=annotations
    )
    t1 = time.time()
    print(f"Created in-memory dataset in {t1 - t0:.2f} seconds")

    in_memory_throughput = measure_throughput(in_memory_dataset)
    print(f"In-memory throughput: {in_memory_throughput:.2f} images/second")
    del images
    del in_memory_dataset

    ####

    print("Testing lazy loading...")

    t0 = time.time()
    lazy_dataset = DetectionDataset(
        classes=["cat", "dog", "raccoon"],
        images=paths,
        annotations=annotations
    )
    t1 = time.time()
    print(f"Created lazy dataset in {t1 - t0:.2f} seconds")

    lazy_throughput = measure_throughput(lazy_dataset)
    print(f"Lazy loading throughput: {lazy_throughput:.2f} images/second")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Compare in-memory vs lazy loading throughput for DetectionDataset.")
    parser.add_argument("folder_path", type=str, help="Path to the folder containing images.")
    parser.add_argument("--max", type=int, default=1_000_000, help="Maximum number of images to load.")

    args = parser.parse_args()

    main(args.folder_path, args.max)

LinasKo · 2024-07-09T12:32:10Z

Colab for testing dataset loading / saving in different formats: https://colab.research.google.com/drive/1LH4rUUr_Iecqj4n43Xsi3vIakjm4iYp2?usp=sharing

LinasKo · 2024-07-09T12:38:15Z

@SkalskiP, ready for review.

LazyDataset Loading - accept List[str] of images

4b9b70f

* Unfinished * Untested * Committing to show Piotr a bit earlier

SkalskiP requested changes Jul 4, 2024

View reviewed changes

LinasKo added 2 commits July 9, 2024 02:13

LazyLoaded DetectionDataset

1219570

ClassificationDataset, moved coco save function into dataset

6b09e1d

LinasKo marked this pull request as ready for review July 9, 2024 09:16

LinasKo added 2 commits July 9, 2024 14:04

Move coco and yolo loaders back into utils

c8f771e

Move back save_images to utils

e690b92

LinasKo force-pushed the feat/dataset-lazy-loading branch from 6a659a5 to e690b92 Compare July 9, 2024 11:17

LinasKo and others added 3 commits July 9, 2024 14:30

Fix image poaths in save_dataset, copy images if lazy

f4fd595

fix(pre_commit): 🎨 auto format pre-commit hooks

a91fc1e

Fix accidental changes, rearrange utils, rename methods

172877a

LinasKo force-pushed the feat/dataset-lazy-loading branch from c318159 to 172877a Compare July 9, 2024 11:56

Bugfix: XML parse in object_to_pascal_voc

8101485

LinasKo force-pushed the feat/dataset-lazy-loading branch from a76761c to 8101485 Compare July 9, 2024 12:24

as_pascal_voc now loads image paths

93c5c39

SkalskiP approved these changes Jul 10, 2024

View reviewed changes

SkalskiP merged commit de89618 into develop Jul 10, 2024

LinasKo mentioned this pull request Jul 12, 2024

supervision-0.22.0 release #1351

Merged

onuralpszr deleted the feat/dataset-lazy-loading branch September 23, 2024 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lazy Dataset Loading #1326

Lazy Dataset Loading #1326

Uh oh!

LinasKo commented Jul 4, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LinasKo commented Jul 9, 2024 •

edited

Loading

Uh oh!

LinasKo commented Jul 9, 2024 •

edited

Loading

Uh oh!

LinasKo commented Jul 9, 2024

Uh oh!

LinasKo commented Jul 9, 2024

Uh oh!

Uh oh!

Lazy Dataset Loading #1326

Lazy Dataset Loading #1326

Uh oh!

Conversation

LinasKo commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How has this change been tested, please provide a testcase or example of how you tested the change?

Any specific deployment considerations

Docs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LinasKo commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LinasKo commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LinasKo commented Jul 9, 2024

Uh oh!

LinasKo commented Jul 9, 2024

Uh oh!

Uh oh!

LinasKo commented Jul 4, 2024 •

edited

Loading

LinasKo commented Jul 9, 2024 •

edited

Loading

LinasKo commented Jul 9, 2024 •

edited

Loading