Skip to content

Conversation

LinasKo
Copy link
Contributor

@LinasKo LinasKo commented Jul 4, 2024

Description

Images are now passed to DetectionDataset as List[str], lazy-loading them as needed.

  • Updated DetectionDataset
  • Updated ClassificationDataset
  • Updated yolo, pascal voc, coco loaders
  • Fixed a couple of bugs in the loaders
  • Updated deprecations list
  • Benchmarked runtime

Having said that, I still have doubts regarding the contructor signature. We're passing List[str] for images and Dict[str, Detections] for annotations, which uses the same paths for list entries and dict keys.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

Any specific deployment considerations

Docs

  • Docs updated? What were the changes:

* Unfinished
* Untested
* Committing to show Piotr a bit earlier
@LinasKo
Copy link
Contributor Author

LinasKo commented Jul 9, 2024

Left to do:

  • Test with datasets.
  • I moved some functions out of utils for better type checking, but I think I found a way to move them back.
  • Think whether passing both images list and annotations is clean enough. Paths are sent twice.

Known imperfections:

  • Saving lazy images first loads them, then writes them to disk.

@LinasKo LinasKo marked this pull request as ready for review July 9, 2024 09:16
@LinasKo
Copy link
Contributor Author

LinasKo commented Jul 9, 2024

Tests show positive results. On my machine:

  • Loading 50,000 images into memory takes 41 seconds on average.
  • Iterating through 50,000 lazy-loaded images takes 40 seconds.
  • Throughput is around 1250 images per second.
  • No perceived spike in memory usage when lazy loading.

Test code:

Code: Create dataset
import os
import shutil
import argparse

def copy_image_n_times(image_path, output_folder, n):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    image_size = os.path.getsize(image_path)
    total_size = image_size * n

    print(f"Image size: {image_size / (1024 * 1024):.2f} MB")
    print(f"Total dataset size: {total_size / (1024 * 1024):.2f} MB")

    confirm = input(f"Are you sure you want to create a dataset of size {total_size / (1024 * 1024):.2f} MB? (yes/no): ")

    if confirm.lower() != 'yes':
        print("Operation cancelled.")
        return

    for i in range(n):
        new_image_path = os.path.join(output_folder, f"{i+1}_{os.path.basename(image_path)}")
        shutil.copy(image_path, new_image_path)
        print(f"Copied {image_path} to {new_image_path}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Copy an image into a new folder N times.")
    parser.add_argument("image_path", type=str, help="Path to the source image.")
    parser.add_argument("output_folder", type=str, help="Path to the output folder.")
    parser.add_argument("n", type=int, help="Number of times to copy the image.")

    args = parser.parse_args()
    
    copy_image_n_times(args.image_path, args.output_folder, args.n)
Code: Benchmark
from pathlib import Path
import time
import argparse
from typing import Dict, List
import cv2
import numpy as np
from inference import get_model

from supervision.dataset.core import DetectionDataset
from supervision.detection.core import Detections


def detect(image: np.ndarray) -> Detections:
    model = get_model("yolov8s-640")
    result = model.infer(image)[0]
    detections = Detections.from_inference(result)
    return detections

def make_annotations(image_paths: List[str]) -> Dict[str, Detections]:
    first_image = cv2.imread(str(image_paths[0]))
    annotations = detect(first_image)
    return {path: annotations for path in image_paths}

def load_into_memory(image_paths: List[str]) -> Dict[str, np.ndarray]:
    images = {}
    for i, path in enumerate(image_paths):
        if i % 10_000 == 0:
            print(f"Loaded {i} images...")
        image = cv2.imread(path)
        images[path] = image
    return images

def load_paths(folder_path: str, max_count=1_000_000) -> List[str]:
    return [str(path) for path in Path(folder_path).glob("*.jpg")][:max_count]

def measure_throughput(dataset: DetectionDataset, num_iterations: int = 3) -> float:
    times = []
    for i in range(num_iterations):
        t0 = time.time()
        for _, _, _ in dataset:
            pass
        t1 = time.time()
        print(f"Completed iteration {i + 1}/{num_iterations} in {t1 - t0:.2f} seconds")
        times.append(t1 - t0)
    throughput = len(dataset) * num_iterations / sum(times)
    return throughput

def main(folder_path, max_count):
    paths = load_paths(folder_path, max_count)
    annotations = make_annotations(paths)

    print("Testing in-memory loading...")
    t0 = time.time()
    images = load_into_memory(paths)
    t1 = time.time()
    print(f"Loaded {len(images)} images in {t1 - t0:.2f} seconds")

    t0 = time.time()
    in_memory_dataset = DetectionDataset(
        classes=["cat", "dog", "raccoon"],
        images=images,
        annotations=annotations
    )
    t1 = time.time()
    print(f"Created in-memory dataset in {t1 - t0:.2f} seconds")

    in_memory_throughput = measure_throughput(in_memory_dataset)
    print(f"In-memory throughput: {in_memory_throughput:.2f} images/second")
    del images
    del in_memory_dataset

    ####

    print("Testing lazy loading...")

    t0 = time.time()
    lazy_dataset = DetectionDataset(
        classes=["cat", "dog", "raccoon"],
        images=paths,
        annotations=annotations
    )
    t1 = time.time()
    print(f"Created lazy dataset in {t1 - t0:.2f} seconds")

    lazy_throughput = measure_throughput(lazy_dataset)
    print(f"Lazy loading throughput: {lazy_throughput:.2f} images/second")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Compare in-memory vs lazy loading throughput for DetectionDataset.")
    parser.add_argument("folder_path", type=str, help="Path to the folder containing images.")
    parser.add_argument("--max", type=int, default=1_000_000, help="Maximum number of images to load.")

    args = parser.parse_args()

    main(args.folder_path, args.max)

@LinasKo LinasKo force-pushed the feat/dataset-lazy-loading branch from 6a659a5 to e690b92 Compare July 9, 2024 11:17
@LinasKo LinasKo force-pushed the feat/dataset-lazy-loading branch from c318159 to 172877a Compare July 9, 2024 11:56
@LinasKo LinasKo force-pushed the feat/dataset-lazy-loading branch from a76761c to 8101485 Compare July 9, 2024 12:24
@LinasKo
Copy link
Contributor Author

LinasKo commented Jul 9, 2024

Colab for testing dataset loading / saving in different formats: https://colab.research.google.com/drive/1LH4rUUr_Iecqj4n43Xsi3vIakjm4iYp2?usp=sharing

@LinasKo
Copy link
Contributor Author

LinasKo commented Jul 9, 2024

@SkalskiP, ready for review.

@SkalskiP SkalskiP merged commit de89618 into develop Jul 10, 2024
@onuralpszr onuralpszr deleted the feat/dataset-lazy-loading branch September 23, 2024 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants