feat: enhance PowerPoint reader with comprehensive content extraction… #19478

Jash271 · 2025-07-20T05:01:50Z

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Refactored and Enhanced PPTx Reader to parse charts, tables, slide notes and text from the PPT's, along with the option for optionally extracting image captions and consolidating slide data with LLM for better context

Fixes #18847

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

Yes
No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

Yes
No

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

I added new unit tests to cover this change
I believe this change is already covered by existing unit tests

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran uv run make format; uv run make lint to appease the lint gods

… and image handling - Added `ImageExtractor` and `SlideContentExtractor` for improved image captioning and structured content extraction from slides. - Updated `PptxReader` to support multithreaded processing and enhanced validation. - Introduced new dependencies: `python-pptx` and `tenacity` for image processing and retry logic. - Created a comprehensive test PowerPoint presentation and corresponding tests to validate extraction capabilities.

Jash271 · 2025-07-20T05:13:36Z

Additional Testing Evidence

Sample PPT Under Test

comprehensive_test_presentation.pptx

Script to Run Basic Parsing (Default Case)

#!/usr/bin/env python3
"""
Basic PPT Reader Test - Uses comprehensive_test_presentation.pptx and outputs to JSON.
"""
import json
from pathlib import Path
from datetime import datetime

from llama_index.readers.file.slides import PptxReader


def document_to_dict(doc):
    """Convert a Document object to a dictionary for JSON serialization."""
    return {
        "text": doc.text,
        "metadata": doc.metadata,
        "doc_id": str(doc.doc_id) if hasattr(doc, 'doc_id') else None,
    }


def main():
    # Use the existing test presentation
    ppt_file = "comprehensive_test_presentation.pptx"
    ppt_path = Path(ppt_file)
    
    if not ppt_path.exists():
        print(f"Error: {ppt_file} not found in current directory")
        return
    
    print(f"Processing: {ppt_file}")
    
    # Initialize reader
    reader = PptxReader(
        extract_images=False,
        context_consolidation_with_llm=False,
        num_workers=4,
        batch_size=10
    )
    
    # Process the file
    start_time = datetime.now()
    documents = reader.load_data(ppt_file)
    processing_time = (datetime.now() - start_time).total_seconds()
    
    print(f"Processed {len(documents)} slides in {processing_time:.2f} seconds")
    
    # Create output data
    output_data = {
        "processing_info": {
            "source_file": ppt_file,
            "timestamp": datetime.now().isoformat(),
            "processing_time_seconds": processing_time,
            "total_slides": len(documents)
        },
        "documents": [document_to_dict(doc) for doc in documents]
    }
    
    # Save to JSON
    output_file = f"ppt_reader_output_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(output_data, f, indent=2, ensure_ascii=False)
    
    print(f"Output saved to: {output_file}")


if __name__ == "__main__":
    main()

Final Output:
ppt_reader_output_20250720_012521.json

-- Output if you set extract_image to True :
ppt_reader_output_20250720_162546.json

Script to Test Consolidation with LLM

#!/usr/bin/env python3
"""
PPT Reader Test with LiteLLM and Databricks - Tests LLM content consolidation.
"""

import json
import argparse
import os
from pathlib import Path
from datetime import datetime

from llama_index.readers.file.slides import PptxReader
from llama_index.core import Settings


def get_os_env(key: str) -> str:
    """Get environment variable or raise error if not found."""
    value = os.getenv(key)
    if not value:
        raise ValueError(f"Environment variable {key} not found")
    return value


def document_to_dict(doc):
    """Convert a Document object to a dictionary for JSON serialization."""
    return {
        "text": doc.text,
        "metadata": doc.metadata,
        "doc_id": str(doc.doc_id) if hasattr(doc, 'doc_id') else None,
    }


def main():
    parser = argparse.ArgumentParser(description="Test PPT Reader with LiteLLM and Databricks")
    parser.add_argument("--api-base", required=True, help="LLM API base URL")
    parser.add_argument("--api-key", required=True, help="LLM API key") 
    parser.add_argument("--model", default="openai/gpt-41", help="Completion model (default: databricks-dbrx-instruct)")
    
    args = parser.parse_args()
    
    # Use the existing test presentation
    ppt_file = "comprehensive_test_presentation.pptx"
    ppt_path = Path(ppt_file)
    
    if not ppt_path.exists():
        print(f"Error: {ppt_file} not found in current directory")
        return
    
    print(f"Processing: {ppt_file}")
    print(f"API Base: {args.api_base}")
    print(f"Model: {args.model}")
    
    # Set up LiteLLM
    try:
        from llama_index.llms.litellm import LiteLLM
        
        Settings.llm = LiteLLM(
            model=args.model,
            api_base=args.api_base,
            api_key=args.api_key,
        )
        
        print("✅ LiteLLM configured successfully")
        
        # Test LLM with a simple completion
        print("🧪 Testing LLM connection...")
        try:
            test_response = Settings.llm.complete("Hello, this is a test.")
            print(f"✅ LLM test successful: {test_response.text[:100]}...")
        except Exception as e:
            print(f"❌ LLM test failed: {e}")
            return
        
    except ImportError:
        print("Error: LiteLLM not available. Install with: pip install llama-index-llms-litellm")
        return
    except Exception as e:
        print(f"Error configuring LiteLLM: {e}")
        return
    
    # Initialize reader with LLM consolidation enabled
    reader = PptxReader(
        extract_images=False,
        context_consolidation_with_llm=True,  # Enable LLM consolidation
        llm=None,  # Will use Settings.llm
        num_workers=4,
        batch_size=10
    )
    
    # Process the file
    print("🔄 Processing with LLM content consolidation...")
    start_time = datetime.now()
    
    try:
        documents = reader.load_data(ppt_file)
        processing_time = (datetime.now() - start_time).total_seconds()
        
        print(f"✅ Processed {len(documents)} slides in {processing_time:.2f} seconds")
        
        # Debug: Check if content is actually being extracted
        print("\n🔍 Debug Info:")
        print(f"Reader LLM: {reader.llm}")
        print(f"LLM consolidation enabled: {reader.context_consolidation_with_llm}")
        
        if documents:
            first_doc = documents[0]
            print(f"First document text length: {len(first_doc.text)}")
            print(f"First document has extraction errors: {first_doc.metadata.get('extraction_errors', [])}")
            if len(first_doc.text) < 100:
                print(f"First document text: '{first_doc.text}'")
        else:
            print("❌ No documents returned!")
        
        # Create output data
        output_data = {
            "processing_info": {
                "source_file": ppt_file,
                "timestamp": datetime.now().isoformat(),
                "processing_time_seconds": processing_time,
                "total_slides": len(documents),
                "llm_consolidation_enabled": True,
                "model": args.model,
                "api_base": args.api_base
            },
            "documents": [document_to_dict(doc) for doc in documents]
        }
        
        # Save to JSON
        output_file = f"ppt_reader_llm_output_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(output_data, f, indent=2, ensure_ascii=False)
        
        print(f"💾 Output saved to: {output_file}")
        
        # Show some sample consolidated content
        print("\n📋 Sample consolidated content from first slide:")
        if documents:
            sample_text = documents[0].text[:500] + "..." if len(documents[0].text) > 500 else documents[0].text
            print(f"   {sample_text}")
        
    except Exception as e:
        print(f"❌ Error during processing: {e}")
        return


if __name__ == "__main__":
    main()

Final Output:
ppt_reader_llm_output_20250720_012855.json

- Added basic and advanced usage examples for the PptxReader, demonstrating text, table, chart, and speaker notes extraction. - Included options for image extraction, LLM content synthesis, parallel processing, and batch size configuration.

AstraBert

This is a great refactoring of the PptxReader - including also some changes that I would describe as breaking.
If we decide to merge this, this will need a major bump :)

AstraBert · 2025-07-21T10:37:03Z

llama-index-integrations/readers/llama-index-readers-file/README.md

+# Advanced usage - control parsing behavior
+parser = PptxReader(
+    extract_images=True,  # Enable image captioning
+    context_consolidation_with_llm=True,  # Use LLM for content synthesis


context or content?

Context, since we are extracting the content anyways ( with or without the llm) the place where llm comes into the play is to make the content we have extracted contextually rich, which will help improve ranking for RAG/Contextual Retrieval processes

That said, there is no hard preference, we can rename it

AstraBert · 2025-07-21T10:48:26Z

...ations/readers/llama-index-readers-file/llama_index/readers/file/slides/content_extractor.py

+    def _find_likely_title(self, slide) -> str:
+        """Find the most likely title from slide shapes using scoring."""
+        try:
+            candidates = []
+
+            for shape in slide.shapes:
+                if (
+                    hasattr(shape, "text_frame")
+                    and shape.text_frame
+                    and shape.text_frame.text.strip()
+                ):
+                    # Score based on position and formatting
+                    score = 0
+
+                    # Position score (higher = closer to top)
+                    if hasattr(shape, "top"):
+                        score += (10000 - shape.top) / 1000
+
+                    # Size score
+                    if hasattr(shape, "height"):
+                        score += shape.height / 100
+
+                    # Text length (prefer shorter for titles)
+                    text = shape.text_frame.text.strip()
+                    if len(text) < 100:
+                        score += 50 - len(text) / 2
+
+                    candidates.append((text, score))
+
+            if candidates:
+                # Return highest scoring candidate
+                candidates.sort(key=lambda x: x[1], reverse=True)
+                return candidates[0][0]
+
+            return ""
+        except Exception:
+            return ""


Seems a little bit arbitrary with the scoring, did you consider any more solid alternatives?

This scoring algorithm is based on three fundamental characteristics that consistently define titles in presentation design.
First, titles are positioned at the top of slides - this is a universal design convention, so we score shapes higher when they're closer to the top of the slide. Second, titles use larger fonts than body text to create visual hierarchy, so we give higher scores to text shapes with greater height (indicating larger font size). Third, titles are concise while body text tends to be verbose - titles rarely exceed a few words, so we favor shorter text content under 100 characters.
The algorithm deliberately prioritizes position over font size, and font size over text length. This hierarchical weighting ensures that a large-font text block at the bottom of a slide won't incorrectly score higher than a properly-positioned title at the top. By giving position the highest weight, we maintain the fundamental principle that titles belong at the top, while still considering typography and content characteristics as secondary factors.

Hi @AstraBert
Did you get a chance to review my comments

logan-markewich · 2025-07-23T16:42:44Z

...-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/slides/base.py

-            "feature_extractor": feature_extractor,
-            "model": model,
-            "tokenizer": tokenizer,
+        if not result["success"]:


should we be raising an error here instead? or?

Yeah we can raise an error here
I was more of the opinion to not have it fail, since if people are reading the files in batches, I wouldn't want to necessarily block parsing for all the files, if extraction for one of it fails
Let me know what you think ?

yea fair enough 🤔 Maybe a constructor option in that case to satisfy both cases?

I’ll make the update

logan-markewich · 2025-07-23T16:43:27Z

...-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/slides/base.py

+                Document(
+                    text=slide["content"],
+                    metadata=metadata,
+                    excluded_embed_metadata_keys=list(


probably we should set excluded_llm_metadata_keys as well?

Yeah, I am in 2 minds here about it honestly
adding a param for set excluded_llm_metadata_keys, users need to what the keys in the metadata added are exactly called, do we want to really let users know about the internal implementation for them to know what keys would get added to meta-data

Also I just feel, these keys don't hold much significance for them to get embedded, the doc content itself should be enough

Let me know what you think, I can make that change accordingly

I'm not really sure why adding excluded_llm_metadata_keys would be exposing internal implementation to users?

We are already excluding the metadata from the embed model, lets also exclude it from the LLM (this metadata will not help an LLM answer a question during RAG, but it will still be there for users to inspect)

I misunderstood it as adding excluded_llm_metadata keys as param to the constructor and people can define to meta-data keys to exclude
Yea I’ll set excluded_llm_metadata_keys as well

...-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/slides/base.py

logan-markewich · 2025-07-23T16:44:04Z

...-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/slides/base.py

+                "charts": slide.get("charts", []),
+                "notes": slide.get("notes", ""),
+                "images": slide.get("images", []),
+                "text_sections": slide.get("text_sections", []),


is this different from the slide content?

Yes, slide content is basically a summary of all these sections - tables, charts, notes, images, text_sections condensed into a paragraph
Incase if someone wants to just inspect what was parsed and the overall structure of the slide, I thought to have these sections individually included in the meta-data

logan-markewich · 2025-07-23T16:44:30Z

...-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/slides/base.py

+            metadata = {
+                "file_path": str(file),
+                "page_label": i,
+                "slide_number": i,


why have page_label and slide_number with the same value?

page_label to make it consistent with the other readers, like pdf reader etc, since they use page_label as the key for storing the page_no
slide_number to keep the key more relevant for ppt's

If you feel we should keep only one, I would prefer page_label in that case, what do you think ?

lets keep it page_label for now

Yup updated to keep it as page_label
Removed the slide_number key

logan-markewich · 2025-07-23T16:45:03Z

...-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/slides/base.py

+                "has_tables": len(slide.get("tables", [])) > 0,
+                "has_charts": len(slide.get("charts", [])) > 0,
+                "has_notes": bool(slide.get("notes", "")),
+                "has_images": len(slide.get("images", [])) > 0,


I would argue these "has_*" fields aren't needed? Users can already look at the metadata to see if its present?

I had these has_fields, since it would become easier for people to read the metadata, they can just see these boolean fields to know what's present and what's not
Since by default, in metadata, even if tables are not present, tables would be an empty array, it wouldn't be like the key itself would not exist

I am okay dropping these has_* cols from metadata, let me know what you think

lets drop it, I think its just duplicating information and making the metadata a larger payload than it needs to be

...-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/slides/base.py

llama-index-integrations/readers/llama-index-readers-file/tests/test_slides.py

- Removed unnecessary metadata fields from slide extraction, including slide_number, has_tables, has_charts, has_notes, has_images, and content. - Enhanced the _validate_file method to return the presentation object for reuse, improving efficiency. - Updated tests to reflect changes in metadata structure, ensuring accurate validation of slide content and properties. - Improved error messaging for missing dependencies in ImageExtractor.

…ovements - Added `raise_on_error` parameter to `PptxReader` to control error handling during file parsing. - Updated title detection logic in `SlideContentExtractor` to improve accuracy by considering text position and size. - Enhanced tests to validate the new `raise_on_error` functionality and title detection edge cases.

Jash271 · 2025-07-25T06:30:09Z

Hi @logan-markewich
Made all the changes we discussed, kindly have a look

Jash271 marked this pull request as ready for review July 20, 2025 05:14

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jul 20, 2025

AstraBert reviewed Jul 21, 2025

View reviewed changes

Merge branch 'main' into enhanced_ppt_reader

a537e9f

logan-markewich reviewed Jul 23, 2025

View reviewed changes

...-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/slides/base.py Outdated Show resolved Hide resolved

logan-markewich reviewed Jul 23, 2025

View reviewed changes

...-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/slides/base.py Outdated Show resolved Hide resolved

logan-markewich reviewed Jul 23, 2025

View reviewed changes

llama-index-integrations/readers/llama-index-readers-file/tests/test_slides.py Outdated Show resolved Hide resolved

Jash271 and others added 6 commits July 23, 2025 13:09

Merge branch 'main' into enhanced_ppt_reader

c03df8a

Merge branch 'main' into enhanced_ppt_reader

cae4184

Merge branch 'main' into enhanced_ppt_reader

68d932b

Merge branch 'main' into enhanced_ppt_reader

7ae8280

Jash271 requested review from AstraBert and logan-markewich July 25, 2025 06:31

logan-markewich added 2 commits August 12, 2025 16:48

merge main

d9e57f4

move pptx dep

561314a

logan-markewich approved these changes Aug 12, 2025

View reviewed changes

logan-markewich merged commit 8cc2925 into run-llama:main Aug 12, 2025
11 checks passed

dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 12, 2025

feat: enhance PowerPoint reader with comprehensive content extraction… #19478

feat: enhance PowerPoint reader with comprehensive content extraction… #19478

Uh oh!

Conversation

Jash271 commented Jul 20, 2025 • edited by logan-markewich Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

Uh oh!

Jash271 commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additional Testing Evidence

Sample PPT Under Test

Script to Run Basic Parsing (Default Case)

Script to Test Consolidation with LLM

Uh oh!

AstraBert left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jash271 Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jash271 Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

logan-markewich Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Jash271 commented Jul 25, 2025

Uh oh!

Uh oh!

Reviewers

Jash271 commented Jul 20, 2025 •

edited by logan-markewich

Loading

Jash271 commented Jul 20, 2025 •

edited

Loading

Jash271 Jul 21, 2025 •

edited

Loading

Jash271 Jul 25, 2025 •

edited

Loading

logan-markewich Jul 23, 2025 •

edited

Loading