Skip to content

Conversation

@Jash271
Copy link
Contributor

@Jash271 Jash271 commented Jul 20, 2025

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Refactored and Enhanced PPTx Reader to parse charts, tables, slide notes and text from the PPT's, along with the option for optionally extracting image captions and consolidating slide data with LLM for better context

Fixes #18847

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

… and image handling

- Added `ImageExtractor` and `SlideContentExtractor` for improved image captioning and structured content extraction from slides.
- Updated `PptxReader` to support multithreaded processing and enhanced validation.
- Introduced new dependencies: `python-pptx` and `tenacity` for image processing and retry logic.
- Created a comprehensive test PowerPoint presentation and corresponding tests to validate extraction capabilities.
@Jash271
Copy link
Contributor Author

Jash271 commented Jul 20, 2025

Additional Testing Evidence

Sample PPT Under Test

comprehensive_test_presentation.pptx

Script to Run Basic Parsing (Default Case)

#!/usr/bin/env python3
"""
Basic PPT Reader Test - Uses comprehensive_test_presentation.pptx and outputs to JSON.
"""
import json
from pathlib import Path
from datetime import datetime

from llama_index.readers.file.slides import PptxReader


def document_to_dict(doc):
    """Convert a Document object to a dictionary for JSON serialization."""
    return {
        "text": doc.text,
        "metadata": doc.metadata,
        "doc_id": str(doc.doc_id) if hasattr(doc, 'doc_id') else None,
    }


def main():
    # Use the existing test presentation
    ppt_file = "comprehensive_test_presentation.pptx"
    ppt_path = Path(ppt_file)
    
    if not ppt_path.exists():
        print(f"Error: {ppt_file} not found in current directory")
        return
    
    print(f"Processing: {ppt_file}")
    
    # Initialize reader
    reader = PptxReader(
        extract_images=False,
        context_consolidation_with_llm=False,
        num_workers=4,
        batch_size=10
    )
    
    # Process the file
    start_time = datetime.now()
    documents = reader.load_data(ppt_file)
    processing_time = (datetime.now() - start_time).total_seconds()
    
    print(f"Processed {len(documents)} slides in {processing_time:.2f} seconds")
    
    # Create output data
    output_data = {
        "processing_info": {
            "source_file": ppt_file,
            "timestamp": datetime.now().isoformat(),
            "processing_time_seconds": processing_time,
            "total_slides": len(documents)
        },
        "documents": [document_to_dict(doc) for doc in documents]
    }
    
    # Save to JSON
    output_file = f"ppt_reader_output_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(output_data, f, indent=2, ensure_ascii=False)
    
    print(f"Output saved to: {output_file}")


if __name__ == "__main__":
    main() 

Final Output:
ppt_reader_output_20250720_012521.json

-- Output if you set extract_image to True :
ppt_reader_output_20250720_162546.json


Script to Test Consolidation with LLM

#!/usr/bin/env python3
"""
PPT Reader Test with LiteLLM and Databricks - Tests LLM content consolidation.
"""

import json
import argparse
import os
from pathlib import Path
from datetime import datetime

from llama_index.readers.file.slides import PptxReader
from llama_index.core import Settings


def get_os_env(key: str) -> str:
    """Get environment variable or raise error if not found."""
    value = os.getenv(key)
    if not value:
        raise ValueError(f"Environment variable {key} not found")
    return value


def document_to_dict(doc):
    """Convert a Document object to a dictionary for JSON serialization."""
    return {
        "text": doc.text,
        "metadata": doc.metadata,
        "doc_id": str(doc.doc_id) if hasattr(doc, 'doc_id') else None,
    }


def main():
    parser = argparse.ArgumentParser(description="Test PPT Reader with LiteLLM and Databricks")
    parser.add_argument("--api-base", required=True, help="LLM API base URL")
    parser.add_argument("--api-key", required=True, help="LLM API key") 
    parser.add_argument("--model", default="openai/gpt-41", help="Completion model (default: databricks-dbrx-instruct)")
    
    args = parser.parse_args()
    
    # Use the existing test presentation
    ppt_file = "comprehensive_test_presentation.pptx"
    ppt_path = Path(ppt_file)
    
    if not ppt_path.exists():
        print(f"Error: {ppt_file} not found in current directory")
        return
    
    print(f"Processing: {ppt_file}")
    print(f"API Base: {args.api_base}")
    print(f"Model: {args.model}")
    
    # Set up LiteLLM
    try:
        from llama_index.llms.litellm import LiteLLM
        
        Settings.llm = LiteLLM(
            model=args.model,
            api_base=args.api_base,
            api_key=args.api_key,
        )
        
        print("✅ LiteLLM configured successfully")
        
        # Test LLM with a simple completion
        print("🧪 Testing LLM connection...")
        try:
            test_response = Settings.llm.complete("Hello, this is a test.")
            print(f"✅ LLM test successful: {test_response.text[:100]}...")
        except Exception as e:
            print(f"❌ LLM test failed: {e}")
            return
        
    except ImportError:
        print("Error: LiteLLM not available. Install with: pip install llama-index-llms-litellm")
        return
    except Exception as e:
        print(f"Error configuring LiteLLM: {e}")
        return
    
    # Initialize reader with LLM consolidation enabled
    reader = PptxReader(
        extract_images=False,
        context_consolidation_with_llm=True,  # Enable LLM consolidation
        llm=None,  # Will use Settings.llm
        num_workers=4,
        batch_size=10
    )
    
    # Process the file
    print("🔄 Processing with LLM content consolidation...")
    start_time = datetime.now()
    
    try:
        documents = reader.load_data(ppt_file)
        processing_time = (datetime.now() - start_time).total_seconds()
        
        print(f"✅ Processed {len(documents)} slides in {processing_time:.2f} seconds")
        
        # Debug: Check if content is actually being extracted
        print("\n🔍 Debug Info:")
        print(f"Reader LLM: {reader.llm}")
        print(f"LLM consolidation enabled: {reader.context_consolidation_with_llm}")
        
        if documents:
            first_doc = documents[0]
            print(f"First document text length: {len(first_doc.text)}")
            print(f"First document has extraction errors: {first_doc.metadata.get('extraction_errors', [])}")
            if len(first_doc.text) < 100:
                print(f"First document text: '{first_doc.text}'")
        else:
            print("❌ No documents returned!")
        
        # Create output data
        output_data = {
            "processing_info": {
                "source_file": ppt_file,
                "timestamp": datetime.now().isoformat(),
                "processing_time_seconds": processing_time,
                "total_slides": len(documents),
                "llm_consolidation_enabled": True,
                "model": args.model,
                "api_base": args.api_base
            },
            "documents": [document_to_dict(doc) for doc in documents]
        }
        
        # Save to JSON
        output_file = f"ppt_reader_llm_output_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(output_data, f, indent=2, ensure_ascii=False)
        
        print(f"💾 Output saved to: {output_file}")
        
        # Show some sample consolidated content
        print("\n📋 Sample consolidated content from first slide:")
        if documents:
            sample_text = documents[0].text[:500] + "..." if len(documents[0].text) > 500 else documents[0].text
            print(f"   {sample_text}")
        
    except Exception as e:
        print(f"❌ Error during processing: {e}")
        return


if __name__ == "__main__":
    main() 

Final Output:
ppt_reader_llm_output_20250720_012855.json

@Jash271 Jash271 marked this pull request as ready for review July 20, 2025 05:14
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jul 20, 2025
- Added basic and advanced usage examples for the PptxReader, demonstrating text, table, chart, and speaker notes extraction.
- Included options for image extraction, LLM content synthesis, parallel processing, and batch size configuration.
Copy link
Member

@AstraBert AstraBert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great refactoring of the PptxReader - including also some changes that I would describe as breaking.
If we decide to merge this, this will need a major bump :)

# Advanced usage - control parsing behavior
parser = PptxReader(
extract_images=True, # Enable image captioning
context_consolidation_with_llm=True, # Use LLM for content synthesis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context or content?

Copy link
Contributor Author

@Jash271 Jash271 Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context, since we are extracting the content anyways ( with or without the llm) the place where llm comes into the play is to make the content we have extracted contextually rich, which will help improve ranking for RAG/Contextual Retrieval processes

That said, there is no hard preference, we can rename it

Comment on lines 144 to 180
def _find_likely_title(self, slide) -> str:
"""Find the most likely title from slide shapes using scoring."""
try:
candidates = []

for shape in slide.shapes:
if (
hasattr(shape, "text_frame")
and shape.text_frame
and shape.text_frame.text.strip()
):
# Score based on position and formatting
score = 0

# Position score (higher = closer to top)
if hasattr(shape, "top"):
score += (10000 - shape.top) / 1000

# Size score
if hasattr(shape, "height"):
score += shape.height / 100

# Text length (prefer shorter for titles)
text = shape.text_frame.text.strip()
if len(text) < 100:
score += 50 - len(text) / 2

candidates.append((text, score))

if candidates:
# Return highest scoring candidate
candidates.sort(key=lambda x: x[1], reverse=True)
return candidates[0][0]

return ""
except Exception:
return ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a little bit arbitrary with the scoring, did you consider any more solid alternatives?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scoring algorithm is based on three fundamental characteristics that consistently define titles in presentation design.
First, titles are positioned at the top of slides - this is a universal design convention, so we score shapes higher when they're closer to the top of the slide. Second, titles use larger fonts than body text to create visual hierarchy, so we give higher scores to text shapes with greater height (indicating larger font size). Third, titles are concise while body text tends to be verbose - titles rarely exceed a few words, so we favor shorter text content under 100 characters.
The algorithm deliberately prioritizes position over font size, and font size over text length. This hierarchical weighting ensures that a large-font text block at the bottom of a slide won't incorrectly score higher than a properly-positioned title at the top. By giving position the highest weight, we maintain the fundamental principle that titles belong at the top, while still considering typography and content characteristics as secondary factors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AstraBert
Did you get a chance to review my comments

"feature_extractor": feature_extractor,
"model": model,
"tokenizer": tokenizer,
if not result["success"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be raising an error here instead? or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can raise an error here
I was more of the opinion to not have it fail, since if people are reading the files in batches, I wouldn't want to necessarily block parsing for all the files, if extraction for one of it fails
Let me know what you think ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea fair enough 🤔 Maybe a constructor option in that case to satisfy both cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll make the update

Document(
text=slide["content"],
metadata=metadata,
excluded_embed_metadata_keys=list(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we should set excluded_llm_metadata_keys as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am in 2 minds here about it honestly
adding a param for set excluded_llm_metadata_keys, users need to what the keys in the metadata added are exactly called, do we want to really let users know about the internal implementation for them to know what keys would get added to meta-data

Also I just feel, these keys don't hold much significance for them to get embedded, the doc content itself should be enough

Let me know what you think, I can make that change accordingly

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure why adding excluded_llm_metadata_keys would be exposing internal implementation to users?

We are already excluding the metadata from the embed model, lets also exclude it from the LLM (this metadata will not help an LLM answer a question during RAG, but it will still be there for users to inspect)

Copy link
Contributor Author

@Jash271 Jash271 Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood it as adding excluded_llm_metadata keys as param to the constructor and people can define to meta-data keys to exclude
Yea I’ll set excluded_llm_metadata_keys as well

"charts": slide.get("charts", []),
"notes": slide.get("notes", ""),
"images": slide.get("images", []),
"text_sections": slide.get("text_sections", []),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this different from the slide content?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, slide content is basically a summary of all these sections - tables, charts, notes, images, text_sections condensed into a paragraph
Incase if someone wants to just inspect what was parsed and the overall structure of the slide, I thought to have these sections individually included in the meta-data

metadata = {
"file_path": str(file),
"page_label": i,
"slide_number": i,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have page_label and slide_number with the same value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

page_label to make it consistent with the other readers, like pdf reader etc, since they use page_label as the key for storing the page_no
slide_number to keep the key more relevant for ppt's

If you feel we should keep only one, I would prefer page_label in that case, what do you think ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets keep it page_label for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup updated to keep it as page_label
Removed the slide_number key

"has_tables": len(slide.get("tables", [])) > 0,
"has_charts": len(slide.get("charts", [])) > 0,
"has_notes": bool(slide.get("notes", "")),
"has_images": len(slide.get("images", [])) > 0,
Copy link
Collaborator

@logan-markewich logan-markewich Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue these "has_*" fields aren't needed? Users can already look at the metadata to see if its present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had these has_fields, since it would become easier for people to read the metadata, they can just see these boolean fields to know what's present and what's not
Since by default, in metadata, even if tables are not present, tables would be an empty array, it wouldn't be like the key itself would not exist

I am okay dropping these has_* cols from metadata, let me know what you think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets drop it, I think its just duplicating information and making the metadata a larger payload than it needs to be

Jash271 and others added 6 commits July 23, 2025 13:09
- Removed unnecessary metadata fields from slide extraction, including slide_number, has_tables, has_charts, has_notes, has_images, and content.
- Enhanced the _validate_file method to return the presentation object for reuse, improving efficiency.
- Updated tests to reflect changes in metadata structure, ensuring accurate validation of slide content and properties.
- Improved error messaging for missing dependencies in ImageExtractor.
…ovements

- Added `raise_on_error` parameter to `PptxReader` to control error handling during file parsing.
- Updated title detection logic in `SlideContentExtractor` to improve accuracy by considering text position and size.
- Enhanced tests to validate the new `raise_on_error` functionality and title detection edge cases.
@Jash271
Copy link
Contributor Author

Jash271 commented Jul 25, 2025

Hi @logan-markewich
Made all the changes we discussed, kindly have a look

@logan-markewich logan-markewich merged commit 8cc2925 into run-llama:main Aug 12, 2025
11 checks passed
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Issue parsing PPTs with Images in Background

3 participants