-
Notifications
You must be signed in to change notification settings - Fork 6.5k
feat: enhance PowerPoint reader with comprehensive content extraction… #19478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… and image handling - Added `ImageExtractor` and `SlideContentExtractor` for improved image captioning and structured content extraction from slides. - Updated `PptxReader` to support multithreaded processing and enhanced validation. - Introduced new dependencies: `python-pptx` and `tenacity` for image processing and retry logic. - Created a comprehensive test PowerPoint presentation and corresponding tests to validate extraction capabilities.
Additional Testing EvidenceSample PPT Under Testcomprehensive_test_presentation.pptx Script to Run Basic Parsing (Default Case)#!/usr/bin/env python3
"""
Basic PPT Reader Test - Uses comprehensive_test_presentation.pptx and outputs to JSON.
"""
import json
from pathlib import Path
from datetime import datetime
from llama_index.readers.file.slides import PptxReader
def document_to_dict(doc):
"""Convert a Document object to a dictionary for JSON serialization."""
return {
"text": doc.text,
"metadata": doc.metadata,
"doc_id": str(doc.doc_id) if hasattr(doc, 'doc_id') else None,
}
def main():
# Use the existing test presentation
ppt_file = "comprehensive_test_presentation.pptx"
ppt_path = Path(ppt_file)
if not ppt_path.exists():
print(f"Error: {ppt_file} not found in current directory")
return
print(f"Processing: {ppt_file}")
# Initialize reader
reader = PptxReader(
extract_images=False,
context_consolidation_with_llm=False,
num_workers=4,
batch_size=10
)
# Process the file
start_time = datetime.now()
documents = reader.load_data(ppt_file)
processing_time = (datetime.now() - start_time).total_seconds()
print(f"Processed {len(documents)} slides in {processing_time:.2f} seconds")
# Create output data
output_data = {
"processing_info": {
"source_file": ppt_file,
"timestamp": datetime.now().isoformat(),
"processing_time_seconds": processing_time,
"total_slides": len(documents)
},
"documents": [document_to_dict(doc) for doc in documents]
}
# Save to JSON
output_file = f"ppt_reader_output_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(output_data, f, indent=2, ensure_ascii=False)
print(f"Output saved to: {output_file}")
if __name__ == "__main__":
main() Final Output: -- Output if you set extract_image to True : Script to Test Consolidation with LLM#!/usr/bin/env python3
"""
PPT Reader Test with LiteLLM and Databricks - Tests LLM content consolidation.
"""
import json
import argparse
import os
from pathlib import Path
from datetime import datetime
from llama_index.readers.file.slides import PptxReader
from llama_index.core import Settings
def get_os_env(key: str) -> str:
"""Get environment variable or raise error if not found."""
value = os.getenv(key)
if not value:
raise ValueError(f"Environment variable {key} not found")
return value
def document_to_dict(doc):
"""Convert a Document object to a dictionary for JSON serialization."""
return {
"text": doc.text,
"metadata": doc.metadata,
"doc_id": str(doc.doc_id) if hasattr(doc, 'doc_id') else None,
}
def main():
parser = argparse.ArgumentParser(description="Test PPT Reader with LiteLLM and Databricks")
parser.add_argument("--api-base", required=True, help="LLM API base URL")
parser.add_argument("--api-key", required=True, help="LLM API key")
parser.add_argument("--model", default="openai/gpt-41", help="Completion model (default: databricks-dbrx-instruct)")
args = parser.parse_args()
# Use the existing test presentation
ppt_file = "comprehensive_test_presentation.pptx"
ppt_path = Path(ppt_file)
if not ppt_path.exists():
print(f"Error: {ppt_file} not found in current directory")
return
print(f"Processing: {ppt_file}")
print(f"API Base: {args.api_base}")
print(f"Model: {args.model}")
# Set up LiteLLM
try:
from llama_index.llms.litellm import LiteLLM
Settings.llm = LiteLLM(
model=args.model,
api_base=args.api_base,
api_key=args.api_key,
)
print("✅ LiteLLM configured successfully")
# Test LLM with a simple completion
print("🧪 Testing LLM connection...")
try:
test_response = Settings.llm.complete("Hello, this is a test.")
print(f"✅ LLM test successful: {test_response.text[:100]}...")
except Exception as e:
print(f"❌ LLM test failed: {e}")
return
except ImportError:
print("Error: LiteLLM not available. Install with: pip install llama-index-llms-litellm")
return
except Exception as e:
print(f"Error configuring LiteLLM: {e}")
return
# Initialize reader with LLM consolidation enabled
reader = PptxReader(
extract_images=False,
context_consolidation_with_llm=True, # Enable LLM consolidation
llm=None, # Will use Settings.llm
num_workers=4,
batch_size=10
)
# Process the file
print("🔄 Processing with LLM content consolidation...")
start_time = datetime.now()
try:
documents = reader.load_data(ppt_file)
processing_time = (datetime.now() - start_time).total_seconds()
print(f"✅ Processed {len(documents)} slides in {processing_time:.2f} seconds")
# Debug: Check if content is actually being extracted
print("\n🔍 Debug Info:")
print(f"Reader LLM: {reader.llm}")
print(f"LLM consolidation enabled: {reader.context_consolidation_with_llm}")
if documents:
first_doc = documents[0]
print(f"First document text length: {len(first_doc.text)}")
print(f"First document has extraction errors: {first_doc.metadata.get('extraction_errors', [])}")
if len(first_doc.text) < 100:
print(f"First document text: '{first_doc.text}'")
else:
print("❌ No documents returned!")
# Create output data
output_data = {
"processing_info": {
"source_file": ppt_file,
"timestamp": datetime.now().isoformat(),
"processing_time_seconds": processing_time,
"total_slides": len(documents),
"llm_consolidation_enabled": True,
"model": args.model,
"api_base": args.api_base
},
"documents": [document_to_dict(doc) for doc in documents]
}
# Save to JSON
output_file = f"ppt_reader_llm_output_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(output_data, f, indent=2, ensure_ascii=False)
print(f"💾 Output saved to: {output_file}")
# Show some sample consolidated content
print("\n📋 Sample consolidated content from first slide:")
if documents:
sample_text = documents[0].text[:500] + "..." if len(documents[0].text) > 500 else documents[0].text
print(f" {sample_text}")
except Exception as e:
print(f"❌ Error during processing: {e}")
return
if __name__ == "__main__":
main() Final Output: |
- Added basic and advanced usage examples for the PptxReader, demonstrating text, table, chart, and speaker notes extraction. - Included options for image extraction, LLM content synthesis, parallel processing, and batch size configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great refactoring of the PptxReader - including also some changes that I would describe as breaking.
If we decide to merge this, this will need a major bump :)
| # Advanced usage - control parsing behavior | ||
| parser = PptxReader( | ||
| extract_images=True, # Enable image captioning | ||
| context_consolidation_with_llm=True, # Use LLM for content synthesis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context or content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context, since we are extracting the content anyways ( with or without the llm) the place where llm comes into the play is to make the content we have extracted contextually rich, which will help improve ranking for RAG/Contextual Retrieval processes
That said, there is no hard preference, we can rename it
| def _find_likely_title(self, slide) -> str: | ||
| """Find the most likely title from slide shapes using scoring.""" | ||
| try: | ||
| candidates = [] | ||
|
|
||
| for shape in slide.shapes: | ||
| if ( | ||
| hasattr(shape, "text_frame") | ||
| and shape.text_frame | ||
| and shape.text_frame.text.strip() | ||
| ): | ||
| # Score based on position and formatting | ||
| score = 0 | ||
|
|
||
| # Position score (higher = closer to top) | ||
| if hasattr(shape, "top"): | ||
| score += (10000 - shape.top) / 1000 | ||
|
|
||
| # Size score | ||
| if hasattr(shape, "height"): | ||
| score += shape.height / 100 | ||
|
|
||
| # Text length (prefer shorter for titles) | ||
| text = shape.text_frame.text.strip() | ||
| if len(text) < 100: | ||
| score += 50 - len(text) / 2 | ||
|
|
||
| candidates.append((text, score)) | ||
|
|
||
| if candidates: | ||
| # Return highest scoring candidate | ||
| candidates.sort(key=lambda x: x[1], reverse=True) | ||
| return candidates[0][0] | ||
|
|
||
| return "" | ||
| except Exception: | ||
| return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a little bit arbitrary with the scoring, did you consider any more solid alternatives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This scoring algorithm is based on three fundamental characteristics that consistently define titles in presentation design.
First, titles are positioned at the top of slides - this is a universal design convention, so we score shapes higher when they're closer to the top of the slide. Second, titles use larger fonts than body text to create visual hierarchy, so we give higher scores to text shapes with greater height (indicating larger font size). Third, titles are concise while body text tends to be verbose - titles rarely exceed a few words, so we favor shorter text content under 100 characters.
The algorithm deliberately prioritizes position over font size, and font size over text length. This hierarchical weighting ensures that a large-font text block at the bottom of a slide won't incorrectly score higher than a properly-positioned title at the top. By giving position the highest weight, we maintain the fundamental principle that titles belong at the top, while still considering typography and content characteristics as secondary factors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @AstraBert
Did you get a chance to review my comments
| "feature_extractor": feature_extractor, | ||
| "model": model, | ||
| "tokenizer": tokenizer, | ||
| if not result["success"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be raising an error here instead? or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can raise an error here
I was more of the opinion to not have it fail, since if people are reading the files in batches, I wouldn't want to necessarily block parsing for all the files, if extraction for one of it fails
Let me know what you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea fair enough 🤔 Maybe a constructor option in that case to satisfy both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll make the update
| Document( | ||
| text=slide["content"], | ||
| metadata=metadata, | ||
| excluded_embed_metadata_keys=list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably we should set excluded_llm_metadata_keys as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am in 2 minds here about it honestly
adding a param for set excluded_llm_metadata_keys, users need to what the keys in the metadata added are exactly called, do we want to really let users know about the internal implementation for them to know what keys would get added to meta-data
Also I just feel, these keys don't hold much significance for them to get embedded, the doc content itself should be enough
Let me know what you think, I can make that change accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure why adding excluded_llm_metadata_keys would be exposing internal implementation to users?
We are already excluding the metadata from the embed model, lets also exclude it from the LLM (this metadata will not help an LLM answer a question during RAG, but it will still be there for users to inspect)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood it as adding excluded_llm_metadata keys as param to the constructor and people can define to meta-data keys to exclude
Yea I’ll set excluded_llm_metadata_keys as well
...-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/slides/base.py
Outdated
Show resolved
Hide resolved
| "charts": slide.get("charts", []), | ||
| "notes": slide.get("notes", ""), | ||
| "images": slide.get("images", []), | ||
| "text_sections": slide.get("text_sections", []), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this different from the slide content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, slide content is basically a summary of all these sections - tables, charts, notes, images, text_sections condensed into a paragraph
Incase if someone wants to just inspect what was parsed and the overall structure of the slide, I thought to have these sections individually included in the meta-data
| metadata = { | ||
| "file_path": str(file), | ||
| "page_label": i, | ||
| "slide_number": i, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why have page_label and slide_number with the same value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
page_label to make it consistent with the other readers, like pdf reader etc, since they use page_label as the key for storing the page_no
slide_number to keep the key more relevant for ppt's
If you feel we should keep only one, I would prefer page_label in that case, what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets keep it page_label for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup updated to keep it as page_label
Removed the slide_number key
| "has_tables": len(slide.get("tables", [])) > 0, | ||
| "has_charts": len(slide.get("charts", [])) > 0, | ||
| "has_notes": bool(slide.get("notes", "")), | ||
| "has_images": len(slide.get("images", [])) > 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue these "has_*" fields aren't needed? Users can already look at the metadata to see if its present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had these has_fields, since it would become easier for people to read the metadata, they can just see these boolean fields to know what's present and what's not
Since by default, in metadata, even if tables are not present, tables would be an empty array, it wouldn't be like the key itself would not exist
I am okay dropping these has_* cols from metadata, let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets drop it, I think its just duplicating information and making the metadata a larger payload than it needs to be
...-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/slides/base.py
Outdated
Show resolved
Hide resolved
llama-index-integrations/readers/llama-index-readers-file/tests/test_slides.py
Outdated
Show resolved
Hide resolved
- Removed unnecessary metadata fields from slide extraction, including slide_number, has_tables, has_charts, has_notes, has_images, and content. - Enhanced the _validate_file method to return the presentation object for reuse, improving efficiency. - Updated tests to reflect changes in metadata structure, ensuring accurate validation of slide content and properties. - Improved error messaging for missing dependencies in ImageExtractor.
…ovements - Added `raise_on_error` parameter to `PptxReader` to control error handling during file parsing. - Updated title detection logic in `SlideContentExtractor` to improve accuracy by considering text position and size. - Enhanced tests to validate the new `raise_on_error` functionality and title detection edge cases.
|
Hi @logan-markewich |
Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Refactored and Enhanced PPTx Reader to parse charts, tables, slide notes and text from the PPT's, along with the option for optionally extracting image captions and consolidating slide data with LLM for better context
Fixes #18847
New Package?
Did I fill in the
tool.llamahubsection in thepyproject.tomland provide a detailed README.md for my new integration or package?Version Bump?
Did I bump the version in the
pyproject.tomlfile of the package I am updating? (Except for thellama-index-corepackage)Type of Change
Please delete options that are not relevant.
How Has This Been Tested?
Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.
Suggested Checklist:
uv run make format; uv run make lintto appease the lint gods