Skip to content

Commit daad35b

Browse files
committed
Update Document segmentation
1 parent 90e2116 commit daad35b

14 files changed

+2974
-264
lines changed

README.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -315,6 +315,7 @@ DeepCode leverages the **Model Context Protocol (MCP)** standard to seamlessly i
315315
| **⚡ command-executor** | System Commands | Execute bash/shell commands for environment management |
316316
| **🧬 code-implementation** | Code Generation Hub | Comprehensive code reproduction with execution and testing |
317317
| **📚 code-reference-indexer** | Smart Code Search | Intelligent indexing and search of code repositories |
318+
| **📄 document-segmentation** | Smart Document Analysis | Intelligent document segmentation for large papers and technical documents |
318319

319320
##### 🔧 **Legacy Tool Functions** *(for reference)*
320321

@@ -465,6 +466,11 @@ curl -O https://raw.githubusercontent.com/HKUDS/DeepCode/main/mcp_agent.secrets.
465466
# Edit mcp_agent.config.yaml to set your API keys:
466467
# - For Brave Search: Set BRAVE_API_KEY: "your_key_here" in brave.env section (line ~28)
467468
# - For Bocha-MCP: Set BOCHA_API_KEY: "your_key_here" in bocha-mcp.env section (line ~74)
469+
470+
# 📄 Configure document segmentation (optional)
471+
# Edit mcp_agent.config.yaml to control document processing:
472+
# - enabled: true/false (whether to use intelligent document segmentation)
473+
# - size_threshold_chars: 50000 (document size threshold to trigger segmentation)
468474
```
469475

470476
#### 🔧 **Development Installation (From Source)**
@@ -496,6 +502,11 @@ uv pip install -r requirements.txt
496502
# Edit mcp_agent.config.yaml to set your API keys:
497503
# - For Brave Search: Set BRAVE_API_KEY: "your_key_here" in brave.env section (line ~28)
498504
# - For Bocha-MCP: Set BOCHA_API_KEY: "your_key_here" in bocha-mcp.env section (line ~74)
505+
506+
# 📄 Configure document segmentation (optional)
507+
# Edit mcp_agent.config.yaml to control document processing:
508+
# - enabled: true/false (whether to use intelligent document segmentation)
509+
# - size_threshold_chars: 50000 (document size threshold to trigger segmentation)
499510
```
500511

501512
##### 🐍 **Using Traditional pip**
@@ -517,6 +528,11 @@ pip install -r requirements.txt
517528
# Edit mcp_agent.config.yaml to set your API keys:
518529
# - For Brave Search: Set BRAVE_API_KEY: "your_key_here" in brave.env section (line ~28)
519530
# - For Bocha-MCP: Set BOCHA_API_KEY: "your_key_here" in bocha-mcp.env section (line ~74)
531+
532+
# 📄 Configure document segmentation (optional)
533+
# Edit mcp_agent.config.yaml to control document processing:
534+
# - enabled: true/false (whether to use intelligent document segmentation)
535+
# - size_threshold_chars: 50000 (document size threshold to trigger segmentation)
520536
```
521537

522538
</details>
@@ -699,6 +715,14 @@ python cli/main_cli.py
699715

700716

701717

718+
### 🆕 **Recent Updates**
719+
720+
#### 📄 **Smart Document Segmentation (v1.2.0)**
721+
- **Intelligent Processing**: Automatically handles large research papers and technical documents that exceed LLM token limits
722+
- **Configurable Control**: Toggle segmentation via configuration with size-based thresholds
723+
- **Semantic Analysis**: Advanced content understanding with algorithm, concept, and formula preservation
724+
- **Backward Compatibility**: Seamlessly falls back to traditional processing for smaller documents
725+
702726
### 🚀 **Coming Soon**
703727

704728
We're continuously enhancing DeepCode with exciting new features:

mcp_agent.config.yaml

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,21 @@ $schema: ./schema/mcp-agent.config.schema.json
22

33
# Default search server configuration
44
# Options: "brave" or "bocha-mcp"
5-
default_search_server: "bocha-mcp"
5+
default_search_server: "brave"
66

77
# Planning mode configuration
88
# Options: "segmented" or "traditional"
99
# segmented: Breaks down large tasks to avoid token truncation (recommended)
1010
# traditional: Uses parallel agents but may hit token limits
1111
planning_mode: "traditional"
1212

13+
# Document segmentation configuration
14+
document_segmentation:
15+
enabled: false # Whether to use intelligent document segmentation
16+
size_threshold_chars: 50000 # Document size threshold (in characters) to trigger segmentation
17+
# If document size > threshold and enabled=true, use segmentation workflow
18+
# If document size <= threshold or enabled=false, use traditional full-document reading
19+
1320
execution_engine: asyncio
1421
logger:
1522
transports: [console, file]
@@ -79,6 +86,12 @@ mcp:
7986
env:
8087
PYTHONPATH: "."
8188
BOCHA_API_KEY: ""
89+
document-segmentation:
90+
command: "python"
91+
args: ["tools/document_segmentation_server.py"]
92+
env:
93+
PYTHONPATH: "."
94+
description: "Document segmentation server - Provides intelligent document analysis and segmented reading to optimize token usage"
8295

8396
openai:
8497
# Secrets (API keys, etc.) are stored in an mcp_agent.secrets.yaml file which can be gitignored

0 commit comments

Comments
 (0)