This project is a robust, configurable pipeline for extracting, cleaning, embedding, and clustering web links based on their topical content. It aims to help you automatically organize large collections of URLs (such as bookmarks or research links) into semantically meaningful groups.
flowchart TD
A[Input List of Weblinks] --> B[Extract Textual Semantic Representation]
B --> C[Compute Embedding per Weblink]
C --> D[Cluster Weblinks]
graph TB
User[User] --> UI[User Interface]
UI --> API[API Layer]
API --> OE[Organization Engines]
API --> CE[Content Extraction]
API --> DC[Data Collection]
API --> ES[Evaluation System]
API --> SL[Storage Layer]
OE --> SL
CE --> SL
DC --> SL
ES --> SL
subgraph "Organization Engines"
NLP[NLP-Based]
LLM[LLM-Based]
Graph[Graph-Based]
Multi[Multi-Modal]
RL[Reinforcement Learning]
end
graph TD
LP[Link Provider] --> |URLs| PC[Processing Controller]
PC --> |URLs| FE[Fetch Engine]
FE --> |Raw Content| PC
PC --> |Processed Content| DM[Data Manager]
subgraph "Link Providers"
TF[Text File]
BM[Browser Bookmarks]
HS[History Import]
Manual[Manual Entry]
end
subgraph "Fetch Engines"
HTTP[HTTP Fetcher]
Sel[Selenium Renderer]
PDF[PDF Processor]
YT[YouTube Extractor]
end
TF --> LP
BM --> LP
HS --> LP
Manual --> LP
HTTP --> FE
Sel --> FE
PDF --> FE
YT --> FE
DM --> |Store| SL[Storage Layer]
graph TD
CM[Content Manager] --> TEP[Text Extraction Pipeline]
CM --> MEP[Metadata Extraction Pipeline]
CM --> VEP[Visual Extraction Pipeline]
TEP --> |Text Content| SL[Storage Layer]
MEP --> |Metadata| SL
VEP --> |Visual Features| SL
subgraph "Text Extraction"
HTML[HTML Extractor]
PDF[PDF Extractor]
YT[Transcript Extractor]
end
subgraph "Metadata Extraction"
Title[Title Extractor]
Date[Date Extractor]
Author[Author Extractor]
LLM[LLM Metadata]
end
subgraph "Visual Extraction"
SS[Screenshot]
CV[Computer Vision]
Layout[Layout Analysis]
end
HTML --> TEP
PDF --> TEP
YT --> TEP
Title --> MEP
Date --> MEP
Author --> MEP
LLM --> MEP
SS --> VEP
CV --> VEP
Layout --> VEP
graph TD
OC[Organization Controller] --> NLP[NLP Engine]
OC --> LLM[LLM Engine]
OC --> GB[Graph-Based Engine]
OC --> MM[Multi-Modal Engine]
OC --> RL[Reinforcement Learning Engine]
NLP --> |Organization| SL[Storage Layer]
LLM --> |Organization| SL
GB --> |Organization| SL
MM --> |Organization| SL
RL --> |Organization| SL
subgraph "NLP Engine"
Embedding[Embedding Generator]
Clustering[Clustering]
TopicModel[Topic Modeling]
end
subgraph "LLM Engine"
ZeroShot[Zero-Shot Classification]
Taxonomy[Taxonomy Generation]
Relationship[Relationship Extraction]
end
subgraph "Graph Engine"
LinkAnalysis[Link Analysis]
Community[Community Detection]
Centrality[Centrality Measures]
end
Embedding --> NLP
Clustering --> NLP
TopicModel --> NLP
ZeroShot --> LLM
Taxonomy --> LLM
Relationship --> LLM
LinkAnalysis --> GB
Community --> GB
Centrality --> GB
graph TD
SL[Storage Layer] --> RM[Repository Manager]
RM --> CR[Content Repository]
RM --> OR[Organization Repository]
RM --> MR[Metadata Repository]
subgraph "Storage Backends"
SQLite[SQLite]
JSON[JSON Files]
Vector[Vector Database]
end
SQLite --> CR
SQLite --> OR
SQLite --> MR
JSON --> CR
JSON --> OR
JSON --> MR
Vector --> CR
graph LR
Client[Client] --> |REST/GraphQL| API[API Gateway]
API --> |Requests| RC[Request Controller]
RC --> DC[Data Collection API]
RC --> CE[Content Extraction API]
RC --> OE[Organization API]
RC --> EA[Evaluation API]
subgraph "API Endpoints"
Ingest[/ingest/]
Extract[/extract/]
Organize[/organize/]
Compare[/compare/]
Export[/export/]
end
Ingest --> DC
Extract --> CE
Organize --> OE
Compare --> EA
Export --> RC
classDiagram
class OrganizationEngine {
<<interface>>
+organize(content_items[]) Organization
+getName() string
+getDescription() string
+getParameters() Dict
+setParameters(params) void
}
class ContentExtractor {
<<interface>>
+extract(url) ContentItem
+extractBatch(urls[]) ContentItem[]
+getSupportedTypes() string[]
}
class Organization {
+id string
+name string
+engine string
+parameters Dict
+hierarchy Dict
+getRootCategories() Category[]
+findItemById(id) ContentItem
+export(format) string
}
class ContentItem {
+id string
+url string
+title string
+content string
+metadata Dict
+embeddings Dict
+visualFeatures Dict
+getTextContent() string
+getSummary() string
}
OrganizationEngine ..> Organization
ContentExtractor ..> ContentItem
Organization o-- ContentItem
-
Configurable Input:
Reads URLs from a specified file (viaconfig.json) and supports domain-based rate limiting and ignore lists (e.g., skip YouTube or problematic PDFs). -
Robust Content Extraction:
Uses trafilatura to extract the main content from HTML pages (with fallback to BeautifulSoup) and pdfminer.six to extract text from PDFs (limited to the first 10 pages).
Advanced cleaning routines remove common boilerplate and extraneous text. -
Parallel Processing & Logging:
Extracts text from URLs in parallel using Python’s ThreadPoolExecutor with progress indication. Detailed logs are maintained in separate log files for general extraction, failures, and summary statistics. -
Embedding & Clustering:
Computes text embeddings with SentenceTransformers and clusters the links using HDBSCAN and hierarchical clustering methods. -
Semantic Keyword Extraction:
Generates cluster reports using KeyBERT to extract meaningful keywords that summarize each cluster’s content.
-
Read and Clean URLs:
URLs are read from a links file (specified inconfig.json), cleaned, and filtered usingget_links.py. -
Parallel Extraction:
The pipeline fetches each URL, extracts the main content using advanced extraction (with trafilatura and cleaning functions), and logs successes and failures. -
Compute Embeddings:
Extracted texts are converted into semantic embeddings using SentenceTransformers. -
Clustering & Reporting:
Embeddings are clustered using HDBSCAN (and optionally hierarchical clustering), and a detailed cluster report (with semantic keywords) is generated.
Make sure you have Python 3.8+ installed. Then install dependencies:
pip install -r requirements.txtEdit config.json to set your links file (a text file listing weblinks), rate limiting domains, and ignore domains.
python main.pyCluster Report: See cluster_report.txt for grouped links and extracted keywords.
Logs: Review extraction.log, extraction_failures.log, and extraction_stats.log for detailed processing information.