URL Content Extraction
Last Updated: 2026-05-27
When a knowledge item includes a URL, the backend fetches the page and appends readable text before persistence and Notion sync.
Components
| Component | Location | Role |
|---|---|---|
ContentExtractionService | Service/Content/ContentExtractionService.php | Orchestrates HTTP fetch, content-type detection, SSRF checks |
HtmlContentExtractor | Service/Content/Extractor/HtmlContentExtractor.php | Extracts main article text from HTML |
ExtractedContent / PreparedContentData | ValueObject/ | Immutable extraction results |
Flow
User submits knowledge with URL
↓
KnowledgeOrchestrator::createKnowledge()
↓
ContentExtractionService::extractContent(url)
↓
HtmlContentExtractor::extract(html)
↓
Content appended to knowledge item
↓
Persist + NotionSyncService::syncKnowledgeToNotion()Integration
KnowledgeOrchestrator calls extraction when a URL is present to:
- Fetch page HTML with timeouts and size limits
- Extract readable text via the HTML extractor
- Append extracted text to the user's original content
- Store metadata (title, length, content type) in
Knowledge.urlMetadata
Security
- SSRF protection validates URLs before fetching
- Request timeouts prevent hanging on slow responses
- Content size limits prevent memory issues