URL Content Extraction

Last Updated: 2026-05-27

When a knowledge item includes a URL, the backend fetches the page and appends readable text before persistence and Notion sync.

Components

Component	Location	Role
`ContentExtractionService`	`Service/Content/ContentExtractionService.php`	Orchestrates HTTP fetch, content-type detection, SSRF checks
`HtmlContentExtractor`	`Service/Content/Extractor/HtmlContentExtractor.php`	Extracts main article text from HTML
`ExtractedContent` / `PreparedContentData`	`ValueObject/`	Immutable extraction results

Flow

User submits knowledge with URL
  ↓
KnowledgeOrchestrator::createKnowledge()
  ↓
ContentExtractionService::extractContent(url)
  ↓
HtmlContentExtractor::extract(html)
  ↓
Content appended to knowledge item
  ↓
Persist + NotionSyncService::syncKnowledgeToNotion()

Integration

KnowledgeOrchestrator calls extraction when a URL is present to:

Fetch page HTML with timeouts and size limits
Extract readable text via the HTML extractor
Append extracted text to the user's original content
Store metadata (title, length, content type) in Knowledge.urlMetadata

Security

SSRF protection validates URLs before fetching
Request timeouts prevent hanging on slow responses
Content size limits prevent memory issues

URL Content Extraction ​

Components ​

Flow ​

Integration ​

Security ​

Related Documentation ​

URL Content Extraction

Components

Flow

Integration

Security

Related Documentation