Skip to content

URL Content Extraction

Last Updated: 2026-05-27

When a knowledge item includes a URL, the backend fetches the page and appends readable text before persistence and Notion sync.

Components

ComponentLocationRole
ContentExtractionServiceService/Content/ContentExtractionService.phpOrchestrates HTTP fetch, content-type detection, SSRF checks
HtmlContentExtractorService/Content/Extractor/HtmlContentExtractor.phpExtracts main article text from HTML
ExtractedContent / PreparedContentDataValueObject/Immutable extraction results

Flow

User submits knowledge with URL

KnowledgeOrchestrator::createKnowledge()

ContentExtractionService::extractContent(url)

HtmlContentExtractor::extract(html)

Content appended to knowledge item

Persist + NotionSyncService::syncKnowledgeToNotion()

Integration

KnowledgeOrchestrator calls extraction when a URL is present to:

  1. Fetch page HTML with timeouts and size limits
  2. Extract readable text via the HTML extractor
  3. Append extracted text to the user's original content
  4. Store metadata (title, length, content type) in Knowledge.urlMetadata

Security

  • SSRF protection validates URLs before fetching
  • Request timeouts prevent hanging on slow responses
  • Content size limits prevent memory issues