Skip to content

Step 2 — Process: Backend Content Extraction

After Step 1, knowledge items with a URL store only the link — the page content is not captured. Step 2 changes this: the backend automatically fetches, parses, and appends the URL's content to the knowledge item during the save operation.

User Story

Story 2.1 — Automatic URL Content Extraction

As a user, when I save a knowledge item that includes a URL, the system should automatically extract the page content and append it to my knowledge item — so the full context is captured without me having to copy-paste.

Flow:

User saves knowledge with URL "https://web.dev/accessibility" (via Modal or Action)
  → Backend detects URL on incoming request
  → Backend fetches page and strips non-article HTML
  → Extracted text appended to user's original content
  → Knowledge saved with both the user's notes and the page content

Side-by-Side Comparison

Here is exactly what changes for the user's data between Step 1 and Step 2:

FieldStep 1 onlyAfter Step 2
url"https://web.dev/accessibility""https://web.dev/accessibility" (unchanged)
content"Kijk deze resource, relevant voor design""Kijk deze resource, relevant voor design\n\n---\n\nAccessibility means making your website usable by as many people as possible..."
urlMetadatanull{ "extractedTitle": "Learn Accessibility", "extractedLength": 4200, "truncated": true, "contentType": "text/html" }

Architecture Flow

The extraction runs entirely inside the existing KnowledgeOrchestrator::create() method from Step 1.

1. High-Level Sequence Diagram

2. The Extraction Decision Pipeline

3. Content Merging Rules

The user's original text is never replaced. Extracted content is always appended with a visual separator:

┌─────────────────────────────────┐
│  User's original content        │  ← always preserved
│  (from modal or message action) │
├─────────────────────────────────┤
│          --- (separator)        │  ← only added if both parts exist
├─────────────────────────────────┤
│  Extracted page content         │  ← appended by Step 2
│  (stripped HTML → plain text)   │
│  (max 5000 characters)         │
└─────────────────────────────────┘

Edge Case Handling:

ScenarioResolved Content field
User text + valid URLuserText + separator + extractedContent
User text + no URLuserText (no extraction)
No user text + valid URLextractedContent (no separator needed)
User text + unreachable URLuserText (extraction gracefully skipped)
User text + PDF URLuserText (content-type check fails, skipped)

Backend Files (3 new, 1 modified)

Domain Services (Layer 4)

FileResponsibilities~LOC
Service/Content/ContentExtractionService.phpValidates URL (SSRF protection against localhost/private IPs). Performs HTTP GET with 10s timeout. Verifies text/html content type. Delegates HTML to extractor. Returns result VO. Extends BaseDomainService.150
Service/Content/Extractor/HtmlContentExtractor.phpParses HTML string. Removes <nav>, <footer>, <script>, <style>. Extracts from <article><main><body>. Normalizes whitespace and truncates to max 5000 chars.80

Value Objects

FileProperties & Methods~LOC
ValueObject/ContentExtractionResult.phpImmutable result. Properties: success, content, title, type, length, truncated, error. Factory methods: success(), failure().50

Orchestrators (Layer 2) - Modified

FileChanges Made~LOC Add
Service/Knowledge/KnowledgeOrchestrator.phpGains ContentExtractionService dependency. In create(), if URL exists, fetches content. On success: appends to $knowledge->content and populates $knowledge->urlMetadata. On failure: gracefully skips.+40

WARNING

Extraction failure never blocks the save. If the URL is unreachable, returns non-HTML, or times out, the knowledge item is persisted with the user's original content unchanged. The failure is logged but invisible to the user.

Full Sequence Diagram

Slack Files (0)

Step 2 introduces zero new Slack files and zero new UI elements. It is purely a backend enrichment pipeline attached to the existing POST /api/knowledge endpoint.

Verification Checklist

Content Extraction

  • [ ] Save knowledge with a valid URL → content contains user text + separator + extracted page text
  • [ ] Save knowledge with a valid URL but no user text → content = extracted text only (no separator)
  • [ ] Check urlMetadata via GET /api/knowledge → contains extractedTitle, extractedLength, truncated, contentType

Graceful Failures

  • [ ] Save knowledge without a URL → content unchanged, no extraction attempted
  • [ ] Save knowledge with an unreachable URL → item saved with user content only, no error shown
  • [ ] Save knowledge with a non-HTML URL (e.g., PDF) → item saved with user content only
  • [ ] Save knowledge with a private IP URL (e.g., http://localhost) → SSRF blocked, item saved with user content only

Step 2 LOC Summary

ComponentFiles~LOC
Domain Services2230
Value Objects150
Orchestrator Modifications(1 existing)40
Slack changes00
Total3 new~320