Step 2 — Process: Backend Content Extraction

After Step 1, knowledge items with a URL store only the link — the page content is not captured. Step 2 changes this: the backend automatically fetches, parses, and appends the URL's content to the knowledge item during the save operation.

User Story

Story 2.1 — Automatic URL Content Extraction

As a user, when I save a knowledge item that includes a URL, the system should automatically extract the page content and append it to my knowledge item — so the full context is captured without me having to copy-paste.

Flow:

User saves knowledge with URL "https://web.dev/accessibility" (via Modal or Action)
  → Backend detects URL on incoming request
  → Backend fetches page and strips non-article HTML
  → Extracted text appended to user's original content
  → Knowledge saved with both the user's notes and the page content

Side-by-Side Comparison

Here is exactly what changes for the user's data between Step 1 and Step 2:

Field	Step 1 only	After Step 2
`url`	`"https://web.dev/accessibility"`	`"https://web.dev/accessibility"` (unchanged)
`content`	`"Kijk deze resource, relevant voor design"`	`"Kijk deze resource, relevant voor design\n\n---\n\nAccessibility means making your website usable by as many people as possible..."`
`urlMetadata`	`null`	`{ "extractedTitle": "Learn Accessibility", "extractedLength": 4200, "truncated": true, "contentType": "text/html" }`

Architecture Flow

The extraction runs entirely inside the existing KnowledgeOrchestrator::create() method from Step 1.

1. High-Level Sequence Diagram

2. The Extraction Decision Pipeline

3. Content Merging Rules

The user's original text is never replaced. Extracted content is always appended with a visual separator:

┌─────────────────────────────────┐
│  User's original content        │  ← always preserved
│  (from modal or message action) │
├─────────────────────────────────┤
│          --- (separator)        │  ← only added if both parts exist
├─────────────────────────────────┤
│  Extracted page content         │  ← appended by Step 2
│  (stripped HTML → plain text)   │
│  (max 5000 characters)         │
└─────────────────────────────────┘

Edge Case Handling:

Scenario	Resolved Content field
User text + valid URL	`userText + separator + extractedContent`
User text + no URL	`userText` (no extraction)
No user text + valid URL	`extractedContent` (no separator needed)
User text + unreachable URL	`userText` (extraction gracefully skipped)
User text + PDF URL	`userText` (content-type check fails, skipped)

Backend Files (3 new, 1 modified)

Domain Services (Layer 4)

File	Responsibilities	~LOC
`Service/Content/ContentExtractionService.php`	Validates URL (SSRF protection against localhost/private IPs). Performs HTTP GET with 10s timeout. Verifies `text/html` content type. Delegates HTML to extractor. Returns result VO. Extends `BaseDomainService`.	150
`Service/Content/Extractor/HtmlContentExtractor.php`	Parses HTML string. Removes `<nav>`, `<footer>`, `<script>`, `<style>`. Extracts from `<article>` → `<main>` → `<body>`. Normalizes whitespace and truncates to max 5000 chars.	80

Value Objects

File	Properties & Methods	~LOC
`ValueObject/ContentExtractionResult.php`	Immutable result. Properties: `success`, `content`, `title`, `type`, `length`, `truncated`, `error`. Factory methods: `success()`, `failure()`.	50

Orchestrators (Layer 2) - Modified

File	Changes Made	~LOC Add
`Service/Knowledge/KnowledgeOrchestrator.php`	Gains `ContentExtractionService` dependency. In `create()`, if URL exists, fetches content. On success: appends to `$knowledge->content` and populates `$knowledge->urlMetadata`. On failure: gracefully skips.	+40

WARNING

Extraction failure never blocks the save. If the URL is unreachable, returns non-HTML, or times out, the knowledge item is persisted with the user's original content unchanged. The failure is logged but invisible to the user.

Full Sequence Diagram

Slack Files (0)

Step 2 introduces zero new Slack files and zero new UI elements. It is purely a backend enrichment pipeline attached to the existing POST /api/knowledge endpoint.

Verification Checklist

Content Extraction

[ ] Save knowledge with a valid URL → content contains user text + separator + extracted page text
[ ] Save knowledge with a valid URL but no user text → content = extracted text only (no separator)
[ ] Check urlMetadata via GET /api/knowledge → contains extractedTitle, extractedLength, truncated, contentType

Graceful Failures

[ ] Save knowledge without a URL → content unchanged, no extraction attempted
[ ] Save knowledge with an unreachable URL → item saved with user content only, no error shown
[ ] Save knowledge with a non-HTML URL (e.g., PDF) → item saved with user content only
[ ] Save knowledge with a private IP URL (e.g., http://localhost) → SSRF blocked, item saved with user content only

Step 2 LOC Summary

Component	Files	~LOC
Domain Services	2	230
Value Objects	1	50
Orchestrator Modifications	(1 existing)	40
Slack changes	0	0
Total	3 new	~320

Step 2 — Process: Backend Content Extraction ​

User Story ​

Story 2.1 — Automatic URL Content Extraction ​

Side-by-Side Comparison ​

Architecture Flow ​

1. High-Level Sequence Diagram ​

2. The Extraction Decision Pipeline ​

3. Content Merging Rules ​

Backend Files (3 new, 1 modified) ​

Domain Services (Layer 4) ​

Value Objects ​

Orchestrators (Layer 2) - Modified ​

Full Sequence Diagram ​

Slack Files (0) ​

Verification Checklist ​

Content Extraction ​

Graceful Failures ​

Step 2 LOC Summary ​