Step 2 — Process: Backend Content Extraction
After Step 1, knowledge items with a URL store only the link — the page content is not captured. Step 2 changes this: the backend automatically fetches, parses, and appends the URL's content to the knowledge item during the save operation.
User Story
Story 2.1 — Automatic URL Content Extraction
As a user, when I save a knowledge item that includes a URL, the system should automatically extract the page content and append it to my knowledge item — so the full context is captured without me having to copy-paste.
Flow:
User saves knowledge with URL "https://web.dev/accessibility" (via Modal or Action)
→ Backend detects URL on incoming request
→ Backend fetches page and strips non-article HTML
→ Extracted text appended to user's original content
→ Knowledge saved with both the user's notes and the page contentSide-by-Side Comparison
Here is exactly what changes for the user's data between Step 1 and Step 2:
| Field | Step 1 only | After Step 2 |
|---|---|---|
url | "https://web.dev/accessibility" | "https://web.dev/accessibility" (unchanged) |
content | "Kijk deze resource, relevant voor design" | "Kijk deze resource, relevant voor design\n\n---\n\nAccessibility means making your website usable by as many people as possible..." |
urlMetadata | null | { "extractedTitle": "Learn Accessibility", "extractedLength": 4200, "truncated": true, "contentType": "text/html" } |
Architecture Flow
The extraction runs entirely inside the existing KnowledgeOrchestrator::create() method from Step 1.
1. High-Level Sequence Diagram
2. The Extraction Decision Pipeline
3. Content Merging Rules
The user's original text is never replaced. Extracted content is always appended with a visual separator:
┌─────────────────────────────────┐
│ User's original content │ ← always preserved
│ (from modal or message action) │
├─────────────────────────────────┤
│ --- (separator) │ ← only added if both parts exist
├─────────────────────────────────┤
│ Extracted page content │ ← appended by Step 2
│ (stripped HTML → plain text) │
│ (max 5000 characters) │
└─────────────────────────────────┘Edge Case Handling:
| Scenario | Resolved Content field |
|---|---|
| User text + valid URL | userText + separator + extractedContent |
| User text + no URL | userText (no extraction) |
| No user text + valid URL | extractedContent (no separator needed) |
| User text + unreachable URL | userText (extraction gracefully skipped) |
| User text + PDF URL | userText (content-type check fails, skipped) |
Backend Files (3 new, 1 modified)
Domain Services (Layer 4)
| File | Responsibilities | ~LOC |
|---|---|---|
Service/Content/ContentExtractionService.php | Validates URL (SSRF protection against localhost/private IPs). Performs HTTP GET with 10s timeout. Verifies text/html content type. Delegates HTML to extractor. Returns result VO. Extends BaseDomainService. | 150 |
Service/Content/Extractor/HtmlContentExtractor.php | Parses HTML string. Removes <nav>, <footer>, <script>, <style>. Extracts from <article> → <main> → <body>. Normalizes whitespace and truncates to max 5000 chars. | 80 |
Value Objects
| File | Properties & Methods | ~LOC |
|---|---|---|
ValueObject/ContentExtractionResult.php | Immutable result. Properties: success, content, title, type, length, truncated, error. Factory methods: success(), failure(). | 50 |
Orchestrators (Layer 2) - Modified
| File | Changes Made | ~LOC Add |
|---|---|---|
Service/Knowledge/KnowledgeOrchestrator.php | Gains ContentExtractionService dependency. In create(), if URL exists, fetches content. On success: appends to $knowledge->content and populates $knowledge->urlMetadata. On failure: gracefully skips. | +40 |
WARNING
Extraction failure never blocks the save. If the URL is unreachable, returns non-HTML, or times out, the knowledge item is persisted with the user's original content unchanged. The failure is logged but invisible to the user.
Full Sequence Diagram
Slack Files (0)
Step 2 introduces zero new Slack files and zero new UI elements. It is purely a backend enrichment pipeline attached to the existing POST /api/knowledge endpoint.
Verification Checklist
Content Extraction
- [ ] Save knowledge with a valid URL →
contentcontains user text + separator + extracted page text - [ ] Save knowledge with a valid URL but no user text →
content= extracted text only (no separator) - [ ] Check
urlMetadataviaGET /api/knowledge→ containsextractedTitle,extractedLength,truncated,contentType
Graceful Failures
- [ ] Save knowledge without a URL → content unchanged, no extraction attempted
- [ ] Save knowledge with an unreachable URL → item saved with user content only, no error shown
- [ ] Save knowledge with a non-HTML URL (e.g., PDF) → item saved with user content only
- [ ] Save knowledge with a private IP URL (e.g.,
http://localhost) → SSRF blocked, item saved with user content only
Step 2 LOC Summary
| Component | Files | ~LOC |
|---|---|---|
| Domain Services | 2 | 230 |
| Value Objects | 1 | 50 |
| Orchestrator Modifications | (1 existing) | 40 |
| Slack changes | 0 | 0 |
| Total | 3 new | ~320 |