URL Content Extraction Feature

Overview

The URL Content Extraction feature automatically extracts and processes content from URLs added to the knowledge database. This enriched content is then used by the AI summary generation system to create more accurate and comprehensive summaries.

Architecture

Components

ContentExtractionService (App\Service\Content\ContentExtractionService)
- Main orchestrator for content extraction
- Handles HTTP requests, content type detection, and routing
- Implements SSRF protection and security measures
HtmlContentExtractor (App\Service\Content\Extractor\HtmlContentExtractor)
- Extracts readable content from HTML pages
- Removes navigation, ads, footers, and other non-content elements
- Uses readability algorithms to find main article content
PdfContentExtractor (App\Service\Content\Extractor\PdfContentExtractor)
- Extracts text from PDF documents
- Handles multi-page documents (up to 50 pages)
- Preserves document structure and metadata
ContentType Enum (App\Enum\ContentType)
- Defines supported content types (HTML, PDF, JSON, Markdown, Plain Text)
- Provides content type detection from URLs and headers
ContentExtractionResult (App\ValueObject\ContentExtractionResult)
- Immutable value object representing extraction results
- Contains content, metadata, errors, and truncation status

Integration Points

PromptBuilderService: Uses extracted content as the {{url_content}} variable in AI prompts
Knowledge Entity: Stores URL and can cache extracted content
AI Summary Generation: Enriches prompts with full article/document content

Features

Content Type Support

Type	Status	Description
HTML	✅ Full	Extracts main article content, removes navigation/ads
PDF	✅ Full	Extracts text from PDFs up to 20MB, 50 pages
Plain Text	✅ Full	Simple truncation and formatting
JSON	⚠️ Partial	Basic support, can be extended
Markdown	⚠️ Partial	Treated as plain text

Security Features

SSRF Protection: Blocks private IP ranges (localhost, 192.168.x.x, 10.x.x.x, etc.)
Size Limits: Maximum 20MB per URL to prevent memory issues
Timeout Protection: 30-second timeout for HTTP requests
URL Validation: Strict URL format validation

Content Processing

HTML Extraction:
- Identifies main content area using semantic selectors
- Removes <script>, <style>, <nav>, <footer>, <aside> tags
- Removes ads and social sharing widgets
- Cleans whitespace and normalizes line breaks
- Truncates to specified length (default 5000 chars)
PDF Extraction:
- Parses PDF structure using smalot/pdfparser
- Extracts text from all pages (up to 50)
- Removes page numbers and headers/footers
- Preserves paragraph structure
- Extracts metadata (title, author, subject)
Truncation:
- Breaks on sentence boundaries when possible
- Falls back to word boundaries
- Adds "..." indicator when truncated

Usage

From PHP Code

php

use App\Service\Content\ContentExtractionService;

// Inject via dependency injection
public function __construct(
    private readonly ContentExtractionService $contentExtractor,
) {}

// Extract content
$result = $this->contentExtractor->extractContent(
    url: 'https://example.com/article',
    maxLength: 5000
);

if ($result->isSuccess()) {
    echo "Content: " . $result->content;
    echo "Type: " . $result->type->value;
    echo "Length: " . $result->length;
    echo "Truncated: " . ($result->truncated ? 'Yes' : 'No');
} else {
    echo "Error: " . $result->error;
}

In AI Prompts

The extracted content is automatically available in prompt templates:

handlebars

You are summarizing the following knowledge item:

Title: {{title}}
Category: {{category}}
URL: {{url}}

User-provided content:
{{content}}

{{#if url_content}}
Full article/document content:
{{url_content}}
{{/if}}

Please create a summary for {{role_name}}.

From Slack

When a user adds a URL to the knowledge database via Slack:

URL is detected in the modal
Metadata is extracted (title, description, image)
When AI summaries are generated, full content is extracted
Content is appended to the prompt for better context

Testing

Unit Tests

bash

# Run all content extraction tests
./vendor/bin/phpunit tests/Unit/Service/Content/

# Run specific test
./vendor/bin/phpunit tests/Unit/Service/Content/HtmlContentExtractorTest.php

Manual Testing

bash

# Test with real URLs
php backend/scripts/test_url_extraction.php

Test Results

Tested with 14 different URLs:

✅ 5/5 Anthropic PDF system cards
✅ 1/1 ArXiv HTML papers
✅ 1/1 Google DeepMind blog posts
✅ 1/1 Kimi blog posts
⚠️ 0/5 OpenAI URLs (403 bot protection)
⚠️ 0/1 ResearchGate (403 bot protection)

Success Rate: 57% (8/14)

Note: 403 errors are expected for sites with aggressive bot protection. These can be handled with headless browsers if needed.

Configuration

Environment Variables

env

# Optional: Override default timeout (seconds)
URL_EXTRACTION_TIMEOUT=30

# Optional: Override max content length (characters)
URL_EXTRACTION_MAX_LENGTH=5000

# Optional: Override max file size (bytes)
URL_EXTRACTION_MAX_SIZE=20971520  # 20MB

Service Configuration

Services are auto-configured via config/services/content.yaml:

yaml

services:
    App\Service\Content\ContentExtractionService:
        arguments:
            $logger: '@logger'
            $httpClient: '@http_client'
            $htmlExtractor: '@App\Service\Content\Extractor\HtmlContentExtractor'
            $pdfExtractor: '@App\Service\Content\Extractor\PdfContentExtractor'

Performance

Benchmarks

Content Type	Avg Duration	Max Size	Notes
HTML (small)	100-500ms	1MB	Fast, minimal processing
HTML (large)	500-2000ms	5MB	Depends on page complexity
PDF (small)	1000-3000ms	5MB	10-20 pages
PDF (large)	3000-7000ms	20MB	50+ pages

Optimization Tips

Caching: Consider caching extracted content in Knowledge entity
Async Processing: Extract content in background job for large files
CDN: Use CDN for frequently accessed URLs
Rate Limiting: Implement rate limiting for external requests

Error Handling

All errors are handled gracefully and logged:

php

// Extraction never throws exceptions
$result = $contentExtractor->extractContent($url);

if ($result->isFailure()) {
    // Log error but continue with AI summary using user-provided content only
    $logger->warning('URL extraction failed', [
        'url' => $url,
        'error' => $result->error,
    ]);
}

Common Errors

Error	Cause	Solution
HTTP 403	Bot protection	Use headless browser or accept limitation
HTTP 404	URL not found	Validate URL before extraction
Timeout	Slow server	Increase timeout or skip extraction
Content too large	File > 20MB	Increase limit or skip extraction
No readable content	JavaScript-rendered	Use headless browser
Invalid URL	Malformed URL	Validate URL format

Future Enhancements

Planned Features

JavaScript Rendering: Use Puppeteer/Playwright for JS-heavy sites
Content Caching: Cache extracted content in database
Async Processing: Background job queue for large files
More Content Types: Support for Word docs, Excel, etc.
Language Detection: Detect and translate non-Dutch content
Image Extraction: Extract and analyze images from articles
Video Transcription: Extract transcripts from YouTube/Vimeo

Extension Points

php

// Add custom extractor
class CustomExtractor implements ContentExtractorInterface
{
    public function extract(string $content, string $url, int $maxLength): ContentExtractionResult
    {
        // Custom extraction logic
    }
}

// Register in services.yaml
services:
    App\Service\Content\Extractor\CustomExtractor:
        tags: ['content.extractor']

Troubleshooting

Issue: Extraction fails with 403 error

Cause: Website blocks automated requests

Solutions:

Use more realistic User-Agent header (already implemented)
Add cookies/session handling
Use headless browser (Puppeteer)
Accept limitation and use user-provided content only

Issue: No content extracted from HTML

Cause: JavaScript-rendered content or unusual HTML structure

Solutions:

Check if site uses JavaScript rendering
Add custom content selectors for specific sites
Use headless browser for JavaScript execution

Issue: PDF extraction is slow

Cause: Large PDF files with many pages

Solutions:

Reduce MAX_PAGES limit
Process PDFs asynchronously
Cache extracted content

AI Summary Architecture
Template System
Implementation Plan

URL Content Extraction Feature ​

Overview ​

Architecture ​

Components ​

Integration Points ​

Features ​

Content Type Support ​

Security Features ​

Content Processing ​

Usage ​

From PHP Code ​

In AI Prompts ​

From Slack ​

Testing ​

Unit Tests ​

Manual Testing ​

Test Results ​

Configuration ​

Environment Variables ​

Service Configuration ​

Performance ​

Benchmarks ​

Optimization Tips ​

Error Handling ​

Common Errors ​

Future Enhancements ​

Planned Features ​

Extension Points ​

Troubleshooting ​

Issue: Extraction fails with 403 error ​

Issue: No content extracted from HTML ​

Issue: PDF extraction is slow ​

Related Documentation ​

URL Content Extraction Feature

Overview

Architecture

Components

Integration Points

Features

Content Type Support

Security Features

Content Processing

Usage

From PHP Code

In AI Prompts

From Slack

Testing

Unit Tests

Manual Testing

Test Results

Configuration

Environment Variables

Service Configuration

Performance

Benchmarks

Optimization Tips

Error Handling

Common Errors

Future Enhancements

Planned Features

Extension Points

Troubleshooting

Issue: Extraction fails with 403 error

Issue: No content extracted from HTML

Issue: PDF extraction is slow

Related Documentation