URL Content Extraction Feature
Overview
The URL Content Extraction feature automatically extracts and processes content from URLs added to the knowledge database. This enriched content is then used by the AI summary generation system to create more accurate and comprehensive summaries.
Architecture
Components
ContentExtractionService (
App\Service\Content\ContentExtractionService)- Main orchestrator for content extraction
- Handles HTTP requests, content type detection, and routing
- Implements SSRF protection and security measures
HtmlContentExtractor (
App\Service\Content\Extractor\HtmlContentExtractor)- Extracts readable content from HTML pages
- Removes navigation, ads, footers, and other non-content elements
- Uses readability algorithms to find main article content
PdfContentExtractor (
App\Service\Content\Extractor\PdfContentExtractor)- Extracts text from PDF documents
- Handles multi-page documents (up to 50 pages)
- Preserves document structure and metadata
ContentType Enum (
App\Enum\ContentType)- Defines supported content types (HTML, PDF, JSON, Markdown, Plain Text)
- Provides content type detection from URLs and headers
ContentExtractionResult (
App\ValueObject\ContentExtractionResult)- Immutable value object representing extraction results
- Contains content, metadata, errors, and truncation status
Integration Points
- PromptBuilderService: Uses extracted content as the
{{url_content}}variable in AI prompts - Knowledge Entity: Stores URL and can cache extracted content
- AI Summary Generation: Enriches prompts with full article/document content
Features
Content Type Support
| Type | Status | Description |
|---|---|---|
| HTML | ✅ Full | Extracts main article content, removes navigation/ads |
| ✅ Full | Extracts text from PDFs up to 20MB, 50 pages | |
| Plain Text | ✅ Full | Simple truncation and formatting |
| JSON | ⚠️ Partial | Basic support, can be extended |
| Markdown | ⚠️ Partial | Treated as plain text |
Security Features
- SSRF Protection: Blocks private IP ranges (localhost, 192.168.x.x, 10.x.x.x, etc.)
- Size Limits: Maximum 20MB per URL to prevent memory issues
- Timeout Protection: 30-second timeout for HTTP requests
- URL Validation: Strict URL format validation
Content Processing
HTML Extraction:
- Identifies main content area using semantic selectors
- Removes
<script>,<style>,<nav>,<footer>,<aside>tags - Removes ads and social sharing widgets
- Cleans whitespace and normalizes line breaks
- Truncates to specified length (default 5000 chars)
PDF Extraction:
- Parses PDF structure using smalot/pdfparser
- Extracts text from all pages (up to 50)
- Removes page numbers and headers/footers
- Preserves paragraph structure
- Extracts metadata (title, author, subject)
Truncation:
- Breaks on sentence boundaries when possible
- Falls back to word boundaries
- Adds "..." indicator when truncated
Usage
From PHP Code
use App\Service\Content\ContentExtractionService;
// Inject via dependency injection
public function __construct(
private readonly ContentExtractionService $contentExtractor,
) {}
// Extract content
$result = $this->contentExtractor->extractContent(
url: 'https://example.com/article',
maxLength: 5000
);
if ($result->isSuccess()) {
echo "Content: " . $result->content;
echo "Type: " . $result->type->value;
echo "Length: " . $result->length;
echo "Truncated: " . ($result->truncated ? 'Yes' : 'No');
} else {
echo "Error: " . $result->error;
}In AI Prompts
The extracted content is automatically available in prompt templates:
From Slack
When a user adds a URL to the knowledge database via Slack:
- URL is detected in the modal
- Metadata is extracted (title, description, image)
- When AI summaries are generated, full content is extracted
- Content is appended to the prompt for better context
Testing
Unit Tests
# Run all content extraction tests
./vendor/bin/phpunit tests/Unit/Service/Content/
# Run specific test
./vendor/bin/phpunit tests/Unit/Service/Content/HtmlContentExtractorTest.phpManual Testing
# Test with real URLs
php backend/scripts/test_url_extraction.phpTest Results
Tested with 14 different URLs:
- ✅ 5/5 Anthropic PDF system cards
- ✅ 1/1 ArXiv HTML papers
- ✅ 1/1 Google DeepMind blog posts
- ✅ 1/1 Kimi blog posts
- ⚠️ 0/5 OpenAI URLs (403 bot protection)
- ⚠️ 0/1 ResearchGate (403 bot protection)
Success Rate: 57% (8/14)
Note: 403 errors are expected for sites with aggressive bot protection. These can be handled with headless browsers if needed.
Configuration
Environment Variables
# Optional: Override default timeout (seconds)
URL_EXTRACTION_TIMEOUT=30
# Optional: Override max content length (characters)
URL_EXTRACTION_MAX_LENGTH=5000
# Optional: Override max file size (bytes)
URL_EXTRACTION_MAX_SIZE=20971520 # 20MBService Configuration
Services are auto-configured via config/services/content.yaml:
services:
App\Service\Content\ContentExtractionService:
arguments:
$logger: '@logger'
$httpClient: '@http_client'
$htmlExtractor: '@App\Service\Content\Extractor\HtmlContentExtractor'
$pdfExtractor: '@App\Service\Content\Extractor\PdfContentExtractor'Performance
Benchmarks
| Content Type | Avg Duration | Max Size | Notes |
|---|---|---|---|
| HTML (small) | 100-500ms | 1MB | Fast, minimal processing |
| HTML (large) | 500-2000ms | 5MB | Depends on page complexity |
| PDF (small) | 1000-3000ms | 5MB | 10-20 pages |
| PDF (large) | 3000-7000ms | 20MB | 50+ pages |
Optimization Tips
- Caching: Consider caching extracted content in Knowledge entity
- Async Processing: Extract content in background job for large files
- CDN: Use CDN for frequently accessed URLs
- Rate Limiting: Implement rate limiting for external requests
Error Handling
All errors are handled gracefully and logged:
// Extraction never throws exceptions
$result = $contentExtractor->extractContent($url);
if ($result->isFailure()) {
// Log error but continue with AI summary using user-provided content only
$logger->warning('URL extraction failed', [
'url' => $url,
'error' => $result->error,
]);
}Common Errors
| Error | Cause | Solution |
|---|---|---|
| HTTP 403 | Bot protection | Use headless browser or accept limitation |
| HTTP 404 | URL not found | Validate URL before extraction |
| Timeout | Slow server | Increase timeout or skip extraction |
| Content too large | File > 20MB | Increase limit or skip extraction |
| No readable content | JavaScript-rendered | Use headless browser |
| Invalid URL | Malformed URL | Validate URL format |
Future Enhancements
Planned Features
- JavaScript Rendering: Use Puppeteer/Playwright for JS-heavy sites
- Content Caching: Cache extracted content in database
- Async Processing: Background job queue for large files
- More Content Types: Support for Word docs, Excel, etc.
- Language Detection: Detect and translate non-Dutch content
- Image Extraction: Extract and analyze images from articles
- Video Transcription: Extract transcripts from YouTube/Vimeo
Extension Points
// Add custom extractor
class CustomExtractor implements ContentExtractorInterface
{
public function extract(string $content, string $url, int $maxLength): ContentExtractionResult
{
// Custom extraction logic
}
}
// Register in services.yaml
services:
App\Service\Content\Extractor\CustomExtractor:
tags: ['content.extractor']Troubleshooting
Issue: Extraction fails with 403 error
Cause: Website blocks automated requests
Solutions:
- Use more realistic User-Agent header (already implemented)
- Add cookies/session handling
- Use headless browser (Puppeteer)
- Accept limitation and use user-provided content only
Issue: No content extracted from HTML
Cause: JavaScript-rendered content or unusual HTML structure
Solutions:
- Check if site uses JavaScript rendering
- Add custom content selectors for specific sites
- Use headless browser for JavaScript execution
Issue: PDF extraction is slow
Cause: Large PDF files with many pages
Solutions:
- Reduce MAX_PAGES limit
- Process PDFs asynchronously
- Cache extracted content
Related Documentation
- AI Summary Architecture
- Template System
- Implementation Plan