Skip to content

URL Content Extraction Feature

Overview

The URL Content Extraction feature automatically extracts and processes content from URLs added to the knowledge database. This enriched content is then used by the AI summary generation system to create more accurate and comprehensive summaries.

Architecture

Components

  1. ContentExtractionService (App\Service\Content\ContentExtractionService)

    • Main orchestrator for content extraction
    • Handles HTTP requests, content type detection, and routing
    • Implements SSRF protection and security measures
  2. HtmlContentExtractor (App\Service\Content\Extractor\HtmlContentExtractor)

    • Extracts readable content from HTML pages
    • Removes navigation, ads, footers, and other non-content elements
    • Uses readability algorithms to find main article content
  3. PdfContentExtractor (App\Service\Content\Extractor\PdfContentExtractor)

    • Extracts text from PDF documents
    • Handles multi-page documents (up to 50 pages)
    • Preserves document structure and metadata
  4. ContentType Enum (App\Enum\ContentType)

    • Defines supported content types (HTML, PDF, JSON, Markdown, Plain Text)
    • Provides content type detection from URLs and headers
  5. ContentExtractionResult (App\ValueObject\ContentExtractionResult)

    • Immutable value object representing extraction results
    • Contains content, metadata, errors, and truncation status

Integration Points

  • PromptBuilderService: Uses extracted content as the {{url_content}} variable in AI prompts
  • Knowledge Entity: Stores URL and can cache extracted content
  • AI Summary Generation: Enriches prompts with full article/document content

Features

Content Type Support

TypeStatusDescription
HTML✅ FullExtracts main article content, removes navigation/ads
PDF✅ FullExtracts text from PDFs up to 20MB, 50 pages
Plain Text✅ FullSimple truncation and formatting
JSON⚠️ PartialBasic support, can be extended
Markdown⚠️ PartialTreated as plain text

Security Features

  • SSRF Protection: Blocks private IP ranges (localhost, 192.168.x.x, 10.x.x.x, etc.)
  • Size Limits: Maximum 20MB per URL to prevent memory issues
  • Timeout Protection: 30-second timeout for HTTP requests
  • URL Validation: Strict URL format validation

Content Processing

  1. HTML Extraction:

    • Identifies main content area using semantic selectors
    • Removes <script>, <style>, <nav>, <footer>, <aside> tags
    • Removes ads and social sharing widgets
    • Cleans whitespace and normalizes line breaks
    • Truncates to specified length (default 5000 chars)
  2. PDF Extraction:

    • Parses PDF structure using smalot/pdfparser
    • Extracts text from all pages (up to 50)
    • Removes page numbers and headers/footers
    • Preserves paragraph structure
    • Extracts metadata (title, author, subject)
  3. Truncation:

    • Breaks on sentence boundaries when possible
    • Falls back to word boundaries
    • Adds "..." indicator when truncated

Usage

From PHP Code

php
use App\Service\Content\ContentExtractionService;

// Inject via dependency injection
public function __construct(
    private readonly ContentExtractionService $contentExtractor,
) {}

// Extract content
$result = $this->contentExtractor->extractContent(
    url: 'https://example.com/article',
    maxLength: 5000
);

if ($result->isSuccess()) {
    echo "Content: " . $result->content;
    echo "Type: " . $result->type->value;
    echo "Length: " . $result->length;
    echo "Truncated: " . ($result->truncated ? 'Yes' : 'No');
} else {
    echo "Error: " . $result->error;
}

In AI Prompts

The extracted content is automatically available in prompt templates:

handlebars
You are summarizing the following knowledge item:

Title: {{title}}
Category: {{category}}
URL: {{url}}

User-provided content:
{{content}}

{{#if url_content}}
Full article/document content:
{{url_content}}
{{/if}}

Please create a summary for {{role_name}}.

From Slack

When a user adds a URL to the knowledge database via Slack:

  1. URL is detected in the modal
  2. Metadata is extracted (title, description, image)
  3. When AI summaries are generated, full content is extracted
  4. Content is appended to the prompt for better context

Testing

Unit Tests

bash
# Run all content extraction tests
./vendor/bin/phpunit tests/Unit/Service/Content/

# Run specific test
./vendor/bin/phpunit tests/Unit/Service/Content/HtmlContentExtractorTest.php

Manual Testing

bash
# Test with real URLs
php backend/scripts/test_url_extraction.php

Test Results

Tested with 14 different URLs:

  • ✅ 5/5 Anthropic PDF system cards
  • ✅ 1/1 ArXiv HTML papers
  • ✅ 1/1 Google DeepMind blog posts
  • ✅ 1/1 Kimi blog posts
  • ⚠️ 0/5 OpenAI URLs (403 bot protection)
  • ⚠️ 0/1 ResearchGate (403 bot protection)

Success Rate: 57% (8/14)

Note: 403 errors are expected for sites with aggressive bot protection. These can be handled with headless browsers if needed.

Configuration

Environment Variables

env
# Optional: Override default timeout (seconds)
URL_EXTRACTION_TIMEOUT=30

# Optional: Override max content length (characters)
URL_EXTRACTION_MAX_LENGTH=5000

# Optional: Override max file size (bytes)
URL_EXTRACTION_MAX_SIZE=20971520  # 20MB

Service Configuration

Services are auto-configured via config/services/content.yaml:

yaml
services:
    App\Service\Content\ContentExtractionService:
        arguments:
            $logger: '@logger'
            $httpClient: '@http_client'
            $htmlExtractor: '@App\Service\Content\Extractor\HtmlContentExtractor'
            $pdfExtractor: '@App\Service\Content\Extractor\PdfContentExtractor'

Performance

Benchmarks

Content TypeAvg DurationMax SizeNotes
HTML (small)100-500ms1MBFast, minimal processing
HTML (large)500-2000ms5MBDepends on page complexity
PDF (small)1000-3000ms5MB10-20 pages
PDF (large)3000-7000ms20MB50+ pages

Optimization Tips

  1. Caching: Consider caching extracted content in Knowledge entity
  2. Async Processing: Extract content in background job for large files
  3. CDN: Use CDN for frequently accessed URLs
  4. Rate Limiting: Implement rate limiting for external requests

Error Handling

All errors are handled gracefully and logged:

php
// Extraction never throws exceptions
$result = $contentExtractor->extractContent($url);

if ($result->isFailure()) {
    // Log error but continue with AI summary using user-provided content only
    $logger->warning('URL extraction failed', [
        'url' => $url,
        'error' => $result->error,
    ]);
}

Common Errors

ErrorCauseSolution
HTTP 403Bot protectionUse headless browser or accept limitation
HTTP 404URL not foundValidate URL before extraction
TimeoutSlow serverIncrease timeout or skip extraction
Content too largeFile > 20MBIncrease limit or skip extraction
No readable contentJavaScript-renderedUse headless browser
Invalid URLMalformed URLValidate URL format

Future Enhancements

Planned Features

  1. JavaScript Rendering: Use Puppeteer/Playwright for JS-heavy sites
  2. Content Caching: Cache extracted content in database
  3. Async Processing: Background job queue for large files
  4. More Content Types: Support for Word docs, Excel, etc.
  5. Language Detection: Detect and translate non-Dutch content
  6. Image Extraction: Extract and analyze images from articles
  7. Video Transcription: Extract transcripts from YouTube/Vimeo

Extension Points

php
// Add custom extractor
class CustomExtractor implements ContentExtractorInterface
{
    public function extract(string $content, string $url, int $maxLength): ContentExtractionResult
    {
        // Custom extraction logic
    }
}

// Register in services.yaml
services:
    App\Service\Content\Extractor\CustomExtractor:
        tags: ['content.extractor']

Troubleshooting

Issue: Extraction fails with 403 error

Cause: Website blocks automated requests

Solutions:

  1. Use more realistic User-Agent header (already implemented)
  2. Add cookies/session handling
  3. Use headless browser (Puppeteer)
  4. Accept limitation and use user-provided content only

Issue: No content extracted from HTML

Cause: JavaScript-rendered content or unusual HTML structure

Solutions:

  1. Check if site uses JavaScript rendering
  2. Add custom content selectors for specific sites
  3. Use headless browser for JavaScript execution

Issue: PDF extraction is slow

Cause: Large PDF files with many pages

Solutions:

  1. Reduce MAX_PAGES limit
  2. Process PDFs asynchronously
  3. Cache extracted content
  • AI Summary Architecture
  • Template System
  • Implementation Plan