AI Content Cleaning System

Overview

The AI Content Cleaning System ensures that AI-generated summaries are clean, professional, and free from internal artifacts like tool calls, thinking tags, and other implementation details that should not be visible to end users.

Problem Statement

When using advanced AI models (especially Claude or extended OpenAI models), the AI may include:

Tool call XML tags (<webSearch>, <webFetch>, etc.)
Thinking process tags (<thinking>...</thinking>)
Function call artifacts (<function_calls>, <invoke>, etc.)
Internal reasoning that should not be in the final output

These artifacts must be removed before presenting the content to users.

Architecture

Components

1. AiConstants (`App\Constants\AiConstants`)

Purpose: Centralize all AI-related magic strings and configuration values.

Key Constants:

php

// Provider names
PROVIDER_OPENAI = 'openai'
PROVIDER_ANTHROPIC = 'anthropic'

// Tool call patterns for detection
TOOL_CALL_PATTERNS = [
    '/<webSearch>.*?<\/webSearch>/s',
    '/<webFetch>.*?<\/webFetch>/s',
    '/<function_calls>.*?<\/function_calls>/s',
    // ... more patterns
]

// Configuration
MAX_RETRIES = 3
RETRY_DELAY_MS = 1000
DEFAULT_COMPLETION_TOKENS = 500
TEMPERATURE_BALANCED = 0.7

Benefits:

✅ No magic strings in code
✅ Single source of truth for constants
✅ Easy to maintain and update
✅ Type-safe constant access

2. AiContentCleanerService (`App\Service\Ai\AiContentCleanerService`)

Purpose: Clean and sanitize AI-generated content.

Methods:

php

// Main cleaning method
public function cleanContent(string $content): string

// Detection methods
public function containsArtifacts(string $content): bool
public function extractToolCalls(string $content): array

// Private cleaning methods
private function removeToolCalls(string $content): string
private function removeThinkingTags(string $content): string
private function removeStrayXmlTags(string $content): string
private function normalizeWhitespace(string $content): string

Cleaning Process:

Remove tool call tags (webSearch, webFetch, etc.)
Remove thinking tags
Remove stray XML artifacts
Normalize whitespace (max 2 consecutive newlines)
Trim and return

Example:

php

$rawContent = <<<CONTENT
<thinking>
I need to search for information about this topic.
</thinking>

<webSearch>
<query>AI tooling</query>
<maxResults>5</maxResults>
</webSearch>

Here is the actual summary content that should be shown to users.
CONTENT;

$cleaned = $contentCleaner->cleanContent($rawContent);
// Result: "Here is the actual summary content that should be shown to users."

3. Updated OpenAiProvider

Changes:

Injects AiContentCleanerService via constructor
Uses AiConstants instead of magic strings
Cleans all AI responses before returning
Logs when artifacts are detected and removed

Flow:

AI API Response
    ↓
Extract raw content
    ↓
Check for artifacts (log if found)
    ↓
Clean content
    ↓
Validate not empty
    ↓
Return AiResponse DTO

Coding Standards Compliance

✅ No Magic Strings

Before:

php

$provider = 'openai';
$maxRetries = 3;
$temperature = 0.7;

After:

php

$provider = AiConstants::PROVIDER_OPENAI;
$maxRetries = AiConstants::MAX_RETRIES;
$temperature = AiConstants::TEMPERATURE_BALANCED;

✅ No Array Objects - Use DTOs

Before:

php

return [
    'content' => $content,
    'tokens' => $tokens,
    'cost' => $cost,
];

After:

php

return new AiResponse(
    content: $content,
    promptTokens: $promptTokens,
    completionTokens: $completionTokens,
    cost: $cost,
    model: $this->model,
    metadata: $metadata,
);

✅ Dedicated Constants Classes

All AI-related constants are in AiConstants:

Provider names
Model identifiers
Temperature presets
Token limits
Retry configuration
Tool call patterns

✅ Proper Service Architecture

Single Responsibility: Each service has one clear purpose
Dependency Injection: All dependencies injected via constructor
Interface Segregation: Services implement focused interfaces
Logging: All operations logged with context

✅ Immutable DTOs

php

final class AiResponse
{
    public function __construct(
        public readonly string $content,
        public readonly int $promptTokens,
        public readonly int $completionTokens,
        public readonly float $cost,
        public readonly string $model,
        public readonly array $metadata = [],
    ) {}
}

Usage

Automatic Cleaning (Default)

All AI responses are automatically cleaned by the OpenAiProvider:

php

// In your service
$aiResponse = $this->aiProvider->generateCompletion($prompt);
// $aiResponse->content is already cleaned!

Manual Cleaning

If you need to clean content manually:

php

use App\Service\Ai\AiContentCleanerService;

public function __construct(
    private readonly AiContentCleanerService $contentCleaner,
) {}

public function processContent(string $rawContent): string
{
    return $this->contentCleaner->cleanContent($rawContent);
}

Detection Only

Check if content contains artifacts without cleaning:

php

if ($this->contentCleaner->containsArtifacts($content)) {
    $toolCalls = $this->contentCleaner->extractToolCalls($content);
    $this->logger->warning('Content contains tool calls', [
        'count' => count($toolCalls),
    ]);
}

Configuration

Service Configuration

File: backend/config/services.yaml

yaml

services:
    App\Service\Ai\AiContentCleanerService:
        arguments:
            $logger: '@logger'

    App\Service\Ai\OpenAiProvider:
        arguments:
            $contentCleaner: '@App\Service\Ai\AiContentCleanerService'
            $apiKey: '%env(OPENAI_API_KEY)%'
            $apiUrl: '%env(OPENAI_API_URL)%'
            $model: '%env(OPENAI_MODEL)%'
            $mockMode: '%env(bool:OPENAI_MOCK)%'

Adding New Tool Call Patterns

To detect and remove new types of tool calls:

File: backend/src/Constants/AiConstants.php

php

public const TOOL_CALL_PATTERNS = [
    '/<webSearch>.*?<\/webSearch>/s',
    '/<webFetch>.*?<\/webFetch>/s',
    '/<newToolName>.*?<\/newToolName>/s',  // Add new pattern here
];

Testing

Unit Tests

File: backend/tests/Unit/Service/Ai/AiContentCleanerServiceTest.php

bash

# Run cleaner tests
./vendor/bin/phpunit tests/Unit/Service/Ai/AiContentCleanerServiceTest.php

# Results: 9 tests, 24 assertions, 100% pass rate

Test Coverage:

✅ Cleans web search tool calls
✅ Cleans thinking tags
✅ Cleans multiple artifacts
✅ Normalizes whitespace
✅ Detects artifacts
✅ Extracts tool calls
✅ Handles empty content
✅ Handles content with only artifacts
✅ Preserves valid content

Integration Testing

Test with real AI responses:

php

// Generate a summary
$summary = $aiSummaryService->generateSummary($knowledge, $role);

// Content should be clean
$this->assertStringNotContainsString('<webSearch>', $summary->getContent());
$this->assertStringNotContainsString('<thinking>', $summary->getContent());

Logging

The system logs all cleaning operations:

php

// When artifacts are detected
$this->logger->warning('AI response contains tool calls/artifacts, cleaning', [
    'tool_calls_found' => count($toolCalls),
    'raw_length' => strlen($rawContent),
]);

// After cleaning
$this->logger->info('AI content cleaned', [
    'original_length' => $originalLength,
    'cleaned_length' => $cleanedLength,
    'removed_bytes' => $originalLength - $cleanedLength,
]);

// If content is empty after cleaning
$this->logger->error('Content is empty after cleaning', [
    'raw_content_preview' => substr($rawContent, 0, 200),
]);

Error Handling

Empty Content After Cleaning

If all content is removed during cleaning (i.e., response was only tool calls):

php

if (empty($content)) {
    throw new \RuntimeException('AI response is empty after cleaning tool calls');
}

This prevents saving empty summaries and alerts you to potential issues.

Graceful Degradation

If cleaning fails, the original content is preserved:

php

try {
    $cleaned = $this->contentCleaner->cleanContent($rawContent);
} catch (\Throwable $e) {
    $this->logger->error('Content cleaning failed', ['error' => $e->getMessage()]);
    $cleaned = $rawContent; // Fallback to raw content
}

Performance

Benchmarks

Small content (< 1KB): < 1ms
Medium content (1-10KB): 1-5ms
Large content (> 10KB): 5-20ms

Optimization

Uses compiled regex patterns
Single-pass cleaning where possible
Minimal string allocations

Future Enhancements

Planned Features

Configurable Patterns: Allow runtime configuration of tool call patterns
Content Validation: Validate cleaned content meets quality standards
Metrics Collection: Track cleaning frequency and patterns
Custom Cleaners: Plugin system for domain-specific cleaning rules

Extension Points

php

interface ContentCleanerInterface
{
    public function cleanContent(string $content): string;
    public function containsArtifacts(string $content): bool;
}

// Register custom cleaner
class CustomContentCleaner implements ContentCleanerInterface
{
    // Custom cleaning logic
}

Troubleshooting

Issue: Content still contains artifacts

Cause: New tool call pattern not in AiConstants::TOOL_CALL_PATTERNS

Solution: Add the pattern to the constants file

Issue: Valid content being removed

Cause: Overly aggressive regex pattern

Solution: Make pattern more specific or add exclusions

Issue: Performance degradation

Cause: Too many regex patterns or large content

Solution: Optimize patterns or implement caching

AI Summary Architecture
OpenAI Provider
Coding Standards

Status: ✅ Production Ready Version: 1.0.0 Last Updated: 2026-03-10

AI Content Cleaning System ​

Overview ​

Problem Statement ​

Architecture ​

Components ​

1. AiConstants (App\Constants\AiConstants) ​

2. AiContentCleanerService (App\Service\Ai\AiContentCleanerService) ​

3. Updated OpenAiProvider ​

Coding Standards Compliance ​

✅ No Magic Strings ​

✅ No Array Objects - Use DTOs ​

✅ Dedicated Constants Classes ​

✅ Proper Service Architecture ​

✅ Immutable DTOs ​

Usage ​

Automatic Cleaning (Default) ​

Manual Cleaning ​

Detection Only ​

Configuration ​

Service Configuration ​

Adding New Tool Call Patterns ​

Testing ​

Unit Tests ​

Integration Testing ​

Logging ​

Error Handling ​

Empty Content After Cleaning ​

Graceful Degradation ​

Performance ​

Benchmarks ​

Optimization ​

Future Enhancements ​

Planned Features ​

Extension Points ​

Troubleshooting ​

Issue: Content still contains artifacts ​

Issue: Valid content being removed ​

Issue: Performance degradation ​

Related Documentation ​