# SourceFinder ## Context Microservice for intelligent news sourcing and scoring. Provides scored, filtered news content via API for content generation clients like PublicationAutomator. **Goal**: Reusable news service with anti-prompt injection protection, intelligent scoring, and stock management. ## Architecture ``` [API Request] → [Stock Search] → [Live Scraping if needed] → [Scoring] → [Anti-injection] → [Filtered Results] ``` **Role**: Independent news sourcing service for multiple content generation clients. **Stack**: Node.js + Express, architecture ultra-modulaire, stockage JSON (interchangeable MongoDB/PostgreSQL), Redis cache, News Provider modulaire (LLM par défaut, scraping/hybride disponibles). ## Reference documents - `CDC.md` - Complete technical specifications and algorithms - `config/sources.json` - Sources configuration and scraping rules - `docs/api.md` - API documentation and examples ## Key technical elements ### Intelligent scoring system ``` Score = (Race_specificity × 0.4) + (Freshness × 0.3) + (Source_quality × 0.2) + (Anti_duplication × 0.1) ``` ### Multi-layer anti-prompt injection - Content preprocessing with pattern detection - Semantic validation - Source scoring with security penalties - Quarantine suspicious content ### Smart stock management Three-tier system: Premium (studies, official sources), Standard (specialized news), Fallback (general content). Reuse logic with rotation periods and usage tracking. ### API design Primary endpoint: `GET /api/v1/news/search` with parameters for race_code, product_context, scoring filters. Returns scored results with metadata and source attribution. ### Cascading source strategy 1. **Specialized sources** (breed clubs, specialized sites) 2. **Animal media** (pet magazines, vet sites) 3. **General fallback** (adapted mainstream content) ## Important constraints - API-first design for multiple clients - Zero prompt injection tolerance - Stock coverage: 50+ sources per popular breed - Numeric race codes only ("352-1" format) - Source diversity and quality balance - Architecture ultra-modulaire: interfaces strictes, composants interchangeables - News Provider: LLM par défaut, scraping/hybride via configuration - Stockage: JSON par défaut, MongoDB/PostgreSQL via interface Repository ## Attention points - Specialized sources = highest injection risk + highest value - Stock management crucial for performance and cost - Scoring algorithm must adapt to different client needs - Background processing for stock refresh and cleanup ## Integrations - **PublicationAutomator**: Primary client for daily article generation - **Future clients**: Newsletter systems, social media content, competitive intelligence - **External APIs**: Google News, RSS feeds, specialized pet industry sources - **Monitoring**: Health checks, usage tracking, source reliability metrics ## ⚠️ IMPORTANT - TODO MANAGEMENT **CRITICAL**: Ce projet est complexe avec 25+ composants interdépendants. La gestion rigoureuse des tâches via todo list est OBLIGATOIRE pour: - Éviter l'oubli d'éléments critiques (sécurité, performance, intégrations) - Maintenir la cohérence entre les phases de développement - Assurer la couverture complète des spécifications CDC - Permettre un suivi précis de l'avancement **Règle absolue**: Utiliser TodoWrite pour TOUS les développements non-triviaux de ce projet. Les 447 lignes du CDC représentent un scope considérable qui nécessite une approche méthodique.