sourcefinder/CLAUDE.md
Alexis Trouvé a7bd6115b7
Some checks failed
SourceFinder CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
SourceFinder CI/CD Pipeline / Unit Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Security Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Integration Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Performance Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Code Coverage Report (push) Has been cancelled
SourceFinder CI/CD Pipeline / Build & Deployment Validation (16.x) (push) Has been cancelled
SourceFinder CI/CD Pipeline / Build & Deployment Validation (18.x) (push) Has been cancelled
SourceFinder CI/CD Pipeline / Build & Deployment Validation (20.x) (push) Has been cancelled
SourceFinder CI/CD Pipeline / Regression Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Security Audit (push) Has been cancelled
SourceFinder CI/CD Pipeline / Notify Results (push) Has been cancelled
feat: Implémentation complète du système SourceFinder avec tests
- Architecture modulaire avec injection de dépendances
- Système de scoring intelligent multi-facteurs (spécificité, fraîcheur, qualité, réutilisation)
- Moteur anti-injection 4 couches (preprocessing, patterns, sémantique, pénalités)
- API REST complète avec validation et rate limiting
- Repository JSON avec index mémoire et backup automatique
- Provider LLM modulaire pour génération de contenu
- Suite de tests complète (Jest) :
  * Tests unitaires pour sécurité et scoring
  * Tests d'intégration API end-to-end
  * Tests de sécurité avec simulation d'attaques
  * Tests de performance et charge
- Pipeline CI/CD avec GitHub Actions
- Logging structuré et monitoring
- Configuration ESLint et environnement de test

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-15 23:06:10 +08:00

3.5 KiB
Raw Blame History

SourceFinder

Context

Microservice for intelligent news sourcing and scoring. Provides scored, filtered news content via API for content generation clients like PublicationAutomator.

Goal: Reusable news service with anti-prompt injection protection, intelligent scoring, and stock management.

Architecture

[API Request] → [Stock Search] → [Live Scraping if needed] → [Scoring] → [Anti-injection] → [Filtered Results]

Role: Independent news sourcing service for multiple content generation clients.

Stack: Node.js + Express, architecture ultra-modulaire, stockage JSON (interchangeable MongoDB/PostgreSQL), Redis cache, News Provider modulaire (LLM par défaut, scraping/hybride disponibles).

Reference documents

  • CDC.md - Complete technical specifications and algorithms
  • config/sources.json - Sources configuration and scraping rules
  • docs/api.md - API documentation and examples

Key technical elements

Intelligent scoring system

Score = (Race_specificity × 0.4) + (Freshness × 0.3) + (Source_quality × 0.2) + (Anti_duplication × 0.1)

Multi-layer anti-prompt injection

  • Content preprocessing with pattern detection
  • Semantic validation
  • Source scoring with security penalties
  • Quarantine suspicious content

Smart stock management

Three-tier system: Premium (studies, official sources), Standard (specialized news), Fallback (general content). Reuse logic with rotation periods and usage tracking.

API design

Primary endpoint: GET /api/v1/news/search with parameters for race_code, product_context, scoring filters. Returns scored results with metadata and source attribution.

Cascading source strategy

  1. Specialized sources (breed clubs, specialized sites)
  2. Animal media (pet magazines, vet sites)
  3. General fallback (adapted mainstream content)

Important constraints

  • API-first design for multiple clients
  • Zero prompt injection tolerance
  • Stock coverage: 50+ sources per popular breed
  • Numeric race codes only ("352-1" format)
  • Source diversity and quality balance
  • Architecture ultra-modulaire: interfaces strictes, composants interchangeables
  • News Provider: LLM par défaut, scraping/hybride via configuration
  • Stockage: JSON par défaut, MongoDB/PostgreSQL via interface Repository

Attention points

  • Specialized sources = highest injection risk + highest value
  • Stock management crucial for performance and cost
  • Scoring algorithm must adapt to different client needs
  • Background processing for stock refresh and cleanup

Integrations

  • PublicationAutomator: Primary client for daily article generation
  • Future clients: Newsletter systems, social media content, competitive intelligence
  • External APIs: Google News, RSS feeds, specialized pet industry sources
  • Monitoring: Health checks, usage tracking, source reliability metrics

⚠️ IMPORTANT - TODO MANAGEMENT

CRITICAL: Ce projet est complexe avec 25+ composants interdépendants. La gestion rigoureuse des tâches via todo list est OBLIGATOIRE pour:

  • Éviter l'oubli d'éléments critiques (sécurité, performance, intégrations)
  • Maintenir la cohérence entre les phases de développement
  • Assurer la couverture complète des spécifications CDC
  • Permettre un suivi précis de l'avancement

Règle absolue: Utiliser TodoWrite pour TOUS les développements non-triviaux de ce projet. Les 447 lignes du CDC représentent un scope considérable qui nécessite une approche méthodique.