sourcefinder/CLAUDE.md

# SourceFinder

## Context
Microservice for intelligent news sourcing and scoring. Provides scored, filtered news content via API for content generation clients like PublicationAutomator.

**Goal**: Reusable news service with anti-prompt injection protection, intelligent scoring, and stock management.

## Architecture
```
[API Request] → [Stock Search] → [Live Scraping if needed] → [Scoring] → [Anti-injection] → [Filtered Results]
```

**Role**: Independent news sourcing service for multiple content generation clients.

**Stack**: Node.js + Express, architecture ultra-modulaire, stockage JSON (interchangeable MongoDB/PostgreSQL), Redis cache, News Provider modulaire (LLM par défaut, scraping/hybride disponibles).

## Reference documents
- `CDC.md` - Complete technical specifications and algorithms
- `config/sources.json` - Sources configuration and scraping rules
- `docs/api.md` - API documentation and examples

## Key technical elements

### Intelligent scoring system
```
Score = (Race_specificity × 0.4) + (Freshness × 0.3) + (Source_quality × 0.2) + (Anti_duplication × 0.1)
```

### Multi-layer anti-prompt injection
- Content preprocessing with pattern detection
- Semantic validation
- Source scoring with security penalties
- Quarantine suspicious content

### Smart stock management
Three-tier system: Premium (studies, official sources), Standard (specialized news), Fallback (general content).
Reuse logic with rotation periods and usage tracking.

### API design
Primary endpoint: `GET /api/v1/news/search` with parameters for race_code, product_context, scoring filters.
Returns scored results with metadata and source attribution.

### Cascading source strategy
1. **Specialized sources** (breed clubs, specialized sites)
2. **Animal media** (pet magazines, vet sites)
3. **General fallback** (adapted mainstream content)

## Important constraints
- API-first design for multiple clients
- Zero prompt injection tolerance
- Stock coverage: 50+ sources per popular breed
- Numeric race codes only ("352-1" format)
- Source diversity and quality balance
- Architecture ultra-modulaire: interfaces strictes, composants interchangeables
- News Provider: LLM par défaut, scraping/hybride via configuration
- Stockage: JSON par défaut, MongoDB/PostgreSQL via interface Repository

## Attention points
- Specialized sources = highest injection risk + highest value
- Stock management crucial for performance and cost
- Scoring algorithm must adapt to different client needs
- Background processing for stock refresh and cleanup

## Integrations
- **PublicationAutomator**: Primary client for daily article generation
- **Future clients**: Newsletter systems, social media content, competitive intelligence
- **External APIs**: Google News, RSS feeds, specialized pet industry sources
- **Monitoring**: Health checks, usage tracking, source reliability metrics

## ⚠️ IMPORTANT - TODO MANAGEMENT
**CRITICAL**: Ce projet est complexe avec 25+ composants interdépendants. La gestion rigoureuse des tâches via todo list est OBLIGATOIRE pour:
- Éviter l'oubli d'éléments critiques (sécurité, performance, intégrations)
- Maintenir la cohérence entre les phases de développement
- Assurer la couverture complète des spécifications CDC
- Permettre un suivi précis de l'avancement

**Règle absolue**: Utiliser TodoWrite pour TOUS les développements non-triviaux de ce projet. Les 447 lignes du CDC représentent un scope considérable qui nécessite une approche méthodique.