sourcefinder/CLAUDE.md
Alexis Trouvé a7bd6115b7
Some checks failed
SourceFinder CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
SourceFinder CI/CD Pipeline / Unit Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Security Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Integration Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Performance Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Code Coverage Report (push) Has been cancelled
SourceFinder CI/CD Pipeline / Build & Deployment Validation (16.x) (push) Has been cancelled
SourceFinder CI/CD Pipeline / Build & Deployment Validation (18.x) (push) Has been cancelled
SourceFinder CI/CD Pipeline / Build & Deployment Validation (20.x) (push) Has been cancelled
SourceFinder CI/CD Pipeline / Regression Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Security Audit (push) Has been cancelled
SourceFinder CI/CD Pipeline / Notify Results (push) Has been cancelled
feat: Implémentation complète du système SourceFinder avec tests
- Architecture modulaire avec injection de dépendances
- Système de scoring intelligent multi-facteurs (spécificité, fraîcheur, qualité, réutilisation)
- Moteur anti-injection 4 couches (preprocessing, patterns, sémantique, pénalités)
- API REST complète avec validation et rate limiting
- Repository JSON avec index mémoire et backup automatique
- Provider LLM modulaire pour génération de contenu
- Suite de tests complète (Jest) :
  * Tests unitaires pour sécurité et scoring
  * Tests d'intégration API end-to-end
  * Tests de sécurité avec simulation d'attaques
  * Tests de performance et charge
- Pipeline CI/CD avec GitHub Actions
- Logging structuré et monitoring
- Configuration ESLint et environnement de test

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-15 23:06:10 +08:00

77 lines
3.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# SourceFinder
## Context
Microservice for intelligent news sourcing and scoring. Provides scored, filtered news content via API for content generation clients like PublicationAutomator.
**Goal**: Reusable news service with anti-prompt injection protection, intelligent scoring, and stock management.
## Architecture
```
[API Request] → [Stock Search] → [Live Scraping if needed] → [Scoring] → [Anti-injection] → [Filtered Results]
```
**Role**: Independent news sourcing service for multiple content generation clients.
**Stack**: Node.js + Express, architecture ultra-modulaire, stockage JSON (interchangeable MongoDB/PostgreSQL), Redis cache, News Provider modulaire (LLM par défaut, scraping/hybride disponibles).
## Reference documents
- `CDC.md` - Complete technical specifications and algorithms
- `config/sources.json` - Sources configuration and scraping rules
- `docs/api.md` - API documentation and examples
## Key technical elements
### Intelligent scoring system
```
Score = (Race_specificity × 0.4) + (Freshness × 0.3) + (Source_quality × 0.2) + (Anti_duplication × 0.1)
```
### Multi-layer anti-prompt injection
- Content preprocessing with pattern detection
- Semantic validation
- Source scoring with security penalties
- Quarantine suspicious content
### Smart stock management
Three-tier system: Premium (studies, official sources), Standard (specialized news), Fallback (general content).
Reuse logic with rotation periods and usage tracking.
### API design
Primary endpoint: `GET /api/v1/news/search` with parameters for race_code, product_context, scoring filters.
Returns scored results with metadata and source attribution.
### Cascading source strategy
1. **Specialized sources** (breed clubs, specialized sites)
2. **Animal media** (pet magazines, vet sites)
3. **General fallback** (adapted mainstream content)
## Important constraints
- API-first design for multiple clients
- Zero prompt injection tolerance
- Stock coverage: 50+ sources per popular breed
- Numeric race codes only ("352-1" format)
- Source diversity and quality balance
- Architecture ultra-modulaire: interfaces strictes, composants interchangeables
- News Provider: LLM par défaut, scraping/hybride via configuration
- Stockage: JSON par défaut, MongoDB/PostgreSQL via interface Repository
## Attention points
- Specialized sources = highest injection risk + highest value
- Stock management crucial for performance and cost
- Scoring algorithm must adapt to different client needs
- Background processing for stock refresh and cleanup
## Integrations
- **PublicationAutomator**: Primary client for daily article generation
- **Future clients**: Newsletter systems, social media content, competitive intelligence
- **External APIs**: Google News, RSS feeds, specialized pet industry sources
- **Monitoring**: Health checks, usage tracking, source reliability metrics
## ⚠️ IMPORTANT - TODO MANAGEMENT
**CRITICAL**: Ce projet est complexe avec 25+ composants interdépendants. La gestion rigoureuse des tâches via todo list est OBLIGATOIRE pour:
- Éviter l'oubli d'éléments critiques (sécurité, performance, intégrations)
- Maintenir la cohérence entre les phases de développement
- Assurer la couverture complète des spécifications CDC
- Permettre un suivi précis de l'avancement
**Règle absolue**: Utiliser TodoWrite pour TOUS les développements non-triviaux de ce projet. Les 447 lignes du CDC représentent un scope considérable qui nécessite une approche méthodique.