Some checks failed
SourceFinder CI/CD Pipeline / Code Quality & Linting (push) Has been cancelled
SourceFinder CI/CD Pipeline / Unit Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Security Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Integration Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Performance Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Code Coverage Report (push) Has been cancelled
SourceFinder CI/CD Pipeline / Build & Deployment Validation (16.x) (push) Has been cancelled
SourceFinder CI/CD Pipeline / Build & Deployment Validation (18.x) (push) Has been cancelled
SourceFinder CI/CD Pipeline / Build & Deployment Validation (20.x) (push) Has been cancelled
SourceFinder CI/CD Pipeline / Regression Tests (push) Has been cancelled
SourceFinder CI/CD Pipeline / Security Audit (push) Has been cancelled
SourceFinder CI/CD Pipeline / Notify Results (push) Has been cancelled
- Architecture modulaire avec injection de dépendances - Système de scoring intelligent multi-facteurs (spécificité, fraîcheur, qualité, réutilisation) - Moteur anti-injection 4 couches (preprocessing, patterns, sémantique, pénalités) - API REST complète avec validation et rate limiting - Repository JSON avec index mémoire et backup automatique - Provider LLM modulaire pour génération de contenu - Suite de tests complète (Jest) : * Tests unitaires pour sécurité et scoring * Tests d'intégration API end-to-end * Tests de sécurité avec simulation d'attaques * Tests de performance et charge - Pipeline CI/CD avec GitHub Actions - Logging structuré et monitoring - Configuration ESLint et environnement de test 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
77 lines
3.5 KiB
Markdown
77 lines
3.5 KiB
Markdown
# SourceFinder
|
||
|
||
## Context
|
||
Microservice for intelligent news sourcing and scoring. Provides scored, filtered news content via API for content generation clients like PublicationAutomator.
|
||
|
||
**Goal**: Reusable news service with anti-prompt injection protection, intelligent scoring, and stock management.
|
||
|
||
## Architecture
|
||
```
|
||
[API Request] → [Stock Search] → [Live Scraping if needed] → [Scoring] → [Anti-injection] → [Filtered Results]
|
||
```
|
||
|
||
**Role**: Independent news sourcing service for multiple content generation clients.
|
||
|
||
**Stack**: Node.js + Express, architecture ultra-modulaire, stockage JSON (interchangeable MongoDB/PostgreSQL), Redis cache, News Provider modulaire (LLM par défaut, scraping/hybride disponibles).
|
||
|
||
## Reference documents
|
||
- `CDC.md` - Complete technical specifications and algorithms
|
||
- `config/sources.json` - Sources configuration and scraping rules
|
||
- `docs/api.md` - API documentation and examples
|
||
|
||
## Key technical elements
|
||
|
||
### Intelligent scoring system
|
||
```
|
||
Score = (Race_specificity × 0.4) + (Freshness × 0.3) + (Source_quality × 0.2) + (Anti_duplication × 0.1)
|
||
```
|
||
|
||
### Multi-layer anti-prompt injection
|
||
- Content preprocessing with pattern detection
|
||
- Semantic validation
|
||
- Source scoring with security penalties
|
||
- Quarantine suspicious content
|
||
|
||
### Smart stock management
|
||
Three-tier system: Premium (studies, official sources), Standard (specialized news), Fallback (general content).
|
||
Reuse logic with rotation periods and usage tracking.
|
||
|
||
### API design
|
||
Primary endpoint: `GET /api/v1/news/search` with parameters for race_code, product_context, scoring filters.
|
||
Returns scored results with metadata and source attribution.
|
||
|
||
### Cascading source strategy
|
||
1. **Specialized sources** (breed clubs, specialized sites)
|
||
2. **Animal media** (pet magazines, vet sites)
|
||
3. **General fallback** (adapted mainstream content)
|
||
|
||
## Important constraints
|
||
- API-first design for multiple clients
|
||
- Zero prompt injection tolerance
|
||
- Stock coverage: 50+ sources per popular breed
|
||
- Numeric race codes only ("352-1" format)
|
||
- Source diversity and quality balance
|
||
- Architecture ultra-modulaire: interfaces strictes, composants interchangeables
|
||
- News Provider: LLM par défaut, scraping/hybride via configuration
|
||
- Stockage: JSON par défaut, MongoDB/PostgreSQL via interface Repository
|
||
|
||
## Attention points
|
||
- Specialized sources = highest injection risk + highest value
|
||
- Stock management crucial for performance and cost
|
||
- Scoring algorithm must adapt to different client needs
|
||
- Background processing for stock refresh and cleanup
|
||
|
||
## Integrations
|
||
- **PublicationAutomator**: Primary client for daily article generation
|
||
- **Future clients**: Newsletter systems, social media content, competitive intelligence
|
||
- **External APIs**: Google News, RSS feeds, specialized pet industry sources
|
||
- **Monitoring**: Health checks, usage tracking, source reliability metrics
|
||
|
||
## ⚠️ IMPORTANT - TODO MANAGEMENT
|
||
**CRITICAL**: Ce projet est complexe avec 25+ composants interdépendants. La gestion rigoureuse des tâches via todo list est OBLIGATOIRE pour:
|
||
- Éviter l'oubli d'éléments critiques (sécurité, performance, intégrations)
|
||
- Maintenir la cohérence entre les phases de développement
|
||
- Assurer la couverture complète des spécifications CDC
|
||
- Permettre un suivi précis de l'avancement
|
||
|
||
**Règle absolue**: Utiliser TodoWrite pour TOUS les développements non-triviaux de ce projet. Les 447 lignes du CDC représentent un scope considérable qui nécessite une approche méthodique. |