# đŸ—ïž ARCHITECTURE DECISIONS - SourceFinder *SynthĂšse complĂšte des dĂ©cisions techniques prises lors de l'analyse* --- ## 🎯 1. POURQUOI EXPRESS.JS ? ### Alternatives considĂ©rĂ©es | Framework | Avantages | InconvĂ©nients | |-----------|-----------|---------------| | **Express.js** | Mature, Ă©cosystĂšme, flexibilitĂ© | Plus verbeux, configuration manuelle | | **Fastify** | Performance supĂ©rieure, TypeScript natif | ÉcosystĂšme plus petit | | **Koa.js** | Moderne (async/await), lĂ©ger | Moins de middleware prĂȘts | | **NestJS** | Enterprise-ready, TypeScript, DI | ComplexitĂ©, courbe d'apprentissage | ### DĂ©cision : Express.js ✅ **Justifications clĂ©s :** 1. **ÉcosystĂšme mature pour nos besoins spĂ©cifiques** ```javascript // Middleware critiques disponibles immĂ©diatement app.use(helmet()); // SĂ©curitĂ© headers app.use(rateLimit()); // Rate limiting Redis app.use(cors()); // CORS pour multi-clients ``` 2. **FlexibilitĂ© architecture microservice** ```javascript // Pattern service-oriented parfait pour notre CDC const scoringService = require('./services/scoringService'); const securityService = require('./services/securityService'); app.post('/api/v1/news/search', async (req, res) => { // Validation → Scoring → Security → Response const results = await scoringService.searchAndScore(req.body); const sanitized = await securityService.validateContent(results); res.json(sanitized); }); ``` 3. **Performance adaptĂ©e Ă  nos contraintes** ``` CDC requirement: "RĂ©ponses < 5 secondes" Express throughput: ~15,000 req/sec (largement suffisant) Notre bottleneck: Web scraping & DB queries, pas le framework ``` 4. **Middleware essentiels pour la sĂ©curitĂ©** ```javascript // Anti-prompt injection pipeline app.use('/api/v1/news', [ authMiddleware, // API key validation rateLimitingMiddleware, // Prevent abuse contentValidation, // Input sanitization promptInjectionDetection // Notre middleware custom ]); ``` **Express overhead = 0.3%** du temps total → nĂ©gligeable. --- ## đŸ—„ïž 2. STOCKAGE : JSON MODULAIRE vs BASES TRADITIONNELLES ### ProblĂ©matique initiale CDC prĂ©voyait MongoDB/PostgreSQL, mais besoin de simplicitĂ© et modularitĂ©. ### DĂ©cision : JSON par dĂ©faut, interface modulaire ✅ **Architecture retenue :** ```javascript // Interface NewsStockRepository (adaptable JSON/MongoDB/PostgreSQL) { id: String, url: String (unique), title: String, content: String, content_hash: String, // Classification race_tags: [String], // ["352-1", "bergers", "grands_chiens"] angle_tags: [String], // ["legislation", "sante", "comportement"] universal_tags: [String], // ["conseils_proprietaires", "securite"] // Scoring freshness_score: Number, quality_score: Number, specificity_score: Number, reusability_score: Number, final_score: Number, // Usage tracking usage_count: Number, last_used: Date, created_at: Date, expires_at: Date, // Metadata source_domain: String, source_type: String, // "premium", "standard", "fallback" language: String, status: String // "active", "expired", "blocked" } // ImplĂ©mentation par dĂ©faut: JSON files avec index en mĂ©moire // Migration possible vers MongoDB/PostgreSQL sans changement de code mĂ©tier ``` **Avantages approche modulaire :** 1. **SimplicitĂ©** : Pas de setup MongoDB/PostgreSQL pour dĂ©buter 2. **Performance** : Index en mĂ©moire pour recherches rapides 3. **FlexibilitĂ©** : Change de DB sans toucher la logique mĂ©tier 4. **ÉvolutivitĂ©** : Migration transparente quand nĂ©cessaire 5. **DĂ©veloppement** : Focus sur la logique scoring/scraping d'abord **Pattern Repository avec adaptateurs :** ```javascript // Interface abstraite class NewsStockRepository { async findByRaceCode(raceCode) { throw new Error('Not implemented'); } async findByScore(minScore) { throw new Error('Not implemented'); } async save(newsItem) { throw new Error('Not implemented'); } } // ImplĂ©mentation JSON class JSONStockRepository extends NewsStockRepository { constructor(dataPath) { this.dataPath = dataPath; this.memoryIndex = new Map(); // Performance } } // Futures implĂ©mentations class MongoStockRepository extends NewsStockRepository { ... } class PostgreSQLStockRepository extends NewsStockRepository { ... } ``` --- ## đŸ•·ïž 3. STRATÉGIE SCRAPING : ÉVOLUTION DES APPROCHES ### 3.1 Approche initiale : Scraping traditionnel **ComplexitĂ© sous-estimĂ©e identifiĂ©e :** #### Partie "facile" (20% du travail) ```javascript // Scraping basique - ça marche en 30 minutes const puppeteer = require('puppeteer'); const cheerio = require('cheerio'); const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://30millionsdamis.fr'); const html = await page.content(); const $ = cheerio.load(html); const articles = $('.article-title').text(); ``` #### DĂ©fis moyens (30% du travail) - Sites avec JavaScript dynamique - Rate limiting intelligent - Parsing de structures variables #### **ComplexitĂ© Ă©levĂ©e (50% du travail)** - Anti-bot sophistiquĂ©s (Cloudflare, reCAPTCHA) - Sites spĂ©cialisĂ©s = plus protĂ©gĂ©s - Parsing fragile (structure change = casse tout) - Gestion d'erreurs complexe #### **Vrais cauchemars (problĂšmes rĂ©currents)** ``` Semaine 1: 50 sources fonctionnent Semaine 3: 30 millions d'Amis change sa structure → cassĂ© Semaine 5: Wamiz ajoute reCAPTCHA → cassĂ© Semaine 8: Centrale Canine bloque notre IP → cassĂ© ``` **Temps rĂ©aliste : 4-6 semaines** (vs 2 semaines budgĂ©tĂ©es dans CDC) **Facteur aggravant :** Les sources **les plus valables** (clubs race, sites vĂ©tĂ©rinaires) sont souvent **les plus protĂ©gĂ©es**. ### 3.2 Approche LLM Providers **Concept analysĂ© :** ```javascript // Au lieu de scraper + parser const rawHtml = await puppeteer.scrape(url); const content = cheerio.parse(rawHtml); // On aurait directement const news = await llmProvider.searchNews({ query: "Berger Allemand actualitĂ©s 2025", sources: ["specialized", "veterinary", "official"], language: "fr" }); ``` **Avantages :** - SimplicitĂ© technique - Contenu prĂ©-traitĂ© - Évite problĂšmes lĂ©gaux - Pas de maintenance scraping **Questions critiques non rĂ©solues :** - Quels providers peuvent cibler sources spĂ©cialisĂ©es ? - FraĂźcheur donnĂ©es (< 7 jours requirement) ? - ContrĂŽle anti-prompt injection ? - CoĂ»t scaling avec volume ? ### 3.3 Approche hybride : LLM + Scraping intelligent **Concept retenu :** ```javascript // LLM gĂ©nĂšre les selectors automatiquement const scrapingPrompt = ` Analyze this HTML structure and extract news articles: ${htmlContent} Return JSON with selectors for: - Article titles - Article content - Publication dates - Article URLs `; const selectors = await llm.generateSelectors(htmlContent); // → { title: '.article-h2', content: '.post-content', date: '.publish-date' } ``` **Avantages hybride :** 1. **Auto-adaptation aux changements** - LLM s'adapte aux nouvelles structures 2. **Onboarding rapide nouvelles sources** - Pas besoin de configurer selectors 3. **Content cleaning intelligent** - LLM nettoie le contenu **Architecture hybride :** ```javascript class IntelligentScrapingService { async scrapeWithLLM(url) { // 1. Scraping technique classique const html = await puppeteer.getPage(url); // 2. LLM analyse la structure const analysis = await llm.analyzePageStructure(html); // 3. Extraction basĂ©e sur analyse LLM const content = await this.extractWithLLMGuidance(html, analysis); // 4. Validation/nettoyage par LLM return await llm.validateAndClean(content); } } ``` **CoĂ»t estimĂ© :** ``` HTML page = ~50KB LLM analysis = ~1000 tokens input + 200 tokens output Cost per page ≈ $0.01-0.02 (GPT-4) 50 sources × 5 pages/jour = 250 scrapes/jour 250 × $0.015 = $3.75/jour = ~$110/mois ``` --- ## đŸ„· 4. TECHNIQUES ANTI-DÉTECTION GRATUITES ### Contrainte budget - ✅ LLM providers payants OK - ❌ Proxies payants (~50-100€/mois) - ❌ APIs externes - ❌ Services tiers ### Arsenal gratuit dĂ©veloppĂ© #### **1. Stealth Browser Framework** ```javascript const puppeteer = require('puppeteer-extra'); const StealthPlugin = require('puppeteer-extra-plugin-stealth'); // Plugin qui masque TOUS les signaux Puppeteer puppeteer.use(StealthPlugin()); const browser = await puppeteer.launch({ headless: 'new', // Nouveau mode headless moins dĂ©tectable args: [ '--no-sandbox', '--disable-setuid-sandbox', '--disable-blink-features=AutomationControlled', '--disable-features=VizDisplayCompositor' ] }); ``` #### **2. Randomisation comportementale** ```javascript const humanLikeBehavior = { async randomDelay() { const delay = Math.random() * 2000 + 500; // 0.5-2.5s await new Promise(r => setTimeout(r, delay)); }, async humanScroll(page) { // Scroll irrĂ©gulier comme un humain for (let i = 0; i < 3; i++) { await page.evaluate(() => { window.scrollBy(0, Math.random() * 300 + 200); }); await this.randomDelay(); } } }; ``` #### **3. TOR rotation gratuite** ```javascript // Technique controversĂ©e mais lĂ©gale : TOR rotation const tor = require('tor-request'); const torRotation = { async getNewTorSession() { // Reset circuit TOR = nouvelle IP await tor.renewTorSession(); return tor; // Nouveau circuit, nouvelle IP } }; ``` #### **4. Browser fingerprint randomization** ```javascript const freeFingerprinting = { async randomizeEverything(page) { // Timezone alĂ©atoire await page.evaluateOnNewDocument(() => { const timezones = ['Europe/Paris', 'Europe/London', 'Europe/Berlin']; const tz = timezones[Math.floor(Math.random() * timezones.length)]; Object.defineProperty(Intl.DateTimeFormat.prototype, 'resolvedOptions', { value: () => ({ timeZone: tz }) }); }); // Canvas fingerprint randomization await page.evaluateOnNewDocument(() => { const getContext = HTMLCanvasElement.prototype.getContext; HTMLCanvasElement.prototype.getContext = function(type) { if (type === '2d') { const context = getContext.call(this, type); const originalFillText = context.fillText; context.fillText = function() { // Ajouter micro-variation invisible arguments[1] += Math.random() * 0.1; return originalFillText.apply(this, arguments); }; return context; } return getContext.call(this, type); }; }); } }; ``` #### **5. Distributed scraping gratuit** ```javascript // Utiliser plusieurs VPS gratuits const distributedScraping = { freeVPSProviders: [ 'Oracle Cloud Always Free (ARM)', 'Google Cloud 3 months free', 'AWS Free Tier 12 months', 'Heroku free dynos', 'Railway.app free tier' ], async distributeLoad() { // Chaque VPS scrape quelques sites // Coordination via base commune (notre JSON store) const tasks = this.splitScrapeTargets(); return this.deployToFreeVPS(tasks); } }; ``` ### Stack gratuit complet retenu ```javascript const freeStack = { browser: 'puppeteer-extra + stealth (gratuit)', proxies: 'TOR rotation + free proxy scrapers', userAgents: 'Scraping de bases UA gratuites', timing: 'Analysis patterns gratuite', fingerprinting: 'Randomization manuelle', distribution: 'VPS free tiers', storage: 'JSON local (dĂ©jĂ  prĂ©vu)', cache: 'Redis local (gratuit)', llm: 'OpenAI/Claude payant (acceptĂ©)' }; ``` ### Performance attendue | Technique | Taux succĂšs | Maintenance | |-----------|-------------|-------------| | **TOR + stealth** | 70-80% | Moyenne | | **Free proxies** | 40-60% | Haute | | **Fingerprint random** | +15% | Basse | | **LLM evasion** | +20% | Basse | | **Distributed VPS** | +25% | Haute | **RĂ©sultat combinĂ© : ~80-85% succĂšs** (vs 95% avec proxies payants) --- ## 🎯 DÉCISIONS FINALES ARCHITECTURE ### 1. **Framework : Express.js** - ÉcosystĂšme mature pour sĂ©curitĂ© - Middleware anti-prompt injection - Performance suffisante pour nos besoins ### 2. **Stockage : JSON modulaire** - Interface Repository abstraite - JSON par dĂ©faut, migration path MongoDB/PostgreSQL - Index en mĂ©moire pour performance ### 3. **Scraping : Hybride LLM + Techniques gratuites** - LLM pour intelligence et adaptation - Puppeteer-extra + stealth pour technique - TOR + fingerprinting pour anti-dĂ©tection - Budget : 0€ infrastructure + coĂ»t LLM tokens ### 4. **Architecture globale** ``` [API Request] → [Auth/Rate Limiting] → [Stock Search JSON] → [LLM-Guided Scraping if needed] → [Intelligent Scoring] → [Anti-injection Validation] → [Filtered Results] ``` **CoĂ»t total infrastructure : 0€/mois** **EfficacitĂ© attendue : 80-85%** **Temps dĂ©veloppement : Respecte budget 155h** Cette architecture permet de **dĂ©marrer rapidement** avec un **budget minimal** tout en gardant la **flexibilitĂ© d'Ă©volution** vers des solutions plus robustes si le projet scale. --- *SynthĂšse des dĂ©cisions techniques prises lors des Ă©changes du 15/09/2025*