GroveEngine/planTI/scenario_02_chaos_monkey.md
StillHammer d8c5f93429 feat: Add comprehensive hot-reload test suite with 3 integration scenarios
This commit implements a complete test infrastructure for validating
hot-reload stability and robustness across multiple scenarios.

## New Test Infrastructure

### Test Helpers (tests/helpers/)
- TestMetrics: FPS, memory, reload time tracking with statistics
- TestReporter: Assertion tracking and formatted test reports
- SystemUtils: Memory usage monitoring via /proc/self/status
- TestAssertions: Macro-based assertion framework

### Test Modules
- TankModule: Realistic module with 50 tanks for production testing
- ChaosModule: Crash-injection module for robustness validation
- StressModule: Lightweight module for long-duration stability tests

## Integration Test Scenarios

### Scenario 1: Production Hot-Reload (test_01_production_hotreload.cpp)
 PASSED - End-to-end hot-reload validation
- 30 seconds simulation (1800 frames @ 60 FPS)
- TankModule with 50 tanks, realistic state
- Source modification (v1.0 → v2.0), recompilation, reload
- State preservation: positions, velocities, frameCount
- Metrics: ~163ms reload time, 0.88MB memory growth

### Scenario 2: Chaos Monkey (test_02_chaos_monkey.cpp)
 PASSED - Extreme robustness testing
- 150+ random crashes per run (5% crash probability per frame)
- 5 crash types: runtime_error, logic_error, out_of_range, domain_error, state corruption
- 100% recovery rate via automatic hot-reload
- Corrupted state detection and rejection
- Random seed for unpredictable crash patterns
- Proof of real reload: temporary files in /tmp/grove_module_*.so

### Scenario 3: Stress Test (test_03_stress_test.cpp)
 PASSED - Long-duration stability validation
- 10 minutes simulation (36000 frames @ 60 FPS)
- 120 hot-reloads (every 5 seconds)
- 100% reload success rate (120/120)
- Memory growth: 2 MB (threshold: 50 MB)
- Avg reload time: 160ms (threshold: 500ms)
- No memory leaks, no file descriptor leaks

## Core Engine Enhancements

### ModuleLoader (src/ModuleLoader.cpp)
- Temporary file copy to /tmp/ for Linux dlopen cache bypass
- Robust reload() method: getState() → unload() → load() → setState()
- Automatic cleanup of temporary files
- Comprehensive error handling and logging

### DebugEngine (src/DebugEngine.cpp)
- Automatic recovery in processModuleSystems()
- Exception catching → logging → module reload → continue
- Module state dump utilities for debugging

### SequentialModuleSystem (src/SequentialModuleSystem.cpp)
- extractModule() for safe module extraction
- registerModule() for module re-registration
- Enhanced processModules() with error handling

## Build System
- CMake configuration for test infrastructure
- Shared library compilation for test modules (.so)
- CTest integration for all scenarios
- PIC flag management for spdlog compatibility

## Documentation (planTI/)
- Complete test architecture documentation
- Detailed scenario specifications with success criteria
- Global test plan and validation thresholds

## Validation Results
All 3 integration scenarios pass successfully:
- Production hot-reload: State preservation validated
- Chaos Monkey: 100% recovery from 150+ crashes
- Stress Test: Stable over 120 reloads, minimal memory growth

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-13 22:13:07 +08:00

15 KiB

Scénario 2: Chaos Monkey

Priorité: CRITIQUE Phase: 1 (MUST HAVE) Durée estimée: ~5 minutes Effort implémentation: ~6-8 heures


🎯 Objectif

Valider la robustesse du système face à des failures aléatoires et sa capacité à:

  • Détecter les crashes
  • Récupérer automatiquement
  • Maintenir la stabilité mémoire
  • Éviter les deadlocks
  • Logger les erreurs correctement

Inspiré de: Netflix Chaos Monkey (tester en cassant aléatoirement)


📋 Description

Principe

Exécuter l'engine pendant 5 minutes avec un module qui génère des events aléatoires de failure:

  • 30% chance de hot-reload par seconde
  • 10% chance de crash dans process()
  • 10% chance de state corrompu
  • 5% chance de config invalide
  • 45% chance de fonctionnement normal

Comportement Attendu

L'engine doit:

  1. Détecter chaque crash
  2. Logger l'erreur avec stack trace
  3. Récupérer en rechargeant le module
  4. Continuer l'exécution
  5. Ne jamais deadlock
  6. Memory usage stable (< 10MB growth)

🏗️ Implémentation

ChaosModule Structure

// ChaosModule.h
class ChaosModule : public IModule {
public:
    void initialize(std::shared_ptr<IDataNode> config) override;
    void process(float deltaTime) override;
    std::shared_ptr<IDataNode> getState() const override;
    void setState(std::shared_ptr<IDataNode> state) override;
    bool isIdle() const override { return !isProcessing; }

private:
    std::mt19937 rng;
    int frameCount = 0;
    int crashCount = 0;
    int corruptionCount = 0;
    bool isProcessing = false;

    // Configuration du chaos
    float hotReloadProbability = 0.30f;
    float crashProbability = 0.10f;
    float corruptionProbability = 0.10f;
    float invalidConfigProbability = 0.05f;

    // Simulations de failures
    void maybeHotReload();
    void maybeCrash();
    void maybeCorruptState();
    void maybeInvalidConfig();
};

Process Logic

void ChaosModule::process(float deltaTime) {
    isProcessing = true;
    frameCount++;

    // Générer event aléatoire (1 fois par seconde = 60 frames)
    if (frameCount % 60 == 0) {
        float roll = (rng() % 100) / 100.0f;

        if (roll < hotReloadProbability) {
            // Signal pour hot-reload (via flag dans state)
            logger->info("🎲 Chaos: Triggering HOT-RELOAD");
            // Note: Le test externe déclenchera le reload
        }
        else if (roll < hotReloadProbability + crashProbability) {
            // CRASH INTENTIONNEL
            logger->warn("🎲 Chaos: Triggering CRASH");
            crashCount++;
            throw std::runtime_error("Intentional crash for chaos testing");
        }
        else if (roll < hotReloadProbability + crashProbability + corruptionProbability) {
            // CORRUPTION DE STATE
            logger->warn("🎲 Chaos: Corrupting STATE");
            corruptionCount++;
            // Modifier state de façon invalide (sera détecté à getState)
        }
        else if (roll < hotReloadProbability + crashProbability + corruptionProbability + invalidConfigProbability) {
            // CONFIG INVALIDE (simulé)
            logger->warn("🎲 Chaos: Invalid CONFIG request");
            // Le test externe tentera setState avec config invalide
        }
        // Sinon: fonctionnement normal (45%)
    }

    isProcessing = false;
}

State Format

{
    "frameCount": 18000,
    "crashCount": 12,
    "corruptionCount": 8,
    "hotReloadCount": 90,
    "seed": 42,
    "isCorrupted": false
}

Test Principal

// test_02_chaos_monkey.cpp
#include "helpers/TestMetrics.h"
#include "helpers/TestReporter.h"
#include <csignal>

// Global pour catch segfault
static bool engineCrashed = false;
void signalHandler(int signal) {
    if (signal == SIGSEGV || signal == SIGABRT) {
        engineCrashed = true;
    }
}

int main() {
    TestReporter reporter("Chaos Monkey");
    TestMetrics metrics;

    // Setup signal handlers
    std::signal(SIGSEGV, signalHandler);
    std::signal(SIGABRT, signalHandler);

    // === SETUP ===
    DebugEngine engine;
    engine.loadModule("ChaosModule", "build/modules/libChaosModule.so");

    auto config = createJsonConfig({
        {"seed", 42},  // Reproductible
        {"hotReloadProbability", 0.30},
        {"crashProbability", 0.10},
        {"corruptionProbability", 0.10},
        {"invalidConfigProbability", 0.05}
    });

    engine.initializeModule("ChaosModule", config);

    // === CHAOS LOOP (5 minutes = 18000 frames) ===
    std::cout << "Starting Chaos Monkey (5 minutes)...\n";

    int totalFrames = 18000; // 5 * 60 * 60
    int crashesDetected = 0;
    int reloadsTriggered = 0;
    int recoverySuccesses = 0;
    bool hadDeadlock = false;

    auto testStart = std::chrono::high_resolution_clock::now();

    for (int frame = 0; frame < totalFrames; frame++) {
        auto frameStart = std::chrono::high_resolution_clock::now();

        try {
            // Update engine
            engine.update(1.0f / 60.0f);

            // Check si le module demande un hot-reload
            auto state = engine.getModuleState("ChaosModule");
            auto* jsonNode = dynamic_cast<JsonDataNode*>(state.get());
            const auto& stateJson = jsonNode->getJsonData();

            // Simuler hot-reload si demandé (aléatoire 30%)
            if (frame % 60 == 0) { // Check toutes les secondes
                float roll = (rand() % 100) / 100.0f;
                if (roll < 0.30f) {
                    std::cout << "  [Frame " << frame << "] Hot-reload triggered\n";

                    auto reloadStart = std::chrono::high_resolution_clock::now();

                    engine.reloadModule("ChaosModule");
                    reloadsTriggered++;

                    auto reloadEnd = std::chrono::high_resolution_clock::now();
                    float reloadTime = std::chrono::duration<float, std::milli>(reloadEnd - reloadStart).count();
                    metrics.recordReloadTime(reloadTime);
                }
            }

        } catch (const std::exception& e) {
            // CRASH DÉTECTÉ
            crashesDetected++;
            std::cout << "  [Frame " << frame << "] ⚠️  Crash detected: " << e.what() << "\n";

            // Tentative de recovery
            try {
                std::cout << "  [Frame " << frame << "] 🔄 Attempting recovery...\n";

                // Recharger le module
                engine.reloadModule("ChaosModule");

                // Réinitialiser avec state par défaut
                engine.initializeModule("ChaosModule", config);

                recoverySuccesses++;
                std::cout << "  [Frame " << frame << "] ✅ Recovery successful\n";

            } catch (const std::exception& recoveryError) {
                std::cout << "  [Frame " << frame << "] ❌ Recovery FAILED: " << recoveryError.what() << "\n";
                reporter.addAssertion("recovery_failed", false);
                break; // Arrêter le test
            }
        }

        // Métriques
        auto frameEnd = std::chrono::high_resolution_clock::now();
        float frameTime = std::chrono::duration<float, std::milli>(frameEnd - frameStart).count();
        metrics.recordFPS(1000.0f / frameTime);
        metrics.recordMemoryUsage(getCurrentMemoryUsage());

        // Deadlock detection (frame > 100ms)
        if (frameTime > 100.0f) {
            std::cout << "  [Frame " << frame << "] ⚠️  Potential deadlock (frame time: " << frameTime << "ms)\n";
            hadDeadlock = true;
        }

        // Progress (toutes les 3000 frames = 50s)
        if (frame % 3000 == 0 && frame > 0) {
            float elapsedMin = frame / 3600.0f;
            std::cout << "Progress: " << elapsedMin << "/5.0 minutes (" << (frame * 100 / totalFrames) << "%)\n";
        }
    }

    auto testEnd = std::chrono::high_resolution_clock::now();
    float totalDuration = std::chrono::duration<float>(testEnd - testStart).count();

    // === VÉRIFICATIONS FINALES ===

    // Engine toujours vivant
    bool engineAlive = !engineCrashed;
    ASSERT_TRUE(engineAlive, "Engine should still be alive");
    reporter.addAssertion("engine_alive", engineAlive);

    // Pas de deadlocks
    ASSERT_FALSE(hadDeadlock, "Should not have deadlocks");
    reporter.addAssertion("no_deadlocks", !hadDeadlock);

    // Recovery rate > 95%
    float recoveryRate = (crashesDetected > 0) ? (recoverySuccesses * 100.0f / crashesDetected) : 100.0f;
    ASSERT_GT(recoveryRate, 95.0f, "Recovery rate should be > 95%");
    reporter.addMetric("recovery_rate_percent", recoveryRate);

    // Memory growth < 10MB
    size_t memGrowth = metrics.getMemoryGrowth();
    ASSERT_LT(memGrowth, 10 * 1024 * 1024, "Memory growth should be < 10MB");
    reporter.addMetric("memory_growth_mb", memGrowth / (1024.0f * 1024.0f));

    // Durée totale proche de 5 minutes (tolérance ±10s)
    ASSERT_WITHIN(totalDuration, 300.0f, 10.0f, "Total duration should be ~5 minutes");
    reporter.addMetric("total_duration_sec", totalDuration);

    // Statistiques
    reporter.addMetric("crashes_detected", crashesDetected);
    reporter.addMetric("reloads_triggered", reloadsTriggered);
    reporter.addMetric("recovery_successes", recoverySuccesses);

    std::cout << "\n";
    std::cout << "================================================================================\n";
    std::cout << "CHAOS MONKEY STATISTICS\n";
    std::cout << "================================================================================\n";
    std::cout << "  Total frames:        " << totalFrames << "\n";
    std::cout << "  Duration:            " << totalDuration << "s\n";
    std::cout << "  Crashes detected:    " << crashesDetected << "\n";
    std::cout << "  Reloads triggered:   " << reloadsTriggered << "\n";
    std::cout << "  Recovery successes:  " << recoverySuccesses << "\n";
    std::cout << "  Recovery rate:       " << recoveryRate << "%\n";
    std::cout << "  Memory growth:       " << (memGrowth / (1024.0f * 1024.0f)) << " MB\n";
    std::cout << "  Had deadlocks:       " << (hadDeadlock ? "YES ❌" : "NO ✅") << "\n";
    std::cout << "================================================================================\n\n";

    // === RAPPORT FINAL ===
    metrics.printReport();
    reporter.printFinalReport();

    return reporter.getExitCode();
}

📊 Métriques Collectées

Métrique Description Seuil
engine_alive Engine toujours vivant à la fin true
no_deadlocks Aucun deadlock détecté true
recovery_rate_percent % de crashes récupérés > 95%
memory_growth_mb Croissance mémoire totale < 10MB
total_duration_sec Durée totale du test ~300s (±10s)
crashes_detected Nombre de crashes détectés N/A (info)
reloads_triggered Nombre de hot-reloads ~90 (30% * 300s)
recovery_successes Nombre de recoveries réussies ~crashes_detected

Critères de Succès

MUST PASS

  1. Engine toujours vivant après 5 minutes
  2. Aucun deadlock
  3. Recovery rate > 95%
  4. Memory growth < 10MB
  5. Logs contiennent tous les crashes (avec stack trace si possible)

NICE TO HAVE

  1. Recovery rate = 100% (aucun échec)
  2. Memory growth < 5MB
  3. Reload time moyen < 500ms même pendant chaos

🔧 Recovery Strategy

Dans DebugEngine::update()

void DebugEngine::update(float deltaTime) {
    try {
        moduleSystem->update(deltaTime);

    } catch (const std::exception& e) {
        // CRASH DÉTECTÉ
        logger->error("❌ Module crashed: {}", e.what());

        // Tentative de recovery automatique
        try {
            logger->info("🔄 Attempting automatic recovery...");

            // 1. Extraire le module défaillant
            auto failedModule = moduleSystem->extractModule();

            // 2. Recharger depuis .so
            std::string moduleName = "ChaosModule"; // À généraliser
            reloadModule(moduleName);

            // 3. Réinitialiser avec config par défaut
            auto defaultConfig = createDefaultConfig();
            auto newModule = moduleLoader->getModule(moduleName);
            newModule->initialize(defaultConfig);

            // 4. Ré-enregistrer
            moduleSystem->registerModule(moduleName, std::move(newModule));

            logger->info("✅ Recovery successful");

        } catch (const std::exception& recoveryError) {
            logger->critical("❌ Recovery failed: {}", recoveryError.what());
            throw; // Re-throw si recovery impossible
        }
    }
}

🐛 Cas d'Erreur Attendus

Erreur Cause Action
Crash non récupéré Recovery logic bugué FAIL - fixer recovery
Deadlock Mutex lock dans crash handler FAIL - review locks
Memory leak > 10MB Module pas correctement nettoyé FAIL - fix destructors
Duration >> 300s Reloads trop longs WARNING - optimiser

📝 Output Attendu

================================================================================
TEST: Chaos Monkey
================================================================================

Starting Chaos Monkey (5 minutes)...

  [Frame 60] Hot-reload triggered
  [Frame 180] ⚠️  Crash detected: Intentional crash for chaos testing
  [Frame 180] 🔄 Attempting recovery...
  [Frame 180] ✅ Recovery successful
  [Frame 240] Hot-reload triggered
  ...
Progress: 1.0/5.0 minutes (20%)
  [Frame 3420] ⚠️  Crash detected: Intentional crash for chaos testing
  [Frame 3420] 🔄 Attempting recovery...
  [Frame 3420] ✅ Recovery successful
  ...
Progress: 5.0/5.0 minutes (100%)

================================================================================
CHAOS MONKEY STATISTICS
================================================================================
  Total frames:        18000
  Duration:            302.4s
  Crashes detected:    28
  Reloads triggered:   89
  Recovery successes:  28
  Recovery rate:       100%
  Memory growth:       3.2 MB
  Had deadlocks:       NO ✅
================================================================================

METRICS
================================================================================
  Engine alive:        true           ✓
  No deadlocks:        true           ✓
  Recovery rate:       100%           (threshold: > 95%) ✓
  Memory growth:       3.2 MB         (threshold: < 10MB) ✓
  Total duration:      302.4s         (expected: ~300s) ✓

ASSERTIONS
================================================================================
  ✓ engine_alive
  ✓ no_deadlocks
  ✓ recovery_rate > 95%
  ✓ memory_stable

Result: ✅ PASSED

================================================================================

📅 Planning

Jour 1 (4h):

  • Implémenter ChaosModule avec probabilités configurables
  • Implémenter recovery logic dans DebugEngine

Jour 2 (4h):

  • Implémenter test_02_chaos_monkey.cpp
  • Signal handling (SIGSEGV, SIGABRT)
  • Deadlock detection
  • Debug + validation

Prochaine étape: scenario_03_stress_test.md