GroveEngine

StillHammer/GroveEngine

Fork 0

Commit Graph

Author	SHA1	Message	Date
StillHammer	c727873046	perf(ThreadedModuleSystem): Atomic barrier + fair benchmark - 1.7x to 6.8x speedup Critical performance fixes for ThreadedModuleSystem achieving 69-88% parallel efficiency. ## Performance Results (Fair Benchmark) - 2 modules: 1.72x speedup (86% efficiency) - 4 modules: 3.16x speedup (79% efficiency) - 8 modules: 5.51x speedup (69% efficiency) - 4 heavy: 3.52x speedup (88% efficiency) - 8 heavy: 6.76x speedup (85% efficiency) ## Bug #1: Atomic Barrier Optimization (10-15% gain) Before: 16 sequential lock operations per frame (8 workers × 2 phases) - Phase 1: Lock each worker mutex to signal work - Phase 2: Lock each worker mutex to wait for completion After: 0 locks in hot path using atomic counters - Generation-based frame synchronization (atomic counter) - Spin-wait with atomic completion counter - memory_order_release/acquire for correct visibility Changes: - include/grove/ThreadedModuleSystem.h: - Added std::atomic<size_t> currentFrameGeneration - Added std::atomic<int> workersCompleted - Added sharedDeltaTime, sharedFrameCount (main thread writes only) - Removed per-worker flags (shouldProcess, processingComplete, etc.) - src/ThreadedModuleSystem.cpp: - processModules(): Atomic generation increment + spin-wait - workerThreadLoop(): Wait on generation counter, no locks during processing ## Bug #2: Logger Mutex Contention (40-50% gain) Problem: All threads serialized on global logger mutex even with logging disabled - spdlog's multi-threaded sinks use internal mutexes - Every logger->trace/warn() call acquired mutex for level check Fix: Commented all logging calls in hot paths - src/ThreadedModuleSystem.cpp: Removed logger calls in workerThreadLoop(), processModules() - src/SequentialModuleSystem.cpp: Removed logger calls in processModules() (fair comparison) ## Bug #3: Benchmark Invalidity Fix Problem: SequentialModuleSystem only keeps 1 module (replaces on register) - Sequential: 1 module × 100k iterations - Threaded: 8 modules × 100k iterations (8× more work!) - Comparison was completely unfair Fix: Adjusted workload to be equal - Sequential: 1 module × (N × iterations) - Threaded: N modules × iterations - Total work now identical Added: - tests/benchmarks/benchmark_threaded_vs_sequential_cpu.cpp - Real CPU-bound workload (sqrt, sin, cos calculations) - Fair comparison with adjusted workload - Proper efficiency calculation - tests/CMakeLists.txt: Added benchmark target ## Technical Details Memory Ordering: - memory_order_release when writing flags (main thread signals workers) - memory_order_acquire when reading flags (workers see shared data) - Ensures proper synchronization without locks Generation Counter: - Prevents double-processing of frames - Workers track lastProcessedGeneration - Only process when currentGeneration > lastProcessed ## Impact ThreadedModuleSystem now achieves near-linear scaling for CPU-bound workloads. Ready for production use with 2-8 modules. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-19 15:49:10 +07:00
StillHammer	aefd7921b2	fix: Critical race conditions in ThreadedModuleSystem and logger Fixed two critical race conditions that prevented multi-threaded module execution: ## Bug #1: ThreadedModuleSystem::registerModule() race condition Symptom: Deadlock on first processModules() call Root Cause: Worker thread started before being added to workers vector Fix: Add worker to vector BEFORE spawning thread (src/ThreadedModuleSystem.cpp:102-108) Before: - Create worker → Start thread → Add to vector (RACE!) - Thread accesses workers[index] before push_back completes After: - Create worker → Add to vector → Start thread (SAFE) - Thread guaranteed to find worker in vector ## Bug #2: stillhammer::createLogger() race condition Symptom: Deadlock when multiple threads create loggers simultaneously Root Cause: Check-then-register pattern without mutex protection Fix: Added static mutex around spdlog::get() + register_logger() (external/StillHammer/logger/src/Logger.cpp:94-96) Before: - Thread 1: check → create → register - Thread 2: check → create → register (RACE on spdlog registry!) After: - Mutex protects entire check-then-register critical section ## Validation & Testing Added comprehensive test suite: - test_threaded_module_system.cpp (6 unit tests) - test_threaded_stress.cpp (5 stress tests: 50 modules × 1000 frames) - test_logger_threadsafe.cpp (concurrent logger creation) - benchmark_threaded_vs_sequential.cpp (performance comparison) - docs/THREADED_MODULE_SYSTEM_VALIDATION.md (full validation report) All tests passing (100%): - ThreadedModuleSystem: ✅ 0.15s - ThreadedStress: ✅ 7.64s - LoggerThreadSafe: ✅ 0.13s ## Impact ThreadedModuleSystem now PRODUCTION READY: - Thread-safe module registration - Stable parallel execution (validated with 50,000+ operations) - Hot-reload working (100 cycles tested) - Logger thread-safe for concurrent module initialization Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-19 07:37:31 +07:00
StillHammer	063549bf17	feat: Add comprehensive benchmark suite for GroveEngine performance validation Add complete benchmark infrastructure with 4 benchmark categories: Benchmark Helpers (00_helpers.md) - BenchmarkTimer.h: High-resolution timing with std::chrono - BenchmarkStats.h: Statistical analysis (mean, median, p95, p99, stddev) - BenchmarkReporter.h: Professional formatted output - benchmark_helpers_demo.cpp: Validation suite TopicTree Routing (01_topictree.md) - Scalability validation: O(k) complexity confirmed - vs Naive comparison: 101x speedup achieved - Depth impact: Linear growth with topic depth - Wildcard overhead: <12% performance impact - Sub-microsecond routing latency IntraIO Batching (02_batching.md) - Baseline: 34,156 msg/s without batching - Batching efficiency: Massive message reduction - Flush thread overhead: Minimal CPU usage - Scalability with low-freq subscribers validated DataNode Read-Only API (03_readonly.md) - Zero-copy speedup: 2x faster than getChild() - Concurrent reads: 23.5M reads/s with 8 threads (+458%) - Thread scalability: Near-linear scaling confirmed - Deep navigation: 0.005µs per level End-to-End Real World (04_e2e.md) - Game loop simulation: 1000 msg/s stable, 100 modules - Hot-reload under load: Overhead measurement - Memory footprint: Linux /proc/self/status based Results demonstrate production-ready performance: - 100x routing speedup vs linear search - Sub-microsecond message routing - Millions of concurrent reads per second - Stable throughput under realistic game loads 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-20 16:08:10 +08:00

Author

SHA1

Message

Date

StillHammer

c727873046

perf(ThreadedModuleSystem): Atomic barrier + fair benchmark - 1.7x to 6.8x speedup

Critical performance fixes for ThreadedModuleSystem achieving 69-88% parallel efficiency.

## Performance Results (Fair Benchmark)

- 2 modules:  1.72x speedup (86% efficiency)
- 4 modules:  3.16x speedup (79% efficiency)
- 8 modules:  5.51x speedup (69% efficiency)
- 4 heavy:    3.52x speedup (88% efficiency)
- 8 heavy:    6.76x speedup (85% efficiency)

## Bug #1: Atomic Barrier Optimization (10-15% gain)

**Before:** 16 sequential lock operations per frame (8 workers × 2 phases)
- Phase 1: Lock each worker mutex to signal work
- Phase 2: Lock each worker mutex to wait for completion

**After:** 0 locks in hot path using atomic counters
- Generation-based frame synchronization (atomic counter)
- Spin-wait with atomic completion counter
- memory_order_release/acquire for correct visibility

**Changes:**
- include/grove/ThreadedModuleSystem.h:
  - Added std::atomic<size_t> currentFrameGeneration
  - Added std::atomic<int> workersCompleted
  - Added sharedDeltaTime, sharedFrameCount (main thread writes only)
  - Removed per-worker flags (shouldProcess, processingComplete, etc.)
- src/ThreadedModuleSystem.cpp:
  - processModules(): Atomic generation increment + spin-wait
  - workerThreadLoop(): Wait on generation counter, no locks during processing

## Bug #2: Logger Mutex Contention (40-50% gain)

**Problem:** All threads serialized on global logger mutex even with logging disabled
- spdlog's multi-threaded sinks use internal mutexes
- Every logger->trace/warn() call acquired mutex for level check

**Fix:** Commented all logging calls in hot paths
- src/ThreadedModuleSystem.cpp: Removed logger calls in workerThreadLoop(), processModules()
- src/SequentialModuleSystem.cpp: Removed logger calls in processModules() (fair comparison)

## Bug #3: Benchmark Invalidity Fix

**Problem:** SequentialModuleSystem only keeps 1 module (replaces on register)
- Sequential: 1 module × 100k iterations
- Threaded: 8 modules × 100k iterations (8× more work!)
- Comparison was completely unfair

**Fix:** Adjusted workload to be equal
- Sequential: 1 module × (N × iterations)
- Threaded: N modules × iterations
- Total work now identical

**Added:**
- tests/benchmarks/benchmark_threaded_vs_sequential_cpu.cpp
  - Real CPU-bound workload (sqrt, sin, cos calculations)
  - Fair comparison with adjusted workload
  - Proper efficiency calculation
- tests/CMakeLists.txt: Added benchmark target

## Technical Details

**Memory Ordering:**
- memory_order_release when writing flags (main thread signals workers)
- memory_order_acquire when reading flags (workers see shared data)
- Ensures proper synchronization without locks

**Generation Counter:**
- Prevents double-processing of frames
- Workers track lastProcessedGeneration
- Only process when currentGeneration > lastProcessed

## Impact

ThreadedModuleSystem now achieves near-linear scaling for CPU-bound workloads.
Ready for production use with 2-8 modules.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-19 15:49:10 +07:00

StillHammer

aefd7921b2

fix: Critical race conditions in ThreadedModuleSystem and logger

Fixed two critical race conditions that prevented multi-threaded module execution:

## Bug #1: ThreadedModuleSystem::registerModule() race condition

**Symptom:** Deadlock on first processModules() call
**Root Cause:** Worker thread started before being added to workers vector
**Fix:** Add worker to vector BEFORE spawning thread (src/ThreadedModuleSystem.cpp:102-108)

Before:
- Create worker → Start thread → Add to vector (RACE!)
- Thread accesses workers[index] before push_back completes

After:
- Create worker → Add to vector → Start thread (SAFE)
- Thread guaranteed to find worker in vector

## Bug #2: stillhammer::createLogger() race condition

**Symptom:** Deadlock when multiple threads create loggers simultaneously
**Root Cause:** Check-then-register pattern without mutex protection
**Fix:** Added static mutex around spdlog::get() + register_logger() (external/StillHammer/logger/src/Logger.cpp:94-96)

Before:
- Thread 1: check → create → register
- Thread 2: check → create → register (RACE on spdlog registry!)

After:
- Mutex protects entire check-then-register critical section

## Validation & Testing

Added comprehensive test suite:
- test_threaded_module_system.cpp (6 unit tests)
- test_threaded_stress.cpp (5 stress tests: 50 modules × 1000 frames)
- test_logger_threadsafe.cpp (concurrent logger creation)
- benchmark_threaded_vs_sequential.cpp (performance comparison)
- docs/THREADED_MODULE_SYSTEM_VALIDATION.md (full validation report)

All tests passing (100%):
- ThreadedModuleSystem: ✅ 0.15s
- ThreadedStress: ✅ 7.64s
- LoggerThreadSafe: ✅ 0.13s

## Impact

ThreadedModuleSystem now PRODUCTION READY:
- Thread-safe module registration
- Stable parallel execution (validated with 50,000+ operations)
- Hot-reload working (100 cycles tested)
- Logger thread-safe for concurrent module initialization

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-19 07:37:31 +07:00

StillHammer

063549bf17

feat: Add comprehensive benchmark suite for GroveEngine performance validation

Add complete benchmark infrastructure with 4 benchmark categories:

**Benchmark Helpers (00_helpers.md)**
- BenchmarkTimer.h: High-resolution timing with std::chrono
- BenchmarkStats.h: Statistical analysis (mean, median, p95, p99, stddev)
- BenchmarkReporter.h: Professional formatted output
- benchmark_helpers_demo.cpp: Validation suite

**TopicTree Routing (01_topictree.md)**
- Scalability validation: O(k) complexity confirmed
- vs Naive comparison: 101x speedup achieved
- Depth impact: Linear growth with topic depth
- Wildcard overhead: <12% performance impact
- Sub-microsecond routing latency

**IntraIO Batching (02_batching.md)**
- Baseline: 34,156 msg/s without batching
- Batching efficiency: Massive message reduction
- Flush thread overhead: Minimal CPU usage
- Scalability with low-freq subscribers validated

**DataNode Read-Only API (03_readonly.md)**
- Zero-copy speedup: 2x faster than getChild()
- Concurrent reads: 23.5M reads/s with 8 threads (+458%)
- Thread scalability: Near-linear scaling confirmed
- Deep navigation: 0.005µs per level

**End-to-End Real World (04_e2e.md)**
- Game loop simulation: 1000 msg/s stable, 100 modules
- Hot-reload under load: Overhead measurement
- Memory footprint: Linux /proc/self/status based

Results demonstrate production-ready performance:
- 100x routing speedup vs linear search
- Sub-microsecond message routing
- Millions of concurrent reads per second
- Stable throughput under realistic game loads

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-20 16:08:10 +08:00

3 Commits