feedgenerator/ARCHITECTURE.md
StillHammer 40138c2d45 Initial implementation: Feed Generator V1
Complete Python implementation with strict type safety and best practices.

Features:
- RSS/Atom/HTML web scraping
- GPT-4 Vision image analysis
- Node.js API integration
- RSS/JSON feed publishing

Modules:
- src/config.py: Configuration with strict validation
- src/exceptions.py: Custom exception hierarchy
- src/scraper.py: Multi-format news scraping (RSS/Atom/HTML)
- src/image_analyzer.py: GPT-4 Vision integration with retry
- src/aggregator.py: Content aggregation and filtering
- src/article_client.py: Node.js API client with retry
- src/publisher.py: RSS/JSON feed generation
- scripts/run.py: Complete pipeline orchestrator
- scripts/validate.py: Code quality validation

Code Quality:
- 100% type hint coverage (mypy strict mode)
- Zero bare except clauses
- Logger throughout (no print statements)
- Comprehensive test suite (598 lines)
- Immutable dataclasses (frozen=True)
- Explicit error handling
- Structured logging

Stats:
- 1,431 lines of source code
- 598 lines of test code
- 15 Python files
- 8 core modules
- 4 test suites

All validation checks pass.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-07 22:28:18 +08:00

28 KiB

ARCHITECTURE.md

# ARCHITECTURE.md - Feed Generator Technical Design

---

## SYSTEM OVERVIEW

**Feed Generator** aggregates news content from web sources, enriches it with AI-generated image analysis, and produces articles via an existing Node.js API.

### High-Level Flow

Web Sources → Scraper → Image Analyzer → Aggregator → Node API Client → Publisher ↓ ↓ ↓ ↓ ↓ ↓ HTML NewsArticle AnalyzedArticle Prompt GeneratedArticle Feed/RSS


### Design Goals

1. **Simplicity** - Clear, readable code over cleverness
2. **Modularity** - Each component has ONE responsibility
3. **Type Safety** - Full type coverage, mypy-compliant
4. **Testability** - Every module independently testable
5. **Prototype Speed** - Working system in 3-5 days
6. **Future-Proof** - Easy to migrate to Node.js later

---

## ARCHITECTURE PRINCIPLES

### 1. Pipeline Architecture

**Linear data flow, no circular dependencies.**

Input → Transform → Transform → Transform → Output


Each stage:
- Takes typed input
- Performs ONE transformation
- Returns typed output
- Can fail explicitly

### 2. Dependency Injection

**Configuration flows top-down, no global state.**

```python
# Main orchestrator
config = Config.from_env()

scraper = NewsScraper(config.scraper)
analyzer = ImageAnalyzer(config.api.openai_key)
client = ArticleAPIClient(config.api.node_api_url)
publisher = FeedPublisher(config.publisher)

# Pass dependencies explicitly
pipeline = Pipeline(scraper, analyzer, client, publisher)

3. Explicit Error Boundaries

Each module defines its failure modes.

# Module A raises ScrapingError
# Module B catches and handles
try:
    articles = scraper.scrape(url)
except ScrapingError as e:
    logger.error(f"Scraping failed: {e}")
    # Decide: retry, skip, or fail

MODULE RESPONSIBILITIES

1. config.py - Configuration Management

Purpose: Centralize all configuration, load from environment.

Responsibilities:

  • Load configuration from .env file
  • Validate required settings
  • Provide immutable config objects
  • NO business logic

Data Structures:

@dataclass(frozen=True)
class APIConfig:
    openai_key: str
    node_api_url: str
    timeout_seconds: int

@dataclass(frozen=True)
class ScraperConfig:
    sources: List[str]
    max_articles: int
    timeout_seconds: int

@dataclass(frozen=True)
class Config:
    api: APIConfig
    scraper: ScraperConfig
    log_level: str

Interface:

def from_env() -> Config:
    """Load and validate configuration from environment."""

2. scraper.py - Web Scraping

Purpose: Extract news articles from web sources.

Responsibilities:

  • HTTP requests to news sites
  • HTML parsing with BeautifulSoup
  • Extract: title, content, image URLs
  • Handle site-specific quirks
  • NO image analysis, NO article generation

Data Structures:

@dataclass
class NewsArticle:
    title: str
    url: str
    content: str
    image_url: Optional[str]
    published_at: Optional[datetime]
    source: str

Interface:

class NewsScraper:
    def scrape(self, url: str) -> List[NewsArticle]:
        """Scrape articles from a news source."""
    
    def scrape_all(self) -> List[NewsArticle]:
        """Scrape all configured sources."""

Error Handling:

  • Raises ScrapingError on failure
  • Logs warnings for individual article failures
  • Returns partial results when possible

3. image_analyzer.py - AI Image Analysis

Purpose: Generate descriptions of news images using GPT-4 Vision.

Responsibilities:

  • Call OpenAI GPT-4 Vision API
  • Generate contextual image descriptions
  • Handle API rate limits and errors
  • NO scraping, NO article generation

Data Structures:

@dataclass
class ImageAnalysis:
    image_url: str
    description: str
    confidence: float  # 0.0 to 1.0
    analysis_time: datetime

Interface:

class ImageAnalyzer:
    def analyze(self, image_url: str, context: str) -> ImageAnalysis:
        """Analyze single image with context."""
    
    def analyze_batch(
        self, 
        articles: List[NewsArticle]
    ) -> Dict[str, ImageAnalysis]:
        """Analyze multiple images, return dict keyed by URL."""

Error Handling:

  • Raises ImageAnalysisError on API failure
  • Returns None for individual failures in batch
  • Implements retry logic with exponential backoff

4. aggregator.py - Content Aggregation

Purpose: Combine scraped content and image analysis into generation prompts.

Responsibilities:

  • Merge NewsArticle + ImageAnalysis
  • Format prompts for article generation API
  • Apply business logic (e.g., skip low-confidence images)
  • NO external API calls

Data Structures:

@dataclass
class AggregatedContent:
    news: NewsArticle
    image_analysis: Optional[ImageAnalysis]
    
    def to_generation_prompt(self) -> Dict[str, str]:
        """Convert to format expected by Node API."""
        return {
            "topic": self.news.title,
            "context": self.news.content,
            "image_description": self.image_analysis.description if self.image_analysis else None
        }

Interface:

class ContentAggregator:
    def aggregate(
        self,
        articles: List[NewsArticle],
        analyses: Dict[str, ImageAnalysis]
    ) -> List[AggregatedContent]:
        """Combine scraped and analyzed content."""

Business Rules:

  • Skip articles without images if image required
  • Skip low-confidence image analyses (< 0.5)
  • Limit prompt length to API constraints

5. article_client.py - Node API Client

Purpose: Call existing Node.js article generation API.

Responsibilities:

  • HTTP POST to Node.js server
  • Request/response serialization
  • Retry logic for transient failures
  • NO content processing, NO publishing

Data Structures:

@dataclass
class GeneratedArticle:
    original_news: NewsArticle
    generated_content: str
    metadata: Dict[str, Any]
    generation_time: datetime

Interface:

class ArticleAPIClient:
    def generate(self, prompt: Dict[str, str]) -> GeneratedArticle:
        """Generate single article."""
    
    def generate_batch(
        self,
        prompts: List[Dict[str, str]]
    ) -> List[GeneratedArticle]:
        """Generate multiple articles with rate limiting."""

Error Handling:

  • Raises APIClientError on failure
  • Implements exponential backoff retry
  • Respects API rate limits

6. publisher.py - Feed Publishing

Purpose: Publish generated articles to output channels.

Responsibilities:

  • Generate RSS/Atom feeds
  • Post to WordPress (if configured)
  • Write to local files
  • NO content generation, NO scraping

Interface:

class FeedPublisher:
    def publish_rss(self, articles: List[GeneratedArticle], path: Path) -> None:
        """Generate RSS feed file."""
    
    def publish_wordpress(self, articles: List[GeneratedArticle]) -> None:
        """Post to WordPress via XML-RPC or REST API."""
    
    def publish_json(self, articles: List[GeneratedArticle], path: Path) -> None:
        """Write articles as JSON for debugging."""

Output Formats:

  • RSS 2.0 feed
  • WordPress posts
  • JSON archive

DATA FLOW DETAIL

Complete Pipeline

def run_pipeline(config: Config) -> None:
    """Execute complete feed generation pipeline."""
    
    # 1. Initialize components
    scraper = NewsScraper(config.scraper)
    analyzer = ImageAnalyzer(config.api.openai_key)
    aggregator = ContentAggregator()
    client = ArticleAPIClient(config.api.node_api_url)
    publisher = FeedPublisher(config.publisher)
    
    # 2. Scrape news sources
    logger.info("Scraping news sources...")
    articles: List[NewsArticle] = scraper.scrape_all()
    logger.info(f"Scraped {len(articles)} articles")
    
    # 3. Analyze images
    logger.info("Analyzing images...")
    analyses: Dict[str, ImageAnalysis] = analyzer.analyze_batch(articles)
    logger.info(f"Analyzed {len(analyses)} images")
    
    # 4. Aggregate content
    logger.info("Aggregating content...")
    aggregated: List[AggregatedContent] = aggregator.aggregate(articles, analyses)
    logger.info(f"Aggregated {len(aggregated)} items")
    
    # 5. Generate articles
    logger.info("Generating articles...")
    prompts = [item.to_generation_prompt() for item in aggregated]
    generated: List[GeneratedArticle] = client.generate_batch(prompts)
    logger.info(f"Generated {len(generated)} articles")
    
    # 6. Publish
    logger.info("Publishing...")
    publisher.publish_rss(generated, Path("output/feed.rss"))
    publisher.publish_json(generated, Path("output/articles.json"))
    logger.info("Pipeline complete!")

Error Handling in Pipeline

def run_pipeline_with_recovery(config: Config) -> None:
    """Pipeline with error recovery at each stage."""
    
    try:
        # Stage 1: Scraping
        articles = scraper.scrape_all()
        if not articles:
            logger.warning("No articles scraped, exiting")
            return
    except ScrapingError as e:
        logger.error(f"Scraping failed: {e}")
        return  # Cannot proceed without articles
    
    try:
        # Stage 2: Image Analysis (optional)
        analyses = analyzer.analyze_batch(articles)
    except ImageAnalysisError as e:
        logger.warning(f"Image analysis failed: {e}, proceeding without images")
        analyses = {}  # Continue without image descriptions
    
    # Stage 3: Aggregation (cannot fail with valid inputs)
    aggregated = aggregator.aggregate(articles, analyses)
    
    try:
        # Stage 4: Generation
        prompts = [item.to_generation_prompt() for item in aggregated]
        generated = client.generate_batch(prompts)
        if not generated:
            logger.error("No articles generated, exiting")
            return
    except APIClientError as e:
        logger.error(f"Article generation failed: {e}")
        return  # Cannot publish without generated articles
    
    try:
        # Stage 5: Publishing
        publisher.publish_rss(generated, Path("output/feed.rss"))
        publisher.publish_json(generated, Path("output/articles.json"))
    except PublishingError as e:
        logger.error(f"Publishing failed: {e}")
        # Save to backup location
        publisher.publish_json(generated, Path("backup/articles.json"))

INTERFACE CONTRACTS

Module Input/Output Types

# scraper.py
Input:  str (URL)
Output: List[NewsArticle]
Errors: ScrapingError

# image_analyzer.py
Input:  List[NewsArticle]
Output: Dict[str, ImageAnalysis]  # Keyed by image_url
Errors: ImageAnalysisError

# aggregator.py
Input:  List[NewsArticle], Dict[str, ImageAnalysis]
Output: List[AggregatedContent]
Errors: None (pure transformation)

# article_client.py
Input:  List[Dict[str, str]]  # Prompts
Output: List[GeneratedArticle]
Errors: APIClientError

# publisher.py
Input:  List[GeneratedArticle]
Output: None (side effects: files, API calls)
Errors: PublishingError

Type Safety Guarantees

All interfaces use:

  • Immutable dataclasses for data structures
  • Explicit Optional for nullable values
  • Specific exceptions for error cases
  • Type hints on all function signatures
# Example: Type-safe interface
def process_article(
    article: NewsArticle,           # Required
    analysis: Optional[ImageAnalysis]  # Nullable
) -> Result[GeneratedArticle, ProcessingError]:  # Explicit result type
    """Type signature guarantees correctness."""

CONFIGURATION STRATEGY

Environment Variables

# Required
OPENAI_API_KEY=sk-...
NODE_API_URL=http://localhost:3000
NEWS_SOURCES=https://example.com/news,https://other.com/feed

# Optional
LOG_LEVEL=INFO
MAX_ARTICLES=10
SCRAPER_TIMEOUT=10
API_TIMEOUT=30

Configuration Hierarchy

Default Values → Environment Variables → CLI Arguments (future)
     ↓                    ↓                      ↓
  config.py          .env file             argparse

Configuration Validation

@classmethod
def from_env(cls) -> Config:
    """Load with validation."""
    
    # Required fields
    openai_key = os.getenv("OPENAI_API_KEY")
    if not openai_key:
        raise ValueError("OPENAI_API_KEY required")
    
    # Validated parsing
    node_api_url = os.getenv("NODE_API_URL", "http://localhost:3000")
    if not node_api_url.startswith(('http://', 'https://')):
        raise ValueError(f"Invalid NODE_API_URL: {node_api_url}")
    
    # List parsing
    sources_str = os.getenv("NEWS_SOURCES", "")
    sources = [s.strip() for s in sources_str.split(",") if s.strip()]
    if not sources:
        raise ValueError("NEWS_SOURCES required (comma-separated URLs)")
    
    return cls(...)

ERROR HANDLING ARCHITECTURE

Exception Hierarchy

class FeedGeneratorError(Exception):
    """Base exception - catch-all for system errors."""
    pass

class ScrapingError(FeedGeneratorError):
    """Web scraping failed."""
    pass

class ImageAnalysisError(FeedGeneratorError):
    """GPT-4 Vision analysis failed."""
    pass

class APIClientError(FeedGeneratorError):
    """Node.js API communication failed."""
    pass

class PublishingError(FeedGeneratorError):
    """Feed publishing failed."""
    pass

Retry Strategy

class RetryConfig:
    """Configuration for retry behavior."""
    max_attempts: int = 3
    initial_delay: float = 1.0  # seconds
    backoff_factor: float = 2.0
    max_delay: float = 60.0

def with_retry(config: RetryConfig):
    """Decorator for retryable operations."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(config.max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == config.max_attempts - 1:
                        raise
                    delay = min(
                        config.initial_delay * (config.backoff_factor ** attempt),
                        config.max_delay
                    )
                    logger.warning(f"Retry {attempt+1}/{config.max_attempts} after {delay}s")
                    time.sleep(delay)
        return wrapper
    return decorator

Partial Failure Handling

def scrape_all(self) -> List[NewsArticle]:
    """Scrape all sources, continue on individual failures."""
    all_articles = []
    
    for source in self._config.sources:
        try:
            articles = self._scrape_source(source)
            all_articles.extend(articles)
            logger.info(f"Scraped {len(articles)} from {source}")
        except ScrapingError as e:
            logger.warning(f"Failed to scrape {source}: {e}")
            # Continue with other sources
            continue
    
    return all_articles

TESTING STRATEGY

Test Pyramid

         E2E Tests (1-2)
           /          \
      Integration (5-10)
       /                \
  Unit Tests (20-30)

Unit Test Coverage

Each module has:

  • Happy path tests - Normal operation
  • Error condition tests - Each exception type
  • Edge case tests - Empty inputs, null values, limits
  • Mock external dependencies - No real HTTP calls
# Example: scraper_test.py
def test_scrape_success():
    """Test successful scraping."""
    # Mock HTTP response
    # Assert correct NewsArticle returned

def test_scrape_timeout():
    """Test timeout handling."""
    # Mock timeout exception
    # Assert ScrapingError raised

def test_scrape_invalid_html():
    """Test malformed HTML handling."""
    # Mock invalid response
    # Assert error or empty result

Integration Test Coverage

Test module interactions:

  • Scraper → Aggregator
  • Analyzer → Aggregator
  • Aggregator → API Client
  • End-to-end pipeline
def test_pipeline_integration():
    """Test complete pipeline with mocked external services."""
    config = Config.from_dict(test_config)
    
    with mock_http_responses():
        with mock_openai_api():
            with mock_node_api():
                result = run_pipeline(config)
                
                assert len(result) > 0
                assert all(isinstance(a, GeneratedArticle) for a in result)

Test Data Strategy

tests/
├── fixtures/
│   ├── sample_news.html      # Mock HTML responses
│   ├── sample_api_response.json
│   └── sample_images.json
└── mocks/
    ├── mock_scraper.py
    ├── mock_analyzer.py
    └── mock_client.py

PERFORMANCE CONSIDERATIONS

Current Targets (V1 Prototype)

  • Scraping: 5-10 articles/source in < 30s
  • Image analysis: < 5s per image (GPT-4V API latency)
  • Article generation: < 10s per article (Node API latency)
  • Total pipeline: < 5 minutes for 50 articles

Bottlenecks Identified

  1. Sequential API calls - GPT-4V and Node API
  2. Network latency - HTTP requests
  3. No caching - Repeated scraping of same sources

Future Optimizations (V2+)

# Parallel image analysis
async def analyze_batch_parallel(
    self,
    articles: List[NewsArticle]
) -> Dict[str, ImageAnalysis]:
    """Analyze images in parallel."""
    tasks = [self._analyze_async(a.image_url) for a in articles]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return {url: result for url, result in zip(urls, results) if not isinstance(result, Exception)}

Caching Strategy (Future)

@dataclass
class CacheConfig:
    scraper_ttl: int = 3600  # 1 hour
    analysis_ttl: int = 86400  # 24 hours

# Redis or simple file-based cache
cache = Cache(config.cache)

def scrape_with_cache(self, url: str) -> List[NewsArticle]:
    """Scrape with TTL-based caching."""
    cached = cache.get(f"scrape:{url}")
    if cached and not cache.is_expired(cached):
        return cached.data
    
    fresh = self._scrape_source(url)
    cache.set(f"scrape:{url}", fresh, ttl=self._config.cache.scraper_ttl)
    return fresh

EXTENSIBILITY POINTS

Adding New News Sources

# 1. Add source-specific parser
class BBCParser(NewsParser):
    """Parser for BBC News."""
    
    def parse(self, html: str) -> List[NewsArticle]:
        """Extract articles from BBC HTML."""
        soup = BeautifulSoup(html, 'html.parser')
        # BBC-specific extraction logic
        return articles

# 2. Register parser
scraper.register_parser("bbc.com", BBCParser())

# 3. Add to configuration
NEWS_SOURCES=...,https://bbc.com/news

Adding Output Formats

# 1. Implement publisher interface
class JSONPublisher(Publisher):
    """Publish articles as JSON."""
    
    def publish(self, articles: List[GeneratedArticle]) -> None:
        """Write to JSON file."""
        with open(self._path, 'w') as f:
            json.dump([a.to_dict() for a in articles], f, indent=2)

# 2. Use in pipeline
publisher = JSONPublisher(Path("output/feed.json"))
publisher.publish(generated_articles)

Custom Processing Steps

# 1. Implement processor interface
class SEOOptimizer(Processor):
    """Add SEO metadata to articles."""
    
    def process(self, article: GeneratedArticle) -> GeneratedArticle:
        """Enhance with SEO tags."""
        optimized = article.copy()
        optimized.metadata['keywords'] = extract_keywords(article.content)
        optimized.metadata['description'] = generate_meta_description(article.content)
        return optimized

# 2. Add to pipeline
pipeline.add_processor(SEOOptimizer())

MIGRATION PATH TO NODE.JS

Why Migrate Later?

This Python prototype will eventually be rewritten in Node.js/TypeScript because:

  1. Consistency - Same stack as article generation API
  2. Maintainability - One language for entire system
  3. Type safety - TypeScript strict mode
  4. Integration - Direct module imports instead of HTTP

What to Preserve

When migrating:

  • Module structure (same responsibilities)
  • Interface contracts (same types)
  • Configuration format (same env vars)
  • Error handling strategy (same exceptions)
  • Test coverage (same test cases)

Migration Strategy

// 1. Create TypeScript interfaces matching Python dataclasses
interface NewsArticle {
    title: string;
    url: string;
    content: string;
    imageUrl?: string;
}

// 2. Port modules one-by-one
class NewsScraper {
    async scrape(url: string): Promise<NewsArticle[]> {
        // Same logic as Python version
    }
}

// 3. Replace HTTP calls with direct imports
import { generateArticle } from './article-generator';

// Instead of HTTP POST
const article = await generateArticle(prompt);

Lessons to Apply

From this Python prototype to Node.js:

  • Use TypeScript strict mode from day 1
  • Define interfaces before implementation
  • Write tests alongside code
  • Use dependency injection
  • Explicit error types
  • No global state

DEPLOYMENT CONSIDERATIONS

Development Environment

# Local development
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env with API keys
python scripts/run.py

Production Deployment (Future)

# docker-compose.yml
version: '3.8'
services:
  feed-generator:
    build: .
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - NODE_API_URL=http://article-api:3000
    volumes:
      - ./output:/app/output
    restart: unless-stopped
  
  article-api:
    image: node-article-generator:latest
    ports:
      - "3000:3000"

Scheduling

# Cron job for periodic execution
0 */6 * * * cd /app/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1

MONITORING & OBSERVABILITY

Logging Levels

# DEBUG - Detailed execution flow
logger.debug(f"Scraping URL: {url}")

# INFO - Major pipeline stages
logger.info(f"Scraped {len(articles)} articles")

# WARNING - Recoverable errors
logger.warning(f"Failed to scrape {source}, continuing")

# ERROR - Unrecoverable errors
logger.error(f"Pipeline failed: {e}", exc_info=True)

Metrics to Track

@dataclass
class PipelineMetrics:
    """Metrics for pipeline execution."""
    start_time: datetime
    end_time: datetime
    articles_scraped: int
    images_analyzed: int
    articles_generated: int
    articles_published: int
    errors: List[str]
    
    def duration(self) -> float:
        """Pipeline duration in seconds."""
        return (self.end_time - self.start_time).total_seconds()
    
    def success_rate(self) -> float:
        """Percentage of articles successfully processed."""
        if self.articles_scraped == 0:
            return 0.0
        return (self.articles_published / self.articles_scraped) * 100

Health Checks

def health_check() -> Dict[str, Any]:
    """Check system health."""
    return {
        "status": "healthy",
        "checks": {
            "openai_api": check_openai_connection(),
            "node_api": check_node_api_connection(),
            "disk_space": check_disk_space(),
        },
        "last_run": get_last_run_metrics(),
    }

SECURITY CONSIDERATIONS

API Key Management

# ❌ NEVER commit API keys
OPENAI_API_KEY = "sk-..."  # FORBIDDEN

# ✅ Use environment variables
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("OPENAI_API_KEY environment variable required")

Input Validation

def validate_url(url: str) -> bool:
    """Validate URL is safe to scrape."""
    parsed = urlparse(url)
    
    # Must be HTTP/HTTPS
    if parsed.scheme not in ('http', 'https'):
        return False
    
    # No localhost or private IPs
    if parsed.hostname in ('localhost', '127.0.0.1'):
        return False
    
    return True

Rate Limiting

class RateLimiter:
    """Simple rate limiter for API calls."""
    
    def __init__(self, calls_per_minute: int) -> None:
        self._calls_per_minute = calls_per_minute
        self._calls: List[datetime] = []
    
    def wait_if_needed(self) -> None:
        """Block if rate limit would be exceeded."""
        now = datetime.now()
        minute_ago = now - timedelta(minutes=1)
        
        # Remove old calls
        self._calls = [c for c in self._calls if c > minute_ago]
        
        if len(self._calls) >= self._calls_per_minute:
            sleep_time = (self._calls[0] - minute_ago).total_seconds()
            time.sleep(sleep_time)
        
        self._calls.append(now)

KNOWN LIMITATIONS (V1)

Scraping Limitations

  • Static HTML only - No JavaScript rendering
  • No anti-bot bypass - May be blocked by Cloudflare/etc
  • No authentication - Cannot access paywalled content
  • Site-specific parsing - Breaks if HTML structure changes

Analysis Limitations

  • Cost - GPT-4V API is expensive at scale
  • Latency - 3-5s per image analysis
  • Rate limits - OpenAI API quotas
  • No caching - Re-analyzes same images

Generation Limitations

  • Dependent on Node API - Single point of failure
  • No fallback - If API down, pipeline fails
  • Sequential processing - One article at a time

Publishing Limitations

  • Local files only - No cloud storage
  • No WordPress integration - RSS only
  • No scheduling - Manual execution

FUTURE ENHANCEMENTS (Post-V1)

Phase 2: Robustness

  • Playwright for JavaScript-rendered sites
  • Retry logic with exponential backoff
  • Persistent queue for failed items
  • Health monitoring dashboard

Phase 3: Performance

  • Async/parallel processing
  • Redis caching layer
  • Connection pooling
  • Batch API requests

Phase 4: Features

  • WordPress integration
  • Multiple output formats
  • Content filtering rules
  • A/B testing for prompts

Phase 5: Migration to Node.js

  • Rewrite in TypeScript
  • Direct integration with article generator
  • Shared types/interfaces
  • Unified deployment

DECISION LOG

Why Python for V1?

Decision: Use Python instead of Node.js Rationale:

  • Better scraping libraries (BeautifulSoup, requests)
  • Simpler OpenAI SDK
  • Faster prototyping
  • Can be rewritten later

Why Not Async from Start?

Decision: Synchronous code for V1 Rationale:

  • Simpler to understand and debug
  • Performance not critical for prototype
  • Can add async in V2

Why Dataclasses over Dicts?

Decision: Use typed dataclasses everywhere Rationale:

  • Type safety catches bugs early
  • Better IDE support
  • Self-documenting code
  • Easy to validate

Why No Database?

Decision: File-based storage for V1 Rationale:

  • Simpler deployment
  • No database management
  • Sufficient for prototype
  • Can add later if needed

End of ARCHITECTURE.md