feedgenerator/CLAUDE.md
StillHammer 40138c2d45 Initial implementation: Feed Generator V1
Complete Python implementation with strict type safety and best practices.

Features:
- RSS/Atom/HTML web scraping
- GPT-4 Vision image analysis
- Node.js API integration
- RSS/JSON feed publishing

Modules:
- src/config.py: Configuration with strict validation
- src/exceptions.py: Custom exception hierarchy
- src/scraper.py: Multi-format news scraping (RSS/Atom/HTML)
- src/image_analyzer.py: GPT-4 Vision integration with retry
- src/aggregator.py: Content aggregation and filtering
- src/article_client.py: Node.js API client with retry
- src/publisher.py: RSS/JSON feed generation
- scripts/run.py: Complete pipeline orchestrator
- scripts/validate.py: Code quality validation

Code Quality:
- 100% type hint coverage (mypy strict mode)
- Zero bare except clauses
- Logger throughout (no print statements)
- Comprehensive test suite (598 lines)
- Immutable dataclasses (frozen=True)
- Explicit error handling
- Structured logging

Stats:
- 1,431 lines of source code
- 598 lines of test code
- 15 Python files
- 8 core modules
- 4 test suites

All validation checks pass.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-07 22:28:18 +08:00

21 KiB

CLAUDE.md - Feed Generator Project Instructions

# CLAUDE.md - Feed Generator Development Instructions

> **CRITICAL**: This document contains mandatory rules for AI-assisted development with Claude Code.
> **NEVER** deviate from these rules without explicit human approval.

---

## PROJECT OVERVIEW

**Feed Generator** is a Python-based content aggregation system that:
1. Scrapes news from web sources
2. Analyzes images using GPT-4 Vision
3. Aggregates content into structured prompts
4. Calls existing Node.js article generation API
5. Publishes to feeds (RSS/WordPress)

**Philosophy**: Quick, functional prototype. NOT a production system yet.
**Timeline**: 3-5 days maximum for V1.
**Future**: May be rewritten in Node.js/TypeScript with strict architecture.

---

## CORE PRINCIPLES

### 1. Type Safety is MANDATORY

**NEVER write untyped Python code.**

```python
# ❌ FORBIDDEN - No type hints
def scrape_news(url):
    return requests.get(url)

# ✅ REQUIRED - Full type hints
from typing import List, Dict, Optional
import requests

def scrape_news(url: str) -> Optional[Dict[str, str]]:
    response: requests.Response = requests.get(url)
    return response.json() if response.ok else None

Rules:

  • Every function MUST have type hints for parameters and return values
  • Use typing module: List, Dict, Optional, Union, Tuple
  • Use from __future__ import annotations for forward references
  • Complex types should use TypedDict or dataclasses

2. Explicit is Better Than Implicit

NEVER use magic or implicit behavior.

# ❌ FORBIDDEN - Implicit dictionary keys
def process(data):
    return data['title']  # What if 'title' doesn't exist?

# ✅ REQUIRED - Explicit with error handling
def process(data: Dict[str, str]) -> str:
    if 'title' not in data:
        raise ValueError("Missing required key: 'title'")
    return data['title']

3. Fail Fast and Loud

NEVER silently swallow errors.

# ❌ FORBIDDEN - Silent failure
try:
    result = dangerous_operation()
except:
    result = None

# ✅ REQUIRED - Explicit error handling
try:
    result = dangerous_operation()
except SpecificException as e:
    logger.error(f"Operation failed: {e}")
    raise

4. Single Responsibility Modules

Each module has ONE clear purpose.

  • scraper.py - ONLY scraping logic
  • image_analyzer.py - ONLY image analysis
  • article_client.py - ONLY API communication
  • aggregator.py - ONLY content aggregation
  • publisher.py - ONLY feed publishing

NEVER mix responsibilities.


FORBIDDEN PATTERNS

NEVER Use These

# 1. Bare except
try:
    something()
except:  # ❌ FORBIDDEN
    pass

# 2. Mutable default arguments
def func(items=[]):  # ❌ FORBIDDEN
    items.append(1)
    return items

# 3. Global state
CACHE = {}  # ❌ FORBIDDEN at module level

def use_cache():
    CACHE['key'] = 'value'

# 4. Star imports
from module import *  # ❌ FORBIDDEN

# 5. Untyped functions
def process(data):  # ❌ FORBIDDEN - no types
    return data

# 6. Magic strings
if mode == "production":  # ❌ FORBIDDEN
    do_something()

# 7. Implicit None returns
def maybe_returns():  # ❌ FORBIDDEN - unclear return
    if condition:
        return value

# 8. Nested functions for reuse
def outer():
    def inner():  # ❌ FORBIDDEN if used multiple times
        pass
    inner()
    inner()

REQUIRED Patterns

# 1. Specific exceptions
try:
    something()
except ValueError as e:  # ✅ REQUIRED
    logger.error(f"Value error: {e}")
    raise

# 2. Immutable defaults
def func(items: Optional[List[str]] = None) -> List[str]:  # ✅ REQUIRED
    if items is None:
        items = []
    items.append('new')
    return items

# 3. Explicit configuration objects
from dataclasses import dataclass

@dataclass
class CacheConfig:
    max_size: int
    ttl_seconds: int

cache = Cache(config=CacheConfig(max_size=100, ttl_seconds=60))

# 4. Explicit imports
from module import SpecificClass, specific_function  # ✅ REQUIRED

# 5. Typed functions
def process(data: Dict[str, Any]) -> Optional[str]:  # ✅ REQUIRED
    return data.get('value')

# 6. Enums for constants
from enum import Enum

class Mode(Enum):  # ✅ REQUIRED
    PRODUCTION = "production"
    DEVELOPMENT = "development"

if mode == Mode.PRODUCTION:
    do_something()

# 7. Explicit Optional returns
def maybe_returns() -> Optional[str]:  # ✅ REQUIRED
    if condition:
        return value
    return None

# 8. Extract functions to module level
def inner_logic() -> None:  # ✅ REQUIRED
    pass

def outer() -> None:
    inner_logic()
    inner_logic()

MODULE STRUCTURE

Standard Module Template

Every module MUST follow this structure:

"""
Module: module_name.py
Purpose: [ONE sentence describing ONLY responsibility]
Dependencies: [List external dependencies]
"""

from __future__ import annotations

# Standard library imports
import logging
from typing import Dict, List, Optional

# Third-party imports
import requests
from bs4 import BeautifulSoup

# Local imports
from .config import Config

# Module-level logger
logger = logging.getLogger(__name__)


class ModuleName:
    """[Clear description of class responsibility]"""
    
    def __init__(self, config: Config) -> None:
        """Initialize with configuration.
        
        Args:
            config: Configuration object
            
        Raises:
            ValueError: If config is invalid
        """
        self._config = config
        self._validate_config()
    
    def _validate_config(self) -> None:
        """Validate configuration."""
        if not self._config.api_key:
            raise ValueError("API key is required")
    
    def public_method(self, param: str) -> Optional[Dict[str, str]]:
        """[Clear description]
        
        Args:
            param: [Description]
            
        Returns:
            [Description of return value]
            
        Raises:
            [Exceptions that can be raised]
        """
        try:
            result = self._internal_logic(param)
            return result
        except SpecificException as e:
            logger.error(f"Failed to process {param}: {e}")
            raise
    
    def _internal_logic(self, param: str) -> Dict[str, str]:
        """Private methods use underscore prefix."""
        return {"key": param}

CONFIGURATION MANAGEMENT

NEVER hardcode values. Use configuration objects.

config.py Structure

"""Configuration management for Feed Generator."""

from __future__ import annotations

import os
from dataclasses import dataclass
from typing import List
from pathlib import Path


@dataclass(frozen=True)  # Immutable
class APIConfig:
    """Configuration for external APIs."""
    openai_key: str
    node_api_url: str
    timeout_seconds: int = 30


@dataclass(frozen=True)
class ScraperConfig:
    """Configuration for news scraping."""
    sources: List[str]
    max_articles: int = 10
    timeout_seconds: int = 10


@dataclass(frozen=True)
class Config:
    """Main configuration object."""
    api: APIConfig
    scraper: ScraperConfig
    log_level: str = "INFO"
    
    @classmethod
    def from_env(cls) -> Config:
        """Load configuration from environment variables.
        
        Returns:
            Loaded configuration
            
        Raises:
            ValueError: If required environment variables are missing
        """
        openai_key = os.getenv("OPENAI_API_KEY")
        if not openai_key:
            raise ValueError("OPENAI_API_KEY environment variable required")
        
        node_api_url = os.getenv("NODE_API_URL", "http://localhost:3000")
        
        sources_str = os.getenv("NEWS_SOURCES", "")
        sources = [s.strip() for s in sources_str.split(",") if s.strip()]
        
        if not sources:
            raise ValueError("NEWS_SOURCES environment variable required")
        
        return cls(
            api=APIConfig(
                openai_key=openai_key,
                node_api_url=node_api_url
            ),
            scraper=ScraperConfig(
                sources=sources
            )
        )

ERROR HANDLING STRATEGY

1. Define Custom Exceptions

"""Custom exceptions for Feed Generator."""

class FeedGeneratorError(Exception):
    """Base exception for all Feed Generator errors."""
    pass


class ScrapingError(FeedGeneratorError):
    """Raised when scraping fails."""
    pass


class ImageAnalysisError(FeedGeneratorError):
    """Raised when image analysis fails."""
    pass


class APIClientError(FeedGeneratorError):
    """Raised when API communication fails."""
    pass

2. Use Specific Error Handling

def scrape_news(url: str) -> Dict[str, str]:
    """Scrape news from URL.
    
    Raises:
        ScrapingError: If scraping fails
    """
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
    except requests.Timeout as e:
        raise ScrapingError(f"Timeout scraping {url}") from e
    except requests.RequestException as e:
        raise ScrapingError(f"Failed to scrape {url}") from e
    
    try:
        return response.json()
    except ValueError as e:
        raise ScrapingError(f"Invalid JSON from {url}") from e

3. Log Before Raising

def critical_operation() -> None:
    """Perform critical operation."""
    try:
        result = dangerous_call()
    except SpecificError as e:
        logger.error(f"Critical operation failed: {e}", exc_info=True)
        raise  # Re-raise after logging

TESTING REQUIREMENTS

Every Module MUST Have Tests

"""Test module for scraper.py"""

import pytest
from unittest.mock import Mock, patch

from src.scraper import NewsScraper
from src.config import ScraperConfig
from src.exceptions import ScrapingError


def test_scraper_success() -> None:
    """Test successful scraping."""
    config = ScraperConfig(sources=["https://example.com"])
    scraper = NewsScraper(config)
    
    with patch('requests.get') as mock_get:
        mock_response = Mock()
        mock_response.ok = True
        mock_response.json.return_value = {"title": "Test"}
        mock_get.return_value = mock_response
        
        result = scraper.scrape("https://example.com")
        
        assert result is not None
        assert result["title"] == "Test"


def test_scraper_timeout() -> None:
    """Test scraping timeout."""
    config = ScraperConfig(sources=["https://example.com"])
    scraper = NewsScraper(config)
    
    with patch('requests.get', side_effect=requests.Timeout):
        with pytest.raises(ScrapingError):
            scraper.scrape("https://example.com")

LOGGING STRATEGY

Standard Logger Setup

import logging
import sys

def setup_logging(level: str = "INFO") -> None:
    """Setup logging configuration.
    
    Args:
        level: Logging level (DEBUG, INFO, WARNING, ERROR)
    """
    logging.basicConfig(
        level=getattr(logging, level.upper()),
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
        handlers=[
            logging.StreamHandler(sys.stdout),
            logging.FileHandler('feed_generator.log')
        ]
    )

# In each module
logger = logging.getLogger(__name__)

Logging Best Practices

# ✅ REQUIRED - Structured logging
logger.info(f"Scraping {url}", extra={"url": url, "attempt": 1})

# ✅ REQUIRED - Log exceptions with context
try:
    result = operation()
except Exception as e:
    logger.error(f"Operation failed", exc_info=True, extra={"context": data})
    raise

# ❌ FORBIDDEN - Print statements
print("Debug info")  # Use logger.debug() instead

DEPENDENCIES MANAGEMENT

requirements.txt Structure

# Core dependencies
requests==2.31.0
beautifulsoup4==4.12.2
openai==1.3.0

# Utilities
python-dotenv==1.0.0

# Testing
pytest==7.4.3
pytest-cov==4.1.0

# Type checking
mypy==1.7.1
types-requests==2.31.0

Installing Dependencies

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install in development mode
pip install -e .

TYPE CHECKING WITH MYPY

mypy.ini Configuration

[mypy]
python_version = 3.11
warn_return_any = True
warn_unused_configs = True
disallow_untyped_defs = True
disallow_any_unimported = True
no_implicit_optional = True
warn_redundant_casts = True
warn_unused_ignores = True
warn_no_return = True
check_untyped_defs = True
strict_equality = True

Running Type Checks

# Type check all code
mypy src/

# MUST pass before committing

COMMON PATTERNS

1. Retry Logic

from typing import Callable, TypeVar
import time

T = TypeVar('T')

def retry(
    func: Callable[..., T],
    max_attempts: int = 3,
    delay_seconds: float = 1.0
) -> T:
    """Retry a function with exponential backoff.
    
    Args:
        func: Function to retry
        max_attempts: Maximum number of attempts
        delay_seconds: Initial delay between retries
        
    Returns:
        Function result
        
    Raises:
        Exception: Last exception if all retries fail
    """
    last_exception: Optional[Exception] = None
    
    for attempt in range(max_attempts):
        try:
            return func()
        except Exception as e:
            last_exception = e
            if attempt < max_attempts - 1:
                sleep_time = delay_seconds * (2 ** attempt)
                logger.warning(
                    f"Attempt {attempt + 1} failed, retrying in {sleep_time}s",
                    extra={"exception": str(e)}
                )
                time.sleep(sleep_time)
    
    raise last_exception  # type: ignore

2. Data Validation

from dataclasses import dataclass

@dataclass
class Article:
    """Validated article data."""
    title: str
    url: str
    image_url: Optional[str] = None
    
    def __post_init__(self) -> None:
        """Validate data after initialization."""
        if not self.title:
            raise ValueError("Title cannot be empty")
        if not self.url.startswith(('http://', 'https://')):
            raise ValueError(f"Invalid URL: {self.url}")

3. Context Managers for Resources

from contextlib import contextmanager
from typing import Generator

@contextmanager
def api_client(config: APIConfig) -> Generator[APIClient, None, None]:
    """Context manager for API client.
    
    Yields:
        Configured API client
    """
    client = APIClient(config)
    try:
        client.connect()
        yield client
    finally:
        client.disconnect()

# Usage
with api_client(config) as client:
    result = client.call()

WORKING WITH EXTERNAL APIS

OpenAI GPT-4 Vision

from openai import OpenAI
from typing import Optional

class ImageAnalyzer:
    """Analyze images using GPT-4 Vision."""
    
    def __init__(self, api_key: str) -> None:
        self._client = OpenAI(api_key=api_key)
    
    def analyze_image(self, image_url: str, prompt: str) -> Optional[str]:
        """Analyze image with custom prompt.
        
        Args:
            image_url: URL of image to analyze
            prompt: Analysis prompt
            
        Returns:
            Analysis result or None if failed
            
        Raises:
            ImageAnalysisError: If analysis fails
        """
        try:
            response = self._client.chat.completions.create(
                model="gpt-4o",
                messages=[{
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt},
                        {"type": "image_url", "image_url": {"url": image_url}}
                    ]
                }],
                max_tokens=300
            )
            return response.choices[0].message.content
        except Exception as e:
            logger.error(f"Image analysis failed: {e}")
            raise ImageAnalysisError(f"Failed to analyze {image_url}") from e

Calling Node.js API

import requests
from typing import Dict, Any

class ArticleAPIClient:
    """Client for Node.js article generation API."""
    
    def __init__(self, base_url: str, timeout: int = 30) -> None:
        self._base_url = base_url.rstrip('/')
        self._timeout = timeout
    
    def generate_article(
        self,
        topic: str,
        context: str,
        image_description: Optional[str] = None
    ) -> Dict[str, Any]:
        """Generate article via API.
        
        Args:
            topic: Article topic
            context: Context information
            image_description: Optional image description
            
        Returns:
            Generated article data
            
        Raises:
            APIClientError: If API call fails
        """
        payload = {
            "topic": topic,
            "context": context,
        }
        if image_description:
            payload["image_description"] = image_description
        
        try:
            response = requests.post(
                f"{self._base_url}/api/generate",
                json=payload,
                timeout=self._timeout
            )
            response.raise_for_status()
            return response.json()
        except requests.RequestException as e:
            logger.error(f"API call failed: {e}")
            raise APIClientError("Article generation failed") from e

WHEN TO ASK FOR HUMAN INPUT

Claude Code MUST ask before:

  1. Changing module structure - Architecture changes
  2. Adding new dependencies - New libraries
  3. Changing configuration format - Breaking changes
  4. Implementing complex logic - Business rules
  5. Error handling strategy - Recovery approaches
  6. Performance optimizations - Trade-offs

Claude Code CAN proceed without asking:

  1. Adding type hints - Always required
  2. Adding logging - Always beneficial
  3. Adding tests - Always needed
  4. Fixing obvious bugs - Clear errors
  5. Improving documentation - Clarity improvements
  6. Refactoring for clarity - Same behavior, better code

DEVELOPMENT WORKFLOW

1. Start with Types and Interfaces

# Define data structures FIRST
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class NewsArticle:
    title: str
    url: str
    content: str
    image_url: Optional[str] = None

@dataclass
class AnalyzedArticle:
    news: NewsArticle
    image_description: Optional[str] = None

2. Implement Core Logic

# Then implement with clear types
def scrape_news(url: str) -> List[NewsArticle]:
    """Implementation with clear contract."""
    pass

3. Add Tests

def test_scrape_news() -> None:
    """Test before considering feature complete."""
    pass

4. Integrate

def pipeline() -> None:
    """Combine modules with clear flow."""
    articles = scrape_news(url)
    analyzed = analyze_images(articles)
    generated = generate_articles(analyzed)
    publish_feed(generated)

CRITICAL REMINDERS

  1. Type hints are NOT optional - Every function must be typed
  2. Error handling is NOT optional - Every external call must have error handling
  3. Logging is NOT optional - Every significant operation must be logged
  4. Tests are NOT optional - Every module must have tests
  5. Configuration is NOT optional - No hardcoded values

If you find yourself thinking "I'll add types/tests/docs later" - STOP. Do it now.

If code works but isn't typed/tested/documented - It's NOT done.

This is NOT Node.js with its loose culture - Python gives us the tools for rigor, USE THEM.


SUCCESS CRITERIA

A module is complete when:

  • All functions have type hints
  • mypy passes with no errors
  • All tests pass
  • Test coverage > 80%
  • No print statements (use logger)
  • No bare excepts
  • No magic strings (use Enums)
  • Documentation is clear and complete
  • Error handling is explicit
  • Configuration is externalized

If ANY of these is missing, the module is NOT complete.