feedgenerator/CLAUDE.md
StillHammer 40138c2d45 Initial implementation: Feed Generator V1
Complete Python implementation with strict type safety and best practices.

Features:
- RSS/Atom/HTML web scraping
- GPT-4 Vision image analysis
- Node.js API integration
- RSS/JSON feed publishing

Modules:
- src/config.py: Configuration with strict validation
- src/exceptions.py: Custom exception hierarchy
- src/scraper.py: Multi-format news scraping (RSS/Atom/HTML)
- src/image_analyzer.py: GPT-4 Vision integration with retry
- src/aggregator.py: Content aggregation and filtering
- src/article_client.py: Node.js API client with retry
- src/publisher.py: RSS/JSON feed generation
- scripts/run.py: Complete pipeline orchestrator
- scripts/validate.py: Code quality validation

Code Quality:
- 100% type hint coverage (mypy strict mode)
- Zero bare except clauses
- Logger throughout (no print statements)
- Comprehensive test suite (598 lines)
- Immutable dataclasses (frozen=True)
- Explicit error handling
- Structured logging

Stats:
- 1,431 lines of source code
- 598 lines of test code
- 15 Python files
- 8 core modules
- 4 test suites

All validation checks pass.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-07 22:28:18 +08:00

878 lines
21 KiB
Markdown

# CLAUDE.md - Feed Generator Project Instructions
```markdown
# CLAUDE.md - Feed Generator Development Instructions
> **CRITICAL**: This document contains mandatory rules for AI-assisted development with Claude Code.
> **NEVER** deviate from these rules without explicit human approval.
---
## PROJECT OVERVIEW
**Feed Generator** is a Python-based content aggregation system that:
1. Scrapes news from web sources
2. Analyzes images using GPT-4 Vision
3. Aggregates content into structured prompts
4. Calls existing Node.js article generation API
5. Publishes to feeds (RSS/WordPress)
**Philosophy**: Quick, functional prototype. NOT a production system yet.
**Timeline**: 3-5 days maximum for V1.
**Future**: May be rewritten in Node.js/TypeScript with strict architecture.
---
## CORE PRINCIPLES
### 1. Type Safety is MANDATORY
**NEVER write untyped Python code.**
```python
# ❌ FORBIDDEN - No type hints
def scrape_news(url):
return requests.get(url)
# ✅ REQUIRED - Full type hints
from typing import List, Dict, Optional
import requests
def scrape_news(url: str) -> Optional[Dict[str, str]]:
response: requests.Response = requests.get(url)
return response.json() if response.ok else None
```
**Rules:**
- Every function MUST have type hints for parameters and return values
- Use `typing` module: `List`, `Dict`, `Optional`, `Union`, `Tuple`
- Use `from __future__ import annotations` for forward references
- Complex types should use `TypedDict` or `dataclasses`
### 2. Explicit is Better Than Implicit
**NEVER use magic or implicit behavior.**
```python
# ❌ FORBIDDEN - Implicit dictionary keys
def process(data):
return data['title'] # What if 'title' doesn't exist?
# ✅ REQUIRED - Explicit with error handling
def process(data: Dict[str, str]) -> str:
if 'title' not in data:
raise ValueError("Missing required key: 'title'")
return data['title']
```
### 3. Fail Fast and Loud
**NEVER silently swallow errors.**
```python
# ❌ FORBIDDEN - Silent failure
try:
result = dangerous_operation()
except:
result = None
# ✅ REQUIRED - Explicit error handling
try:
result = dangerous_operation()
except SpecificException as e:
logger.error(f"Operation failed: {e}")
raise
```
### 4. Single Responsibility Modules
**Each module has ONE clear purpose.**
- `scraper.py` - ONLY scraping logic
- `image_analyzer.py` - ONLY image analysis
- `article_client.py` - ONLY API communication
- `aggregator.py` - ONLY content aggregation
- `publisher.py` - ONLY feed publishing
**NEVER mix responsibilities.**
---
## FORBIDDEN PATTERNS
### ❌ NEVER Use These
```python
# 1. Bare except
try:
something()
except: # ❌ FORBIDDEN
pass
# 2. Mutable default arguments
def func(items=[]): # ❌ FORBIDDEN
items.append(1)
return items
# 3. Global state
CACHE = {} # ❌ FORBIDDEN at module level
def use_cache():
CACHE['key'] = 'value'
# 4. Star imports
from module import * # ❌ FORBIDDEN
# 5. Untyped functions
def process(data): # ❌ FORBIDDEN - no types
return data
# 6. Magic strings
if mode == "production": # ❌ FORBIDDEN
do_something()
# 7. Implicit None returns
def maybe_returns(): # ❌ FORBIDDEN - unclear return
if condition:
return value
# 8. Nested functions for reuse
def outer():
def inner(): # ❌ FORBIDDEN if used multiple times
pass
inner()
inner()
```
### ✅ REQUIRED Patterns
```python
# 1. Specific exceptions
try:
something()
except ValueError as e: # ✅ REQUIRED
logger.error(f"Value error: {e}")
raise
# 2. Immutable defaults
def func(items: Optional[List[str]] = None) -> List[str]: # ✅ REQUIRED
if items is None:
items = []
items.append('new')
return items
# 3. Explicit configuration objects
from dataclasses import dataclass
@dataclass
class CacheConfig:
max_size: int
ttl_seconds: int
cache = Cache(config=CacheConfig(max_size=100, ttl_seconds=60))
# 4. Explicit imports
from module import SpecificClass, specific_function # ✅ REQUIRED
# 5. Typed functions
def process(data: Dict[str, Any]) -> Optional[str]: # ✅ REQUIRED
return data.get('value')
# 6. Enums for constants
from enum import Enum
class Mode(Enum): # ✅ REQUIRED
PRODUCTION = "production"
DEVELOPMENT = "development"
if mode == Mode.PRODUCTION:
do_something()
# 7. Explicit Optional returns
def maybe_returns() -> Optional[str]: # ✅ REQUIRED
if condition:
return value
return None
# 8. Extract functions to module level
def inner_logic() -> None: # ✅ REQUIRED
pass
def outer() -> None:
inner_logic()
inner_logic()
```
---
## MODULE STRUCTURE
### Standard Module Template
Every module MUST follow this structure:
```python
"""
Module: module_name.py
Purpose: [ONE sentence describing ONLY responsibility]
Dependencies: [List external dependencies]
"""
from __future__ import annotations
# Standard library imports
import logging
from typing import Dict, List, Optional
# Third-party imports
import requests
from bs4 import BeautifulSoup
# Local imports
from .config import Config
# Module-level logger
logger = logging.getLogger(__name__)
class ModuleName:
"""[Clear description of class responsibility]"""
def __init__(self, config: Config) -> None:
"""Initialize with configuration.
Args:
config: Configuration object
Raises:
ValueError: If config is invalid
"""
self._config = config
self._validate_config()
def _validate_config(self) -> None:
"""Validate configuration."""
if not self._config.api_key:
raise ValueError("API key is required")
def public_method(self, param: str) -> Optional[Dict[str, str]]:
"""[Clear description]
Args:
param: [Description]
Returns:
[Description of return value]
Raises:
[Exceptions that can be raised]
"""
try:
result = self._internal_logic(param)
return result
except SpecificException as e:
logger.error(f"Failed to process {param}: {e}")
raise
def _internal_logic(self, param: str) -> Dict[str, str]:
"""Private methods use underscore prefix."""
return {"key": param}
```
---
## CONFIGURATION MANAGEMENT
**NEVER hardcode values. Use configuration objects.**
### config.py Structure
```python
"""Configuration management for Feed Generator."""
from __future__ import annotations
import os
from dataclasses import dataclass
from typing import List
from pathlib import Path
@dataclass(frozen=True) # Immutable
class APIConfig:
"""Configuration for external APIs."""
openai_key: str
node_api_url: str
timeout_seconds: int = 30
@dataclass(frozen=True)
class ScraperConfig:
"""Configuration for news scraping."""
sources: List[str]
max_articles: int = 10
timeout_seconds: int = 10
@dataclass(frozen=True)
class Config:
"""Main configuration object."""
api: APIConfig
scraper: ScraperConfig
log_level: str = "INFO"
@classmethod
def from_env(cls) -> Config:
"""Load configuration from environment variables.
Returns:
Loaded configuration
Raises:
ValueError: If required environment variables are missing
"""
openai_key = os.getenv("OPENAI_API_KEY")
if not openai_key:
raise ValueError("OPENAI_API_KEY environment variable required")
node_api_url = os.getenv("NODE_API_URL", "http://localhost:3000")
sources_str = os.getenv("NEWS_SOURCES", "")
sources = [s.strip() for s in sources_str.split(",") if s.strip()]
if not sources:
raise ValueError("NEWS_SOURCES environment variable required")
return cls(
api=APIConfig(
openai_key=openai_key,
node_api_url=node_api_url
),
scraper=ScraperConfig(
sources=sources
)
)
```
---
## ERROR HANDLING STRATEGY
### 1. Define Custom Exceptions
```python
"""Custom exceptions for Feed Generator."""
class FeedGeneratorError(Exception):
"""Base exception for all Feed Generator errors."""
pass
class ScrapingError(FeedGeneratorError):
"""Raised when scraping fails."""
pass
class ImageAnalysisError(FeedGeneratorError):
"""Raised when image analysis fails."""
pass
class APIClientError(FeedGeneratorError):
"""Raised when API communication fails."""
pass
```
### 2. Use Specific Error Handling
```python
def scrape_news(url: str) -> Dict[str, str]:
"""Scrape news from URL.
Raises:
ScrapingError: If scraping fails
"""
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
except requests.Timeout as e:
raise ScrapingError(f"Timeout scraping {url}") from e
except requests.RequestException as e:
raise ScrapingError(f"Failed to scrape {url}") from e
try:
return response.json()
except ValueError as e:
raise ScrapingError(f"Invalid JSON from {url}") from e
```
### 3. Log Before Raising
```python
def critical_operation() -> None:
"""Perform critical operation."""
try:
result = dangerous_call()
except SpecificError as e:
logger.error(f"Critical operation failed: {e}", exc_info=True)
raise # Re-raise after logging
```
---
## TESTING REQUIREMENTS
### Every Module MUST Have Tests
```python
"""Test module for scraper.py"""
import pytest
from unittest.mock import Mock, patch
from src.scraper import NewsScraper
from src.config import ScraperConfig
from src.exceptions import ScrapingError
def test_scraper_success() -> None:
"""Test successful scraping."""
config = ScraperConfig(sources=["https://example.com"])
scraper = NewsScraper(config)
with patch('requests.get') as mock_get:
mock_response = Mock()
mock_response.ok = True
mock_response.json.return_value = {"title": "Test"}
mock_get.return_value = mock_response
result = scraper.scrape("https://example.com")
assert result is not None
assert result["title"] == "Test"
def test_scraper_timeout() -> None:
"""Test scraping timeout."""
config = ScraperConfig(sources=["https://example.com"])
scraper = NewsScraper(config)
with patch('requests.get', side_effect=requests.Timeout):
with pytest.raises(ScrapingError):
scraper.scrape("https://example.com")
```
---
## LOGGING STRATEGY
### Standard Logger Setup
```python
import logging
import sys
def setup_logging(level: str = "INFO") -> None:
"""Setup logging configuration.
Args:
level: Logging level (DEBUG, INFO, WARNING, ERROR)
"""
logging.basicConfig(
level=getattr(logging, level.upper()),
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.StreamHandler(sys.stdout),
logging.FileHandler('feed_generator.log')
]
)
# In each module
logger = logging.getLogger(__name__)
```
### Logging Best Practices
```python
# ✅ REQUIRED - Structured logging
logger.info(f"Scraping {url}", extra={"url": url, "attempt": 1})
# ✅ REQUIRED - Log exceptions with context
try:
result = operation()
except Exception as e:
logger.error(f"Operation failed", exc_info=True, extra={"context": data})
raise
# ❌ FORBIDDEN - Print statements
print("Debug info") # Use logger.debug() instead
```
---
## DEPENDENCIES MANAGEMENT
### requirements.txt Structure
```txt
# Core dependencies
requests==2.31.0
beautifulsoup4==4.12.2
openai==1.3.0
# Utilities
python-dotenv==1.0.0
# Testing
pytest==7.4.3
pytest-cov==4.1.0
# Type checking
mypy==1.7.1
types-requests==2.31.0
```
### Installing Dependencies
```bash
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install in development mode
pip install -e .
```
---
## TYPE CHECKING WITH MYPY
### mypy.ini Configuration
```ini
[mypy]
python_version = 3.11
warn_return_any = True
warn_unused_configs = True
disallow_untyped_defs = True
disallow_any_unimported = True
no_implicit_optional = True
warn_redundant_casts = True
warn_unused_ignores = True
warn_no_return = True
check_untyped_defs = True
strict_equality = True
```
### Running Type Checks
```bash
# Type check all code
mypy src/
# MUST pass before committing
```
---
## COMMON PATTERNS
### 1. Retry Logic
```python
from typing import Callable, TypeVar
import time
T = TypeVar('T')
def retry(
func: Callable[..., T],
max_attempts: int = 3,
delay_seconds: float = 1.0
) -> T:
"""Retry a function with exponential backoff.
Args:
func: Function to retry
max_attempts: Maximum number of attempts
delay_seconds: Initial delay between retries
Returns:
Function result
Raises:
Exception: Last exception if all retries fail
"""
last_exception: Optional[Exception] = None
for attempt in range(max_attempts):
try:
return func()
except Exception as e:
last_exception = e
if attempt < max_attempts - 1:
sleep_time = delay_seconds * (2 ** attempt)
logger.warning(
f"Attempt {attempt + 1} failed, retrying in {sleep_time}s",
extra={"exception": str(e)}
)
time.sleep(sleep_time)
raise last_exception # type: ignore
```
### 2. Data Validation
```python
from dataclasses import dataclass
@dataclass
class Article:
"""Validated article data."""
title: str
url: str
image_url: Optional[str] = None
def __post_init__(self) -> None:
"""Validate data after initialization."""
if not self.title:
raise ValueError("Title cannot be empty")
if not self.url.startswith(('http://', 'https://')):
raise ValueError(f"Invalid URL: {self.url}")
```
### 3. Context Managers for Resources
```python
from contextlib import contextmanager
from typing import Generator
@contextmanager
def api_client(config: APIConfig) -> Generator[APIClient, None, None]:
"""Context manager for API client.
Yields:
Configured API client
"""
client = APIClient(config)
try:
client.connect()
yield client
finally:
client.disconnect()
# Usage
with api_client(config) as client:
result = client.call()
```
---
## WORKING WITH EXTERNAL APIS
### OpenAI GPT-4 Vision
```python
from openai import OpenAI
from typing import Optional
class ImageAnalyzer:
"""Analyze images using GPT-4 Vision."""
def __init__(self, api_key: str) -> None:
self._client = OpenAI(api_key=api_key)
def analyze_image(self, image_url: str, prompt: str) -> Optional[str]:
"""Analyze image with custom prompt.
Args:
image_url: URL of image to analyze
prompt: Analysis prompt
Returns:
Analysis result or None if failed
Raises:
ImageAnalysisError: If analysis fails
"""
try:
response = self._client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": image_url}}
]
}],
max_tokens=300
)
return response.choices[0].message.content
except Exception as e:
logger.error(f"Image analysis failed: {e}")
raise ImageAnalysisError(f"Failed to analyze {image_url}") from e
```
### Calling Node.js API
```python
import requests
from typing import Dict, Any
class ArticleAPIClient:
"""Client for Node.js article generation API."""
def __init__(self, base_url: str, timeout: int = 30) -> None:
self._base_url = base_url.rstrip('/')
self._timeout = timeout
def generate_article(
self,
topic: str,
context: str,
image_description: Optional[str] = None
) -> Dict[str, Any]:
"""Generate article via API.
Args:
topic: Article topic
context: Context information
image_description: Optional image description
Returns:
Generated article data
Raises:
APIClientError: If API call fails
"""
payload = {
"topic": topic,
"context": context,
}
if image_description:
payload["image_description"] = image_description
try:
response = requests.post(
f"{self._base_url}/api/generate",
json=payload,
timeout=self._timeout
)
response.raise_for_status()
return response.json()
except requests.RequestException as e:
logger.error(f"API call failed: {e}")
raise APIClientError("Article generation failed") from e
```
---
## WHEN TO ASK FOR HUMAN INPUT
Claude Code MUST ask before:
1. **Changing module structure** - Architecture changes
2. **Adding new dependencies** - New libraries
3. **Changing configuration format** - Breaking changes
4. **Implementing complex logic** - Business rules
5. **Error handling strategy** - Recovery approaches
6. **Performance optimizations** - Trade-offs
Claude Code CAN proceed without asking:
1. **Adding type hints** - Always required
2. **Adding logging** - Always beneficial
3. **Adding tests** - Always needed
4. **Fixing obvious bugs** - Clear errors
5. **Improving documentation** - Clarity improvements
6. **Refactoring for clarity** - Same behavior, better code
---
## DEVELOPMENT WORKFLOW
### 1. Start with Types and Interfaces
```python
# Define data structures FIRST
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class NewsArticle:
title: str
url: str
content: str
image_url: Optional[str] = None
@dataclass
class AnalyzedArticle:
news: NewsArticle
image_description: Optional[str] = None
```
### 2. Implement Core Logic
```python
# Then implement with clear types
def scrape_news(url: str) -> List[NewsArticle]:
"""Implementation with clear contract."""
pass
```
### 3. Add Tests
```python
def test_scrape_news() -> None:
"""Test before considering feature complete."""
pass
```
### 4. Integrate
```python
def pipeline() -> None:
"""Combine modules with clear flow."""
articles = scrape_news(url)
analyzed = analyze_images(articles)
generated = generate_articles(analyzed)
publish_feed(generated)
```
---
## CRITICAL REMINDERS
1. **Type hints are NOT optional** - Every function must be typed
2. **Error handling is NOT optional** - Every external call must have error handling
3. **Logging is NOT optional** - Every significant operation must be logged
4. **Tests are NOT optional** - Every module must have tests
5. **Configuration is NOT optional** - No hardcoded values
**If you find yourself thinking "I'll add types/tests/docs later"** - STOP. Do it now.
**If code works but isn't typed/tested/documented** - It's NOT done.
**This is NOT Node.js with its loose culture** - Python gives us the tools for rigor, USE THEM.
---
## SUCCESS CRITERIA
A module is complete when:
- ✅ All functions have type hints
-`mypy` passes with no errors
- ✅ All tests pass
- ✅ Test coverage > 80%
- ✅ No print statements (use logger)
- ✅ No bare excepts
- ✅ No magic strings (use Enums)
- ✅ Documentation is clear and complete
- ✅ Error handling is explicit
- ✅ Configuration is externalized
**If ANY of these is missing, the module is NOT complete.**