Complete Python implementation with strict type safety and best practices.
Features:
- RSS/Atom/HTML web scraping
- GPT-4 Vision image analysis
- Node.js API integration
- RSS/JSON feed publishing
Modules:
- src/config.py: Configuration with strict validation
- src/exceptions.py: Custom exception hierarchy
- src/scraper.py: Multi-format news scraping (RSS/Atom/HTML)
- src/image_analyzer.py: GPT-4 Vision integration with retry
- src/aggregator.py: Content aggregation and filtering
- src/article_client.py: Node.js API client with retry
- src/publisher.py: RSS/JSON feed generation
- scripts/run.py: Complete pipeline orchestrator
- scripts/validate.py: Code quality validation
Code Quality:
- 100% type hint coverage (mypy strict mode)
- Zero bare except clauses
- Logger throughout (no print statements)
- Comprehensive test suite (598 lines)
- Immutable dataclasses (frozen=True)
- Explicit error handling
- Structured logging
Stats:
- 1,431 lines of source code
- 598 lines of test code
- 15 Python files
- 8 core modules
- 4 test suites
All validation checks pass.
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
21 KiB
21 KiB
CLAUDE.md - Feed Generator Project Instructions
# CLAUDE.md - Feed Generator Development Instructions
> **CRITICAL**: This document contains mandatory rules for AI-assisted development with Claude Code.
> **NEVER** deviate from these rules without explicit human approval.
---
## PROJECT OVERVIEW
**Feed Generator** is a Python-based content aggregation system that:
1. Scrapes news from web sources
2. Analyzes images using GPT-4 Vision
3. Aggregates content into structured prompts
4. Calls existing Node.js article generation API
5. Publishes to feeds (RSS/WordPress)
**Philosophy**: Quick, functional prototype. NOT a production system yet.
**Timeline**: 3-5 days maximum for V1.
**Future**: May be rewritten in Node.js/TypeScript with strict architecture.
---
## CORE PRINCIPLES
### 1. Type Safety is MANDATORY
**NEVER write untyped Python code.**
```python
# ❌ FORBIDDEN - No type hints
def scrape_news(url):
return requests.get(url)
# ✅ REQUIRED - Full type hints
from typing import List, Dict, Optional
import requests
def scrape_news(url: str) -> Optional[Dict[str, str]]:
response: requests.Response = requests.get(url)
return response.json() if response.ok else None
Rules:
- Every function MUST have type hints for parameters and return values
- Use
typingmodule:List,Dict,Optional,Union,Tuple - Use
from __future__ import annotationsfor forward references - Complex types should use
TypedDictordataclasses
2. Explicit is Better Than Implicit
NEVER use magic or implicit behavior.
# ❌ FORBIDDEN - Implicit dictionary keys
def process(data):
return data['title'] # What if 'title' doesn't exist?
# ✅ REQUIRED - Explicit with error handling
def process(data: Dict[str, str]) -> str:
if 'title' not in data:
raise ValueError("Missing required key: 'title'")
return data['title']
3. Fail Fast and Loud
NEVER silently swallow errors.
# ❌ FORBIDDEN - Silent failure
try:
result = dangerous_operation()
except:
result = None
# ✅ REQUIRED - Explicit error handling
try:
result = dangerous_operation()
except SpecificException as e:
logger.error(f"Operation failed: {e}")
raise
4. Single Responsibility Modules
Each module has ONE clear purpose.
scraper.py- ONLY scraping logicimage_analyzer.py- ONLY image analysisarticle_client.py- ONLY API communicationaggregator.py- ONLY content aggregationpublisher.py- ONLY feed publishing
NEVER mix responsibilities.
FORBIDDEN PATTERNS
❌ NEVER Use These
# 1. Bare except
try:
something()
except: # ❌ FORBIDDEN
pass
# 2. Mutable default arguments
def func(items=[]): # ❌ FORBIDDEN
items.append(1)
return items
# 3. Global state
CACHE = {} # ❌ FORBIDDEN at module level
def use_cache():
CACHE['key'] = 'value'
# 4. Star imports
from module import * # ❌ FORBIDDEN
# 5. Untyped functions
def process(data): # ❌ FORBIDDEN - no types
return data
# 6. Magic strings
if mode == "production": # ❌ FORBIDDEN
do_something()
# 7. Implicit None returns
def maybe_returns(): # ❌ FORBIDDEN - unclear return
if condition:
return value
# 8. Nested functions for reuse
def outer():
def inner(): # ❌ FORBIDDEN if used multiple times
pass
inner()
inner()
✅ REQUIRED Patterns
# 1. Specific exceptions
try:
something()
except ValueError as e: # ✅ REQUIRED
logger.error(f"Value error: {e}")
raise
# 2. Immutable defaults
def func(items: Optional[List[str]] = None) -> List[str]: # ✅ REQUIRED
if items is None:
items = []
items.append('new')
return items
# 3. Explicit configuration objects
from dataclasses import dataclass
@dataclass
class CacheConfig:
max_size: int
ttl_seconds: int
cache = Cache(config=CacheConfig(max_size=100, ttl_seconds=60))
# 4. Explicit imports
from module import SpecificClass, specific_function # ✅ REQUIRED
# 5. Typed functions
def process(data: Dict[str, Any]) -> Optional[str]: # ✅ REQUIRED
return data.get('value')
# 6. Enums for constants
from enum import Enum
class Mode(Enum): # ✅ REQUIRED
PRODUCTION = "production"
DEVELOPMENT = "development"
if mode == Mode.PRODUCTION:
do_something()
# 7. Explicit Optional returns
def maybe_returns() -> Optional[str]: # ✅ REQUIRED
if condition:
return value
return None
# 8. Extract functions to module level
def inner_logic() -> None: # ✅ REQUIRED
pass
def outer() -> None:
inner_logic()
inner_logic()
MODULE STRUCTURE
Standard Module Template
Every module MUST follow this structure:
"""
Module: module_name.py
Purpose: [ONE sentence describing ONLY responsibility]
Dependencies: [List external dependencies]
"""
from __future__ import annotations
# Standard library imports
import logging
from typing import Dict, List, Optional
# Third-party imports
import requests
from bs4 import BeautifulSoup
# Local imports
from .config import Config
# Module-level logger
logger = logging.getLogger(__name__)
class ModuleName:
"""[Clear description of class responsibility]"""
def __init__(self, config: Config) -> None:
"""Initialize with configuration.
Args:
config: Configuration object
Raises:
ValueError: If config is invalid
"""
self._config = config
self._validate_config()
def _validate_config(self) -> None:
"""Validate configuration."""
if not self._config.api_key:
raise ValueError("API key is required")
def public_method(self, param: str) -> Optional[Dict[str, str]]:
"""[Clear description]
Args:
param: [Description]
Returns:
[Description of return value]
Raises:
[Exceptions that can be raised]
"""
try:
result = self._internal_logic(param)
return result
except SpecificException as e:
logger.error(f"Failed to process {param}: {e}")
raise
def _internal_logic(self, param: str) -> Dict[str, str]:
"""Private methods use underscore prefix."""
return {"key": param}
CONFIGURATION MANAGEMENT
NEVER hardcode values. Use configuration objects.
config.py Structure
"""Configuration management for Feed Generator."""
from __future__ import annotations
import os
from dataclasses import dataclass
from typing import List
from pathlib import Path
@dataclass(frozen=True) # Immutable
class APIConfig:
"""Configuration for external APIs."""
openai_key: str
node_api_url: str
timeout_seconds: int = 30
@dataclass(frozen=True)
class ScraperConfig:
"""Configuration for news scraping."""
sources: List[str]
max_articles: int = 10
timeout_seconds: int = 10
@dataclass(frozen=True)
class Config:
"""Main configuration object."""
api: APIConfig
scraper: ScraperConfig
log_level: str = "INFO"
@classmethod
def from_env(cls) -> Config:
"""Load configuration from environment variables.
Returns:
Loaded configuration
Raises:
ValueError: If required environment variables are missing
"""
openai_key = os.getenv("OPENAI_API_KEY")
if not openai_key:
raise ValueError("OPENAI_API_KEY environment variable required")
node_api_url = os.getenv("NODE_API_URL", "http://localhost:3000")
sources_str = os.getenv("NEWS_SOURCES", "")
sources = [s.strip() for s in sources_str.split(",") if s.strip()]
if not sources:
raise ValueError("NEWS_SOURCES environment variable required")
return cls(
api=APIConfig(
openai_key=openai_key,
node_api_url=node_api_url
),
scraper=ScraperConfig(
sources=sources
)
)
ERROR HANDLING STRATEGY
1. Define Custom Exceptions
"""Custom exceptions for Feed Generator."""
class FeedGeneratorError(Exception):
"""Base exception for all Feed Generator errors."""
pass
class ScrapingError(FeedGeneratorError):
"""Raised when scraping fails."""
pass
class ImageAnalysisError(FeedGeneratorError):
"""Raised when image analysis fails."""
pass
class APIClientError(FeedGeneratorError):
"""Raised when API communication fails."""
pass
2. Use Specific Error Handling
def scrape_news(url: str) -> Dict[str, str]:
"""Scrape news from URL.
Raises:
ScrapingError: If scraping fails
"""
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
except requests.Timeout as e:
raise ScrapingError(f"Timeout scraping {url}") from e
except requests.RequestException as e:
raise ScrapingError(f"Failed to scrape {url}") from e
try:
return response.json()
except ValueError as e:
raise ScrapingError(f"Invalid JSON from {url}") from e
3. Log Before Raising
def critical_operation() -> None:
"""Perform critical operation."""
try:
result = dangerous_call()
except SpecificError as e:
logger.error(f"Critical operation failed: {e}", exc_info=True)
raise # Re-raise after logging
TESTING REQUIREMENTS
Every Module MUST Have Tests
"""Test module for scraper.py"""
import pytest
from unittest.mock import Mock, patch
from src.scraper import NewsScraper
from src.config import ScraperConfig
from src.exceptions import ScrapingError
def test_scraper_success() -> None:
"""Test successful scraping."""
config = ScraperConfig(sources=["https://example.com"])
scraper = NewsScraper(config)
with patch('requests.get') as mock_get:
mock_response = Mock()
mock_response.ok = True
mock_response.json.return_value = {"title": "Test"}
mock_get.return_value = mock_response
result = scraper.scrape("https://example.com")
assert result is not None
assert result["title"] == "Test"
def test_scraper_timeout() -> None:
"""Test scraping timeout."""
config = ScraperConfig(sources=["https://example.com"])
scraper = NewsScraper(config)
with patch('requests.get', side_effect=requests.Timeout):
with pytest.raises(ScrapingError):
scraper.scrape("https://example.com")
LOGGING STRATEGY
Standard Logger Setup
import logging
import sys
def setup_logging(level: str = "INFO") -> None:
"""Setup logging configuration.
Args:
level: Logging level (DEBUG, INFO, WARNING, ERROR)
"""
logging.basicConfig(
level=getattr(logging, level.upper()),
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.StreamHandler(sys.stdout),
logging.FileHandler('feed_generator.log')
]
)
# In each module
logger = logging.getLogger(__name__)
Logging Best Practices
# ✅ REQUIRED - Structured logging
logger.info(f"Scraping {url}", extra={"url": url, "attempt": 1})
# ✅ REQUIRED - Log exceptions with context
try:
result = operation()
except Exception as e:
logger.error(f"Operation failed", exc_info=True, extra={"context": data})
raise
# ❌ FORBIDDEN - Print statements
print("Debug info") # Use logger.debug() instead
DEPENDENCIES MANAGEMENT
requirements.txt Structure
# Core dependencies
requests==2.31.0
beautifulsoup4==4.12.2
openai==1.3.0
# Utilities
python-dotenv==1.0.0
# Testing
pytest==7.4.3
pytest-cov==4.1.0
# Type checking
mypy==1.7.1
types-requests==2.31.0
Installing Dependencies
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install in development mode
pip install -e .
TYPE CHECKING WITH MYPY
mypy.ini Configuration
[mypy]
python_version = 3.11
warn_return_any = True
warn_unused_configs = True
disallow_untyped_defs = True
disallow_any_unimported = True
no_implicit_optional = True
warn_redundant_casts = True
warn_unused_ignores = True
warn_no_return = True
check_untyped_defs = True
strict_equality = True
Running Type Checks
# Type check all code
mypy src/
# MUST pass before committing
COMMON PATTERNS
1. Retry Logic
from typing import Callable, TypeVar
import time
T = TypeVar('T')
def retry(
func: Callable[..., T],
max_attempts: int = 3,
delay_seconds: float = 1.0
) -> T:
"""Retry a function with exponential backoff.
Args:
func: Function to retry
max_attempts: Maximum number of attempts
delay_seconds: Initial delay between retries
Returns:
Function result
Raises:
Exception: Last exception if all retries fail
"""
last_exception: Optional[Exception] = None
for attempt in range(max_attempts):
try:
return func()
except Exception as e:
last_exception = e
if attempt < max_attempts - 1:
sleep_time = delay_seconds * (2 ** attempt)
logger.warning(
f"Attempt {attempt + 1} failed, retrying in {sleep_time}s",
extra={"exception": str(e)}
)
time.sleep(sleep_time)
raise last_exception # type: ignore
2. Data Validation
from dataclasses import dataclass
@dataclass
class Article:
"""Validated article data."""
title: str
url: str
image_url: Optional[str] = None
def __post_init__(self) -> None:
"""Validate data after initialization."""
if not self.title:
raise ValueError("Title cannot be empty")
if not self.url.startswith(('http://', 'https://')):
raise ValueError(f"Invalid URL: {self.url}")
3. Context Managers for Resources
from contextlib import contextmanager
from typing import Generator
@contextmanager
def api_client(config: APIConfig) -> Generator[APIClient, None, None]:
"""Context manager for API client.
Yields:
Configured API client
"""
client = APIClient(config)
try:
client.connect()
yield client
finally:
client.disconnect()
# Usage
with api_client(config) as client:
result = client.call()
WORKING WITH EXTERNAL APIS
OpenAI GPT-4 Vision
from openai import OpenAI
from typing import Optional
class ImageAnalyzer:
"""Analyze images using GPT-4 Vision."""
def __init__(self, api_key: str) -> None:
self._client = OpenAI(api_key=api_key)
def analyze_image(self, image_url: str, prompt: str) -> Optional[str]:
"""Analyze image with custom prompt.
Args:
image_url: URL of image to analyze
prompt: Analysis prompt
Returns:
Analysis result or None if failed
Raises:
ImageAnalysisError: If analysis fails
"""
try:
response = self._client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": image_url}}
]
}],
max_tokens=300
)
return response.choices[0].message.content
except Exception as e:
logger.error(f"Image analysis failed: {e}")
raise ImageAnalysisError(f"Failed to analyze {image_url}") from e
Calling Node.js API
import requests
from typing import Dict, Any
class ArticleAPIClient:
"""Client for Node.js article generation API."""
def __init__(self, base_url: str, timeout: int = 30) -> None:
self._base_url = base_url.rstrip('/')
self._timeout = timeout
def generate_article(
self,
topic: str,
context: str,
image_description: Optional[str] = None
) -> Dict[str, Any]:
"""Generate article via API.
Args:
topic: Article topic
context: Context information
image_description: Optional image description
Returns:
Generated article data
Raises:
APIClientError: If API call fails
"""
payload = {
"topic": topic,
"context": context,
}
if image_description:
payload["image_description"] = image_description
try:
response = requests.post(
f"{self._base_url}/api/generate",
json=payload,
timeout=self._timeout
)
response.raise_for_status()
return response.json()
except requests.RequestException as e:
logger.error(f"API call failed: {e}")
raise APIClientError("Article generation failed") from e
WHEN TO ASK FOR HUMAN INPUT
Claude Code MUST ask before:
- Changing module structure - Architecture changes
- Adding new dependencies - New libraries
- Changing configuration format - Breaking changes
- Implementing complex logic - Business rules
- Error handling strategy - Recovery approaches
- Performance optimizations - Trade-offs
Claude Code CAN proceed without asking:
- Adding type hints - Always required
- Adding logging - Always beneficial
- Adding tests - Always needed
- Fixing obvious bugs - Clear errors
- Improving documentation - Clarity improvements
- Refactoring for clarity - Same behavior, better code
DEVELOPMENT WORKFLOW
1. Start with Types and Interfaces
# Define data structures FIRST
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class NewsArticle:
title: str
url: str
content: str
image_url: Optional[str] = None
@dataclass
class AnalyzedArticle:
news: NewsArticle
image_description: Optional[str] = None
2. Implement Core Logic
# Then implement with clear types
def scrape_news(url: str) -> List[NewsArticle]:
"""Implementation with clear contract."""
pass
3. Add Tests
def test_scrape_news() -> None:
"""Test before considering feature complete."""
pass
4. Integrate
def pipeline() -> None:
"""Combine modules with clear flow."""
articles = scrape_news(url)
analyzed = analyze_images(articles)
generated = generate_articles(analyzed)
publish_feed(generated)
CRITICAL REMINDERS
- Type hints are NOT optional - Every function must be typed
- Error handling is NOT optional - Every external call must have error handling
- Logging is NOT optional - Every significant operation must be logged
- Tests are NOT optional - Every module must have tests
- Configuration is NOT optional - No hardcoded values
If you find yourself thinking "I'll add types/tests/docs later" - STOP. Do it now.
If code works but isn't typed/tested/documented - It's NOT done.
This is NOT Node.js with its loose culture - Python gives us the tools for rigor, USE THEM.
SUCCESS CRITERIA
A module is complete when:
- ✅ All functions have type hints
- ✅
mypypasses with no errors - ✅ All tests pass
- ✅ Test coverage > 80%
- ✅ No print statements (use logger)
- ✅ No bare excepts
- ✅ No magic strings (use Enums)
- ✅ Documentation is clear and complete
- ✅ Error handling is explicit
- ✅ Configuration is externalized
If ANY of these is missing, the module is NOT complete.