feedgenerator/ARCHITECTURE.md
StillHammer 40138c2d45 Initial implementation: Feed Generator V1
Complete Python implementation with strict type safety and best practices.

Features:
- RSS/Atom/HTML web scraping
- GPT-4 Vision image analysis
- Node.js API integration
- RSS/JSON feed publishing

Modules:
- src/config.py: Configuration with strict validation
- src/exceptions.py: Custom exception hierarchy
- src/scraper.py: Multi-format news scraping (RSS/Atom/HTML)
- src/image_analyzer.py: GPT-4 Vision integration with retry
- src/aggregator.py: Content aggregation and filtering
- src/article_client.py: Node.js API client with retry
- src/publisher.py: RSS/JSON feed generation
- scripts/run.py: Complete pipeline orchestrator
- scripts/validate.py: Code quality validation

Code Quality:
- 100% type hint coverage (mypy strict mode)
- Zero bare except clauses
- Logger throughout (no print statements)
- Comprehensive test suite (598 lines)
- Immutable dataclasses (frozen=True)
- Explicit error handling
- Structured logging

Stats:
- 1,431 lines of source code
- 598 lines of test code
- 15 Python files
- 8 core modules
- 4 test suites

All validation checks pass.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-07 22:28:18 +08:00

1098 lines
28 KiB
Markdown

# ARCHITECTURE.md
```markdown
# ARCHITECTURE.md - Feed Generator Technical Design
---
## SYSTEM OVERVIEW
**Feed Generator** aggregates news content from web sources, enriches it with AI-generated image analysis, and produces articles via an existing Node.js API.
### High-Level Flow
```
Web Sources → Scraper → Image Analyzer → Aggregator → Node API Client → Publisher
↓ ↓ ↓ ↓ ↓ ↓
HTML NewsArticle AnalyzedArticle Prompt GeneratedArticle Feed/RSS
```
### Design Goals
1. **Simplicity** - Clear, readable code over cleverness
2. **Modularity** - Each component has ONE responsibility
3. **Type Safety** - Full type coverage, mypy-compliant
4. **Testability** - Every module independently testable
5. **Prototype Speed** - Working system in 3-5 days
6. **Future-Proof** - Easy to migrate to Node.js later
---
## ARCHITECTURE PRINCIPLES
### 1. Pipeline Architecture
**Linear data flow, no circular dependencies.**
```
Input → Transform → Transform → Transform → Output
```
Each stage:
- Takes typed input
- Performs ONE transformation
- Returns typed output
- Can fail explicitly
### 2. Dependency Injection
**Configuration flows top-down, no global state.**
```python
# Main orchestrator
config = Config.from_env()
scraper = NewsScraper(config.scraper)
analyzer = ImageAnalyzer(config.api.openai_key)
client = ArticleAPIClient(config.api.node_api_url)
publisher = FeedPublisher(config.publisher)
# Pass dependencies explicitly
pipeline = Pipeline(scraper, analyzer, client, publisher)
```
### 3. Explicit Error Boundaries
**Each module defines its failure modes.**
```python
# Module A raises ScrapingError
# Module B catches and handles
try:
articles = scraper.scrape(url)
except ScrapingError as e:
logger.error(f"Scraping failed: {e}")
# Decide: retry, skip, or fail
```
---
## MODULE RESPONSIBILITIES
### 1. config.py - Configuration Management
**Purpose**: Centralize all configuration, load from environment.
**Responsibilities**:
- Load configuration from `.env` file
- Validate required settings
- Provide immutable config objects
- NO business logic
**Data Structures**:
```python
@dataclass(frozen=True)
class APIConfig:
openai_key: str
node_api_url: str
timeout_seconds: int
@dataclass(frozen=True)
class ScraperConfig:
sources: List[str]
max_articles: int
timeout_seconds: int
@dataclass(frozen=True)
class Config:
api: APIConfig
scraper: ScraperConfig
log_level: str
```
**Interface**:
```python
def from_env() -> Config:
"""Load and validate configuration from environment."""
```
---
### 2. scraper.py - Web Scraping
**Purpose**: Extract news articles from web sources.
**Responsibilities**:
- HTTP requests to news sites
- HTML parsing with BeautifulSoup
- Extract: title, content, image URLs
- Handle site-specific quirks
- NO image analysis, NO article generation
**Data Structures**:
```python
@dataclass
class NewsArticle:
title: str
url: str
content: str
image_url: Optional[str]
published_at: Optional[datetime]
source: str
```
**Interface**:
```python
class NewsScraper:
def scrape(self, url: str) -> List[NewsArticle]:
"""Scrape articles from a news source."""
def scrape_all(self) -> List[NewsArticle]:
"""Scrape all configured sources."""
```
**Error Handling**:
- Raises `ScrapingError` on failure
- Logs warnings for individual article failures
- Returns partial results when possible
---
### 3. image_analyzer.py - AI Image Analysis
**Purpose**: Generate descriptions of news images using GPT-4 Vision.
**Responsibilities**:
- Call OpenAI GPT-4 Vision API
- Generate contextual image descriptions
- Handle API rate limits and errors
- NO scraping, NO article generation
**Data Structures**:
```python
@dataclass
class ImageAnalysis:
image_url: str
description: str
confidence: float # 0.0 to 1.0
analysis_time: datetime
```
**Interface**:
```python
class ImageAnalyzer:
def analyze(self, image_url: str, context: str) -> ImageAnalysis:
"""Analyze single image with context."""
def analyze_batch(
self,
articles: List[NewsArticle]
) -> Dict[str, ImageAnalysis]:
"""Analyze multiple images, return dict keyed by URL."""
```
**Error Handling**:
- Raises `ImageAnalysisError` on API failure
- Returns None for individual failures in batch
- Implements retry logic with exponential backoff
---
### 4. aggregator.py - Content Aggregation
**Purpose**: Combine scraped content and image analysis into generation prompts.
**Responsibilities**:
- Merge NewsArticle + ImageAnalysis
- Format prompts for article generation API
- Apply business logic (e.g., skip low-confidence images)
- NO external API calls
**Data Structures**:
```python
@dataclass
class AggregatedContent:
news: NewsArticle
image_analysis: Optional[ImageAnalysis]
def to_generation_prompt(self) -> Dict[str, str]:
"""Convert to format expected by Node API."""
return {
"topic": self.news.title,
"context": self.news.content,
"image_description": self.image_analysis.description if self.image_analysis else None
}
```
**Interface**:
```python
class ContentAggregator:
def aggregate(
self,
articles: List[NewsArticle],
analyses: Dict[str, ImageAnalysis]
) -> List[AggregatedContent]:
"""Combine scraped and analyzed content."""
```
**Business Rules**:
- Skip articles without images if image required
- Skip low-confidence image analyses (< 0.5)
- Limit prompt length to API constraints
---
### 5. article_client.py - Node API Client
**Purpose**: Call existing Node.js article generation API.
**Responsibilities**:
- HTTP POST to Node.js server
- Request/response serialization
- Retry logic for transient failures
- NO content processing, NO publishing
**Data Structures**:
```python
@dataclass
class GeneratedArticle:
original_news: NewsArticle
generated_content: str
metadata: Dict[str, Any]
generation_time: datetime
```
**Interface**:
```python
class ArticleAPIClient:
def generate(self, prompt: Dict[str, str]) -> GeneratedArticle:
"""Generate single article."""
def generate_batch(
self,
prompts: List[Dict[str, str]]
) -> List[GeneratedArticle]:
"""Generate multiple articles with rate limiting."""
```
**Error Handling**:
- Raises `APIClientError` on failure
- Implements exponential backoff retry
- Respects API rate limits
---
### 6. publisher.py - Feed Publishing
**Purpose**: Publish generated articles to output channels.
**Responsibilities**:
- Generate RSS/Atom feeds
- Post to WordPress (if configured)
- Write to local files
- NO content generation, NO scraping
**Interface**:
```python
class FeedPublisher:
def publish_rss(self, articles: List[GeneratedArticle], path: Path) -> None:
"""Generate RSS feed file."""
def publish_wordpress(self, articles: List[GeneratedArticle]) -> None:
"""Post to WordPress via XML-RPC or REST API."""
def publish_json(self, articles: List[GeneratedArticle], path: Path) -> None:
"""Write articles as JSON for debugging."""
```
**Output Formats**:
- RSS 2.0 feed
- WordPress posts
- JSON archive
---
## DATA FLOW DETAIL
### Complete Pipeline
```python
def run_pipeline(config: Config) -> None:
"""Execute complete feed generation pipeline."""
# 1. Initialize components
scraper = NewsScraper(config.scraper)
analyzer = ImageAnalyzer(config.api.openai_key)
aggregator = ContentAggregator()
client = ArticleAPIClient(config.api.node_api_url)
publisher = FeedPublisher(config.publisher)
# 2. Scrape news sources
logger.info("Scraping news sources...")
articles: List[NewsArticle] = scraper.scrape_all()
logger.info(f"Scraped {len(articles)} articles")
# 3. Analyze images
logger.info("Analyzing images...")
analyses: Dict[str, ImageAnalysis] = analyzer.analyze_batch(articles)
logger.info(f"Analyzed {len(analyses)} images")
# 4. Aggregate content
logger.info("Aggregating content...")
aggregated: List[AggregatedContent] = aggregator.aggregate(articles, analyses)
logger.info(f"Aggregated {len(aggregated)} items")
# 5. Generate articles
logger.info("Generating articles...")
prompts = [item.to_generation_prompt() for item in aggregated]
generated: List[GeneratedArticle] = client.generate_batch(prompts)
logger.info(f"Generated {len(generated)} articles")
# 6. Publish
logger.info("Publishing...")
publisher.publish_rss(generated, Path("output/feed.rss"))
publisher.publish_json(generated, Path("output/articles.json"))
logger.info("Pipeline complete!")
```
### Error Handling in Pipeline
```python
def run_pipeline_with_recovery(config: Config) -> None:
"""Pipeline with error recovery at each stage."""
try:
# Stage 1: Scraping
articles = scraper.scrape_all()
if not articles:
logger.warning("No articles scraped, exiting")
return
except ScrapingError as e:
logger.error(f"Scraping failed: {e}")
return # Cannot proceed without articles
try:
# Stage 2: Image Analysis (optional)
analyses = analyzer.analyze_batch(articles)
except ImageAnalysisError as e:
logger.warning(f"Image analysis failed: {e}, proceeding without images")
analyses = {} # Continue without image descriptions
# Stage 3: Aggregation (cannot fail with valid inputs)
aggregated = aggregator.aggregate(articles, analyses)
try:
# Stage 4: Generation
prompts = [item.to_generation_prompt() for item in aggregated]
generated = client.generate_batch(prompts)
if not generated:
logger.error("No articles generated, exiting")
return
except APIClientError as e:
logger.error(f"Article generation failed: {e}")
return # Cannot publish without generated articles
try:
# Stage 5: Publishing
publisher.publish_rss(generated, Path("output/feed.rss"))
publisher.publish_json(generated, Path("output/articles.json"))
except PublishingError as e:
logger.error(f"Publishing failed: {e}")
# Save to backup location
publisher.publish_json(generated, Path("backup/articles.json"))
```
---
## INTERFACE CONTRACTS
### Module Input/Output Types
```python
# scraper.py
Input: str (URL)
Output: List[NewsArticle]
Errors: ScrapingError
# image_analyzer.py
Input: List[NewsArticle]
Output: Dict[str, ImageAnalysis] # Keyed by image_url
Errors: ImageAnalysisError
# aggregator.py
Input: List[NewsArticle], Dict[str, ImageAnalysis]
Output: List[AggregatedContent]
Errors: None (pure transformation)
# article_client.py
Input: List[Dict[str, str]] # Prompts
Output: List[GeneratedArticle]
Errors: APIClientError
# publisher.py
Input: List[GeneratedArticle]
Output: None (side effects: files, API calls)
Errors: PublishingError
```
### Type Safety Guarantees
All interfaces use:
- **Immutable dataclasses** for data structures
- **Explicit Optional** for nullable values
- **Specific exceptions** for error cases
- **Type hints** on all function signatures
```python
# Example: Type-safe interface
def process_article(
article: NewsArticle, # Required
analysis: Optional[ImageAnalysis] # Nullable
) -> Result[GeneratedArticle, ProcessingError]: # Explicit result type
"""Type signature guarantees correctness."""
```
---
## CONFIGURATION STRATEGY
### Environment Variables
```bash
# Required
OPENAI_API_KEY=sk-...
NODE_API_URL=http://localhost:3000
NEWS_SOURCES=https://example.com/news,https://other.com/feed
# Optional
LOG_LEVEL=INFO
MAX_ARTICLES=10
SCRAPER_TIMEOUT=10
API_TIMEOUT=30
```
### Configuration Hierarchy
```
Default Values → Environment Variables → CLI Arguments (future)
↓ ↓ ↓
config.py .env file argparse
```
### Configuration Validation
```python
@classmethod
def from_env(cls) -> Config:
"""Load with validation."""
# Required fields
openai_key = os.getenv("OPENAI_API_KEY")
if not openai_key:
raise ValueError("OPENAI_API_KEY required")
# Validated parsing
node_api_url = os.getenv("NODE_API_URL", "http://localhost:3000")
if not node_api_url.startswith(('http://', 'https://')):
raise ValueError(f"Invalid NODE_API_URL: {node_api_url}")
# List parsing
sources_str = os.getenv("NEWS_SOURCES", "")
sources = [s.strip() for s in sources_str.split(",") if s.strip()]
if not sources:
raise ValueError("NEWS_SOURCES required (comma-separated URLs)")
return cls(...)
```
---
## ERROR HANDLING ARCHITECTURE
### Exception Hierarchy
```python
class FeedGeneratorError(Exception):
"""Base exception - catch-all for system errors."""
pass
class ScrapingError(FeedGeneratorError):
"""Web scraping failed."""
pass
class ImageAnalysisError(FeedGeneratorError):
"""GPT-4 Vision analysis failed."""
pass
class APIClientError(FeedGeneratorError):
"""Node.js API communication failed."""
pass
class PublishingError(FeedGeneratorError):
"""Feed publishing failed."""
pass
```
### Retry Strategy
```python
class RetryConfig:
"""Configuration for retry behavior."""
max_attempts: int = 3
initial_delay: float = 1.0 # seconds
backoff_factor: float = 2.0
max_delay: float = 60.0
def with_retry(config: RetryConfig):
"""Decorator for retryable operations."""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(config.max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == config.max_attempts - 1:
raise
delay = min(
config.initial_delay * (config.backoff_factor ** attempt),
config.max_delay
)
logger.warning(f"Retry {attempt+1}/{config.max_attempts} after {delay}s")
time.sleep(delay)
return wrapper
return decorator
```
### Partial Failure Handling
```python
def scrape_all(self) -> List[NewsArticle]:
"""Scrape all sources, continue on individual failures."""
all_articles = []
for source in self._config.sources:
try:
articles = self._scrape_source(source)
all_articles.extend(articles)
logger.info(f"Scraped {len(articles)} from {source}")
except ScrapingError as e:
logger.warning(f"Failed to scrape {source}: {e}")
# Continue with other sources
continue
return all_articles
```
---
## TESTING STRATEGY
### Test Pyramid
```
E2E Tests (1-2)
/ \
Integration (5-10)
/ \
Unit Tests (20-30)
```
### Unit Test Coverage
Each module has:
- **Happy path tests** - Normal operation
- **Error condition tests** - Each exception type
- **Edge case tests** - Empty inputs, null values, limits
- **Mock external dependencies** - No real HTTP calls
```python
# Example: scraper_test.py
def test_scrape_success():
"""Test successful scraping."""
# Mock HTTP response
# Assert correct NewsArticle returned
def test_scrape_timeout():
"""Test timeout handling."""
# Mock timeout exception
# Assert ScrapingError raised
def test_scrape_invalid_html():
"""Test malformed HTML handling."""
# Mock invalid response
# Assert error or empty result
```
### Integration Test Coverage
Test module interactions:
- Scraper Aggregator
- Analyzer Aggregator
- Aggregator API Client
- End-to-end pipeline
```python
def test_pipeline_integration():
"""Test complete pipeline with mocked external services."""
config = Config.from_dict(test_config)
with mock_http_responses():
with mock_openai_api():
with mock_node_api():
result = run_pipeline(config)
assert len(result) > 0
assert all(isinstance(a, GeneratedArticle) for a in result)
```
### Test Data Strategy
```
tests/
├── fixtures/
│ ├── sample_news.html # Mock HTML responses
│ ├── sample_api_response.json
│ └── sample_images.json
└── mocks/
├── mock_scraper.py
├── mock_analyzer.py
└── mock_client.py
```
---
## PERFORMANCE CONSIDERATIONS
### Current Targets (V1 Prototype)
- Scraping: 5-10 articles/source in < 30s
- Image analysis: < 5s per image (GPT-4V API latency)
- Article generation: < 10s per article (Node API latency)
- Total pipeline: < 5 minutes for 50 articles
### Bottlenecks Identified
1. **Sequential API calls** - GPT-4V and Node API
2. **Network latency** - HTTP requests
3. **No caching** - Repeated scraping of same sources
### Future Optimizations (V2+)
```python
# Parallel image analysis
async def analyze_batch_parallel(
self,
articles: List[NewsArticle]
) -> Dict[str, ImageAnalysis]:
"""Analyze images in parallel."""
tasks = [self._analyze_async(a.image_url) for a in articles]
results = await asyncio.gather(*tasks, return_exceptions=True)
return {url: result for url, result in zip(urls, results) if not isinstance(result, Exception)}
```
### Caching Strategy (Future)
```python
@dataclass
class CacheConfig:
scraper_ttl: int = 3600 # 1 hour
analysis_ttl: int = 86400 # 24 hours
# Redis or simple file-based cache
cache = Cache(config.cache)
def scrape_with_cache(self, url: str) -> List[NewsArticle]:
"""Scrape with TTL-based caching."""
cached = cache.get(f"scrape:{url}")
if cached and not cache.is_expired(cached):
return cached.data
fresh = self._scrape_source(url)
cache.set(f"scrape:{url}", fresh, ttl=self._config.cache.scraper_ttl)
return fresh
```
---
## EXTENSIBILITY POINTS
### Adding New News Sources
```python
# 1. Add source-specific parser
class BBCParser(NewsParser):
"""Parser for BBC News."""
def parse(self, html: str) -> List[NewsArticle]:
"""Extract articles from BBC HTML."""
soup = BeautifulSoup(html, 'html.parser')
# BBC-specific extraction logic
return articles
# 2. Register parser
scraper.register_parser("bbc.com", BBCParser())
# 3. Add to configuration
NEWS_SOURCES=...,https://bbc.com/news
```
### Adding Output Formats
```python
# 1. Implement publisher interface
class JSONPublisher(Publisher):
"""Publish articles as JSON."""
def publish(self, articles: List[GeneratedArticle]) -> None:
"""Write to JSON file."""
with open(self._path, 'w') as f:
json.dump([a.to_dict() for a in articles], f, indent=2)
# 2. Use in pipeline
publisher = JSONPublisher(Path("output/feed.json"))
publisher.publish(generated_articles)
```
### Custom Processing Steps
```python
# 1. Implement processor interface
class SEOOptimizer(Processor):
"""Add SEO metadata to articles."""
def process(self, article: GeneratedArticle) -> GeneratedArticle:
"""Enhance with SEO tags."""
optimized = article.copy()
optimized.metadata['keywords'] = extract_keywords(article.content)
optimized.metadata['description'] = generate_meta_description(article.content)
return optimized
# 2. Add to pipeline
pipeline.add_processor(SEOOptimizer())
```
---
## MIGRATION PATH TO NODE.JS
### Why Migrate Later?
This Python prototype will eventually be rewritten in Node.js/TypeScript because:
1. **Consistency** - Same stack as article generation API
2. **Maintainability** - One language for entire system
3. **Type safety** - TypeScript strict mode
4. **Integration** - Direct module imports instead of HTTP
### What to Preserve
When migrating:
- Module structure (same responsibilities)
- Interface contracts (same types)
- Configuration format (same env vars)
- Error handling strategy (same exceptions)
- Test coverage (same test cases)
### Migration Strategy
```typescript
// 1. Create TypeScript interfaces matching Python dataclasses
interface NewsArticle {
title: string;
url: string;
content: string;
imageUrl?: string;
}
// 2. Port modules one-by-one
class NewsScraper {
async scrape(url: string): Promise<NewsArticle[]> {
// Same logic as Python version
}
}
// 3. Replace HTTP calls with direct imports
import { generateArticle } from './article-generator';
// Instead of HTTP POST
const article = await generateArticle(prompt);
```
### Lessons to Apply
From this Python prototype to Node.js:
- Use TypeScript strict mode from day 1
- Define interfaces before implementation
- Write tests alongside code
- Use dependency injection
- Explicit error types
- No global state
---
## DEPLOYMENT CONSIDERATIONS
### Development Environment
```bash
# Local development
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env with API keys
python scripts/run.py
```
### Production Deployment (Future)
```yaml
# docker-compose.yml
version: '3.8'
services:
feed-generator:
build: .
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- NODE_API_URL=http://article-api:3000
volumes:
- ./output:/app/output
restart: unless-stopped
article-api:
image: node-article-generator:latest
ports:
- "3000:3000"
```
### Scheduling
```bash
# Cron job for periodic execution
0 */6 * * * cd /app/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1
```
---
## MONITORING & OBSERVABILITY
### Logging Levels
```python
# DEBUG - Detailed execution flow
logger.debug(f"Scraping URL: {url}")
# INFO - Major pipeline stages
logger.info(f"Scraped {len(articles)} articles")
# WARNING - Recoverable errors
logger.warning(f"Failed to scrape {source}, continuing")
# ERROR - Unrecoverable errors
logger.error(f"Pipeline failed: {e}", exc_info=True)
```
### Metrics to Track
```python
@dataclass
class PipelineMetrics:
"""Metrics for pipeline execution."""
start_time: datetime
end_time: datetime
articles_scraped: int
images_analyzed: int
articles_generated: int
articles_published: int
errors: List[str]
def duration(self) -> float:
"""Pipeline duration in seconds."""
return (self.end_time - self.start_time).total_seconds()
def success_rate(self) -> float:
"""Percentage of articles successfully processed."""
if self.articles_scraped == 0:
return 0.0
return (self.articles_published / self.articles_scraped) * 100
```
### Health Checks
```python
def health_check() -> Dict[str, Any]:
"""Check system health."""
return {
"status": "healthy",
"checks": {
"openai_api": check_openai_connection(),
"node_api": check_node_api_connection(),
"disk_space": check_disk_space(),
},
"last_run": get_last_run_metrics(),
}
```
---
## SECURITY CONSIDERATIONS
### API Key Management
```python
# ❌ NEVER commit API keys
OPENAI_API_KEY = "sk-..." # FORBIDDEN
# ✅ Use environment variables
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY environment variable required")
```
### Input Validation
```python
def validate_url(url: str) -> bool:
"""Validate URL is safe to scrape."""
parsed = urlparse(url)
# Must be HTTP/HTTPS
if parsed.scheme not in ('http', 'https'):
return False
# No localhost or private IPs
if parsed.hostname in ('localhost', '127.0.0.1'):
return False
return True
```
### Rate Limiting
```python
class RateLimiter:
"""Simple rate limiter for API calls."""
def __init__(self, calls_per_minute: int) -> None:
self._calls_per_minute = calls_per_minute
self._calls: List[datetime] = []
def wait_if_needed(self) -> None:
"""Block if rate limit would be exceeded."""
now = datetime.now()
minute_ago = now - timedelta(minutes=1)
# Remove old calls
self._calls = [c for c in self._calls if c > minute_ago]
if len(self._calls) >= self._calls_per_minute:
sleep_time = (self._calls[0] - minute_ago).total_seconds()
time.sleep(sleep_time)
self._calls.append(now)
```
---
## KNOWN LIMITATIONS (V1)
### Scraping Limitations
- **Static HTML only** - No JavaScript rendering
- **No anti-bot bypass** - May be blocked by Cloudflare/etc
- **No authentication** - Cannot access paywalled content
- **Site-specific parsing** - Breaks if HTML structure changes
### Analysis Limitations
- **Cost** - GPT-4V API is expensive at scale
- **Latency** - 3-5s per image analysis
- **Rate limits** - OpenAI API quotas
- **No caching** - Re-analyzes same images
### Generation Limitations
- **Dependent on Node API** - Single point of failure
- **No fallback** - If API down, pipeline fails
- **Sequential processing** - One article at a time
### Publishing Limitations
- **Local files only** - No cloud storage
- **No WordPress integration** - RSS only
- **No scheduling** - Manual execution
---
## FUTURE ENHANCEMENTS (Post-V1)
### Phase 2: Robustness
- [ ] Playwright for JavaScript-rendered sites
- [ ] Retry logic with exponential backoff
- [ ] Persistent queue for failed items
- [ ] Health monitoring dashboard
### Phase 3: Performance
- [ ] Async/parallel processing
- [ ] Redis caching layer
- [ ] Connection pooling
- [ ] Batch API requests
### Phase 4: Features
- [ ] WordPress integration
- [ ] Multiple output formats
- [ ] Content filtering rules
- [ ] A/B testing for prompts
### Phase 5: Migration to Node.js
- [ ] Rewrite in TypeScript
- [ ] Direct integration with article generator
- [ ] Shared types/interfaces
- [ ] Unified deployment
---
## DECISION LOG
### Why Python for V1?
**Decision**: Use Python instead of Node.js
**Rationale**:
- Better scraping libraries (BeautifulSoup, requests)
- Simpler OpenAI SDK
- Faster prototyping
- Can be rewritten later
### Why Not Async from Start?
**Decision**: Synchronous code for V1
**Rationale**:
- Simpler to understand and debug
- Performance not critical for prototype
- Can add async in V2
### Why Dataclasses over Dicts?
**Decision**: Use typed dataclasses everywhere
**Rationale**:
- Type safety catches bugs early
- Better IDE support
- Self-documenting code
- Easy to validate
### Why No Database?
**Decision**: File-based storage for V1
**Rationale**:
- Simpler deployment
- No database management
- Sufficient for prototype
- Can add later if needed
---
End of ARCHITECTURE.md