feedgenerator/SETUP.md
StillHammer 40138c2d45 Initial implementation: Feed Generator V1
Complete Python implementation with strict type safety and best practices.

Features:
- RSS/Atom/HTML web scraping
- GPT-4 Vision image analysis
- Node.js API integration
- RSS/JSON feed publishing

Modules:
- src/config.py: Configuration with strict validation
- src/exceptions.py: Custom exception hierarchy
- src/scraper.py: Multi-format news scraping (RSS/Atom/HTML)
- src/image_analyzer.py: GPT-4 Vision integration with retry
- src/aggregator.py: Content aggregation and filtering
- src/article_client.py: Node.js API client with retry
- src/publisher.py: RSS/JSON feed generation
- scripts/run.py: Complete pipeline orchestrator
- scripts/validate.py: Code quality validation

Code Quality:
- 100% type hint coverage (mypy strict mode)
- Zero bare except clauses
- Logger throughout (no print statements)
- Comprehensive test suite (598 lines)
- Immutable dataclasses (frozen=True)
- Explicit error handling
- Structured logging

Stats:
- 1,431 lines of source code
- 598 lines of test code
- 15 Python files
- 8 core modules
- 4 test suites

All validation checks pass.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-07 22:28:18 +08:00

944 lines
19 KiB
Markdown

# SETUP.md
```markdown
# SETUP.md - Feed Generator Installation Guide
---
## PREREQUISITES
### Required Software
- **Python 3.11+** (3.10 minimum)
```bash
python --version # Should be 3.11 or higher
```
- **pip** (comes with Python)
```bash
pip --version
```
- **Git** (for cloning repository)
```bash
git --version
```
### Required Services
- **OpenAI API account** with GPT-4 Vision access
- Sign up: https://platform.openai.com/signup
- Generate API key: https://platform.openai.com/api-keys
- **Node.js Article Generator** (your existing API)
- Should be running on `http://localhost:3000`
- Or configure different URL in `.env`
---
## INSTALLATION
### Step 1: Clone Repository
```bash
# Clone the project
git clone https://github.com/your-org/feed-generator.git
cd feed-generator
# Verify structure
ls -la
# Should see: src/, tests/, requirements.txt, README.md, etc.
```
### Step 2: Create Virtual Environment
```bash
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Verify activation (should show (venv) in prompt)
which python # Should point to venv/bin/python
```
### Step 3: Install Dependencies
```bash
# Upgrade pip first
pip install --upgrade pip
# Install project dependencies
pip install -r requirements.txt
# Verify installations
pip list
# Should see: requests, beautifulsoup4, openai, pytest, mypy, etc.
```
### Step 4: Install Development Tools (Optional)
```bash
# For development
pip install -r requirements-dev.txt
# Includes: black, flake8, pylint, ipython
```
---
## CONFIGURATION
### Step 1: Create Environment File
```bash
# Copy example configuration
cp .env.example .env
# Edit with your settings
nano .env # or vim, code, etc.
```
### Step 2: Configure API Keys
Edit `.env` file:
```bash
# REQUIRED: OpenAI API Key
OPENAI_API_KEY=sk-proj-your-key-here
# REQUIRED: Node.js Article Generator API
NODE_API_URL=http://localhost:3000
# REQUIRED: News sources (comma-separated)
NEWS_SOURCES=https://example.com/news,https://techcrunch.com/feed
# OPTIONAL: Logging level
LOG_LEVEL=INFO
# OPTIONAL: Timeouts and limits
MAX_ARTICLES=10
SCRAPER_TIMEOUT=10
API_TIMEOUT=30
```
### Step 3: Verify Configuration
```bash
# Test configuration loading
python -c "from src.config import Config; c = Config.from_env(); print(c)"
# Should print configuration without errors
```
---
## VERIFICATION
### Step 1: Verify Python Environment
```bash
# Check Python version
python --version
# Output: Python 3.11.x or higher
# Check virtual environment
which python
# Output: /path/to/feed-generator/venv/bin/python
# Check installed packages
pip list | grep -E "(requests|openai|beautifulsoup4)"
# Should show all three packages
```
### Step 2: Verify API Connections
#### Test OpenAI API
```bash
python scripts/test_openai.py
```
Expected output:
```
Testing OpenAI API connection...
✓ API key loaded
✓ Connection successful
✓ GPT-4 Vision available
All checks passed!
```
#### Test Node.js API
```bash
# Make sure your Node.js API is running first
# In another terminal:
cd /path/to/node-article-generator
npm start
# Then test connection
python scripts/test_node_api.py
```
Expected output:
```
Testing Node.js API connection...
✓ API endpoint reachable
✓ Health check passed
✓ Test article generation successful
All checks passed!
```
### Step 3: Run Component Tests
```bash
# Test individual components
python -m pytest tests/ -v
# Expected output:
# tests/test_config.py::test_config_from_env PASSED
# tests/test_scraper.py::test_scraper_init PASSED
# ...
# ============ X passed in X.XXs ============
```
### Step 4: Test Complete Pipeline
```bash
# Dry run (mock external services)
python scripts/test_pipeline.py --dry-run
# Expected output:
# [INFO] Starting pipeline test (dry run)...
# [INFO] ✓ Configuration loaded
# [INFO] ✓ Scraper initialized
# [INFO] ✓ Image analyzer initialized
# [INFO] ✓ API client initialized
# [INFO] ✓ Publisher initialized
# [INFO] Pipeline test successful!
```
---
## RUNNING THE GENERATOR
### Manual Execution
```bash
# Run complete pipeline
python scripts/run.py
# With custom configuration
python scripts/run.py --config custom.env
# Dry run (no actual API calls)
python scripts/run.py --dry-run
# Verbose output
python scripts/run.py --verbose
```
### Expected Output
```
[2025-01-15 10:00:00] INFO - Starting Feed Generator...
[2025-01-15 10:00:00] INFO - Loading configuration...
[2025-01-15 10:00:01] INFO - Configuration loaded successfully
[2025-01-15 10:00:01] INFO - Scraping 3 news sources...
[2025-01-15 10:00:05] INFO - Scraped 15 articles
[2025-01-15 10:00:05] INFO - Analyzing 15 images...
[2025-01-15 10:00:25] INFO - Analyzed 12 images (3 failed)
[2025-01-15 10:00:25] INFO - Aggregating content...
[2025-01-15 10:00:25] INFO - Aggregated 12 items
[2025-01-15 10:00:25] INFO - Generating articles...
[2025-01-15 10:01:30] INFO - Generated 12 articles
[2025-01-15 10:01:30] INFO - Publishing to RSS...
[2025-01-15 10:01:30] INFO - Published to output/feed.rss
[2025-01-15 10:01:30] INFO - Pipeline complete! (90 seconds)
```
### Output Files
```bash
# Check generated files
ls -l output/
# Should see:
# feed.rss - RSS feed
# articles.json - Full article data
# feed_generator.log - Execution log
```
---
## TROUBLESHOOTING
### Issue: "OPENAI_API_KEY not found"
**Cause**: Environment variable not set
**Solution**:
```bash
# Check .env file exists
ls -la .env
# Verify API key is set
cat .env | grep OPENAI_API_KEY
# Reload environment
source venv/bin/activate
```
### Issue: "Module not found" errors
**Cause**: Dependencies not installed
**Solution**:
```bash
# Ensure virtual environment is activated
which python # Should point to venv
# Reinstall dependencies
pip install -r requirements.txt
# Verify installation
pip list | grep <missing-module>
```
### Issue: "Connection refused" to Node API
**Cause**: Node.js API not running
**Solution**:
```bash
# Start Node.js API first
cd /path/to/node-article-generator
npm start
# Verify it's running
curl http://localhost:3000/health
# Check configured URL in .env
cat .env | grep NODE_API_URL
```
### Issue: "Rate limit exceeded" from OpenAI
**Cause**: Too many API requests
**Solution**:
```bash
# Reduce MAX_ARTICLES in .env
echo "MAX_ARTICLES=5" >> .env
# Add delay between requests (future enhancement)
# For now, wait a few minutes and retry
```
### Issue: Scraping fails for specific sites
**Cause**: Site structure changed or blocking
**Solution**:
```bash
# Test individual source
python scripts/test_scraper.py --url https://problematic-site.com
# Check logs
cat feed_generator.log | grep ScrapingError
# Remove problematic source from .env temporarily
nano .env # Remove from NEWS_SOURCES
```
### Issue: Type checking fails
**Cause**: Missing or incorrect type hints
**Solution**:
```bash
# Run mypy to see errors
mypy src/
# Fix reported issues
# Every function must have type hints
```
---
## DEVELOPMENT SETUP
### Additional Tools
```bash
# Code formatting
pip install black
black src/ tests/
# Linting
pip install flake8
flake8 src/ tests/
# Type checking
pip install mypy
mypy src/
# Interactive Python shell
pip install ipython
ipython
```
### Pre-commit Hook (Optional)
```bash
# Install pre-commit
pip install pre-commit
# Setup hooks
pre-commit install
# Now runs automatically on git commit
# Or run manually:
pre-commit run --all-files
```
### IDE Setup
#### VS Code
```json
// .vscode/settings.json
{
"python.defaultInterpreterPath": "${workspaceFolder}/venv/bin/python",
"python.linting.enabled": true,
"python.linting.pylintEnabled": false,
"python.linting.flake8Enabled": true,
"python.formatting.provider": "black",
"python.analysis.typeCheckingMode": "strict"
}
```
#### PyCharm
```
1. Open Project
2. File → Settings → Project → Python Interpreter
3. Add Interpreter → Existing Environment
4. Select: /path/to/feed-generator/venv/bin/python
5. Apply
```
---
## SCHEDULED EXECUTION
### Cron Job (Linux/Mac)
```bash
# Edit crontab
crontab -e
# Run every 6 hours
0 */6 * * * cd /path/to/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1
# Run daily at 8 AM
0 8 * * * cd /path/to/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1
```
### Systemd Service (Linux)
```ini
# /etc/systemd/system/feed-generator.service
[Unit]
Description=Feed Generator
After=network.target
[Service]
Type=simple
User=your-user
WorkingDirectory=/path/to/feed-generator
ExecStart=/path/to/feed-generator/venv/bin/python scripts/run.py
Restart=on-failure
[Install]
WantedBy=multi-user.target
```
```bash
# Enable and start
sudo systemctl enable feed-generator
sudo systemctl start feed-generator
# Check status
sudo systemctl status feed-generator
```
### Task Scheduler (Windows)
```powershell
# Create scheduled task
$action = New-ScheduledTaskAction -Execute "C:\path\to\venv\Scripts\python.exe" -Argument "C:\path\to\scripts\run.py"
$trigger = New-ScheduledTaskTrigger -Daily -At 8am
Register-ScheduledTask -Action $action -Trigger $trigger -TaskName "FeedGenerator" -Description "Run feed generator daily"
```
---
## MONITORING
### Log Files
```bash
# View live logs
tail -f feed_generator.log
# View recent errors
grep ERROR feed_generator.log | tail -20
# View pipeline summary
grep "Pipeline complete" feed_generator.log
```
### Metrics Dashboard (Future)
```bash
# View last run metrics
python scripts/show_metrics.py
# Expected output:
# Last Run: 2025-01-15 10:01:30
# Duration: 90 seconds
# Articles Scraped: 15
# Articles Generated: 12
# Success Rate: 80%
# Errors: 3 (image analysis failures)
```
---
## BACKUP & RECOVERY
### Backup Configuration
```bash
# Backup .env file (CAREFUL - contains API keys)
cp .env .env.backup
# Store securely, NOT in git
# Use password manager or encrypted storage
```
### Backup Output
```bash
# Create daily backup
mkdir -p backups/$(date +%Y-%m-%d)
cp -r output/* backups/$(date +%Y-%m-%d)/
# Automated backup script
./scripts/backup_output.sh
```
### Recovery
```bash
# Restore from backup
cp backups/2025-01-15/feed.rss output/
# Verify integrity
python scripts/verify_feed.py output/feed.rss
```
---
## UPDATING
### Update Dependencies
```bash
# Activate virtual environment
source venv/bin/activate
# Update pip
pip install --upgrade pip
# Update all packages
pip install --upgrade -r requirements.txt
# Verify updates
pip list --outdated
```
### Update Code
```bash
# Pull latest changes
git pull origin main
# Reinstall if requirements changed
pip install -r requirements.txt
# Run tests
python -m pytest tests/
# Test pipeline
python scripts/test_pipeline.py --dry-run
```
---
## UNINSTALLATION
### Remove Virtual Environment
```bash
# Deactivate first
deactivate
# Remove virtual environment
rm -rf venv/
```
### Remove Generated Files
```bash
# Remove output
rm -rf output/
# Remove logs
rm -rf logs/
# Remove backups
rm -rf backups/
```
### Remove Project
```bash
# Remove entire project directory
cd ..
rm -rf feed-generator/
```
---
## SECURITY CHECKLIST
Before deploying:
- [ ] `.env` file is NOT committed to git
- [ ] `.env.example` has placeholder values only
- [ ] API keys are stored securely
- [ ] `.gitignore` includes `.env`, `venv/`, `output/`, `logs/`
- [ ] Log files don't contain sensitive data
- [ ] File permissions are restrictive (`chmod 600 .env`)
- [ ] Virtual environment is isolated
- [ ] Dependencies are from trusted sources
---
## PERFORMANCE BASELINE
Expected performance on standard hardware:
| Metric | Target | Acceptable Range |
|--------|--------|------------------|
| Scraping (10 articles) | 10s | 5-20s |
| Image analysis (10 images) | 30s | 20-50s |
| Article generation (10 articles) | 60s | 40-120s |
| Publishing | 1s | <5s |
| **Total pipeline (10 articles)** | **2 min** | **1-5 min** |
### Performance Testing
```bash
# Benchmark pipeline
python scripts/benchmark.py
# Output:
# Scraping: 8.3s (15 articles)
# Analysis: 42.1s (15 images)
# Generation: 95.7s (12 articles)
# Publishing: 0.8s
# TOTAL: 146.9s
```
---
## NEXT STEPS
After successful setup:
1. **Run first pipeline**
```bash
python scripts/run.py
```
2. **Verify output**
```bash
ls -l output/
cat output/feed.rss | head -20
```
3. **Set up scheduling** (cron/systemd/Task Scheduler)
4. **Configure monitoring** (logs, metrics)
5. **Read DEVELOPMENT.md** for extending functionality
---
## GETTING HELP
### Documentation
- **README.md** - Project overview
- **ARCHITECTURE.md** - Technical design
- **CLAUDE.md** - Development guidelines
- **API_INTEGRATION.md** - Node API integration
### Diagnostics
```bash
# Run diagnostics script
python scripts/diagnose.py
# Output:
# ✓ Python version: 3.11.5
# ✓ Virtual environment: active
# ✓ Dependencies: installed
# ✓ Configuration: valid
# ✓ OpenAI API: reachable
# ✓ Node API: reachable
# ✓ Output directory: writable
# All systems operational!
```
### Common Issues
Check troubleshooting section above, or:
```bash
# Generate debug report
python scripts/debug_report.py > debug.txt
# Share debug.txt (remove API keys first!)
```
---
## CHECKLIST: FIRST RUN
Complete setup verification:
- [ ] Python 3.11+ installed
- [ ] Virtual environment created and activated
- [ ] Dependencies installed (`pip list` shows all packages)
- [ ] `.env` file created with API keys
- [ ] OpenAI API connection tested
- [ ] Node.js API running and tested
- [ ] Configuration validated (`Config.from_env()` works)
- [ ] Component tests pass (`pytest tests/`)
- [ ] Dry run successful (`python scripts/run.py --dry-run`)
- [ ] First real run completed
- [ ] Output files generated (`output/feed.rss` exists)
- [ ] Logs are readable (`feed_generator.log`)
**If all checks pass → You're ready to use Feed Generator!**
---
## QUICK START SUMMARY
For experienced developers:
```bash
# 1. Setup
git clone <repo> && cd feed-generator
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# 2. Configure
cp .env.example .env
# Edit .env with your API keys
# 3. Test
python scripts/test_pipeline.py --dry-run
# 4. Run
python scripts/run.py
# 5. Verify
ls -l output/
```
**Time to first run: ~10 minutes**
---
## APPENDIX: EXAMPLE .env FILE
```bash
# .env.example - Copy to .env and fill in your values
# ==============================================
# REQUIRED CONFIGURATION
# ==============================================
# OpenAI API Key (get from https://platform.openai.com/api-keys)
OPENAI_API_KEY=sk-proj-your-actual-key-here
# Node.js Article Generator API URL
NODE_API_URL=http://localhost:3000
# News sources (comma-separated URLs)
NEWS_SOURCES=https://techcrunch.com/feed,https://www.theverge.com/rss/index.xml
# ==============================================
# OPTIONAL CONFIGURATION
# ==============================================
# Logging level (DEBUG, INFO, WARNING, ERROR)
LOG_LEVEL=INFO
# Maximum articles to process per source
MAX_ARTICLES=10
# HTTP timeout for scraping (seconds)
SCRAPER_TIMEOUT=10
# HTTP timeout for API calls (seconds)
API_TIMEOUT=30
# Output directory (default: ./output)
OUTPUT_DIR=./output
# ==============================================
# ADVANCED CONFIGURATION (V2)
# ==============================================
# Enable caching (true/false)
# ENABLE_CACHE=false
# Cache TTL in seconds
# CACHE_TTL=3600
# Enable parallel processing (true/false)
# ENABLE_PARALLEL=false
# Max concurrent workers
# MAX_WORKERS=5
```
---
## APPENDIX: DIRECTORY STRUCTURE
```
feed-generator/
├── .env # Configuration (NOT in git)
├── .env.example # Configuration template
├── .gitignore # Git ignore rules
├── README.md # Project overview
├── CLAUDE.md # Development guidelines
├── ARCHITECTURE.md # Technical design
├── SETUP.md # This file
├── requirements.txt # Python dependencies
├── requirements-dev.txt # Development dependencies
├── pyproject.toml # Python project metadata
├── src/ # Source code
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── exceptions.py # Custom exceptions
│ ├── scraper.py # News scraping
│ ├── image_analyzer.py # Image analysis
│ ├── aggregator.py # Content aggregation
│ ├── article_client.py # Node API client
│ └── publisher.py # Feed publishing
├── tests/ # Test suite
│ ├── __init__.py
│ ├── test_config.py
│ ├── test_scraper.py
│ ├── test_image_analyzer.py
│ ├── test_aggregator.py
│ ├── test_article_client.py
│ ├── test_publisher.py
│ └── test_integration.py
├── scripts/ # Utility scripts
│ ├── run.py # Main pipeline
│ ├── test_pipeline.py # Pipeline testing
│ ├── test_openai.py # OpenAI API test
│ ├── test_node_api.py # Node API test
│ ├── diagnose.py # System diagnostics
│ ├── debug_report.py # Debug information
│ └── benchmark.py # Performance testing
├── output/ # Generated files (git-ignored)
│ ├── feed.rss
│ ├── articles.json
│ └── feed_generator.log
├── logs/ # Log files (git-ignored)
│ └── *.log
└── backups/ # Backup files (git-ignored)
└── YYYY-MM-DD/
```
---
## APPENDIX: MINIMAL WORKING EXAMPLE
Test that everything works with minimal code:
```python
# test_minimal.py - Minimal working example
from src.config import Config
from src.scraper import NewsScraper
from src.image_analyzer import ImageAnalyzer
# Load configuration
config = Config.from_env()
print(f"✓ Configuration loaded")
# Test scraper
scraper = NewsScraper(config.scraper)
print(f"✓ Scraper initialized")
# Test analyzer
analyzer = ImageAnalyzer(config.api.openai_key)
print(f"✓ Analyzer initialized")
# Scrape one article
test_url = config.scraper.sources[0]
articles = scraper.scrape(test_url)
print(f"✓ Scraped {len(articles)} articles from {test_url}")
# Analyze one image (if available)
if articles and articles[0].image_url:
analysis = analyzer.analyze(
articles[0].image_url,
context="Test image analysis"
)
print(f"✓ Image analyzed: {analysis.description[:50]}...")
print("\n✅ All basic functionality working!")
```
Run with:
```bash
python test_minimal.py
```
---
End of SETUP.md