feedgenerator/SETUP.md
StillHammer 40138c2d45 Initial implementation: Feed Generator V1
Complete Python implementation with strict type safety and best practices.

Features:
- RSS/Atom/HTML web scraping
- GPT-4 Vision image analysis
- Node.js API integration
- RSS/JSON feed publishing

Modules:
- src/config.py: Configuration with strict validation
- src/exceptions.py: Custom exception hierarchy
- src/scraper.py: Multi-format news scraping (RSS/Atom/HTML)
- src/image_analyzer.py: GPT-4 Vision integration with retry
- src/aggregator.py: Content aggregation and filtering
- src/article_client.py: Node.js API client with retry
- src/publisher.py: RSS/JSON feed generation
- scripts/run.py: Complete pipeline orchestrator
- scripts/validate.py: Code quality validation

Code Quality:
- 100% type hint coverage (mypy strict mode)
- Zero bare except clauses
- Logger throughout (no print statements)
- Comprehensive test suite (598 lines)
- Immutable dataclasses (frozen=True)
- Explicit error handling
- Structured logging

Stats:
- 1,431 lines of source code
- 598 lines of test code
- 15 Python files
- 8 core modules
- 4 test suites

All validation checks pass.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-07 22:28:18 +08:00

19 KiB

SETUP.md

# SETUP.md - Feed Generator Installation Guide

---

## PREREQUISITES

### Required Software

- **Python 3.11+** (3.10 minimum)
  ```bash
  python --version  # Should be 3.11 or higher
  • pip (comes with Python)

    pip --version
    
  • Git (for cloning repository)

    git --version
    

Required Services


INSTALLATION

Step 1: Clone Repository

# Clone the project
git clone https://github.com/your-org/feed-generator.git
cd feed-generator

# Verify structure
ls -la
# Should see: src/, tests/, requirements.txt, README.md, etc.

Step 2: Create Virtual Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate

# On Windows:
venv\Scripts\activate

# Verify activation (should show (venv) in prompt)
which python  # Should point to venv/bin/python

Step 3: Install Dependencies

# Upgrade pip first
pip install --upgrade pip

# Install project dependencies
pip install -r requirements.txt

# Verify installations
pip list
# Should see: requests, beautifulsoup4, openai, pytest, mypy, etc.

Step 4: Install Development Tools (Optional)

# For development
pip install -r requirements-dev.txt

# Includes: black, flake8, pylint, ipython

CONFIGURATION

Step 1: Create Environment File

# Copy example configuration
cp .env.example .env

# Edit with your settings
nano .env  # or vim, code, etc.

Step 2: Configure API Keys

Edit .env file:

# REQUIRED: OpenAI API Key
OPENAI_API_KEY=sk-proj-your-key-here

# REQUIRED: Node.js Article Generator API
NODE_API_URL=http://localhost:3000

# REQUIRED: News sources (comma-separated)
NEWS_SOURCES=https://example.com/news,https://techcrunch.com/feed

# OPTIONAL: Logging level
LOG_LEVEL=INFO

# OPTIONAL: Timeouts and limits
MAX_ARTICLES=10
SCRAPER_TIMEOUT=10
API_TIMEOUT=30

Step 3: Verify Configuration

# Test configuration loading
python -c "from src.config import Config; c = Config.from_env(); print(c)"

# Should print configuration without errors

VERIFICATION

Step 1: Verify Python Environment

# Check Python version
python --version
# Output: Python 3.11.x or higher

# Check virtual environment
which python
# Output: /path/to/feed-generator/venv/bin/python

# Check installed packages
pip list | grep -E "(requests|openai|beautifulsoup4)"
# Should show all three packages

Step 2: Verify API Connections

Test OpenAI API

python scripts/test_openai.py

Expected output:

Testing OpenAI API connection...
✓ API key loaded
✓ Connection successful
✓ GPT-4 Vision available
All checks passed!

Test Node.js API

# Make sure your Node.js API is running first
# In another terminal:
cd /path/to/node-article-generator
npm start

# Then test connection
python scripts/test_node_api.py

Expected output:

Testing Node.js API connection...
✓ API endpoint reachable
✓ Health check passed
✓ Test article generation successful
All checks passed!

Step 3: Run Component Tests

# Test individual components
python -m pytest tests/ -v

# Expected output:
# tests/test_config.py::test_config_from_env PASSED
# tests/test_scraper.py::test_scraper_init PASSED
# ...
# ============ X passed in X.XXs ============

Step 4: Test Complete Pipeline

# Dry run (mock external services)
python scripts/test_pipeline.py --dry-run

# Expected output:
# [INFO] Starting pipeline test (dry run)...
# [INFO] ✓ Configuration loaded
# [INFO] ✓ Scraper initialized
# [INFO] ✓ Image analyzer initialized
# [INFO] ✓ API client initialized
# [INFO] ✓ Publisher initialized
# [INFO] Pipeline test successful!

RUNNING THE GENERATOR

Manual Execution

# Run complete pipeline
python scripts/run.py

# With custom configuration
python scripts/run.py --config custom.env

# Dry run (no actual API calls)
python scripts/run.py --dry-run

# Verbose output
python scripts/run.py --verbose

Expected Output

[2025-01-15 10:00:00] INFO - Starting Feed Generator...
[2025-01-15 10:00:00] INFO - Loading configuration...
[2025-01-15 10:00:01] INFO - Configuration loaded successfully
[2025-01-15 10:00:01] INFO - Scraping 3 news sources...
[2025-01-15 10:00:05] INFO - Scraped 15 articles
[2025-01-15 10:00:05] INFO - Analyzing 15 images...
[2025-01-15 10:00:25] INFO - Analyzed 12 images (3 failed)
[2025-01-15 10:00:25] INFO - Aggregating content...
[2025-01-15 10:00:25] INFO - Aggregated 12 items
[2025-01-15 10:00:25] INFO - Generating articles...
[2025-01-15 10:01:30] INFO - Generated 12 articles
[2025-01-15 10:01:30] INFO - Publishing to RSS...
[2025-01-15 10:01:30] INFO - Published to output/feed.rss
[2025-01-15 10:01:30] INFO - Pipeline complete! (90 seconds)

Output Files

# Check generated files
ls -l output/

# Should see:
# feed.rss          - RSS feed
# articles.json     - Full article data
# feed_generator.log - Execution log

TROUBLESHOOTING

Issue: "OPENAI_API_KEY not found"

Cause: Environment variable not set

Solution:

# Check .env file exists
ls -la .env

# Verify API key is set
cat .env | grep OPENAI_API_KEY

# Reload environment
source venv/bin/activate

Issue: "Module not found" errors

Cause: Dependencies not installed

Solution:

# Ensure virtual environment is activated
which python  # Should point to venv

# Reinstall dependencies
pip install -r requirements.txt

# Verify installation
pip list | grep <missing-module>

Issue: "Connection refused" to Node API

Cause: Node.js API not running

Solution:

# Start Node.js API first
cd /path/to/node-article-generator
npm start

# Verify it's running
curl http://localhost:3000/health

# Check configured URL in .env
cat .env | grep NODE_API_URL

Issue: "Rate limit exceeded" from OpenAI

Cause: Too many API requests

Solution:

# Reduce MAX_ARTICLES in .env
echo "MAX_ARTICLES=5" >> .env

# Add delay between requests (future enhancement)
# For now, wait a few minutes and retry

Issue: Scraping fails for specific sites

Cause: Site structure changed or blocking

Solution:

# Test individual source
python scripts/test_scraper.py --url https://problematic-site.com

# Check logs
cat feed_generator.log | grep ScrapingError

# Remove problematic source from .env temporarily
nano .env  # Remove from NEWS_SOURCES

Issue: Type checking fails

Cause: Missing or incorrect type hints

Solution:

# Run mypy to see errors
mypy src/

# Fix reported issues
# Every function must have type hints

DEVELOPMENT SETUP

Additional Tools

# Code formatting
pip install black
black src/ tests/

# Linting
pip install flake8
flake8 src/ tests/

# Type checking
pip install mypy
mypy src/

# Interactive Python shell
pip install ipython
ipython

Pre-commit Hook (Optional)

# Install pre-commit
pip install pre-commit

# Setup hooks
pre-commit install

# Now runs automatically on git commit
# Or run manually:
pre-commit run --all-files

IDE Setup

VS Code

// .vscode/settings.json
{
    "python.defaultInterpreterPath": "${workspaceFolder}/venv/bin/python",
    "python.linting.enabled": true,
    "python.linting.pylintEnabled": false,
    "python.linting.flake8Enabled": true,
    "python.formatting.provider": "black",
    "python.analysis.typeCheckingMode": "strict"
}

PyCharm

1. Open Project
2. File → Settings → Project → Python Interpreter
3. Add Interpreter → Existing Environment
4. Select: /path/to/feed-generator/venv/bin/python
5. Apply

SCHEDULED EXECUTION

Cron Job (Linux/Mac)

# Edit crontab
crontab -e

# Run every 6 hours
0 */6 * * * cd /path/to/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1

# Run daily at 8 AM
0 8 * * * cd /path/to/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1

Systemd Service (Linux)

# /etc/systemd/system/feed-generator.service
[Unit]
Description=Feed Generator
After=network.target

[Service]
Type=simple
User=your-user
WorkingDirectory=/path/to/feed-generator
ExecStart=/path/to/feed-generator/venv/bin/python scripts/run.py
Restart=on-failure

[Install]
WantedBy=multi-user.target
# Enable and start
sudo systemctl enable feed-generator
sudo systemctl start feed-generator

# Check status
sudo systemctl status feed-generator

Task Scheduler (Windows)

# Create scheduled task
$action = New-ScheduledTaskAction -Execute "C:\path\to\venv\Scripts\python.exe" -Argument "C:\path\to\scripts\run.py"
$trigger = New-ScheduledTaskTrigger -Daily -At 8am
Register-ScheduledTask -Action $action -Trigger $trigger -TaskName "FeedGenerator" -Description "Run feed generator daily"

MONITORING

Log Files

# View live logs
tail -f feed_generator.log

# View recent errors
grep ERROR feed_generator.log | tail -20

# View pipeline summary
grep "Pipeline complete" feed_generator.log

Metrics Dashboard (Future)

# View last run metrics
python scripts/show_metrics.py

# Expected output:
# Last Run: 2025-01-15 10:01:30
# Duration: 90 seconds
# Articles Scraped: 15
# Articles Generated: 12
# Success Rate: 80%
# Errors: 3 (image analysis failures)

BACKUP & RECOVERY

Backup Configuration

# Backup .env file (CAREFUL - contains API keys)
cp .env .env.backup

# Store securely, NOT in git
# Use password manager or encrypted storage

Backup Output

# Create daily backup
mkdir -p backups/$(date +%Y-%m-%d)
cp -r output/* backups/$(date +%Y-%m-%d)/

# Automated backup script
./scripts/backup_output.sh

Recovery

# Restore from backup
cp backups/2025-01-15/feed.rss output/

# Verify integrity
python scripts/verify_feed.py output/feed.rss

UPDATING

Update Dependencies

# Activate virtual environment
source venv/bin/activate

# Update pip
pip install --upgrade pip

# Update all packages
pip install --upgrade -r requirements.txt

# Verify updates
pip list --outdated

Update Code

# Pull latest changes
git pull origin main

# Reinstall if requirements changed
pip install -r requirements.txt

# Run tests
python -m pytest tests/

# Test pipeline
python scripts/test_pipeline.py --dry-run

UNINSTALLATION

Remove Virtual Environment

# Deactivate first
deactivate

# Remove virtual environment
rm -rf venv/

Remove Generated Files

# Remove output
rm -rf output/

# Remove logs
rm -rf logs/

# Remove backups
rm -rf backups/

Remove Project

# Remove entire project directory
cd ..
rm -rf feed-generator/

SECURITY CHECKLIST

Before deploying:

  • .env file is NOT committed to git
  • .env.example has placeholder values only
  • API keys are stored securely
  • .gitignore includes .env, venv/, output/, logs/
  • Log files don't contain sensitive data
  • File permissions are restrictive (chmod 600 .env)
  • Virtual environment is isolated
  • Dependencies are from trusted sources

PERFORMANCE BASELINE

Expected performance on standard hardware:

Metric Target Acceptable Range
Scraping (10 articles) 10s 5-20s
Image analysis (10 images) 30s 20-50s
Article generation (10 articles) 60s 40-120s
Publishing 1s <5s
Total pipeline (10 articles) 2 min 1-5 min

Performance Testing

# Benchmark pipeline
python scripts/benchmark.py

# Output:
# Scraping: 8.3s (15 articles)
# Analysis: 42.1s (15 images)
# Generation: 95.7s (12 articles)
# Publishing: 0.8s
# TOTAL: 146.9s

NEXT STEPS

After successful setup:

  1. Run first pipeline

    python scripts/run.py
    
  2. Verify output

    ls -l output/
    cat output/feed.rss | head -20
    
  3. Set up scheduling (cron/systemd/Task Scheduler)

  4. Configure monitoring (logs, metrics)

  5. Read DEVELOPMENT.md for extending functionality


GETTING HELP

Documentation

  • README.md - Project overview
  • ARCHITECTURE.md - Technical design
  • CLAUDE.md - Development guidelines
  • API_INTEGRATION.md - Node API integration

Diagnostics

# Run diagnostics script
python scripts/diagnose.py

# Output:
# ✓ Python version: 3.11.5
# ✓ Virtual environment: active
# ✓ Dependencies: installed
# ✓ Configuration: valid
# ✓ OpenAI API: reachable
# ✓ Node API: reachable
# ✓ Output directory: writable
# All systems operational!

Common Issues

Check troubleshooting section above, or:

# Generate debug report
python scripts/debug_report.py > debug.txt

# Share debug.txt (remove API keys first!)

CHECKLIST: FIRST RUN

Complete setup verification:

  • Python 3.11+ installed
  • Virtual environment created and activated
  • Dependencies installed (pip list shows all packages)
  • .env file created with API keys
  • OpenAI API connection tested
  • Node.js API running and tested
  • Configuration validated (Config.from_env() works)
  • Component tests pass (pytest tests/)
  • Dry run successful (python scripts/run.py --dry-run)
  • First real run completed
  • Output files generated (output/feed.rss exists)
  • Logs are readable (feed_generator.log)

If all checks pass → You're ready to use Feed Generator!


QUICK START SUMMARY

For experienced developers:

# 1. Setup
git clone <repo> && cd feed-generator
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# 2. Configure
cp .env.example .env
# Edit .env with your API keys

# 3. Test
python scripts/test_pipeline.py --dry-run

# 4. Run
python scripts/run.py

# 5. Verify
ls -l output/

Time to first run: ~10 minutes


APPENDIX: EXAMPLE .env FILE

# .env.example - Copy to .env and fill in your values

# ==============================================
# REQUIRED CONFIGURATION
# ==============================================

# OpenAI API Key (get from https://platform.openai.com/api-keys)
OPENAI_API_KEY=sk-proj-your-actual-key-here

# Node.js Article Generator API URL
NODE_API_URL=http://localhost:3000

# News sources (comma-separated URLs)
NEWS_SOURCES=https://techcrunch.com/feed,https://www.theverge.com/rss/index.xml

# ==============================================
# OPTIONAL CONFIGURATION
# ==============================================

# Logging level (DEBUG, INFO, WARNING, ERROR)
LOG_LEVEL=INFO

# Maximum articles to process per source
MAX_ARTICLES=10

# HTTP timeout for scraping (seconds)
SCRAPER_TIMEOUT=10

# HTTP timeout for API calls (seconds)
API_TIMEOUT=30

# Output directory (default: ./output)
OUTPUT_DIR=./output

# ==============================================
# ADVANCED CONFIGURATION (V2)
# ==============================================

# Enable caching (true/false)
# ENABLE_CACHE=false

# Cache TTL in seconds
# CACHE_TTL=3600

# Enable parallel processing (true/false)
# ENABLE_PARALLEL=false

# Max concurrent workers
# MAX_WORKERS=5

APPENDIX: DIRECTORY STRUCTURE

feed-generator/
├── .env                    # Configuration (NOT in git)
├── .env.example            # Configuration template
├── .gitignore              # Git ignore rules
├── README.md               # Project overview
├── CLAUDE.md               # Development guidelines
├── ARCHITECTURE.md         # Technical design
├── SETUP.md                # This file
├── requirements.txt        # Python dependencies
├── requirements-dev.txt    # Development dependencies
├── pyproject.toml          # Python project metadata
│
├── src/                    # Source code
│   ├── __init__.py
│   ├── config.py           # Configuration management
│   ├── exceptions.py       # Custom exceptions
│   ├── scraper.py          # News scraping
│   ├── image_analyzer.py   # Image analysis
│   ├── aggregator.py       # Content aggregation
│   ├── article_client.py   # Node API client
│   └── publisher.py        # Feed publishing
│
├── tests/                  # Test suite
│   ├── __init__.py
│   ├── test_config.py
│   ├── test_scraper.py
│   ├── test_image_analyzer.py
│   ├── test_aggregator.py
│   ├── test_article_client.py
│   ├── test_publisher.py
│   └── test_integration.py
│
├── scripts/                # Utility scripts
│   ├── run.py              # Main pipeline
│   ├── test_pipeline.py    # Pipeline testing
│   ├── test_openai.py      # OpenAI API test
│   ├── test_node_api.py    # Node API test
│   ├── diagnose.py         # System diagnostics
│   ├── debug_report.py     # Debug information
│   └── benchmark.py        # Performance testing
│
├── output/                 # Generated files (git-ignored)
│   ├── feed.rss
│   ├── articles.json
│   └── feed_generator.log
│
├── logs/                   # Log files (git-ignored)
│   └── *.log
│
└── backups/                # Backup files (git-ignored)
    └── YYYY-MM-DD/

APPENDIX: MINIMAL WORKING EXAMPLE

Test that everything works with minimal code:

# test_minimal.py - Minimal working example

from src.config import Config
from src.scraper import NewsScraper
from src.image_analyzer import ImageAnalyzer

# Load configuration
config = Config.from_env()
print(f"✓ Configuration loaded")

# Test scraper
scraper = NewsScraper(config.scraper)
print(f"✓ Scraper initialized")

# Test analyzer
analyzer = ImageAnalyzer(config.api.openai_key)
print(f"✓ Analyzer initialized")

# Scrape one article
test_url = config.scraper.sources[0]
articles = scraper.scrape(test_url)
print(f"✓ Scraped {len(articles)} articles from {test_url}")

# Analyze one image (if available)
if articles and articles[0].image_url:
    analysis = analyzer.analyze(
        articles[0].image_url,
        context="Test image analysis"
    )
    print(f"✓ Image analyzed: {analysis.description[:50]}...")

print("\n✅ All basic functionality working!")

Run with:

python test_minimal.py

End of SETUP.md