StillHammer 40138c2d45 Initial implementation: Feed Generator V1

Complete Python implementation with strict type safety and best practices.

Features:
- RSS/Atom/HTML web scraping
- GPT-4 Vision image analysis
- Node.js API integration
- RSS/JSON feed publishing

Modules:
- src/config.py: Configuration with strict validation
- src/exceptions.py: Custom exception hierarchy
- src/scraper.py: Multi-format news scraping (RSS/Atom/HTML)
- src/image_analyzer.py: GPT-4 Vision integration with retry
- src/aggregator.py: Content aggregation and filtering
- src/article_client.py: Node.js API client with retry
- src/publisher.py: RSS/JSON feed generation
- scripts/run.py: Complete pipeline orchestrator
- scripts/validate.py: Code quality validation

Code Quality:
- 100% type hint coverage (mypy strict mode)
- Zero bare except clauses
- Logger throughout (no print statements)
- Comprehensive test suite (598 lines)
- Immutable dataclasses (frozen=True)
- Explicit error handling
- Structured logging

Stats:
- 1,431 lines of source code
- 598 lines of test code
- 15 Python files
- 8 core modules
- 4 test suites

All validation checks pass.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-07 22:28:18 +08:00

19 KiB

Raw Permalink Blame History

SETUP.md

# SETUP.md - Feed Generator Installation Guide

---

## PREREQUISITES

### Required Software

- **Python 3.11+** (3.10 minimum)
  ```bash
  python --version  # Should be 3.11 or higher

pip (comes with Python)
```
pip --version
```
Git (for cloning repository)
```
git --version
```

Required Services

OpenAI API account with GPT-4 Vision access
- Sign up: https://platform.openai.com/signup
- Generate API key: https://platform.openai.com/api-keys
Node.js Article Generator (your existing API)
- Should be running on http://localhost:3000
- Or configure different URL in .env

INSTALLATION

Step 1: Clone Repository

# Clone the project
git clone https://github.com/your-org/feed-generator.git
cd feed-generator

# Verify structure
ls -la
# Should see: src/, tests/, requirements.txt, README.md, etc.

Step 2: Create Virtual Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate

# On Windows:
venv\Scripts\activate

# Verify activation (should show (venv) in prompt)
which python  # Should point to venv/bin/python

Step 3: Install Dependencies

# Upgrade pip first
pip install --upgrade pip

# Install project dependencies
pip install -r requirements.txt

# Verify installations
pip list
# Should see: requests, beautifulsoup4, openai, pytest, mypy, etc.

Step 4: Install Development Tools (Optional)

# For development
pip install -r requirements-dev.txt

# Includes: black, flake8, pylint, ipython

CONFIGURATION

Step 1: Create Environment File

# Copy example configuration
cp .env.example .env

# Edit with your settings
nano .env  # or vim, code, etc.

Step 2: Configure API Keys

Edit .env file:

# REQUIRED: OpenAI API Key
OPENAI_API_KEY=sk-proj-your-key-here

# REQUIRED: Node.js Article Generator API
NODE_API_URL=http://localhost:3000

# REQUIRED: News sources (comma-separated)
NEWS_SOURCES=https://example.com/news,https://techcrunch.com/feed

# OPTIONAL: Logging level
LOG_LEVEL=INFO

# OPTIONAL: Timeouts and limits
MAX_ARTICLES=10
SCRAPER_TIMEOUT=10
API_TIMEOUT=30

Step 3: Verify Configuration

# Test configuration loading
python -c "from src.config import Config; c = Config.from_env(); print(c)"

# Should print configuration without errors

VERIFICATION

Step 1: Verify Python Environment

# Check Python version
python --version
# Output: Python 3.11.x or higher

# Check virtual environment
which python
# Output: /path/to/feed-generator/venv/bin/python

# Check installed packages
pip list | grep -E "(requests|openai|beautifulsoup4)"
# Should show all three packages

Step 2: Verify API Connections

Test OpenAI API

python scripts/test_openai.py

Expected output:

Testing OpenAI API connection...
✓ API key loaded
✓ Connection successful
✓ GPT-4 Vision available
All checks passed!

Test Node.js API

# Make sure your Node.js API is running first
# In another terminal:
cd /path/to/node-article-generator
npm start

# Then test connection
python scripts/test_node_api.py

Expected output:

Testing Node.js API connection...
✓ API endpoint reachable
✓ Health check passed
✓ Test article generation successful
All checks passed!

Step 3: Run Component Tests

# Test individual components
python -m pytest tests/ -v

# Expected output:
# tests/test_config.py::test_config_from_env PASSED
# tests/test_scraper.py::test_scraper_init PASSED
# ...
# ============ X passed in X.XXs ============

Step 4: Test Complete Pipeline

# Dry run (mock external services)
python scripts/test_pipeline.py --dry-run

# Expected output:
# [INFO] Starting pipeline test (dry run)...
# [INFO] ✓ Configuration loaded
# [INFO] ✓ Scraper initialized
# [INFO] ✓ Image analyzer initialized
# [INFO] ✓ API client initialized
# [INFO] ✓ Publisher initialized
# [INFO] Pipeline test successful!

RUNNING THE GENERATOR

Manual Execution

# Run complete pipeline
python scripts/run.py

# With custom configuration
python scripts/run.py --config custom.env

# Dry run (no actual API calls)
python scripts/run.py --dry-run

# Verbose output
python scripts/run.py --verbose

Expected Output

[2025-01-15 10:00:00] INFO - Starting Feed Generator...
[2025-01-15 10:00:00] INFO - Loading configuration...
[2025-01-15 10:00:01] INFO - Configuration loaded successfully
[2025-01-15 10:00:01] INFO - Scraping 3 news sources...
[2025-01-15 10:00:05] INFO - Scraped 15 articles
[2025-01-15 10:00:05] INFO - Analyzing 15 images...
[2025-01-15 10:00:25] INFO - Analyzed 12 images (3 failed)
[2025-01-15 10:00:25] INFO - Aggregating content...
[2025-01-15 10:00:25] INFO - Aggregated 12 items
[2025-01-15 10:00:25] INFO - Generating articles...
[2025-01-15 10:01:30] INFO - Generated 12 articles
[2025-01-15 10:01:30] INFO - Publishing to RSS...
[2025-01-15 10:01:30] INFO - Published to output/feed.rss
[2025-01-15 10:01:30] INFO - Pipeline complete! (90 seconds)

Output Files

# Check generated files
ls -l output/

# Should see:
# feed.rss          - RSS feed
# articles.json     - Full article data
# feed_generator.log - Execution log

TROUBLESHOOTING

Issue: "OPENAI_API_KEY not found"

Cause: Environment variable not set

Solution:

# Check .env file exists
ls -la .env

# Verify API key is set
cat .env | grep OPENAI_API_KEY

# Reload environment
source venv/bin/activate

Issue: "Module not found" errors

Cause: Dependencies not installed

Solution:

# Ensure virtual environment is activated
which python  # Should point to venv

# Reinstall dependencies
pip install -r requirements.txt

# Verify installation
pip list | grep <missing-module>

Issue: "Connection refused" to Node API

Cause: Node.js API not running

Solution:

# Start Node.js API first
cd /path/to/node-article-generator
npm start

# Verify it's running
curl http://localhost:3000/health

# Check configured URL in .env
cat .env | grep NODE_API_URL

Issue: "Rate limit exceeded" from OpenAI

Cause: Too many API requests

Solution:

# Reduce MAX_ARTICLES in .env
echo "MAX_ARTICLES=5" >> .env

# Add delay between requests (future enhancement)
# For now, wait a few minutes and retry

Issue: Scraping fails for specific sites

Cause: Site structure changed or blocking

Solution:

# Test individual source
python scripts/test_scraper.py --url https://problematic-site.com

# Check logs
cat feed_generator.log | grep ScrapingError

# Remove problematic source from .env temporarily
nano .env  # Remove from NEWS_SOURCES

Issue: Type checking fails

Cause: Missing or incorrect type hints

Solution:

# Run mypy to see errors
mypy src/

# Fix reported issues
# Every function must have type hints

DEVELOPMENT SETUP

Additional Tools

# Code formatting
pip install black
black src/ tests/

# Linting
pip install flake8
flake8 src/ tests/

# Type checking
pip install mypy
mypy src/

# Interactive Python shell
pip install ipython
ipython

Pre-commit Hook (Optional)

# Install pre-commit
pip install pre-commit

# Setup hooks
pre-commit install

# Now runs automatically on git commit
# Or run manually:
pre-commit run --all-files

IDE Setup

VS Code

// .vscode/settings.json
{
    "python.defaultInterpreterPath": "${workspaceFolder}/venv/bin/python",
    "python.linting.enabled": true,
    "python.linting.pylintEnabled": false,
    "python.linting.flake8Enabled": true,
    "python.formatting.provider": "black",
    "python.analysis.typeCheckingMode": "strict"
}

PyCharm

1. Open Project
2. File → Settings → Project → Python Interpreter
3. Add Interpreter → Existing Environment
4. Select: /path/to/feed-generator/venv/bin/python
5. Apply

SCHEDULED EXECUTION

Cron Job (Linux/Mac)

# Edit crontab
crontab -e

# Run every 6 hours
0 */6 * * * cd /path/to/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1

# Run daily at 8 AM
0 8 * * * cd /path/to/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1

Systemd Service (Linux)

# /etc/systemd/system/feed-generator.service
[Unit]
Description=Feed Generator
After=network.target

[Service]
Type=simple
User=your-user
WorkingDirectory=/path/to/feed-generator
ExecStart=/path/to/feed-generator/venv/bin/python scripts/run.py
Restart=on-failure

[Install]
WantedBy=multi-user.target

# Enable and start
sudo systemctl enable feed-generator
sudo systemctl start feed-generator

# Check status
sudo systemctl status feed-generator

Task Scheduler (Windows)

# Create scheduled task
$action = New-ScheduledTaskAction -Execute "C:\path\to\venv\Scripts\python.exe" -Argument "C:\path\to\scripts\run.py"
$trigger = New-ScheduledTaskTrigger -Daily -At 8am
Register-ScheduledTask -Action $action -Trigger $trigger -TaskName "FeedGenerator" -Description "Run feed generator daily"

MONITORING

Log Files

# View live logs
tail -f feed_generator.log

# View recent errors
grep ERROR feed_generator.log | tail -20

# View pipeline summary
grep "Pipeline complete" feed_generator.log

Metrics Dashboard (Future)

# View last run metrics
python scripts/show_metrics.py

# Expected output:
# Last Run: 2025-01-15 10:01:30
# Duration: 90 seconds
# Articles Scraped: 15
# Articles Generated: 12
# Success Rate: 80%
# Errors: 3 (image analysis failures)

BACKUP & RECOVERY

Backup Configuration

# Backup .env file (CAREFUL - contains API keys)
cp .env .env.backup

# Store securely, NOT in git
# Use password manager or encrypted storage

Backup Output

# Create daily backup
mkdir -p backups/$(date +%Y-%m-%d)
cp -r output/* backups/$(date +%Y-%m-%d)/

# Automated backup script
./scripts/backup_output.sh

Recovery

# Restore from backup
cp backups/2025-01-15/feed.rss output/

# Verify integrity
python scripts/verify_feed.py output/feed.rss

UPDATING

Update Dependencies

# Activate virtual environment
source venv/bin/activate

# Update pip
pip install --upgrade pip

# Update all packages
pip install --upgrade -r requirements.txt

# Verify updates
pip list --outdated

Update Code

# Pull latest changes
git pull origin main

# Reinstall if requirements changed
pip install -r requirements.txt

# Run tests
python -m pytest tests/

# Test pipeline
python scripts/test_pipeline.py --dry-run

UNINSTALLATION

Remove Virtual Environment

# Deactivate first
deactivate

# Remove virtual environment
rm -rf venv/

Remove Generated Files

# Remove output
rm -rf output/

# Remove logs
rm -rf logs/

# Remove backups
rm -rf backups/

Remove Project

# Remove entire project directory
cd ..
rm -rf feed-generator/

SECURITY CHECKLIST

Before deploying:

.env file is NOT committed to git
.env.example has placeholder values only
API keys are stored securely
.gitignore includes .env, venv/, output/, logs/
Log files don't contain sensitive data
File permissions are restrictive (chmod 600 .env)
Virtual environment is isolated
Dependencies are from trusted sources

PERFORMANCE BASELINE

Expected performance on standard hardware:

Metric	Target	Acceptable Range
Scraping (10 articles)	10s	5-20s
Image analysis (10 images)	30s	20-50s
Article generation (10 articles)	60s	40-120s
Publishing	1s	<5s
Total pipeline (10 articles)	2 min	1-5 min

Performance Testing

# Benchmark pipeline
python scripts/benchmark.py

# Output:
# Scraping: 8.3s (15 articles)
# Analysis: 42.1s (15 images)
# Generation: 95.7s (12 articles)
# Publishing: 0.8s
# TOTAL: 146.9s

NEXT STEPS

After successful setup:

Run first pipeline
```
python scripts/run.py
```

Verify output

ls -l output/
cat output/feed.rss | head -20

Set up scheduling (cron/systemd/Task Scheduler)
Configure monitoring (logs, metrics)
Read DEVELOPMENT.md for extending functionality

GETTING HELP

Documentation

README.md - Project overview
ARCHITECTURE.md - Technical design
CLAUDE.md - Development guidelines
API_INTEGRATION.md - Node API integration

Diagnostics

# Run diagnostics script
python scripts/diagnose.py

# Output:
# ✓ Python version: 3.11.5
# ✓ Virtual environment: active
# ✓ Dependencies: installed
# ✓ Configuration: valid
# ✓ OpenAI API: reachable
# ✓ Node API: reachable
# ✓ Output directory: writable
# All systems operational!

Common Issues

Check troubleshooting section above, or:

# Generate debug report
python scripts/debug_report.py > debug.txt

# Share debug.txt (remove API keys first!)

CHECKLIST: FIRST RUN

Complete setup verification:

Python 3.11+ installed
Virtual environment created and activated
Dependencies installed (pip list shows all packages)
.env file created with API keys
OpenAI API connection tested
Node.js API running and tested
Configuration validated (Config.from_env() works)
Component tests pass (pytest tests/)
Dry run successful (python scripts/run.py --dry-run)
First real run completed
Output files generated (output/feed.rss exists)
Logs are readable (feed_generator.log)

If all checks pass → You're ready to use Feed Generator!

QUICK START SUMMARY

For experienced developers:

# 1. Setup
git clone <repo> && cd feed-generator
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# 2. Configure
cp .env.example .env
# Edit .env with your API keys

# 3. Test
python scripts/test_pipeline.py --dry-run

# 4. Run
python scripts/run.py

# 5. Verify
ls -l output/

Time to first run: ~10 minutes

APPENDIX: EXAMPLE .env FILE

# .env.example - Copy to .env and fill in your values

# ==============================================
# REQUIRED CONFIGURATION
# ==============================================

# OpenAI API Key (get from https://platform.openai.com/api-keys)
OPENAI_API_KEY=sk-proj-your-actual-key-here

# Node.js Article Generator API URL
NODE_API_URL=http://localhost:3000

# News sources (comma-separated URLs)
NEWS_SOURCES=https://techcrunch.com/feed,https://www.theverge.com/rss/index.xml

# ==============================================
# OPTIONAL CONFIGURATION
# ==============================================

# Logging level (DEBUG, INFO, WARNING, ERROR)
LOG_LEVEL=INFO

# Maximum articles to process per source
MAX_ARTICLES=10

# HTTP timeout for scraping (seconds)
SCRAPER_TIMEOUT=10

# HTTP timeout for API calls (seconds)
API_TIMEOUT=30

# Output directory (default: ./output)
OUTPUT_DIR=./output

# ==============================================
# ADVANCED CONFIGURATION (V2)
# ==============================================

# Enable caching (true/false)
# ENABLE_CACHE=false

# Cache TTL in seconds
# CACHE_TTL=3600

# Enable parallel processing (true/false)
# ENABLE_PARALLEL=false

# Max concurrent workers
# MAX_WORKERS=5

APPENDIX: DIRECTORY STRUCTURE

feed-generator/
├── .env                    # Configuration (NOT in git)
├── .env.example            # Configuration template
├── .gitignore              # Git ignore rules
├── README.md               # Project overview
├── CLAUDE.md               # Development guidelines
├── ARCHITECTURE.md         # Technical design
├── SETUP.md                # This file
├── requirements.txt        # Python dependencies
├── requirements-dev.txt    # Development dependencies
├── pyproject.toml          # Python project metadata
│
├── src/                    # Source code
│   ├── __init__.py
│   ├── config.py           # Configuration management
│   ├── exceptions.py       # Custom exceptions
│   ├── scraper.py          # News scraping
│   ├── image_analyzer.py   # Image analysis
│   ├── aggregator.py       # Content aggregation
│   ├── article_client.py   # Node API client
│   └── publisher.py        # Feed publishing
│
├── tests/                  # Test suite
│   ├── __init__.py
│   ├── test_config.py
│   ├── test_scraper.py
│   ├── test_image_analyzer.py
│   ├── test_aggregator.py
│   ├── test_article_client.py
│   ├── test_publisher.py
│   └── test_integration.py
│
├── scripts/                # Utility scripts
│   ├── run.py              # Main pipeline
│   ├── test_pipeline.py    # Pipeline testing
│   ├── test_openai.py      # OpenAI API test
│   ├── test_node_api.py    # Node API test
│   ├── diagnose.py         # System diagnostics
│   ├── debug_report.py     # Debug information
│   └── benchmark.py        # Performance testing
│
├── output/                 # Generated files (git-ignored)
│   ├── feed.rss
│   ├── articles.json
│   └── feed_generator.log
│
├── logs/                   # Log files (git-ignored)
│   └── *.log
│
└── backups/                # Backup files (git-ignored)
    └── YYYY-MM-DD/

APPENDIX: MINIMAL WORKING EXAMPLE

Test that everything works with minimal code:

# test_minimal.py - Minimal working example

from src.config import Config
from src.scraper import NewsScraper
from src.image_analyzer import ImageAnalyzer

# Load configuration
config = Config.from_env()
print(f"✓ Configuration loaded")

# Test scraper
scraper = NewsScraper(config.scraper)
print(f"✓ Scraper initialized")

# Test analyzer
analyzer = ImageAnalyzer(config.api.openai_key)
print(f"✓ Analyzer initialized")

# Scrape one article
test_url = config.scraper.sources[0]
articles = scraper.scrape(test_url)
print(f"✓ Scraped {len(articles)} articles from {test_url}")

# Analyze one image (if available)
if articles and articles[0].image_url:
    analysis = analyzer.analyze(
        articles[0].image_url,
        context="Test image analysis"
    )
    print(f"✓ Image analyzed: {analysis.description[:50]}...")

print("\n✅ All basic functionality working!")

Run with:

python test_minimal.py

End of SETUP.md

19 KiB Raw Permalink Blame History

SETUP.md

Required Services

INSTALLATION

Step 1: Clone Repository

Step 2: Create Virtual Environment

Step 3: Install Dependencies

Step 4: Install Development Tools (Optional)

CONFIGURATION

Step 1: Create Environment File

Step 2: Configure API Keys

Step 3: Verify Configuration

VERIFICATION

Step 1: Verify Python Environment

Step 2: Verify API Connections

Test OpenAI API

Test Node.js API

Step 3: Run Component Tests

Step 4: Test Complete Pipeline

RUNNING THE GENERATOR

Manual Execution

Expected Output

Output Files

TROUBLESHOOTING

Issue: "OPENAI_API_KEY not found"

Issue: "Module not found" errors

Issue: "Connection refused" to Node API

Issue: "Rate limit exceeded" from OpenAI

Issue: Scraping fails for specific sites

Issue: Type checking fails

DEVELOPMENT SETUP

Additional Tools

Pre-commit Hook (Optional)

IDE Setup

VS Code

PyCharm

SCHEDULED EXECUTION

Cron Job (Linux/Mac)

Systemd Service (Linux)

Task Scheduler (Windows)

MONITORING

Log Files

Metrics Dashboard (Future)

BACKUP & RECOVERY

Backup Configuration

Backup Output

Recovery

UPDATING

Update Dependencies

Update Code

UNINSTALLATION

Remove Virtual Environment

Remove Generated Files

Remove Project

SECURITY CHECKLIST

PERFORMANCE BASELINE

Performance Testing

NEXT STEPS

GETTING HELP

Documentation

Diagnostics

Common Issues

CHECKLIST: FIRST RUN

QUICK START SUMMARY

APPENDIX: EXAMPLE .env FILE

APPENDIX: DIRECTORY STRUCTURE

APPENDIX: MINIMAL WORKING EXAMPLE

19 KiB

Raw Permalink Blame History