feedgenerator/SETUP.md

# SETUP.md

```markdown
# SETUP.md - Feed Generator Installation Guide

---

## PREREQUISITES

### Required Software

- **Python 3.11+** (3.10 minimum)
  ```bash
  python --version  # Should be 3.11 or higher
  ```

- **pip** (comes with Python)
  ```bash
  pip --version
  ```

- **Git** (for cloning repository)
  ```bash
  git --version
  ```

### Required Services

- **OpenAI API account** with GPT-4 Vision access
  - Sign up: https://platform.openai.com/signup
  - Generate API key: https://platform.openai.com/api-keys

- **Node.js Article Generator** (your existing API)
  - Should be running on `http://localhost:3000`
  - Or configure different URL in `.env`

---

## INSTALLATION

### Step 1: Clone Repository

```bash
# Clone the project
git clone https://github.com/your-org/feed-generator.git
cd feed-generator

# Verify structure
ls -la
# Should see: src/, tests/, requirements.txt, README.md, etc.
```

### Step 2: Create Virtual Environment

```bash
# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate

# On Windows:
venv\Scripts\activate

# Verify activation (should show (venv) in prompt)
which python  # Should point to venv/bin/python
```

### Step 3: Install Dependencies

```bash
# Upgrade pip first
pip install --upgrade pip

# Install project dependencies
pip install -r requirements.txt

# Verify installations
pip list
# Should see: requests, beautifulsoup4, openai, pytest, mypy, etc.
```

### Step 4: Install Development Tools (Optional)

```bash
# For development
pip install -r requirements-dev.txt

# Includes: black, flake8, pylint, ipython
```

---

## CONFIGURATION

### Step 1: Create Environment File

```bash
# Copy example configuration
cp .env.example .env

# Edit with your settings
nano .env  # or vim, code, etc.
```

### Step 2: Configure API Keys

Edit `.env` file:

```bash
# REQUIRED: OpenAI API Key
OPENAI_API_KEY=sk-proj-your-key-here

# REQUIRED: Node.js Article Generator API
NODE_API_URL=http://localhost:3000

# REQUIRED: News sources (comma-separated)
NEWS_SOURCES=https://example.com/news,https://techcrunch.com/feed

# OPTIONAL: Logging level
LOG_LEVEL=INFO

# OPTIONAL: Timeouts and limits
MAX_ARTICLES=10
SCRAPER_TIMEOUT=10
API_TIMEOUT=30
```

### Step 3: Verify Configuration

```bash
# Test configuration loading
python -c "from src.config import Config; c = Config.from_env(); print(c)"

# Should print configuration without errors
```

---

## VERIFICATION

### Step 1: Verify Python Environment

```bash
# Check Python version
python --version
# Output: Python 3.11.x or higher

# Check virtual environment
which python
# Output: /path/to/feed-generator/venv/bin/python

# Check installed packages
pip list | grep -E "(requests|openai|beautifulsoup4)"
# Should show all three packages
```

### Step 2: Verify API Connections

#### Test OpenAI API

```bash
python scripts/test_openai.py
```

Expected output:
```
Testing OpenAI API connection...
✓ API key loaded
✓ Connection successful
✓ GPT-4 Vision available
All checks passed!
```

#### Test Node.js API

```bash
# Make sure your Node.js API is running first
# In another terminal:
cd /path/to/node-article-generator
npm start

# Then test connection
python scripts/test_node_api.py
```

Expected output:
```
Testing Node.js API connection...
✓ API endpoint reachable
✓ Health check passed
✓ Test article generation successful
All checks passed!
```

### Step 3: Run Component Tests

```bash
# Test individual components
python -m pytest tests/ -v

# Expected output:
# tests/test_config.py::test_config_from_env PASSED
# tests/test_scraper.py::test_scraper_init PASSED
# ...
# ============ X passed in X.XXs ============
```

### Step 4: Test Complete Pipeline

```bash
# Dry run (mock external services)
python scripts/test_pipeline.py --dry-run

# Expected output:
# [INFO] Starting pipeline test (dry run)...
# [INFO] ✓ Configuration loaded
# [INFO] ✓ Scraper initialized
# [INFO] ✓ Image analyzer initialized
# [INFO] ✓ API client initialized
# [INFO] ✓ Publisher initialized
# [INFO] Pipeline test successful!
```

---

## RUNNING THE GENERATOR

### Manual Execution

```bash
# Run complete pipeline
python scripts/run.py

# With custom configuration
python scripts/run.py --config custom.env

# Dry run (no actual API calls)
python scripts/run.py --dry-run

# Verbose output
python scripts/run.py --verbose
```

### Expected Output

```
[2025-01-15 10:00:00] INFO - Starting Feed Generator...
[2025-01-15 10:00:00] INFO - Loading configuration...
[2025-01-15 10:00:01] INFO - Configuration loaded successfully
[2025-01-15 10:00:01] INFO - Scraping 3 news sources...
[2025-01-15 10:00:05] INFO - Scraped 15 articles
[2025-01-15 10:00:05] INFO - Analyzing 15 images...
[2025-01-15 10:00:25] INFO - Analyzed 12 images (3 failed)
[2025-01-15 10:00:25] INFO - Aggregating content...
[2025-01-15 10:00:25] INFO - Aggregated 12 items
[2025-01-15 10:00:25] INFO - Generating articles...
[2025-01-15 10:01:30] INFO - Generated 12 articles
[2025-01-15 10:01:30] INFO - Publishing to RSS...
[2025-01-15 10:01:30] INFO - Published to output/feed.rss
[2025-01-15 10:01:30] INFO - Pipeline complete! (90 seconds)
```

### Output Files

```bash
# Check generated files
ls -l output/

# Should see:
# feed.rss          - RSS feed
# articles.json     - Full article data
# feed_generator.log - Execution log
```

---

## TROUBLESHOOTING

### Issue: "OPENAI_API_KEY not found"

**Cause**: Environment variable not set

**Solution**:
```bash
# Check .env file exists
ls -la .env

# Verify API key is set
cat .env | grep OPENAI_API_KEY

# Reload environment
source venv/bin/activate
```

### Issue: "Module not found" errors

**Cause**: Dependencies not installed

**Solution**:
```bash
# Ensure virtual environment is activated
which python  # Should point to venv

# Reinstall dependencies
pip install -r requirements.txt

# Verify installation
pip list | grep <missing-module>
```

### Issue: "Connection refused" to Node API

**Cause**: Node.js API not running

**Solution**:
```bash
# Start Node.js API first
cd /path/to/node-article-generator
npm start

# Verify it's running
curl http://localhost:3000/health

# Check configured URL in .env
cat .env | grep NODE_API_URL
```

### Issue: "Rate limit exceeded" from OpenAI

**Cause**: Too many API requests

**Solution**:
```bash
# Reduce MAX_ARTICLES in .env
echo "MAX_ARTICLES=5" >> .env

# Add delay between requests (future enhancement)
# For now, wait a few minutes and retry
```

### Issue: Scraping fails for specific sites

**Cause**: Site structure changed or blocking

**Solution**:
```bash
# Test individual source
python scripts/test_scraper.py --url https://problematic-site.com

# Check logs
cat feed_generator.log | grep ScrapingError

# Remove problematic source from .env temporarily
nano .env  # Remove from NEWS_SOURCES
```

### Issue: Type checking fails

**Cause**: Missing or incorrect type hints

**Solution**:
```bash
# Run mypy to see errors
mypy src/

# Fix reported issues
# Every function must have type hints
```

---

## DEVELOPMENT SETUP

### Additional Tools

```bash
# Code formatting
pip install black
black src/ tests/

# Linting
pip install flake8
flake8 src/ tests/

# Type checking
pip install mypy
mypy src/

# Interactive Python shell
pip install ipython
ipython
```

### Pre-commit Hook (Optional)

```bash
# Install pre-commit
pip install pre-commit

# Setup hooks
pre-commit install

# Now runs automatically on git commit
# Or run manually:
pre-commit run --all-files
```

### IDE Setup

#### VS Code

```json
// .vscode/settings.json
{
    "python.defaultInterpreterPath": "${workspaceFolder}/venv/bin/python",
    "python.linting.enabled": true,
    "python.linting.pylintEnabled": false,
    "python.linting.flake8Enabled": true,
    "python.formatting.provider": "black",
    "python.analysis.typeCheckingMode": "strict"
}
```

#### PyCharm

```
1. Open Project
2. File → Settings → Project → Python Interpreter
3. Add Interpreter → Existing Environment
4. Select: /path/to/feed-generator/venv/bin/python
5. Apply
```

---

## SCHEDULED EXECUTION

### Cron Job (Linux/Mac)

```bash
# Edit crontab
crontab -e

# Run every 6 hours
0 */6 * * * cd /path/to/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1

# Run daily at 8 AM
0 8 * * * cd /path/to/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1
```

### Systemd Service (Linux)

```ini
# /etc/systemd/system/feed-generator.service
[Unit]
Description=Feed Generator
After=network.target

[Service]
Type=simple
User=your-user
WorkingDirectory=/path/to/feed-generator
ExecStart=/path/to/feed-generator/venv/bin/python scripts/run.py
Restart=on-failure

[Install]
WantedBy=multi-user.target
```

```bash
# Enable and start
sudo systemctl enable feed-generator
sudo systemctl start feed-generator

# Check status
sudo systemctl status feed-generator
```

### Task Scheduler (Windows)

```powershell
# Create scheduled task
$action = New-ScheduledTaskAction -Execute "C:\path\to\venv\Scripts\python.exe" -Argument "C:\path\to\scripts\run.py"
$trigger = New-ScheduledTaskTrigger -Daily -At 8am
Register-ScheduledTask -Action $action -Trigger $trigger -TaskName "FeedGenerator" -Description "Run feed generator daily"
```

---

## MONITORING

### Log Files

```bash
# View live logs
tail -f feed_generator.log

# View recent errors
grep ERROR feed_generator.log | tail -20

# View pipeline summary
grep "Pipeline complete" feed_generator.log
```

### Metrics Dashboard (Future)

```bash
# View last run metrics
python scripts/show_metrics.py

# Expected output:
# Last Run: 2025-01-15 10:01:30
# Duration: 90 seconds
# Articles Scraped: 15
# Articles Generated: 12
# Success Rate: 80%
# Errors: 3 (image analysis failures)
```

---

## BACKUP & RECOVERY

### Backup Configuration

```bash
# Backup .env file (CAREFUL - contains API keys)
cp .env .env.backup

# Store securely, NOT in git
# Use password manager or encrypted storage
```

### Backup Output

```bash
# Create daily backup
mkdir -p backups/$(date +%Y-%m-%d)
cp -r output/* backups/$(date +%Y-%m-%d)/

# Automated backup script
./scripts/backup_output.sh
```

### Recovery

```bash
# Restore from backup
cp backups/2025-01-15/feed.rss output/

# Verify integrity
python scripts/verify_feed.py output/feed.rss
```

---

## UPDATING

### Update Dependencies

```bash
# Activate virtual environment
source venv/bin/activate

# Update pip
pip install --upgrade pip

# Update all packages
pip install --upgrade -r requirements.txt

# Verify updates
pip list --outdated
```

### Update Code

```bash
# Pull latest changes
git pull origin main

# Reinstall if requirements changed
pip install -r requirements.txt

# Run tests
python -m pytest tests/

# Test pipeline
python scripts/test_pipeline.py --dry-run
```

---

## UNINSTALLATION

### Remove Virtual Environment

```bash
# Deactivate first
deactivate

# Remove virtual environment
rm -rf venv/
```

### Remove Generated Files

```bash
# Remove output
rm -rf output/

# Remove logs
rm -rf logs/

# Remove backups
rm -rf backups/
```

### Remove Project

```bash
# Remove entire project directory
cd ..
rm -rf feed-generator/
```

---

## SECURITY CHECKLIST

Before deploying:

- [ ] `.env` file is NOT committed to git
- [ ] `.env.example` has placeholder values only
- [ ] API keys are stored securely
- [ ] `.gitignore` includes `.env`, `venv/`, `output/`, `logs/`
- [ ] Log files don't contain sensitive data
- [ ] File permissions are restrictive (`chmod 600 .env`)
- [ ] Virtual environment is isolated
- [ ] Dependencies are from trusted sources

---

## PERFORMANCE BASELINE

Expected performance on standard hardware:

| Metric | Target | Acceptable Range |
|--------|--------|------------------|
| Scraping (10 articles) | 10s | 5-20s |
| Image analysis (10 images) | 30s | 20-50s |
| Article generation (10 articles) | 60s | 40-120s |
| Publishing | 1s | <5s |
| **Total pipeline (10 articles)** | **2 min** | **1-5 min** |

### Performance Testing

```bash
# Benchmark pipeline
python scripts/benchmark.py

# Output:
# Scraping: 8.3s (15 articles)
# Analysis: 42.1s (15 images)
# Generation: 95.7s (12 articles)
# Publishing: 0.8s
# TOTAL: 146.9s
```

---

## NEXT STEPS

After successful setup:

1. **Run first pipeline**
   ```bash
   python scripts/run.py
   ```

2. **Verify output**
   ```bash
   ls -l output/
   cat output/feed.rss | head -20
   ```

3. **Set up scheduling** (cron/systemd/Task Scheduler)

4. **Configure monitoring** (logs, metrics)

5. **Read DEVELOPMENT.md** for extending functionality

---

## GETTING HELP

### Documentation

- **README.md** - Project overview
- **ARCHITECTURE.md** - Technical design
- **CLAUDE.md** - Development guidelines
- **API_INTEGRATION.md** - Node API integration

### Diagnostics

```bash
# Run diagnostics script
python scripts/diagnose.py

# Output:
# ✓ Python version: 3.11.5
# ✓ Virtual environment: active
# ✓ Dependencies: installed
# ✓ Configuration: valid
# ✓ OpenAI API: reachable
# ✓ Node API: reachable
# ✓ Output directory: writable
# All systems operational!
```

### Common Issues

Check troubleshooting section above, or:

```bash
# Generate debug report
python scripts/debug_report.py > debug.txt

# Share debug.txt (remove API keys first!)
```

---

## CHECKLIST: FIRST RUN

Complete setup verification:

- [ ] Python 3.11+ installed
- [ ] Virtual environment created and activated
- [ ] Dependencies installed (`pip list` shows all packages)
- [ ] `.env` file created with API keys
- [ ] OpenAI API connection tested
- [ ] Node.js API running and tested
- [ ] Configuration validated (`Config.from_env()` works)
- [ ] Component tests pass (`pytest tests/`)
- [ ] Dry run successful (`python scripts/run.py --dry-run`)
- [ ] First real run completed
- [ ] Output files generated (`output/feed.rss` exists)
- [ ] Logs are readable (`feed_generator.log`)

**If all checks pass → You're ready to use Feed Generator!**

---

## QUICK START SUMMARY

For experienced developers:

```bash
# 1. Setup
git clone <repo> && cd feed-generator
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# 2. Configure
cp .env.example .env
# Edit .env with your API keys

# 3. Test
python scripts/test_pipeline.py --dry-run

# 4. Run
python scripts/run.py

# 5. Verify
ls -l output/
```

**Time to first run: ~10 minutes**

---

## APPENDIX: EXAMPLE .env FILE

```bash
# .env.example - Copy to .env and fill in your values

# ==============================================
# REQUIRED CONFIGURATION
# ==============================================

# OpenAI API Key (get from https://platform.openai.com/api-keys)
OPENAI_API_KEY=sk-proj-your-actual-key-here

# Node.js Article Generator API URL
NODE_API_URL=http://localhost:3000

# News sources (comma-separated URLs)
NEWS_SOURCES=https://techcrunch.com/feed,https://www.theverge.com/rss/index.xml

# ==============================================
# OPTIONAL CONFIGURATION
# ==============================================

# Logging level (DEBUG, INFO, WARNING, ERROR)
LOG_LEVEL=INFO

# Maximum articles to process per source
MAX_ARTICLES=10

# HTTP timeout for scraping (seconds)
SCRAPER_TIMEOUT=10

# HTTP timeout for API calls (seconds)
API_TIMEOUT=30

# Output directory (default: ./output)
OUTPUT_DIR=./output

# ==============================================
# ADVANCED CONFIGURATION (V2)
# ==============================================

# Enable caching (true/false)
# ENABLE_CACHE=false

# Cache TTL in seconds
# CACHE_TTL=3600

# Enable parallel processing (true/false)
# ENABLE_PARALLEL=false

# Max concurrent workers
# MAX_WORKERS=5
```

---

## APPENDIX: DIRECTORY STRUCTURE

```
feed-generator/
├── .env                    # Configuration (NOT in git)
├── .env.example            # Configuration template
├── .gitignore              # Git ignore rules
├── README.md               # Project overview
├── CLAUDE.md               # Development guidelines
├── ARCHITECTURE.md         # Technical design
├── SETUP.md                # This file
├── requirements.txt        # Python dependencies
├── requirements-dev.txt    # Development dependencies
├── pyproject.toml          # Python project metadata
│
├── src/                    # Source code
│   ├── __init__.py
│   ├── config.py           # Configuration management
│   ├── exceptions.py       # Custom exceptions
│   ├── scraper.py          # News scraping
│   ├── image_analyzer.py   # Image analysis
│   ├── aggregator.py       # Content aggregation
│   ├── article_client.py   # Node API client
│   └── publisher.py        # Feed publishing
│
├── tests/                  # Test suite
│   ├── __init__.py
│   ├── test_config.py
│   ├── test_scraper.py
│   ├── test_image_analyzer.py
│   ├── test_aggregator.py
│   ├── test_article_client.py
│   ├── test_publisher.py
│   └── test_integration.py
│
├── scripts/                # Utility scripts
│   ├── run.py              # Main pipeline
│   ├── test_pipeline.py    # Pipeline testing
│   ├── test_openai.py      # OpenAI API test
│   ├── test_node_api.py    # Node API test
│   ├── diagnose.py         # System diagnostics
│   ├── debug_report.py     # Debug information
│   └── benchmark.py        # Performance testing
│
├── output/                 # Generated files (git-ignored)
│   ├── feed.rss
│   ├── articles.json
│   └── feed_generator.log
│
├── logs/                   # Log files (git-ignored)
│   └── *.log
│
└── backups/                # Backup files (git-ignored)
    └── YYYY-MM-DD/
```

---

## APPENDIX: MINIMAL WORKING EXAMPLE

Test that everything works with minimal code:

```python
# test_minimal.py - Minimal working example

from src.config import Config
from src.scraper import NewsScraper
from src.image_analyzer import ImageAnalyzer

# Load configuration
config = Config.from_env()
print(f"✓ Configuration loaded")

# Test scraper
scraper = NewsScraper(config.scraper)
print(f"✓ Scraper initialized")

# Test analyzer
analyzer = ImageAnalyzer(config.api.openai_key)
print(f"✓ Analyzer initialized")

# Scrape one article
test_url = config.scraper.sources[0]
articles = scraper.scrape(test_url)
print(f"✓ Scraped {len(articles)} articles from {test_url}")

# Analyze one image (if available)
if articles and articles[0].image_url:
    analysis = analyzer.analyze(
        articles[0].image_url,
        context="Test image analysis"
    )
    print(f"✓ Image analyzed: {analysis.description[:50]}...")

print("\n✅ All basic functionality working!")
```

Run with:
```bash
python test_minimal.py
```

---

End of SETUP.md