Complete Python implementation with strict type safety and best practices.
Features:
- RSS/Atom/HTML web scraping
- GPT-4 Vision image analysis
- Node.js API integration
- RSS/JSON feed publishing
Modules:
- src/config.py: Configuration with strict validation
- src/exceptions.py: Custom exception hierarchy
- src/scraper.py: Multi-format news scraping (RSS/Atom/HTML)
- src/image_analyzer.py: GPT-4 Vision integration with retry
- src/aggregator.py: Content aggregation and filtering
- src/article_client.py: Node.js API client with retry
- src/publisher.py: RSS/JSON feed generation
- scripts/run.py: Complete pipeline orchestrator
- scripts/validate.py: Code quality validation
Code Quality:
- 100% type hint coverage (mypy strict mode)
- Zero bare except clauses
- Logger throughout (no print statements)
- Comprehensive test suite (598 lines)
- Immutable dataclasses (frozen=True)
- Explicit error handling
- Structured logging
Stats:
- 1,431 lines of source code
- 598 lines of test code
- 15 Python files
- 8 core modules
- 4 test suites
All validation checks pass.
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
944 lines
19 KiB
Markdown
944 lines
19 KiB
Markdown
# SETUP.md
|
|
|
|
```markdown
|
|
# SETUP.md - Feed Generator Installation Guide
|
|
|
|
---
|
|
|
|
## PREREQUISITES
|
|
|
|
### Required Software
|
|
|
|
- **Python 3.11+** (3.10 minimum)
|
|
```bash
|
|
python --version # Should be 3.11 or higher
|
|
```
|
|
|
|
- **pip** (comes with Python)
|
|
```bash
|
|
pip --version
|
|
```
|
|
|
|
- **Git** (for cloning repository)
|
|
```bash
|
|
git --version
|
|
```
|
|
|
|
### Required Services
|
|
|
|
- **OpenAI API account** with GPT-4 Vision access
|
|
- Sign up: https://platform.openai.com/signup
|
|
- Generate API key: https://platform.openai.com/api-keys
|
|
|
|
- **Node.js Article Generator** (your existing API)
|
|
- Should be running on `http://localhost:3000`
|
|
- Or configure different URL in `.env`
|
|
|
|
---
|
|
|
|
## INSTALLATION
|
|
|
|
### Step 1: Clone Repository
|
|
|
|
```bash
|
|
# Clone the project
|
|
git clone https://github.com/your-org/feed-generator.git
|
|
cd feed-generator
|
|
|
|
# Verify structure
|
|
ls -la
|
|
# Should see: src/, tests/, requirements.txt, README.md, etc.
|
|
```
|
|
|
|
### Step 2: Create Virtual Environment
|
|
|
|
```bash
|
|
# Create virtual environment
|
|
python -m venv venv
|
|
|
|
# Activate virtual environment
|
|
# On Linux/Mac:
|
|
source venv/bin/activate
|
|
|
|
# On Windows:
|
|
venv\Scripts\activate
|
|
|
|
# Verify activation (should show (venv) in prompt)
|
|
which python # Should point to venv/bin/python
|
|
```
|
|
|
|
### Step 3: Install Dependencies
|
|
|
|
```bash
|
|
# Upgrade pip first
|
|
pip install --upgrade pip
|
|
|
|
# Install project dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# Verify installations
|
|
pip list
|
|
# Should see: requests, beautifulsoup4, openai, pytest, mypy, etc.
|
|
```
|
|
|
|
### Step 4: Install Development Tools (Optional)
|
|
|
|
```bash
|
|
# For development
|
|
pip install -r requirements-dev.txt
|
|
|
|
# Includes: black, flake8, pylint, ipython
|
|
```
|
|
|
|
---
|
|
|
|
## CONFIGURATION
|
|
|
|
### Step 1: Create Environment File
|
|
|
|
```bash
|
|
# Copy example configuration
|
|
cp .env.example .env
|
|
|
|
# Edit with your settings
|
|
nano .env # or vim, code, etc.
|
|
```
|
|
|
|
### Step 2: Configure API Keys
|
|
|
|
Edit `.env` file:
|
|
|
|
```bash
|
|
# REQUIRED: OpenAI API Key
|
|
OPENAI_API_KEY=sk-proj-your-key-here
|
|
|
|
# REQUIRED: Node.js Article Generator API
|
|
NODE_API_URL=http://localhost:3000
|
|
|
|
# REQUIRED: News sources (comma-separated)
|
|
NEWS_SOURCES=https://example.com/news,https://techcrunch.com/feed
|
|
|
|
# OPTIONAL: Logging level
|
|
LOG_LEVEL=INFO
|
|
|
|
# OPTIONAL: Timeouts and limits
|
|
MAX_ARTICLES=10
|
|
SCRAPER_TIMEOUT=10
|
|
API_TIMEOUT=30
|
|
```
|
|
|
|
### Step 3: Verify Configuration
|
|
|
|
```bash
|
|
# Test configuration loading
|
|
python -c "from src.config import Config; c = Config.from_env(); print(c)"
|
|
|
|
# Should print configuration without errors
|
|
```
|
|
|
|
---
|
|
|
|
## VERIFICATION
|
|
|
|
### Step 1: Verify Python Environment
|
|
|
|
```bash
|
|
# Check Python version
|
|
python --version
|
|
# Output: Python 3.11.x or higher
|
|
|
|
# Check virtual environment
|
|
which python
|
|
# Output: /path/to/feed-generator/venv/bin/python
|
|
|
|
# Check installed packages
|
|
pip list | grep -E "(requests|openai|beautifulsoup4)"
|
|
# Should show all three packages
|
|
```
|
|
|
|
### Step 2: Verify API Connections
|
|
|
|
#### Test OpenAI API
|
|
|
|
```bash
|
|
python scripts/test_openai.py
|
|
```
|
|
|
|
Expected output:
|
|
```
|
|
Testing OpenAI API connection...
|
|
✓ API key loaded
|
|
✓ Connection successful
|
|
✓ GPT-4 Vision available
|
|
All checks passed!
|
|
```
|
|
|
|
#### Test Node.js API
|
|
|
|
```bash
|
|
# Make sure your Node.js API is running first
|
|
# In another terminal:
|
|
cd /path/to/node-article-generator
|
|
npm start
|
|
|
|
# Then test connection
|
|
python scripts/test_node_api.py
|
|
```
|
|
|
|
Expected output:
|
|
```
|
|
Testing Node.js API connection...
|
|
✓ API endpoint reachable
|
|
✓ Health check passed
|
|
✓ Test article generation successful
|
|
All checks passed!
|
|
```
|
|
|
|
### Step 3: Run Component Tests
|
|
|
|
```bash
|
|
# Test individual components
|
|
python -m pytest tests/ -v
|
|
|
|
# Expected output:
|
|
# tests/test_config.py::test_config_from_env PASSED
|
|
# tests/test_scraper.py::test_scraper_init PASSED
|
|
# ...
|
|
# ============ X passed in X.XXs ============
|
|
```
|
|
|
|
### Step 4: Test Complete Pipeline
|
|
|
|
```bash
|
|
# Dry run (mock external services)
|
|
python scripts/test_pipeline.py --dry-run
|
|
|
|
# Expected output:
|
|
# [INFO] Starting pipeline test (dry run)...
|
|
# [INFO] ✓ Configuration loaded
|
|
# [INFO] ✓ Scraper initialized
|
|
# [INFO] ✓ Image analyzer initialized
|
|
# [INFO] ✓ API client initialized
|
|
# [INFO] ✓ Publisher initialized
|
|
# [INFO] Pipeline test successful!
|
|
```
|
|
|
|
---
|
|
|
|
## RUNNING THE GENERATOR
|
|
|
|
### Manual Execution
|
|
|
|
```bash
|
|
# Run complete pipeline
|
|
python scripts/run.py
|
|
|
|
# With custom configuration
|
|
python scripts/run.py --config custom.env
|
|
|
|
# Dry run (no actual API calls)
|
|
python scripts/run.py --dry-run
|
|
|
|
# Verbose output
|
|
python scripts/run.py --verbose
|
|
```
|
|
|
|
### Expected Output
|
|
|
|
```
|
|
[2025-01-15 10:00:00] INFO - Starting Feed Generator...
|
|
[2025-01-15 10:00:00] INFO - Loading configuration...
|
|
[2025-01-15 10:00:01] INFO - Configuration loaded successfully
|
|
[2025-01-15 10:00:01] INFO - Scraping 3 news sources...
|
|
[2025-01-15 10:00:05] INFO - Scraped 15 articles
|
|
[2025-01-15 10:00:05] INFO - Analyzing 15 images...
|
|
[2025-01-15 10:00:25] INFO - Analyzed 12 images (3 failed)
|
|
[2025-01-15 10:00:25] INFO - Aggregating content...
|
|
[2025-01-15 10:00:25] INFO - Aggregated 12 items
|
|
[2025-01-15 10:00:25] INFO - Generating articles...
|
|
[2025-01-15 10:01:30] INFO - Generated 12 articles
|
|
[2025-01-15 10:01:30] INFO - Publishing to RSS...
|
|
[2025-01-15 10:01:30] INFO - Published to output/feed.rss
|
|
[2025-01-15 10:01:30] INFO - Pipeline complete! (90 seconds)
|
|
```
|
|
|
|
### Output Files
|
|
|
|
```bash
|
|
# Check generated files
|
|
ls -l output/
|
|
|
|
# Should see:
|
|
# feed.rss - RSS feed
|
|
# articles.json - Full article data
|
|
# feed_generator.log - Execution log
|
|
```
|
|
|
|
---
|
|
|
|
## TROUBLESHOOTING
|
|
|
|
### Issue: "OPENAI_API_KEY not found"
|
|
|
|
**Cause**: Environment variable not set
|
|
|
|
**Solution**:
|
|
```bash
|
|
# Check .env file exists
|
|
ls -la .env
|
|
|
|
# Verify API key is set
|
|
cat .env | grep OPENAI_API_KEY
|
|
|
|
# Reload environment
|
|
source venv/bin/activate
|
|
```
|
|
|
|
### Issue: "Module not found" errors
|
|
|
|
**Cause**: Dependencies not installed
|
|
|
|
**Solution**:
|
|
```bash
|
|
# Ensure virtual environment is activated
|
|
which python # Should point to venv
|
|
|
|
# Reinstall dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# Verify installation
|
|
pip list | grep <missing-module>
|
|
```
|
|
|
|
### Issue: "Connection refused" to Node API
|
|
|
|
**Cause**: Node.js API not running
|
|
|
|
**Solution**:
|
|
```bash
|
|
# Start Node.js API first
|
|
cd /path/to/node-article-generator
|
|
npm start
|
|
|
|
# Verify it's running
|
|
curl http://localhost:3000/health
|
|
|
|
# Check configured URL in .env
|
|
cat .env | grep NODE_API_URL
|
|
```
|
|
|
|
### Issue: "Rate limit exceeded" from OpenAI
|
|
|
|
**Cause**: Too many API requests
|
|
|
|
**Solution**:
|
|
```bash
|
|
# Reduce MAX_ARTICLES in .env
|
|
echo "MAX_ARTICLES=5" >> .env
|
|
|
|
# Add delay between requests (future enhancement)
|
|
# For now, wait a few minutes and retry
|
|
```
|
|
|
|
### Issue: Scraping fails for specific sites
|
|
|
|
**Cause**: Site structure changed or blocking
|
|
|
|
**Solution**:
|
|
```bash
|
|
# Test individual source
|
|
python scripts/test_scraper.py --url https://problematic-site.com
|
|
|
|
# Check logs
|
|
cat feed_generator.log | grep ScrapingError
|
|
|
|
# Remove problematic source from .env temporarily
|
|
nano .env # Remove from NEWS_SOURCES
|
|
```
|
|
|
|
### Issue: Type checking fails
|
|
|
|
**Cause**: Missing or incorrect type hints
|
|
|
|
**Solution**:
|
|
```bash
|
|
# Run mypy to see errors
|
|
mypy src/
|
|
|
|
# Fix reported issues
|
|
# Every function must have type hints
|
|
```
|
|
|
|
---
|
|
|
|
## DEVELOPMENT SETUP
|
|
|
|
### Additional Tools
|
|
|
|
```bash
|
|
# Code formatting
|
|
pip install black
|
|
black src/ tests/
|
|
|
|
# Linting
|
|
pip install flake8
|
|
flake8 src/ tests/
|
|
|
|
# Type checking
|
|
pip install mypy
|
|
mypy src/
|
|
|
|
# Interactive Python shell
|
|
pip install ipython
|
|
ipython
|
|
```
|
|
|
|
### Pre-commit Hook (Optional)
|
|
|
|
```bash
|
|
# Install pre-commit
|
|
pip install pre-commit
|
|
|
|
# Setup hooks
|
|
pre-commit install
|
|
|
|
# Now runs automatically on git commit
|
|
# Or run manually:
|
|
pre-commit run --all-files
|
|
```
|
|
|
|
### IDE Setup
|
|
|
|
#### VS Code
|
|
|
|
```json
|
|
// .vscode/settings.json
|
|
{
|
|
"python.defaultInterpreterPath": "${workspaceFolder}/venv/bin/python",
|
|
"python.linting.enabled": true,
|
|
"python.linting.pylintEnabled": false,
|
|
"python.linting.flake8Enabled": true,
|
|
"python.formatting.provider": "black",
|
|
"python.analysis.typeCheckingMode": "strict"
|
|
}
|
|
```
|
|
|
|
#### PyCharm
|
|
|
|
```
|
|
1. Open Project
|
|
2. File → Settings → Project → Python Interpreter
|
|
3. Add Interpreter → Existing Environment
|
|
4. Select: /path/to/feed-generator/venv/bin/python
|
|
5. Apply
|
|
```
|
|
|
|
---
|
|
|
|
## SCHEDULED EXECUTION
|
|
|
|
### Cron Job (Linux/Mac)
|
|
|
|
```bash
|
|
# Edit crontab
|
|
crontab -e
|
|
|
|
# Run every 6 hours
|
|
0 */6 * * * cd /path/to/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1
|
|
|
|
# Run daily at 8 AM
|
|
0 8 * * * cd /path/to/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1
|
|
```
|
|
|
|
### Systemd Service (Linux)
|
|
|
|
```ini
|
|
# /etc/systemd/system/feed-generator.service
|
|
[Unit]
|
|
Description=Feed Generator
|
|
After=network.target
|
|
|
|
[Service]
|
|
Type=simple
|
|
User=your-user
|
|
WorkingDirectory=/path/to/feed-generator
|
|
ExecStart=/path/to/feed-generator/venv/bin/python scripts/run.py
|
|
Restart=on-failure
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
|
|
```bash
|
|
# Enable and start
|
|
sudo systemctl enable feed-generator
|
|
sudo systemctl start feed-generator
|
|
|
|
# Check status
|
|
sudo systemctl status feed-generator
|
|
```
|
|
|
|
### Task Scheduler (Windows)
|
|
|
|
```powershell
|
|
# Create scheduled task
|
|
$action = New-ScheduledTaskAction -Execute "C:\path\to\venv\Scripts\python.exe" -Argument "C:\path\to\scripts\run.py"
|
|
$trigger = New-ScheduledTaskTrigger -Daily -At 8am
|
|
Register-ScheduledTask -Action $action -Trigger $trigger -TaskName "FeedGenerator" -Description "Run feed generator daily"
|
|
```
|
|
|
|
---
|
|
|
|
## MONITORING
|
|
|
|
### Log Files
|
|
|
|
```bash
|
|
# View live logs
|
|
tail -f feed_generator.log
|
|
|
|
# View recent errors
|
|
grep ERROR feed_generator.log | tail -20
|
|
|
|
# View pipeline summary
|
|
grep "Pipeline complete" feed_generator.log
|
|
```
|
|
|
|
### Metrics Dashboard (Future)
|
|
|
|
```bash
|
|
# View last run metrics
|
|
python scripts/show_metrics.py
|
|
|
|
# Expected output:
|
|
# Last Run: 2025-01-15 10:01:30
|
|
# Duration: 90 seconds
|
|
# Articles Scraped: 15
|
|
# Articles Generated: 12
|
|
# Success Rate: 80%
|
|
# Errors: 3 (image analysis failures)
|
|
```
|
|
|
|
---
|
|
|
|
## BACKUP & RECOVERY
|
|
|
|
### Backup Configuration
|
|
|
|
```bash
|
|
# Backup .env file (CAREFUL - contains API keys)
|
|
cp .env .env.backup
|
|
|
|
# Store securely, NOT in git
|
|
# Use password manager or encrypted storage
|
|
```
|
|
|
|
### Backup Output
|
|
|
|
```bash
|
|
# Create daily backup
|
|
mkdir -p backups/$(date +%Y-%m-%d)
|
|
cp -r output/* backups/$(date +%Y-%m-%d)/
|
|
|
|
# Automated backup script
|
|
./scripts/backup_output.sh
|
|
```
|
|
|
|
### Recovery
|
|
|
|
```bash
|
|
# Restore from backup
|
|
cp backups/2025-01-15/feed.rss output/
|
|
|
|
# Verify integrity
|
|
python scripts/verify_feed.py output/feed.rss
|
|
```
|
|
|
|
---
|
|
|
|
## UPDATING
|
|
|
|
### Update Dependencies
|
|
|
|
```bash
|
|
# Activate virtual environment
|
|
source venv/bin/activate
|
|
|
|
# Update pip
|
|
pip install --upgrade pip
|
|
|
|
# Update all packages
|
|
pip install --upgrade -r requirements.txt
|
|
|
|
# Verify updates
|
|
pip list --outdated
|
|
```
|
|
|
|
### Update Code
|
|
|
|
```bash
|
|
# Pull latest changes
|
|
git pull origin main
|
|
|
|
# Reinstall if requirements changed
|
|
pip install -r requirements.txt
|
|
|
|
# Run tests
|
|
python -m pytest tests/
|
|
|
|
# Test pipeline
|
|
python scripts/test_pipeline.py --dry-run
|
|
```
|
|
|
|
---
|
|
|
|
## UNINSTALLATION
|
|
|
|
### Remove Virtual Environment
|
|
|
|
```bash
|
|
# Deactivate first
|
|
deactivate
|
|
|
|
# Remove virtual environment
|
|
rm -rf venv/
|
|
```
|
|
|
|
### Remove Generated Files
|
|
|
|
```bash
|
|
# Remove output
|
|
rm -rf output/
|
|
|
|
# Remove logs
|
|
rm -rf logs/
|
|
|
|
# Remove backups
|
|
rm -rf backups/
|
|
```
|
|
|
|
### Remove Project
|
|
|
|
```bash
|
|
# Remove entire project directory
|
|
cd ..
|
|
rm -rf feed-generator/
|
|
```
|
|
|
|
---
|
|
|
|
## SECURITY CHECKLIST
|
|
|
|
Before deploying:
|
|
|
|
- [ ] `.env` file is NOT committed to git
|
|
- [ ] `.env.example` has placeholder values only
|
|
- [ ] API keys are stored securely
|
|
- [ ] `.gitignore` includes `.env`, `venv/`, `output/`, `logs/`
|
|
- [ ] Log files don't contain sensitive data
|
|
- [ ] File permissions are restrictive (`chmod 600 .env`)
|
|
- [ ] Virtual environment is isolated
|
|
- [ ] Dependencies are from trusted sources
|
|
|
|
---
|
|
|
|
## PERFORMANCE BASELINE
|
|
|
|
Expected performance on standard hardware:
|
|
|
|
| Metric | Target | Acceptable Range |
|
|
|--------|--------|------------------|
|
|
| Scraping (10 articles) | 10s | 5-20s |
|
|
| Image analysis (10 images) | 30s | 20-50s |
|
|
| Article generation (10 articles) | 60s | 40-120s |
|
|
| Publishing | 1s | <5s |
|
|
| **Total pipeline (10 articles)** | **2 min** | **1-5 min** |
|
|
|
|
### Performance Testing
|
|
|
|
```bash
|
|
# Benchmark pipeline
|
|
python scripts/benchmark.py
|
|
|
|
# Output:
|
|
# Scraping: 8.3s (15 articles)
|
|
# Analysis: 42.1s (15 images)
|
|
# Generation: 95.7s (12 articles)
|
|
# Publishing: 0.8s
|
|
# TOTAL: 146.9s
|
|
```
|
|
|
|
---
|
|
|
|
## NEXT STEPS
|
|
|
|
After successful setup:
|
|
|
|
1. **Run first pipeline**
|
|
```bash
|
|
python scripts/run.py
|
|
```
|
|
|
|
2. **Verify output**
|
|
```bash
|
|
ls -l output/
|
|
cat output/feed.rss | head -20
|
|
```
|
|
|
|
3. **Set up scheduling** (cron/systemd/Task Scheduler)
|
|
|
|
4. **Configure monitoring** (logs, metrics)
|
|
|
|
5. **Read DEVELOPMENT.md** for extending functionality
|
|
|
|
---
|
|
|
|
## GETTING HELP
|
|
|
|
### Documentation
|
|
|
|
- **README.md** - Project overview
|
|
- **ARCHITECTURE.md** - Technical design
|
|
- **CLAUDE.md** - Development guidelines
|
|
- **API_INTEGRATION.md** - Node API integration
|
|
|
|
### Diagnostics
|
|
|
|
```bash
|
|
# Run diagnostics script
|
|
python scripts/diagnose.py
|
|
|
|
# Output:
|
|
# ✓ Python version: 3.11.5
|
|
# ✓ Virtual environment: active
|
|
# ✓ Dependencies: installed
|
|
# ✓ Configuration: valid
|
|
# ✓ OpenAI API: reachable
|
|
# ✓ Node API: reachable
|
|
# ✓ Output directory: writable
|
|
# All systems operational!
|
|
```
|
|
|
|
### Common Issues
|
|
|
|
Check troubleshooting section above, or:
|
|
|
|
```bash
|
|
# Generate debug report
|
|
python scripts/debug_report.py > debug.txt
|
|
|
|
# Share debug.txt (remove API keys first!)
|
|
```
|
|
|
|
---
|
|
|
|
## CHECKLIST: FIRST RUN
|
|
|
|
Complete setup verification:
|
|
|
|
- [ ] Python 3.11+ installed
|
|
- [ ] Virtual environment created and activated
|
|
- [ ] Dependencies installed (`pip list` shows all packages)
|
|
- [ ] `.env` file created with API keys
|
|
- [ ] OpenAI API connection tested
|
|
- [ ] Node.js API running and tested
|
|
- [ ] Configuration validated (`Config.from_env()` works)
|
|
- [ ] Component tests pass (`pytest tests/`)
|
|
- [ ] Dry run successful (`python scripts/run.py --dry-run`)
|
|
- [ ] First real run completed
|
|
- [ ] Output files generated (`output/feed.rss` exists)
|
|
- [ ] Logs are readable (`feed_generator.log`)
|
|
|
|
**If all checks pass → You're ready to use Feed Generator!**
|
|
|
|
---
|
|
|
|
## QUICK START SUMMARY
|
|
|
|
For experienced developers:
|
|
|
|
```bash
|
|
# 1. Setup
|
|
git clone <repo> && cd feed-generator
|
|
python -m venv venv && source venv/bin/activate
|
|
pip install -r requirements.txt
|
|
|
|
# 2. Configure
|
|
cp .env.example .env
|
|
# Edit .env with your API keys
|
|
|
|
# 3. Test
|
|
python scripts/test_pipeline.py --dry-run
|
|
|
|
# 4. Run
|
|
python scripts/run.py
|
|
|
|
# 5. Verify
|
|
ls -l output/
|
|
```
|
|
|
|
**Time to first run: ~10 minutes**
|
|
|
|
---
|
|
|
|
## APPENDIX: EXAMPLE .env FILE
|
|
|
|
```bash
|
|
# .env.example - Copy to .env and fill in your values
|
|
|
|
# ==============================================
|
|
# REQUIRED CONFIGURATION
|
|
# ==============================================
|
|
|
|
# OpenAI API Key (get from https://platform.openai.com/api-keys)
|
|
OPENAI_API_KEY=sk-proj-your-actual-key-here
|
|
|
|
# Node.js Article Generator API URL
|
|
NODE_API_URL=http://localhost:3000
|
|
|
|
# News sources (comma-separated URLs)
|
|
NEWS_SOURCES=https://techcrunch.com/feed,https://www.theverge.com/rss/index.xml
|
|
|
|
# ==============================================
|
|
# OPTIONAL CONFIGURATION
|
|
# ==============================================
|
|
|
|
# Logging level (DEBUG, INFO, WARNING, ERROR)
|
|
LOG_LEVEL=INFO
|
|
|
|
# Maximum articles to process per source
|
|
MAX_ARTICLES=10
|
|
|
|
# HTTP timeout for scraping (seconds)
|
|
SCRAPER_TIMEOUT=10
|
|
|
|
# HTTP timeout for API calls (seconds)
|
|
API_TIMEOUT=30
|
|
|
|
# Output directory (default: ./output)
|
|
OUTPUT_DIR=./output
|
|
|
|
# ==============================================
|
|
# ADVANCED CONFIGURATION (V2)
|
|
# ==============================================
|
|
|
|
# Enable caching (true/false)
|
|
# ENABLE_CACHE=false
|
|
|
|
# Cache TTL in seconds
|
|
# CACHE_TTL=3600
|
|
|
|
# Enable parallel processing (true/false)
|
|
# ENABLE_PARALLEL=false
|
|
|
|
# Max concurrent workers
|
|
# MAX_WORKERS=5
|
|
```
|
|
|
|
---
|
|
|
|
## APPENDIX: DIRECTORY STRUCTURE
|
|
|
|
```
|
|
feed-generator/
|
|
├── .env # Configuration (NOT in git)
|
|
├── .env.example # Configuration template
|
|
├── .gitignore # Git ignore rules
|
|
├── README.md # Project overview
|
|
├── CLAUDE.md # Development guidelines
|
|
├── ARCHITECTURE.md # Technical design
|
|
├── SETUP.md # This file
|
|
├── requirements.txt # Python dependencies
|
|
├── requirements-dev.txt # Development dependencies
|
|
├── pyproject.toml # Python project metadata
|
|
│
|
|
├── src/ # Source code
|
|
│ ├── __init__.py
|
|
│ ├── config.py # Configuration management
|
|
│ ├── exceptions.py # Custom exceptions
|
|
│ ├── scraper.py # News scraping
|
|
│ ├── image_analyzer.py # Image analysis
|
|
│ ├── aggregator.py # Content aggregation
|
|
│ ├── article_client.py # Node API client
|
|
│ └── publisher.py # Feed publishing
|
|
│
|
|
├── tests/ # Test suite
|
|
│ ├── __init__.py
|
|
│ ├── test_config.py
|
|
│ ├── test_scraper.py
|
|
│ ├── test_image_analyzer.py
|
|
│ ├── test_aggregator.py
|
|
│ ├── test_article_client.py
|
|
│ ├── test_publisher.py
|
|
│ └── test_integration.py
|
|
│
|
|
├── scripts/ # Utility scripts
|
|
│ ├── run.py # Main pipeline
|
|
│ ├── test_pipeline.py # Pipeline testing
|
|
│ ├── test_openai.py # OpenAI API test
|
|
│ ├── test_node_api.py # Node API test
|
|
│ ├── diagnose.py # System diagnostics
|
|
│ ├── debug_report.py # Debug information
|
|
│ └── benchmark.py # Performance testing
|
|
│
|
|
├── output/ # Generated files (git-ignored)
|
|
│ ├── feed.rss
|
|
│ ├── articles.json
|
|
│ └── feed_generator.log
|
|
│
|
|
├── logs/ # Log files (git-ignored)
|
|
│ └── *.log
|
|
│
|
|
└── backups/ # Backup files (git-ignored)
|
|
└── YYYY-MM-DD/
|
|
```
|
|
|
|
---
|
|
|
|
## APPENDIX: MINIMAL WORKING EXAMPLE
|
|
|
|
Test that everything works with minimal code:
|
|
|
|
```python
|
|
# test_minimal.py - Minimal working example
|
|
|
|
from src.config import Config
|
|
from src.scraper import NewsScraper
|
|
from src.image_analyzer import ImageAnalyzer
|
|
|
|
# Load configuration
|
|
config = Config.from_env()
|
|
print(f"✓ Configuration loaded")
|
|
|
|
# Test scraper
|
|
scraper = NewsScraper(config.scraper)
|
|
print(f"✓ Scraper initialized")
|
|
|
|
# Test analyzer
|
|
analyzer = ImageAnalyzer(config.api.openai_key)
|
|
print(f"✓ Analyzer initialized")
|
|
|
|
# Scrape one article
|
|
test_url = config.scraper.sources[0]
|
|
articles = scraper.scrape(test_url)
|
|
print(f"✓ Scraped {len(articles)} articles from {test_url}")
|
|
|
|
# Analyze one image (if available)
|
|
if articles and articles[0].image_url:
|
|
analysis = analyzer.analyze(
|
|
articles[0].image_url,
|
|
context="Test image analysis"
|
|
)
|
|
print(f"✓ Image analyzed: {analysis.description[:50]}...")
|
|
|
|
print("\n✅ All basic functionality working!")
|
|
```
|
|
|
|
Run with:
|
|
```bash
|
|
python test_minimal.py
|
|
```
|
|
|
|
---
|
|
|
|
End of SETUP.md |