# SETUP.md ```markdown # SETUP.md - Feed Generator Installation Guide --- ## PREREQUISITES ### Required Software - **Python 3.11+** (3.10 minimum) ```bash python --version # Should be 3.11 or higher ``` - **pip** (comes with Python) ```bash pip --version ``` - **Git** (for cloning repository) ```bash git --version ``` ### Required Services - **OpenAI API account** with GPT-4 Vision access - Sign up: https://platform.openai.com/signup - Generate API key: https://platform.openai.com/api-keys - **Node.js Article Generator** (your existing API) - Should be running on `http://localhost:3000` - Or configure different URL in `.env` --- ## INSTALLATION ### Step 1: Clone Repository ```bash # Clone the project git clone https://github.com/your-org/feed-generator.git cd feed-generator # Verify structure ls -la # Should see: src/, tests/, requirements.txt, README.md, etc. ``` ### Step 2: Create Virtual Environment ```bash # Create virtual environment python -m venv venv # Activate virtual environment # On Linux/Mac: source venv/bin/activate # On Windows: venv\Scripts\activate # Verify activation (should show (venv) in prompt) which python # Should point to venv/bin/python ``` ### Step 3: Install Dependencies ```bash # Upgrade pip first pip install --upgrade pip # Install project dependencies pip install -r requirements.txt # Verify installations pip list # Should see: requests, beautifulsoup4, openai, pytest, mypy, etc. ``` ### Step 4: Install Development Tools (Optional) ```bash # For development pip install -r requirements-dev.txt # Includes: black, flake8, pylint, ipython ``` --- ## CONFIGURATION ### Step 1: Create Environment File ```bash # Copy example configuration cp .env.example .env # Edit with your settings nano .env # or vim, code, etc. ``` ### Step 2: Configure API Keys Edit `.env` file: ```bash # REQUIRED: OpenAI API Key OPENAI_API_KEY=sk-proj-your-key-here # REQUIRED: Node.js Article Generator API NODE_API_URL=http://localhost:3000 # REQUIRED: News sources (comma-separated) NEWS_SOURCES=https://example.com/news,https://techcrunch.com/feed # OPTIONAL: Logging level LOG_LEVEL=INFO # OPTIONAL: Timeouts and limits MAX_ARTICLES=10 SCRAPER_TIMEOUT=10 API_TIMEOUT=30 ``` ### Step 3: Verify Configuration ```bash # Test configuration loading python -c "from src.config import Config; c = Config.from_env(); print(c)" # Should print configuration without errors ``` --- ## VERIFICATION ### Step 1: Verify Python Environment ```bash # Check Python version python --version # Output: Python 3.11.x or higher # Check virtual environment which python # Output: /path/to/feed-generator/venv/bin/python # Check installed packages pip list | grep -E "(requests|openai|beautifulsoup4)" # Should show all three packages ``` ### Step 2: Verify API Connections #### Test OpenAI API ```bash python scripts/test_openai.py ``` Expected output: ``` Testing OpenAI API connection... ✓ API key loaded ✓ Connection successful ✓ GPT-4 Vision available All checks passed! ``` #### Test Node.js API ```bash # Make sure your Node.js API is running first # In another terminal: cd /path/to/node-article-generator npm start # Then test connection python scripts/test_node_api.py ``` Expected output: ``` Testing Node.js API connection... ✓ API endpoint reachable ✓ Health check passed ✓ Test article generation successful All checks passed! ``` ### Step 3: Run Component Tests ```bash # Test individual components python -m pytest tests/ -v # Expected output: # tests/test_config.py::test_config_from_env PASSED # tests/test_scraper.py::test_scraper_init PASSED # ... # ============ X passed in X.XXs ============ ``` ### Step 4: Test Complete Pipeline ```bash # Dry run (mock external services) python scripts/test_pipeline.py --dry-run # Expected output: # [INFO] Starting pipeline test (dry run)... # [INFO] ✓ Configuration loaded # [INFO] ✓ Scraper initialized # [INFO] ✓ Image analyzer initialized # [INFO] ✓ API client initialized # [INFO] ✓ Publisher initialized # [INFO] Pipeline test successful! ``` --- ## RUNNING THE GENERATOR ### Manual Execution ```bash # Run complete pipeline python scripts/run.py # With custom configuration python scripts/run.py --config custom.env # Dry run (no actual API calls) python scripts/run.py --dry-run # Verbose output python scripts/run.py --verbose ``` ### Expected Output ``` [2025-01-15 10:00:00] INFO - Starting Feed Generator... [2025-01-15 10:00:00] INFO - Loading configuration... [2025-01-15 10:00:01] INFO - Configuration loaded successfully [2025-01-15 10:00:01] INFO - Scraping 3 news sources... [2025-01-15 10:00:05] INFO - Scraped 15 articles [2025-01-15 10:00:05] INFO - Analyzing 15 images... [2025-01-15 10:00:25] INFO - Analyzed 12 images (3 failed) [2025-01-15 10:00:25] INFO - Aggregating content... [2025-01-15 10:00:25] INFO - Aggregated 12 items [2025-01-15 10:00:25] INFO - Generating articles... [2025-01-15 10:01:30] INFO - Generated 12 articles [2025-01-15 10:01:30] INFO - Publishing to RSS... [2025-01-15 10:01:30] INFO - Published to output/feed.rss [2025-01-15 10:01:30] INFO - Pipeline complete! (90 seconds) ``` ### Output Files ```bash # Check generated files ls -l output/ # Should see: # feed.rss - RSS feed # articles.json - Full article data # feed_generator.log - Execution log ``` --- ## TROUBLESHOOTING ### Issue: "OPENAI_API_KEY not found" **Cause**: Environment variable not set **Solution**: ```bash # Check .env file exists ls -la .env # Verify API key is set cat .env | grep OPENAI_API_KEY # Reload environment source venv/bin/activate ``` ### Issue: "Module not found" errors **Cause**: Dependencies not installed **Solution**: ```bash # Ensure virtual environment is activated which python # Should point to venv # Reinstall dependencies pip install -r requirements.txt # Verify installation pip list | grep ``` ### Issue: "Connection refused" to Node API **Cause**: Node.js API not running **Solution**: ```bash # Start Node.js API first cd /path/to/node-article-generator npm start # Verify it's running curl http://localhost:3000/health # Check configured URL in .env cat .env | grep NODE_API_URL ``` ### Issue: "Rate limit exceeded" from OpenAI **Cause**: Too many API requests **Solution**: ```bash # Reduce MAX_ARTICLES in .env echo "MAX_ARTICLES=5" >> .env # Add delay between requests (future enhancement) # For now, wait a few minutes and retry ``` ### Issue: Scraping fails for specific sites **Cause**: Site structure changed or blocking **Solution**: ```bash # Test individual source python scripts/test_scraper.py --url https://problematic-site.com # Check logs cat feed_generator.log | grep ScrapingError # Remove problematic source from .env temporarily nano .env # Remove from NEWS_SOURCES ``` ### Issue: Type checking fails **Cause**: Missing or incorrect type hints **Solution**: ```bash # Run mypy to see errors mypy src/ # Fix reported issues # Every function must have type hints ``` --- ## DEVELOPMENT SETUP ### Additional Tools ```bash # Code formatting pip install black black src/ tests/ # Linting pip install flake8 flake8 src/ tests/ # Type checking pip install mypy mypy src/ # Interactive Python shell pip install ipython ipython ``` ### Pre-commit Hook (Optional) ```bash # Install pre-commit pip install pre-commit # Setup hooks pre-commit install # Now runs automatically on git commit # Or run manually: pre-commit run --all-files ``` ### IDE Setup #### VS Code ```json // .vscode/settings.json { "python.defaultInterpreterPath": "${workspaceFolder}/venv/bin/python", "python.linting.enabled": true, "python.linting.pylintEnabled": false, "python.linting.flake8Enabled": true, "python.formatting.provider": "black", "python.analysis.typeCheckingMode": "strict" } ``` #### PyCharm ``` 1. Open Project 2. File → Settings → Project → Python Interpreter 3. Add Interpreter → Existing Environment 4. Select: /path/to/feed-generator/venv/bin/python 5. Apply ``` --- ## SCHEDULED EXECUTION ### Cron Job (Linux/Mac) ```bash # Edit crontab crontab -e # Run every 6 hours 0 */6 * * * cd /path/to/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1 # Run daily at 8 AM 0 8 * * * cd /path/to/feed-generator && venv/bin/python scripts/run.py >> logs/cron.log 2>&1 ``` ### Systemd Service (Linux) ```ini # /etc/systemd/system/feed-generator.service [Unit] Description=Feed Generator After=network.target [Service] Type=simple User=your-user WorkingDirectory=/path/to/feed-generator ExecStart=/path/to/feed-generator/venv/bin/python scripts/run.py Restart=on-failure [Install] WantedBy=multi-user.target ``` ```bash # Enable and start sudo systemctl enable feed-generator sudo systemctl start feed-generator # Check status sudo systemctl status feed-generator ``` ### Task Scheduler (Windows) ```powershell # Create scheduled task $action = New-ScheduledTaskAction -Execute "C:\path\to\venv\Scripts\python.exe" -Argument "C:\path\to\scripts\run.py" $trigger = New-ScheduledTaskTrigger -Daily -At 8am Register-ScheduledTask -Action $action -Trigger $trigger -TaskName "FeedGenerator" -Description "Run feed generator daily" ``` --- ## MONITORING ### Log Files ```bash # View live logs tail -f feed_generator.log # View recent errors grep ERROR feed_generator.log | tail -20 # View pipeline summary grep "Pipeline complete" feed_generator.log ``` ### Metrics Dashboard (Future) ```bash # View last run metrics python scripts/show_metrics.py # Expected output: # Last Run: 2025-01-15 10:01:30 # Duration: 90 seconds # Articles Scraped: 15 # Articles Generated: 12 # Success Rate: 80% # Errors: 3 (image analysis failures) ``` --- ## BACKUP & RECOVERY ### Backup Configuration ```bash # Backup .env file (CAREFUL - contains API keys) cp .env .env.backup # Store securely, NOT in git # Use password manager or encrypted storage ``` ### Backup Output ```bash # Create daily backup mkdir -p backups/$(date +%Y-%m-%d) cp -r output/* backups/$(date +%Y-%m-%d)/ # Automated backup script ./scripts/backup_output.sh ``` ### Recovery ```bash # Restore from backup cp backups/2025-01-15/feed.rss output/ # Verify integrity python scripts/verify_feed.py output/feed.rss ``` --- ## UPDATING ### Update Dependencies ```bash # Activate virtual environment source venv/bin/activate # Update pip pip install --upgrade pip # Update all packages pip install --upgrade -r requirements.txt # Verify updates pip list --outdated ``` ### Update Code ```bash # Pull latest changes git pull origin main # Reinstall if requirements changed pip install -r requirements.txt # Run tests python -m pytest tests/ # Test pipeline python scripts/test_pipeline.py --dry-run ``` --- ## UNINSTALLATION ### Remove Virtual Environment ```bash # Deactivate first deactivate # Remove virtual environment rm -rf venv/ ``` ### Remove Generated Files ```bash # Remove output rm -rf output/ # Remove logs rm -rf logs/ # Remove backups rm -rf backups/ ``` ### Remove Project ```bash # Remove entire project directory cd .. rm -rf feed-generator/ ``` --- ## SECURITY CHECKLIST Before deploying: - [ ] `.env` file is NOT committed to git - [ ] `.env.example` has placeholder values only - [ ] API keys are stored securely - [ ] `.gitignore` includes `.env`, `venv/`, `output/`, `logs/` - [ ] Log files don't contain sensitive data - [ ] File permissions are restrictive (`chmod 600 .env`) - [ ] Virtual environment is isolated - [ ] Dependencies are from trusted sources --- ## PERFORMANCE BASELINE Expected performance on standard hardware: | Metric | Target | Acceptable Range | |--------|--------|------------------| | Scraping (10 articles) | 10s | 5-20s | | Image analysis (10 images) | 30s | 20-50s | | Article generation (10 articles) | 60s | 40-120s | | Publishing | 1s | <5s | | **Total pipeline (10 articles)** | **2 min** | **1-5 min** | ### Performance Testing ```bash # Benchmark pipeline python scripts/benchmark.py # Output: # Scraping: 8.3s (15 articles) # Analysis: 42.1s (15 images) # Generation: 95.7s (12 articles) # Publishing: 0.8s # TOTAL: 146.9s ``` --- ## NEXT STEPS After successful setup: 1. **Run first pipeline** ```bash python scripts/run.py ``` 2. **Verify output** ```bash ls -l output/ cat output/feed.rss | head -20 ``` 3. **Set up scheduling** (cron/systemd/Task Scheduler) 4. **Configure monitoring** (logs, metrics) 5. **Read DEVELOPMENT.md** for extending functionality --- ## GETTING HELP ### Documentation - **README.md** - Project overview - **ARCHITECTURE.md** - Technical design - **CLAUDE.md** - Development guidelines - **API_INTEGRATION.md** - Node API integration ### Diagnostics ```bash # Run diagnostics script python scripts/diagnose.py # Output: # ✓ Python version: 3.11.5 # ✓ Virtual environment: active # ✓ Dependencies: installed # ✓ Configuration: valid # ✓ OpenAI API: reachable # ✓ Node API: reachable # ✓ Output directory: writable # All systems operational! ``` ### Common Issues Check troubleshooting section above, or: ```bash # Generate debug report python scripts/debug_report.py > debug.txt # Share debug.txt (remove API keys first!) ``` --- ## CHECKLIST: FIRST RUN Complete setup verification: - [ ] Python 3.11+ installed - [ ] Virtual environment created and activated - [ ] Dependencies installed (`pip list` shows all packages) - [ ] `.env` file created with API keys - [ ] OpenAI API connection tested - [ ] Node.js API running and tested - [ ] Configuration validated (`Config.from_env()` works) - [ ] Component tests pass (`pytest tests/`) - [ ] Dry run successful (`python scripts/run.py --dry-run`) - [ ] First real run completed - [ ] Output files generated (`output/feed.rss` exists) - [ ] Logs are readable (`feed_generator.log`) **If all checks pass → You're ready to use Feed Generator!** --- ## QUICK START SUMMARY For experienced developers: ```bash # 1. Setup git clone && cd feed-generator python -m venv venv && source venv/bin/activate pip install -r requirements.txt # 2. Configure cp .env.example .env # Edit .env with your API keys # 3. Test python scripts/test_pipeline.py --dry-run # 4. Run python scripts/run.py # 5. Verify ls -l output/ ``` **Time to first run: ~10 minutes** --- ## APPENDIX: EXAMPLE .env FILE ```bash # .env.example - Copy to .env and fill in your values # ============================================== # REQUIRED CONFIGURATION # ============================================== # OpenAI API Key (get from https://platform.openai.com/api-keys) OPENAI_API_KEY=sk-proj-your-actual-key-here # Node.js Article Generator API URL NODE_API_URL=http://localhost:3000 # News sources (comma-separated URLs) NEWS_SOURCES=https://techcrunch.com/feed,https://www.theverge.com/rss/index.xml # ============================================== # OPTIONAL CONFIGURATION # ============================================== # Logging level (DEBUG, INFO, WARNING, ERROR) LOG_LEVEL=INFO # Maximum articles to process per source MAX_ARTICLES=10 # HTTP timeout for scraping (seconds) SCRAPER_TIMEOUT=10 # HTTP timeout for API calls (seconds) API_TIMEOUT=30 # Output directory (default: ./output) OUTPUT_DIR=./output # ============================================== # ADVANCED CONFIGURATION (V2) # ============================================== # Enable caching (true/false) # ENABLE_CACHE=false # Cache TTL in seconds # CACHE_TTL=3600 # Enable parallel processing (true/false) # ENABLE_PARALLEL=false # Max concurrent workers # MAX_WORKERS=5 ``` --- ## APPENDIX: DIRECTORY STRUCTURE ``` feed-generator/ ├── .env # Configuration (NOT in git) ├── .env.example # Configuration template ├── .gitignore # Git ignore rules ├── README.md # Project overview ├── CLAUDE.md # Development guidelines ├── ARCHITECTURE.md # Technical design ├── SETUP.md # This file ├── requirements.txt # Python dependencies ├── requirements-dev.txt # Development dependencies ├── pyproject.toml # Python project metadata │ ├── src/ # Source code │ ├── __init__.py │ ├── config.py # Configuration management │ ├── exceptions.py # Custom exceptions │ ├── scraper.py # News scraping │ ├── image_analyzer.py # Image analysis │ ├── aggregator.py # Content aggregation │ ├── article_client.py # Node API client │ └── publisher.py # Feed publishing │ ├── tests/ # Test suite │ ├── __init__.py │ ├── test_config.py │ ├── test_scraper.py │ ├── test_image_analyzer.py │ ├── test_aggregator.py │ ├── test_article_client.py │ ├── test_publisher.py │ └── test_integration.py │ ├── scripts/ # Utility scripts │ ├── run.py # Main pipeline │ ├── test_pipeline.py # Pipeline testing │ ├── test_openai.py # OpenAI API test │ ├── test_node_api.py # Node API test │ ├── diagnose.py # System diagnostics │ ├── debug_report.py # Debug information │ └── benchmark.py # Performance testing │ ├── output/ # Generated files (git-ignored) │ ├── feed.rss │ ├── articles.json │ └── feed_generator.log │ ├── logs/ # Log files (git-ignored) │ └── *.log │ └── backups/ # Backup files (git-ignored) └── YYYY-MM-DD/ ``` --- ## APPENDIX: MINIMAL WORKING EXAMPLE Test that everything works with minimal code: ```python # test_minimal.py - Minimal working example from src.config import Config from src.scraper import NewsScraper from src.image_analyzer import ImageAnalyzer # Load configuration config = Config.from_env() print(f"✓ Configuration loaded") # Test scraper scraper = NewsScraper(config.scraper) print(f"✓ Scraper initialized") # Test analyzer analyzer = ImageAnalyzer(config.api.openai_key) print(f"✓ Analyzer initialized") # Scrape one article test_url = config.scraper.sources[0] articles = scraper.scrape(test_url) print(f"✓ Scraped {len(articles)} articles from {test_url}") # Analyze one image (if available) if articles and articles[0].image_url: analysis = analyzer.analyze( articles[0].image_url, context="Test image analysis" ) print(f"✓ Image analyzed: {analysis.description[:50]}...") print("\n✅ All basic functionality working!") ``` Run with: ```bash python test_minimal.py ``` --- End of SETUP.md