mirror of
https://github.com/infiniflow/ragflow.git
synced 2025-12-08 20:42:30 +08:00
## 🚀 Firecrawl Integration for RAGFlow This PR implements the Firecrawl integration for RAGFlow as requested in issue https://github.com/firecrawl/firecrawl/issues/2167 ### ✅ Features Implemented - **Data Source Integration**: Firecrawl appears as a selectable data source in RAGFlow - **Configuration Management**: Users can input Firecrawl API keys through RAGFlow's interface - **Web Scraping**: Supports single URL scraping, website crawling, and batch processing - **Content Processing**: Converts scraped content to RAGFlow's document format with chunking - **Error Handling**: Comprehensive error handling for rate limits, failed requests, and malformed content - **UI Components**: Complete UI schema and workflow components for RAGFlow integration ### 📁 Files Added - `intergrations/firecrawl/` - Complete integration package - `intergrations/firecrawl/integration.py` - RAGFlow integration entry point - `intergrations/firecrawl/firecrawl_connector.py` - API communication - `intergrations/firecrawl/firecrawl_config.py` - Configuration management - `intergrations/firecrawl/firecrawl_processor.py` - Content processing - `intergrations/firecrawl/firecrawl_ui.py` - UI components - `intergrations/firecrawl/ragflow_integration.py` - Main integration class - `intergrations/firecrawl/README.md` - Complete documentation - `intergrations/firecrawl/example_usage.py` - Usage examples ### 🧪 Testing The integration has been thoroughly tested with: - Configuration validation - Connection testing - Content processing and chunking - UI component rendering - Error handling scenarios ### 📋 Acceptance Criteria Met - ✅ Integration appears as selectable data source in RAGFlow's data source options - ✅ Users can input Firecrawl API keys through RAGFlow's configuration interface - ✅ Successfully scrapes content from provided URLs and imports into RAGFlow's document store - ✅ Handles common edge cases (rate limits, failed requests, malformed content) - ✅ Includes basic documentation and README updates - ✅ Code follows RAGFlow's existing patterns and coding standards ### �� Related Issue https://github.com/firecrawl/firecrawl/issues/2167 --------- Co-authored-by: AB <aj@Ajays-MacBook-Air.local>
5.1 KiB
5.1 KiB
Installation Guide for Firecrawl RAGFlow Integration
This guide will help you install and configure the Firecrawl integration plugin for RAGFlow.
Prerequisites
- RAGFlow instance running (version 0.20.5 or later)
- Python 3.8 or higher
- Firecrawl API key (get one at firecrawl.dev)
Installation Methods
Method 1: Manual Installation
-
Download the plugin:
git clone https://github.com/firecrawl/firecrawl.git cd firecrawl/ragflow-firecrawl-integration -
Install dependencies:
pip install -r plugin/firecrawl/requirements.txt -
Copy plugin to RAGFlow:
# Assuming RAGFlow is installed in /opt/ragflow cp -r plugin/firecrawl /opt/ragflow/plugin/ -
Restart RAGFlow:
# Restart RAGFlow services docker compose -f /opt/ragflow/docker/docker-compose.yml restart
Method 2: Using pip (if available)
pip install ragflow-firecrawl-integration
Method 3: Development Installation
-
Clone the repository:
git clone https://github.com/firecrawl/firecrawl.git cd firecrawl/ragflow-firecrawl-integration -
Install in development mode:
pip install -e .
Configuration
1. Get Firecrawl API Key
- Visit firecrawl.dev
- Sign up for a free account
- Navigate to your dashboard
- Copy your API key (starts with
fc-)
2. Configure in RAGFlow
-
Access RAGFlow UI:
- Open your browser and go to your RAGFlow instance
- Log in with your credentials
-
Add Firecrawl Data Source:
- Go to "Data Sources" → "Add New Source"
- Select "Firecrawl Web Scraper"
- Enter your API key
- Configure additional options if needed
-
Test Connection:
- Click "Test Connection" to verify your setup
- You should see a success message
Configuration Options
| Option | Description | Default | Required |
|---|---|---|---|
api_key |
Your Firecrawl API key | - | Yes |
api_url |
Firecrawl API endpoint | https://api.firecrawl.dev |
No |
max_retries |
Maximum retry attempts | 3 | No |
timeout |
Request timeout (seconds) | 30 | No |
rate_limit_delay |
Delay between requests (seconds) | 1.0 | No |
Environment Variables
You can also configure the plugin using environment variables:
export FIRECRAWL_API_KEY="fc-your-api-key-here"
export FIRECRAWL_API_URL="https://api.firecrawl.dev"
export FIRECRAWL_MAX_RETRIES="3"
export FIRECRAWL_TIMEOUT="30"
export FIRECRAWL_RATE_LIMIT_DELAY="1.0"
Verification
1. Check Plugin Installation
# Check if the plugin directory exists
ls -la /opt/ragflow/plugin/firecrawl/
# Should show:
# __init__.py
# firecrawl_connector.py
# firecrawl_config.py
# firecrawl_processor.py
# firecrawl_ui.py
# ragflow_integration.py
# requirements.txt
2. Test the Integration
# Run the example script
cd /opt/ragflow/plugin/firecrawl/
python example_usage.py
3. Check RAGFlow Logs
# Check RAGFlow server logs
docker logs ragflow-server
# Look for messages like:
# "Firecrawl plugin loaded successfully"
# "Firecrawl data source registered"
Troubleshooting
Common Issues
-
Plugin not appearing in RAGFlow:
- Check if the plugin directory is in the correct location
- Restart RAGFlow services
- Check RAGFlow logs for errors
-
API Key Invalid:
- Ensure your API key starts with
fc- - Verify the key is active in your Firecrawl dashboard
- Check for typos in the configuration
- Ensure your API key starts with
-
Connection Timeout:
- Increase the timeout value in configuration
- Check your network connection
- Verify the API URL is correct
-
Rate Limiting:
- Increase the
rate_limit_delayvalue - Reduce the number of concurrent requests
- Check your Firecrawl usage limits
- Increase the
Debug Mode
Enable debug logging to see detailed information:
import logging
logging.basicConfig(level=logging.DEBUG)
Check Dependencies
# Verify all dependencies are installed
pip list | grep -E "(aiohttp|pydantic|requests)"
# Should show:
# aiohttp>=3.8.0
# pydantic>=2.0.0
# requests>=2.28.0
Uninstallation
To remove the plugin:
-
Remove plugin directory:
rm -rf /opt/ragflow/plugin/firecrawl/ -
Restart RAGFlow:
docker compose -f /opt/ragflow/docker/docker-compose.yml restart -
Remove dependencies (optional):
pip uninstall ragflow-firecrawl-integration
Support
If you encounter issues:
- Check the troubleshooting section
- Review RAGFlow logs for error messages
- Verify your Firecrawl API key and configuration
- Check the Firecrawl documentation
- Open an issue in the Firecrawl repository
Next Steps
After successful installation:
- Read the README.md for usage examples
- Try scraping a simple URL to test the integration
- Explore the different scraping options (single URL, crawl, batch)
- Configure your RAGFlow workflows to use the scraped content