mirror of
https://github.com/infiniflow/ragflow.git
synced 2025-12-08 20:42:30 +08:00
## 🚀 Firecrawl Integration for RAGFlow This PR implements the Firecrawl integration for RAGFlow as requested in issue https://github.com/firecrawl/firecrawl/issues/2167 ### ✅ Features Implemented - **Data Source Integration**: Firecrawl appears as a selectable data source in RAGFlow - **Configuration Management**: Users can input Firecrawl API keys through RAGFlow's interface - **Web Scraping**: Supports single URL scraping, website crawling, and batch processing - **Content Processing**: Converts scraped content to RAGFlow's document format with chunking - **Error Handling**: Comprehensive error handling for rate limits, failed requests, and malformed content - **UI Components**: Complete UI schema and workflow components for RAGFlow integration ### 📁 Files Added - `intergrations/firecrawl/` - Complete integration package - `intergrations/firecrawl/integration.py` - RAGFlow integration entry point - `intergrations/firecrawl/firecrawl_connector.py` - API communication - `intergrations/firecrawl/firecrawl_config.py` - Configuration management - `intergrations/firecrawl/firecrawl_processor.py` - Content processing - `intergrations/firecrawl/firecrawl_ui.py` - UI components - `intergrations/firecrawl/ragflow_integration.py` - Main integration class - `intergrations/firecrawl/README.md` - Complete documentation - `intergrations/firecrawl/example_usage.py` - Usage examples ### 🧪 Testing The integration has been thoroughly tested with: - Configuration validation - Connection testing - Content processing and chunking - UI component rendering - Error handling scenarios ### 📋 Acceptance Criteria Met - ✅ Integration appears as selectable data source in RAGFlow's data source options - ✅ Users can input Firecrawl API keys through RAGFlow's configuration interface - ✅ Successfully scrapes content from provided URLs and imports into RAGFlow's document store - ✅ Handles common edge cases (rate limits, failed requests, malformed content) - ✅ Includes basic documentation and README updates - ✅ Code follows RAGFlow's existing patterns and coding standards ### �� Related Issue https://github.com/firecrawl/firecrawl/issues/2167 --------- Co-authored-by: AB <aj@Ajays-MacBook-Air.local>
223 lines
5.1 KiB
Markdown
223 lines
5.1 KiB
Markdown
# Installation Guide for Firecrawl RAGFlow Integration
|
|
|
|
This guide will help you install and configure the Firecrawl integration plugin for RAGFlow.
|
|
|
|
## Prerequisites
|
|
|
|
- RAGFlow instance running (version 0.20.5 or later)
|
|
- Python 3.8 or higher
|
|
- Firecrawl API key (get one at [firecrawl.dev](https://firecrawl.dev))
|
|
|
|
## Installation Methods
|
|
|
|
### Method 1: Manual Installation
|
|
|
|
1. **Download the plugin**:
|
|
```bash
|
|
git clone https://github.com/firecrawl/firecrawl.git
|
|
cd firecrawl/ragflow-firecrawl-integration
|
|
```
|
|
|
|
2. **Install dependencies**:
|
|
```bash
|
|
pip install -r plugin/firecrawl/requirements.txt
|
|
```
|
|
|
|
3. **Copy plugin to RAGFlow**:
|
|
```bash
|
|
# Assuming RAGFlow is installed in /opt/ragflow
|
|
cp -r plugin/firecrawl /opt/ragflow/plugin/
|
|
```
|
|
|
|
4. **Restart RAGFlow**:
|
|
```bash
|
|
# Restart RAGFlow services
|
|
docker compose -f /opt/ragflow/docker/docker-compose.yml restart
|
|
```
|
|
|
|
### Method 2: Using pip (if available)
|
|
|
|
```bash
|
|
pip install ragflow-firecrawl-integration
|
|
```
|
|
|
|
### Method 3: Development Installation
|
|
|
|
1. **Clone the repository**:
|
|
```bash
|
|
git clone https://github.com/firecrawl/firecrawl.git
|
|
cd firecrawl/ragflow-firecrawl-integration
|
|
```
|
|
|
|
2. **Install in development mode**:
|
|
```bash
|
|
pip install -e .
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### 1. Get Firecrawl API Key
|
|
|
|
1. Visit [firecrawl.dev](https://firecrawl.dev)
|
|
2. Sign up for a free account
|
|
3. Navigate to your dashboard
|
|
4. Copy your API key (starts with `fc-`)
|
|
|
|
### 2. Configure in RAGFlow
|
|
|
|
1. **Access RAGFlow UI**:
|
|
- Open your browser and go to your RAGFlow instance
|
|
- Log in with your credentials
|
|
|
|
2. **Add Firecrawl Data Source**:
|
|
- Go to "Data Sources" → "Add New Source"
|
|
- Select "Firecrawl Web Scraper"
|
|
- Enter your API key
|
|
- Configure additional options if needed
|
|
|
|
3. **Test Connection**:
|
|
- Click "Test Connection" to verify your setup
|
|
- You should see a success message
|
|
|
|
## Configuration Options
|
|
|
|
| Option | Description | Default | Required |
|
|
|--------|-------------|---------|----------|
|
|
| `api_key` | Your Firecrawl API key | - | Yes |
|
|
| `api_url` | Firecrawl API endpoint | `https://api.firecrawl.dev` | No |
|
|
| `max_retries` | Maximum retry attempts | 3 | No |
|
|
| `timeout` | Request timeout (seconds) | 30 | No |
|
|
| `rate_limit_delay` | Delay between requests (seconds) | 1.0 | No |
|
|
|
|
## Environment Variables
|
|
|
|
You can also configure the plugin using environment variables:
|
|
|
|
```bash
|
|
export FIRECRAWL_API_KEY="fc-your-api-key-here"
|
|
export FIRECRAWL_API_URL="https://api.firecrawl.dev"
|
|
export FIRECRAWL_MAX_RETRIES="3"
|
|
export FIRECRAWL_TIMEOUT="30"
|
|
export FIRECRAWL_RATE_LIMIT_DELAY="1.0"
|
|
```
|
|
|
|
## Verification
|
|
|
|
### 1. Check Plugin Installation
|
|
|
|
```bash
|
|
# Check if the plugin directory exists
|
|
ls -la /opt/ragflow/plugin/firecrawl/
|
|
|
|
# Should show:
|
|
# __init__.py
|
|
# firecrawl_connector.py
|
|
# firecrawl_config.py
|
|
# firecrawl_processor.py
|
|
# firecrawl_ui.py
|
|
# ragflow_integration.py
|
|
# requirements.txt
|
|
```
|
|
|
|
### 2. Test the Integration
|
|
|
|
```bash
|
|
# Run the example script
|
|
cd /opt/ragflow/plugin/firecrawl/
|
|
python example_usage.py
|
|
```
|
|
|
|
### 3. Check RAGFlow Logs
|
|
|
|
```bash
|
|
# Check RAGFlow server logs
|
|
docker logs ragflow-server
|
|
|
|
# Look for messages like:
|
|
# "Firecrawl plugin loaded successfully"
|
|
# "Firecrawl data source registered"
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Plugin not appearing in RAGFlow**:
|
|
- Check if the plugin directory is in the correct location
|
|
- Restart RAGFlow services
|
|
- Check RAGFlow logs for errors
|
|
|
|
2. **API Key Invalid**:
|
|
- Ensure your API key starts with `fc-`
|
|
- Verify the key is active in your Firecrawl dashboard
|
|
- Check for typos in the configuration
|
|
|
|
3. **Connection Timeout**:
|
|
- Increase the timeout value in configuration
|
|
- Check your network connection
|
|
- Verify the API URL is correct
|
|
|
|
4. **Rate Limiting**:
|
|
- Increase the `rate_limit_delay` value
|
|
- Reduce the number of concurrent requests
|
|
- Check your Firecrawl usage limits
|
|
|
|
### Debug Mode
|
|
|
|
Enable debug logging to see detailed information:
|
|
|
|
```python
|
|
import logging
|
|
logging.basicConfig(level=logging.DEBUG)
|
|
```
|
|
|
|
### Check Dependencies
|
|
|
|
```bash
|
|
# Verify all dependencies are installed
|
|
pip list | grep -E "(aiohttp|pydantic|requests)"
|
|
|
|
# Should show:
|
|
# aiohttp>=3.8.0
|
|
# pydantic>=2.0.0
|
|
# requests>=2.28.0
|
|
```
|
|
|
|
## Uninstallation
|
|
|
|
To remove the plugin:
|
|
|
|
1. **Remove plugin directory**:
|
|
```bash
|
|
rm -rf /opt/ragflow/plugin/firecrawl/
|
|
```
|
|
|
|
2. **Restart RAGFlow**:
|
|
```bash
|
|
docker compose -f /opt/ragflow/docker/docker-compose.yml restart
|
|
```
|
|
|
|
3. **Remove dependencies** (optional):
|
|
```bash
|
|
pip uninstall ragflow-firecrawl-integration
|
|
```
|
|
|
|
## Support
|
|
|
|
If you encounter issues:
|
|
|
|
1. Check the [troubleshooting section](#troubleshooting)
|
|
2. Review RAGFlow logs for error messages
|
|
3. Verify your Firecrawl API key and configuration
|
|
4. Check the [Firecrawl documentation](https://docs.firecrawl.dev)
|
|
5. Open an issue in the [Firecrawl repository](https://github.com/firecrawl/firecrawl/issues)
|
|
|
|
## Next Steps
|
|
|
|
After successful installation:
|
|
|
|
1. Read the [README.md](README.md) for usage examples
|
|
2. Try scraping a simple URL to test the integration
|
|
3. Explore the different scraping options (single URL, crawl, batch)
|
|
4. Configure your RAGFlow workflows to use the scraped content
|