mirror of https://github.com/infiniflow/ragflow.git synced 2025-12-08 20:42:30 +08:00

Files

Ajay ed6a76dcc0 Add Firecrawl integration for RAGFlow (#10152 )

## 🚀 Firecrawl Integration for RAGFlow

This PR implements the Firecrawl integration for RAGFlow as requested in
issue https://github.com/firecrawl/firecrawl/issues/2167

### ✅ Features Implemented

- **Data Source Integration**: Firecrawl appears as a selectable data
source in RAGFlow
- **Configuration Management**: Users can input Firecrawl API keys
through RAGFlow's interface
- **Web Scraping**: Supports single URL scraping, website crawling, and
batch processing
- **Content Processing**: Converts scraped content to RAGFlow's document
format with chunking
- **Error Handling**: Comprehensive error handling for rate limits,
failed requests, and malformed content
- **UI Components**: Complete UI schema and workflow components for
RAGFlow integration

### 📁 Files Added

- `intergrations/firecrawl/` - Complete integration package
- `intergrations/firecrawl/integration.py` - RAGFlow integration entry
point
- `intergrations/firecrawl/firecrawl_connector.py` - API communication
- `intergrations/firecrawl/firecrawl_config.py` - Configuration
management
- `intergrations/firecrawl/firecrawl_processor.py` - Content processing
- `intergrations/firecrawl/firecrawl_ui.py` - UI components
- `intergrations/firecrawl/ragflow_integration.py` - Main integration
class
- `intergrations/firecrawl/README.md` - Complete documentation
- `intergrations/firecrawl/example_usage.py` - Usage examples

### 🧪 Testing

The integration has been thoroughly tested with:
- Configuration validation
- Connection testing
- Content processing and chunking
- UI component rendering
- Error handling scenarios

### 📋 Acceptance Criteria Met

- ✅ Integration appears as selectable data source in RAGFlow's data
source options
- ✅ Users can input Firecrawl API keys through RAGFlow's configuration
interface
- ✅ Successfully scrapes content from provided URLs and imports into
RAGFlow's document store
- ✅ Handles common edge cases (rate limits, failed requests, malformed
content)
- ✅ Includes basic documentation and README updates
- ✅ Code follows RAGFlow's existing patterns and coding standards

### �� Related Issue

https://github.com/firecrawl/firecrawl/issues/2167

---------

Co-authored-by: AB <aj@Ajays-MacBook-Air.local>

2025-09-19 09:58:17 +08:00

5.1 KiB

Raw Blame History

Installation Guide for Firecrawl RAGFlow Integration

This guide will help you install and configure the Firecrawl integration plugin for RAGFlow.

Prerequisites

RAGFlow instance running (version 0.20.5 or later)
Python 3.8 or higher
Firecrawl API key (get one at firecrawl.dev)

Installation Methods

Method 1: Manual Installation

Download the plugin:

git clone https://github.com/firecrawl/firecrawl.git
cd firecrawl/ragflow-firecrawl-integration

Install dependencies:

pip install -r plugin/firecrawl/requirements.txt

Copy plugin to RAGFlow:

# Assuming RAGFlow is installed in /opt/ragflow
cp -r plugin/firecrawl /opt/ragflow/plugin/

Restart RAGFlow:

# Restart RAGFlow services
docker compose -f /opt/ragflow/docker/docker-compose.yml restart

Method 2: Using pip (if available)

pip install ragflow-firecrawl-integration

Method 3: Development Installation

Clone the repository:

git clone https://github.com/firecrawl/firecrawl.git
cd firecrawl/ragflow-firecrawl-integration

Install in development mode:
```
pip install -e .
```

Configuration

1. Get Firecrawl API Key

Visit firecrawl.dev
Sign up for a free account
Navigate to your dashboard
Copy your API key (starts with fc-)

2. Configure in RAGFlow

Access RAGFlow UI:
- Open your browser and go to your RAGFlow instance
- Log in with your credentials
Add Firecrawl Data Source:
- Go to "Data Sources" → "Add New Source"
- Select "Firecrawl Web Scraper"
- Enter your API key
- Configure additional options if needed
Test Connection:
- Click "Test Connection" to verify your setup
- You should see a success message

Configuration Options

Option	Description	Default	Required
`api_key`	Your Firecrawl API key	-	Yes
`api_url`	Firecrawl API endpoint	`https://api.firecrawl.dev`	No
`max_retries`	Maximum retry attempts	3	No
`timeout`	Request timeout (seconds)	30	No
`rate_limit_delay`	Delay between requests (seconds)	1.0	No

Environment Variables

You can also configure the plugin using environment variables:

export FIRECRAWL_API_KEY="fc-your-api-key-here"
export FIRECRAWL_API_URL="https://api.firecrawl.dev"
export FIRECRAWL_MAX_RETRIES="3"
export FIRECRAWL_TIMEOUT="30"
export FIRECRAWL_RATE_LIMIT_DELAY="1.0"

Verification

1. Check Plugin Installation

# Check if the plugin directory exists
ls -la /opt/ragflow/plugin/firecrawl/

# Should show:
# __init__.py
# firecrawl_connector.py
# firecrawl_config.py
# firecrawl_processor.py
# firecrawl_ui.py
# ragflow_integration.py
# requirements.txt

2. Test the Integration

# Run the example script
cd /opt/ragflow/plugin/firecrawl/
python example_usage.py

3. Check RAGFlow Logs

# Check RAGFlow server logs
docker logs ragflow-server

# Look for messages like:
# "Firecrawl plugin loaded successfully"
# "Firecrawl data source registered"

Troubleshooting

Common Issues

Plugin not appearing in RAGFlow:
- Check if the plugin directory is in the correct location
- Restart RAGFlow services
- Check RAGFlow logs for errors
API Key Invalid:
- Ensure your API key starts with fc-
- Verify the key is active in your Firecrawl dashboard
- Check for typos in the configuration
Connection Timeout:
- Increase the timeout value in configuration
- Check your network connection
- Verify the API URL is correct
Rate Limiting:
- Increase the rate_limit_delay value
- Reduce the number of concurrent requests
- Check your Firecrawl usage limits

Debug Mode

Enable debug logging to see detailed information:

import logging
logging.basicConfig(level=logging.DEBUG)

Check Dependencies

# Verify all dependencies are installed
pip list | grep -E "(aiohttp|pydantic|requests)"

# Should show:
# aiohttp>=3.8.0
# pydantic>=2.0.0
# requests>=2.28.0

Uninstallation

To remove the plugin:

Remove plugin directory:
```
rm -rf /opt/ragflow/plugin/firecrawl/
```