Files
ragflow/intergrations/firecrawl/INSTALLATION.md
Ajay ed6a76dcc0 Add Firecrawl integration for RAGFlow (#10152)
## 🚀 Firecrawl Integration for RAGFlow

This PR implements the Firecrawl integration for RAGFlow as requested in
issue https://github.com/firecrawl/firecrawl/issues/2167

###  Features Implemented

- **Data Source Integration**: Firecrawl appears as a selectable data
source in RAGFlow
- **Configuration Management**: Users can input Firecrawl API keys
through RAGFlow's interface
- **Web Scraping**: Supports single URL scraping, website crawling, and
batch processing
- **Content Processing**: Converts scraped content to RAGFlow's document
format with chunking
- **Error Handling**: Comprehensive error handling for rate limits,
failed requests, and malformed content
- **UI Components**: Complete UI schema and workflow components for
RAGFlow integration

### 📁 Files Added

- `intergrations/firecrawl/` - Complete integration package
- `intergrations/firecrawl/integration.py` - RAGFlow integration entry
point
- `intergrations/firecrawl/firecrawl_connector.py` - API communication
- `intergrations/firecrawl/firecrawl_config.py` - Configuration
management
- `intergrations/firecrawl/firecrawl_processor.py` - Content processing
- `intergrations/firecrawl/firecrawl_ui.py` - UI components
- `intergrations/firecrawl/ragflow_integration.py` - Main integration
class
- `intergrations/firecrawl/README.md` - Complete documentation
- `intergrations/firecrawl/example_usage.py` - Usage examples

### 🧪 Testing

The integration has been thoroughly tested with:
- Configuration validation
- Connection testing
- Content processing and chunking
- UI component rendering
- Error handling scenarios

### 📋 Acceptance Criteria Met

-  Integration appears as selectable data source in RAGFlow's data
source options
-  Users can input Firecrawl API keys through RAGFlow's configuration
interface
-  Successfully scrapes content from provided URLs and imports into
RAGFlow's document store
-  Handles common edge cases (rate limits, failed requests, malformed
content)
-  Includes basic documentation and README updates
-  Code follows RAGFlow's existing patterns and coding standards

### �� Related Issue

https://github.com/firecrawl/firecrawl/issues/2167

---------

Co-authored-by: AB <aj@Ajays-MacBook-Air.local>
2025-09-19 09:58:17 +08:00

5.1 KiB

Installation Guide for Firecrawl RAGFlow Integration

This guide will help you install and configure the Firecrawl integration plugin for RAGFlow.

Prerequisites

  • RAGFlow instance running (version 0.20.5 or later)
  • Python 3.8 or higher
  • Firecrawl API key (get one at firecrawl.dev)

Installation Methods

Method 1: Manual Installation

  1. Download the plugin:

    git clone https://github.com/firecrawl/firecrawl.git
    cd firecrawl/ragflow-firecrawl-integration
    
  2. Install dependencies:

    pip install -r plugin/firecrawl/requirements.txt
    
  3. Copy plugin to RAGFlow:

    # Assuming RAGFlow is installed in /opt/ragflow
    cp -r plugin/firecrawl /opt/ragflow/plugin/
    
  4. Restart RAGFlow:

    # Restart RAGFlow services
    docker compose -f /opt/ragflow/docker/docker-compose.yml restart
    

Method 2: Using pip (if available)

pip install ragflow-firecrawl-integration

Method 3: Development Installation

  1. Clone the repository:

    git clone https://github.com/firecrawl/firecrawl.git
    cd firecrawl/ragflow-firecrawl-integration
    
  2. Install in development mode:

    pip install -e .
    

Configuration

1. Get Firecrawl API Key

  1. Visit firecrawl.dev
  2. Sign up for a free account
  3. Navigate to your dashboard
  4. Copy your API key (starts with fc-)

2. Configure in RAGFlow

  1. Access RAGFlow UI:

    • Open your browser and go to your RAGFlow instance
    • Log in with your credentials
  2. Add Firecrawl Data Source:

    • Go to "Data Sources" → "Add New Source"
    • Select "Firecrawl Web Scraper"
    • Enter your API key
    • Configure additional options if needed
  3. Test Connection:

    • Click "Test Connection" to verify your setup
    • You should see a success message

Configuration Options

Option Description Default Required
api_key Your Firecrawl API key - Yes
api_url Firecrawl API endpoint https://api.firecrawl.dev No
max_retries Maximum retry attempts 3 No
timeout Request timeout (seconds) 30 No
rate_limit_delay Delay between requests (seconds) 1.0 No

Environment Variables

You can also configure the plugin using environment variables:

export FIRECRAWL_API_KEY="fc-your-api-key-here"
export FIRECRAWL_API_URL="https://api.firecrawl.dev"
export FIRECRAWL_MAX_RETRIES="3"
export FIRECRAWL_TIMEOUT="30"
export FIRECRAWL_RATE_LIMIT_DELAY="1.0"

Verification

1. Check Plugin Installation

# Check if the plugin directory exists
ls -la /opt/ragflow/plugin/firecrawl/

# Should show:
# __init__.py
# firecrawl_connector.py
# firecrawl_config.py
# firecrawl_processor.py
# firecrawl_ui.py
# ragflow_integration.py
# requirements.txt

2. Test the Integration

# Run the example script
cd /opt/ragflow/plugin/firecrawl/
python example_usage.py

3. Check RAGFlow Logs

# Check RAGFlow server logs
docker logs ragflow-server

# Look for messages like:
# "Firecrawl plugin loaded successfully"
# "Firecrawl data source registered"

Troubleshooting

Common Issues

  1. Plugin not appearing in RAGFlow:

    • Check if the plugin directory is in the correct location
    • Restart RAGFlow services
    • Check RAGFlow logs for errors
  2. API Key Invalid:

    • Ensure your API key starts with fc-
    • Verify the key is active in your Firecrawl dashboard
    • Check for typos in the configuration
  3. Connection Timeout:

    • Increase the timeout value in configuration
    • Check your network connection
    • Verify the API URL is correct
  4. Rate Limiting:

    • Increase the rate_limit_delay value
    • Reduce the number of concurrent requests
    • Check your Firecrawl usage limits

Debug Mode

Enable debug logging to see detailed information:

import logging
logging.basicConfig(level=logging.DEBUG)

Check Dependencies

# Verify all dependencies are installed
pip list | grep -E "(aiohttp|pydantic|requests)"

# Should show:
# aiohttp>=3.8.0
# pydantic>=2.0.0
# requests>=2.28.0

Uninstallation

To remove the plugin:

  1. Remove plugin directory:

    rm -rf /opt/ragflow/plugin/firecrawl/
    
  2. Restart RAGFlow:

    docker compose -f /opt/ragflow/docker/docker-compose.yml restart
    
  3. Remove dependencies (optional):

    pip uninstall ragflow-firecrawl-integration
    

Support

If you encounter issues:

  1. Check the troubleshooting section
  2. Review RAGFlow logs for error messages
  3. Verify your Firecrawl API key and configuration
  4. Check the Firecrawl documentation
  5. Open an issue in the Firecrawl repository

Next Steps

After successful installation:

  1. Read the README.md for usage examples
  2. Try scraping a simple URL to test the integration
  3. Explore the different scraping options (single URL, crawl, batch)
  4. Configure your RAGFlow workflows to use the scraped content