mirror of
https://github.com/infiniflow/ragflow.git
synced 2025-12-08 20:42:30 +08:00
## 🚀 Firecrawl Integration for RAGFlow This PR implements the Firecrawl integration for RAGFlow as requested in issue https://github.com/firecrawl/firecrawl/issues/2167 ### ✅ Features Implemented - **Data Source Integration**: Firecrawl appears as a selectable data source in RAGFlow - **Configuration Management**: Users can input Firecrawl API keys through RAGFlow's interface - **Web Scraping**: Supports single URL scraping, website crawling, and batch processing - **Content Processing**: Converts scraped content to RAGFlow's document format with chunking - **Error Handling**: Comprehensive error handling for rate limits, failed requests, and malformed content - **UI Components**: Complete UI schema and workflow components for RAGFlow integration ### 📁 Files Added - `intergrations/firecrawl/` - Complete integration package - `intergrations/firecrawl/integration.py` - RAGFlow integration entry point - `intergrations/firecrawl/firecrawl_connector.py` - API communication - `intergrations/firecrawl/firecrawl_config.py` - Configuration management - `intergrations/firecrawl/firecrawl_processor.py` - Content processing - `intergrations/firecrawl/firecrawl_ui.py` - UI components - `intergrations/firecrawl/ragflow_integration.py` - Main integration class - `intergrations/firecrawl/README.md` - Complete documentation - `intergrations/firecrawl/example_usage.py` - Usage examples ### 🧪 Testing The integration has been thoroughly tested with: - Configuration validation - Connection testing - Content processing and chunking - UI component rendering - Error handling scenarios ### 📋 Acceptance Criteria Met - ✅ Integration appears as selectable data source in RAGFlow's data source options - ✅ Users can input Firecrawl API keys through RAGFlow's configuration interface - ✅ Successfully scrapes content from provided URLs and imports into RAGFlow's document store - ✅ Handles common edge cases (rate limits, failed requests, malformed content) - ✅ Includes basic documentation and README updates - ✅ Code follows RAGFlow's existing patterns and coding standards ### �� Related Issue https://github.com/firecrawl/firecrawl/issues/2167 --------- Co-authored-by: AB <aj@Ajays-MacBook-Air.local>
32 lines
538 B
Plaintext
32 lines
538 B
Plaintext
# Firecrawl Plugin for RAGFlow - Dependencies
|
|
|
|
# Core dependencies
|
|
aiohttp>=3.8.0
|
|
asyncio-throttle>=1.0.0
|
|
|
|
# Data processing
|
|
pydantic>=2.0.0
|
|
python-dateutil>=2.8.0
|
|
|
|
# HTTP and networking
|
|
urllib3>=1.26.0
|
|
requests>=2.28.0
|
|
|
|
# Logging and monitoring
|
|
structlog>=22.0.0
|
|
|
|
# Optional: For advanced content processing
|
|
beautifulsoup4>=4.11.0
|
|
lxml>=4.9.0
|
|
html2text>=2020.1.16
|
|
|
|
# Optional: For enhanced error handling
|
|
tenacity>=8.0.0
|
|
|
|
# Development dependencies (optional)
|
|
pytest>=7.0.0
|
|
pytest-asyncio>=0.21.0
|
|
black>=22.0.0
|
|
flake8>=5.0.0
|
|
mypy>=1.0.0
|